* crush multipick anomaly
@ 2017-01-26 3:05 Sage Weil
2017-01-26 11:13 ` Loic Dachary
2017-02-13 10:36 ` Loic Dachary
0 siblings, 2 replies; 70+ messages in thread
From: Sage Weil @ 2017-01-26 3:05 UTC (permalink / raw)
To: ceph-devel
This is a longstanding bug,
http://tracker.ceph.com/issues/15653
that causes low-weighted devices to get more data than they should. Loic's
recent activity resurrected discussion on the original PR
https://github.com/ceph/ceph/pull/10218
but since it's closed and almost nobody will see it I'm moving the
discussion here.
The main news is that I have a simple adjustment for the weights that
works (almost perfectly) for the 2nd round of placements. The solution is
pretty simple, although as with most probabilities it tends to make my
brain hurt.
The idea is that, on the second round, the original weight for the small
OSD (call it P(pick small)) isn't what we should use. Instead, we want
P(pick small | first pick not small). Since P(a|b) (the probability of a
given b) is P(a && b) / P(b),
P(pick small | first pick not small)
= P(pick small && first pick not small) / P(first pick not small)
The last term is easy to calculate,
P(first pick not small) = (total_weight - small_weight) / total_weight
and the && term is the distribution we're trying to produce. For exmaple,
if small has 1/10 the weight, then we should see 1/10th of the PGs have
their second replica be the small OSD. So
P(pick small && first pick not small) = small_weight / total_weight
Putting those together,
P(pick small | first pick not small)
= P(pick small && first pick not small) / P(first pick not small)
= (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
= small_weight / (total_weight - small_weight)
This is, on the second round, we should adjust the weights by the above so
that we get the right distribution of second choices. It turns out it
works to adjust *all* weights like this to get hte conditional probability
that they weren't already chosen.
I have a branch that hacks this into straw2 and it appears to work
properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
current code, you get
$ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
rule 0 (data), x = 0..40000000, numrep = 2..2
rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
device 0: 19765965 [9899364,9866601]
device 1: 19768033 [9899444,9868589]
device 2: 19769938 [9901770,9868168]
device 3: 19766918 [9898851,9868067]
device 6: 929148 [400572,528576]
which is very close for the first replica (primary), but way off for the
second. With my hacky change,
rule 0 (data), x = 0..40000000, numrep = 2..2
rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
device 0: 19797315 [9899364,9897951]
device 1: 19799199 [9899444,9899755]
device 2: 19801016 [9901770,9899246]
device 3: 19797906 [9898851,9899055]
device 6: 804566 [400572,403994]
which is quite close, but still skewing slightly high (by a big less than
1%).
Next steps:
1- generalize this for >2 replicas
2- figure out why it skews high
3- make this work for multi-level hierarchical descent
sage
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-01-26 3:05 crush multipick anomaly Sage Weil
@ 2017-01-26 11:13 ` Loic Dachary
2017-01-26 11:51 ` kefu chai
2017-02-03 14:37 ` Loic Dachary
2017-02-13 10:36 ` Loic Dachary
1 sibling, 2 replies; 70+ messages in thread
From: Loic Dachary @ 2017-01-26 11:13 UTC (permalink / raw)
To: Sage Weil, ceph-devel
Hi Sage,
Still trying to understand what you did :-) I have one question below.
On 01/26/2017 04:05 AM, Sage Weil wrote:
> This is a longstanding bug,
>
> http://tracker.ceph.com/issues/15653
>
> that causes low-weighted devices to get more data than they should. Loic's
> recent activity resurrected discussion on the original PR
>
> https://github.com/ceph/ceph/pull/10218
>
> but since it's closed and almost nobody will see it I'm moving the
> discussion here.
>
> The main news is that I have a simple adjustment for the weights that
> works (almost perfectly) for the 2nd round of placements. The solution is
> pretty simple, although as with most probabilities it tends to make my
> brain hurt.
>
> The idea is that, on the second round, the original weight for the small
> OSD (call it P(pick small)) isn't what we should use. Instead, we want
> P(pick small | first pick not small). Since P(a|b) (the probability of a
> given b) is P(a && b) / P(b),
From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>
> P(pick small | first pick not small)
> = P(pick small && first pick not small) / P(first pick not small)
>
> The last term is easy to calculate,
>
> P(first pick not small) = (total_weight - small_weight) / total_weight
>
> and the && term is the distribution we're trying to produce.
https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
> For exmaple,
> if small has 1/10 the weight, then we should see 1/10th of the PGs have
> their second replica be the small OSD. So
>
> P(pick small && first pick not small) = small_weight / total_weight
>
> Putting those together,
>
> P(pick small | first pick not small)
> = P(pick small && first pick not small) / P(first pick not small)
> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
> = small_weight / (total_weight - small_weight)
>
> This is, on the second round, we should adjust the weights by the above so
> that we get the right distribution of second choices. It turns out it
> works to adjust *all* weights like this to get hte conditional probability
> that they weren't already chosen.
>
> I have a branch that hacks this into straw2 and it appears to work
This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
> current code, you get
>
> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
> rule 0 (data), x = 0..40000000, numrep = 2..2
> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
> device 0: 19765965 [9899364,9866601]
> device 1: 19768033 [9899444,9868589]
> device 2: 19769938 [9901770,9868168]
> device 3: 19766918 [9898851,9868067]
> device 6: 929148 [400572,528576]
>
> which is very close for the first replica (primary), but way off for the
> second. With my hacky change,
>
> rule 0 (data), x = 0..40000000, numrep = 2..2
> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
> device 0: 19797315 [9899364,9897951]
> device 1: 19799199 [9899444,9899755]
> device 2: 19801016 [9901770,9899246]
> device 3: 19797906 [9898851,9899055]
> device 6: 804566 [400572,403994]
>
> which is quite close, but still skewing slightly high (by a big less than
> 1%).
>
> Next steps:
>
> 1- generalize this for >2 replicas
> 2- figure out why it skews high
> 3- make this work for multi-level hierarchical descent
>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-01-26 11:13 ` Loic Dachary
@ 2017-01-26 11:51 ` kefu chai
2017-02-03 14:37 ` Loic Dachary
1 sibling, 0 replies; 70+ messages in thread
From: kefu chai @ 2017-01-26 11:51 UTC (permalink / raw)
To: Loic Dachary; +Cc: Sage Weil, ceph-devel
On Thu, Jan 26, 2017 at 7:13 PM, Loic Dachary <loic@dachary.org> wrote:
> Hi Sage,
>
> Still trying to understand what you did :-) I have one question below.
>
> On 01/26/2017 04:05 AM, Sage Weil wrote:
>> This is a longstanding bug,
>>
>> http://tracker.ceph.com/issues/15653
>>
>> that causes low-weighted devices to get more data than they should. Loic's
>> recent activity resurrected discussion on the original PR
>>
>> https://github.com/ceph/ceph/pull/10218
>>
>> but since it's closed and almost nobody will see it I'm moving the
>> discussion here.
>>
>> The main news is that I have a simple adjustment for the weights that
>> works (almost perfectly) for the 2nd round of placements. The solution is
>> pretty simple, although as with most probabilities it tends to make my
>> brain hurt.
>>
>> The idea is that, on the second round, the original weight for the small
>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>> given b) is P(a && b) / P(b),
>
> From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>
>>
>> P(pick small | first pick not small)
>> = P(pick small && first pick not small) / P(first pick not small)
>>
>> The last term is easy to calculate,
>>
>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>
>> and the && term is the distribution we're trying to produce.
>
> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
a joint events of A and B means these two events occur together. Here,
A and B are two random variables. so, the notion of P(A && B) or
P(A∩B) stands for the probability of the event A and event B happens
together. in our case, in our case, "a" denotes the event of "first
small", and "b" denotes "first pick not small". so P(a∩b) is the
probability of the first pick is not small **and** the resulting pick
is not small. maybe you can also reference
https://en.wikipedia.org/wiki/Joint_probability_distribution
>
>> For exmaple,
>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>> their second replica be the small OSD. So
>>
>> P(pick small && first pick not small) = small_weight / total_weight
>>
>> Putting those together,
>>
>> P(pick small | first pick not small)
>> = P(pick small && first pick not small) / P(first pick not small)
>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>> = small_weight / (total_weight - small_weight)
>>
>> This is, on the second round, we should adjust the weights by the above so
>> that we get the right distribution of second choices. It turns out it
>> works to adjust *all* weights like this to get hte conditional probability
>> that they weren't already chosen.
>>
>> I have a branch that hacks this into straw2 and it appears to work
>
> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>
>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>> current code, you get
>>
>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>> rule 0 (data), x = 0..40000000, numrep = 2..2
>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>> device 0: 19765965 [9899364,9866601]
>> device 1: 19768033 [9899444,9868589]
>> device 2: 19769938 [9901770,9868168]
>> device 3: 19766918 [9898851,9868067]
>> device 6: 929148 [400572,528576]
>>
>> which is very close for the first replica (primary), but way off for the
>> second. With my hacky change,
>>
>> rule 0 (data), x = 0..40000000, numrep = 2..2
>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>> device 0: 19797315 [9899364,9897951]
>> device 1: 19799199 [9899444,9899755]
>> device 2: 19801016 [9901770,9899246]
>> device 3: 19797906 [9898851,9899055]
>> device 6: 804566 [400572,403994]
>>
>> which is quite close, but still skewing slightly high (by a big less than
>> 1%).
>>
>> Next steps:
>>
>> 1- generalize this for >2 replicas
>> 2- figure out why it skews high
>> 3- make this work for multi-level hierarchical descent
>>
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Regards
Kefu Chai
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-01-26 11:13 ` Loic Dachary
2017-01-26 11:51 ` kefu chai
@ 2017-02-03 14:37 ` Loic Dachary
2017-02-03 14:47 ` Sage Weil
1 sibling, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-03 14:37 UTC (permalink / raw)
To: Sage Weil, ceph-devel
On 01/26/2017 12:13 PM, Loic Dachary wrote:
> Hi Sage,
>
> Still trying to understand what you did :-) I have one question below.
>
> On 01/26/2017 04:05 AM, Sage Weil wrote:
>> This is a longstanding bug,
>>
>> http://tracker.ceph.com/issues/15653
>>
>> that causes low-weighted devices to get more data than they should. Loic's
>> recent activity resurrected discussion on the original PR
>>
>> https://github.com/ceph/ceph/pull/10218
>>
>> but since it's closed and almost nobody will see it I'm moving the
>> discussion here.
>>
>> The main news is that I have a simple adjustment for the weights that
>> works (almost perfectly) for the 2nd round of placements. The solution is
>> pretty simple, although as with most probabilities it tends to make my
>> brain hurt.
>>
>> The idea is that, on the second round, the original weight for the small
>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>> given b) is P(a && b) / P(b),
>
>>From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>
>>
>> P(pick small | first pick not small)
>> = P(pick small && first pick not small) / P(first pick not small)
>>
>> The last term is easy to calculate,
>>
>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>
>> and the && term is the distribution we're trying to produce.
>
> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>
>> For exmaple,
>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>> their second replica be the small OSD. So
>>
>> P(pick small && first pick not small) = small_weight / total_weight
>>
>> Putting those together,
>>
>> P(pick small | first pick not small)
>> = P(pick small && first pick not small) / P(first pick not small)
>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>> = small_weight / (total_weight - small_weight)
>>
>> This is, on the second round, we should adjust the weights by the above so
>> that we get the right distribution of second choices. It turns out it
>> works to adjust *all* weights like this to get hte conditional probability
>> that they weren't already chosen.
>>
>> I have a branch that hacks this into straw2 and it appears to work
>
> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
In
https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
double neww = oldw / (bucketw - oldw) * bucketw;
I don't get why we need "* bucketw" at the end ?
>
>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>> current code, you get
>>
>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>> rule 0 (data), x = 0..40000000, numrep = 2..2
>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>> device 0: 19765965 [9899364,9866601]
>> device 1: 19768033 [9899444,9868589]
>> device 2: 19769938 [9901770,9868168]
>> device 3: 19766918 [9898851,9868067]
>> device 6: 929148 [400572,528576]
>>
>> which is very close for the first replica (primary), but way off for the
>> second. With my hacky change,
>>
>> rule 0 (data), x = 0..40000000, numrep = 2..2
>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>> device 0: 19797315 [9899364,9897951]
>> device 1: 19799199 [9899444,9899755]
>> device 2: 19801016 [9901770,9899246]
>> device 3: 19797906 [9898851,9899055]
>> device 6: 804566 [400572,403994]
>>
>> which is quite close, but still skewing slightly high (by a big less than
>> 1%).
>>
>> Next steps:
>>
>> 1- generalize this for >2 replicas
>> 2- figure out why it skews high
>> 3- make this work for multi-level hierarchical descent
>>
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-03 14:37 ` Loic Dachary
@ 2017-02-03 14:47 ` Sage Weil
2017-02-03 15:08 ` Loic Dachary
2017-02-03 15:26 ` Dan van der Ster
0 siblings, 2 replies; 70+ messages in thread
From: Sage Weil @ 2017-02-03 14:47 UTC (permalink / raw)
To: Loic Dachary; +Cc: ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 5588 bytes --]
On Fri, 3 Feb 2017, Loic Dachary wrote:
> On 01/26/2017 12:13 PM, Loic Dachary wrote:
> > Hi Sage,
> >
> > Still trying to understand what you did :-) I have one question below.
> >
> > On 01/26/2017 04:05 AM, Sage Weil wrote:
> >> This is a longstanding bug,
> >>
> >> http://tracker.ceph.com/issues/15653
> >>
> >> that causes low-weighted devices to get more data than they should. Loic's
> >> recent activity resurrected discussion on the original PR
> >>
> >> https://github.com/ceph/ceph/pull/10218
> >>
> >> but since it's closed and almost nobody will see it I'm moving the
> >> discussion here.
> >>
> >> The main news is that I have a simple adjustment for the weights that
> >> works (almost perfectly) for the 2nd round of placements. The solution is
> >> pretty simple, although as with most probabilities it tends to make my
> >> brain hurt.
> >>
> >> The idea is that, on the second round, the original weight for the small
> >> OSD (call it P(pick small)) isn't what we should use. Instead, we want
> >> P(pick small | first pick not small). Since P(a|b) (the probability of a
> >> given b) is P(a && b) / P(b),
> >
> >>From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
> >
> >>
> >> P(pick small | first pick not small)
> >> = P(pick small && first pick not small) / P(first pick not small)
> >>
> >> The last term is easy to calculate,
> >>
> >> P(first pick not small) = (total_weight - small_weight) / total_weight
> >>
> >> and the && term is the distribution we're trying to produce.
> >
> > https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
> >
> >> For exmaple,
> >> if small has 1/10 the weight, then we should see 1/10th of the PGs have
> >> their second replica be the small OSD. So
> >>
> >> P(pick small && first pick not small) = small_weight / total_weight
> >>
> >> Putting those together,
> >>
> >> P(pick small | first pick not small)
> >> = P(pick small && first pick not small) / P(first pick not small)
> >> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
> >> = small_weight / (total_weight - small_weight)
> >>
> >> This is, on the second round, we should adjust the weights by the above so
> >> that we get the right distribution of second choices. It turns out it
> >> works to adjust *all* weights like this to get hte conditional probability
> >> that they weren't already chosen.
> >>
> >> I have a branch that hacks this into straw2 and it appears to work
> >
> > This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>
> In
>
> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>
> double neww = oldw / (bucketw - oldw) * bucketw;
>
> I don't get why we need "* bucketw" at the end ?
It's just to keep the values within a reasonable range so that we don't
lose precision by dropping down into small integers.
I futzed around with this some more last week trying to get the third
replica to work and ended up doubting that this piece is correct. The
ratio between the big and small OSDs in my [99 99 99 99 4] example varies
slightly from what I would expect from first principles and what I get out
of this derivation by about 1%.. which would explain the bias I as seeing.
I'm hoping we can find someone with a strong stats/probability background
and loads of free time who can tackle this...
sage
>
> >
> >> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
> >> current code, you get
> >>
> >> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
> >> rule 0 (data), x = 0..40000000, numrep = 2..2
> >> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
> >> device 0: 19765965 [9899364,9866601]
> >> device 1: 19768033 [9899444,9868589]
> >> device 2: 19769938 [9901770,9868168]
> >> device 3: 19766918 [9898851,9868067]
> >> device 6: 929148 [400572,528576]
> >>
> >> which is very close for the first replica (primary), but way off for the
> >> second. With my hacky change,
> >>
> >> rule 0 (data), x = 0..40000000, numrep = 2..2
> >> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
> >> device 0: 19797315 [9899364,9897951]
> >> device 1: 19799199 [9899444,9899755]
> >> device 2: 19801016 [9901770,9899246]
> >> device 3: 19797906 [9898851,9899055]
> >> device 6: 804566 [400572,403994]
> >>
> >> which is quite close, but still skewing slightly high (by a big less than
> >> 1%).
> >>
> >> Next steps:
> >>
> >> 1- generalize this for >2 replicas
> >> 2- figure out why it skews high
> >> 3- make this work for multi-level hierarchical descent
> >>
> >> sage
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>
> >
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-03 14:47 ` Sage Weil
@ 2017-02-03 15:08 ` Loic Dachary
2017-02-03 18:54 ` Loic Dachary
2017-02-03 15:26 ` Dan van der Ster
1 sibling, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-03 15:08 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On 02/03/2017 03:47 PM, Sage Weil wrote:
> On Fri, 3 Feb 2017, Loic Dachary wrote:
>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>> Hi Sage,
>>>
>>> Still trying to understand what you did :-) I have one question below.
>>>
>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>> This is a longstanding bug,
>>>>
>>>> http://tracker.ceph.com/issues/15653
>>>>
>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>> recent activity resurrected discussion on the original PR
>>>>
>>>> https://github.com/ceph/ceph/pull/10218
>>>>
>>>> but since it's closed and almost nobody will see it I'm moving the
>>>> discussion here.
>>>>
>>>> The main news is that I have a simple adjustment for the weights that
>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>> pretty simple, although as with most probabilities it tends to make my
>>>> brain hurt.
>>>>
>>>> The idea is that, on the second round, the original weight for the small
>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>> given b) is P(a && b) / P(b),
>>>
>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>
>>>>
>>>> P(pick small | first pick not small)
>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>
>>>> The last term is easy to calculate,
>>>>
>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>
>>>> and the && term is the distribution we're trying to produce.
>>>
>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>
>>>> For exmaple,
>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>> their second replica be the small OSD. So
>>>>
>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>
>>>> Putting those together,
>>>>
>>>> P(pick small | first pick not small)
>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>> = small_weight / (total_weight - small_weight)
>>>>
>>>> This is, on the second round, we should adjust the weights by the above so
>>>> that we get the right distribution of second choices. It turns out it
>>>> works to adjust *all* weights like this to get hte conditional probability
>>>> that they weren't already chosen.
>>>>
>>>> I have a branch that hacks this into straw2 and it appears to work
>>>
>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>
>> In
>>
>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>
>> double neww = oldw / (bucketw - oldw) * bucketw;
>>
>> I don't get why we need "* bucketw" at the end ?
>
> It's just to keep the values within a reasonable range so that we don't
> lose precision by dropping down into small integers.
>
> I futzed around with this some more last week trying to get the third
> replica to work and ended up doubting that this piece is correct. The
> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
> slightly from what I would expect from first principles and what I get out
> of this derivation by about 1%.. which would explain the bias I as seeing.
>
> I'm hoping we can find someone with a strong stats/probability background
> and loads of free time who can tackle this...
>
It would help to formulate the problem into a self contained puzzle to present a mathematician. I tried to do it last week but failed. I'll give it another shot and submit a draft, hoping something bad could be the start of something better ;-)
> sage
>
>
>>
>>>
>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>> current code, you get
>>>>
>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>> device 0: 19765965 [9899364,9866601]
>>>> device 1: 19768033 [9899444,9868589]
>>>> device 2: 19769938 [9901770,9868168]
>>>> device 3: 19766918 [9898851,9868067]
>>>> device 6: 929148 [400572,528576]
>>>>
>>>> which is very close for the first replica (primary), but way off for the
>>>> second. With my hacky change,
>>>>
>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>> device 0: 19797315 [9899364,9897951]
>>>> device 1: 19799199 [9899444,9899755]
>>>> device 2: 19801016 [9901770,9899246]
>>>> device 3: 19797906 [9898851,9899055]
>>>> device 6: 804566 [400572,403994]
>>>>
>>>> which is quite close, but still skewing slightly high (by a big less than
>>>> 1%).
>>>>
>>>> Next steps:
>>>>
>>>> 1- generalize this for >2 replicas
>>>> 2- figure out why it skews high
>>>> 3- make this work for multi-level hierarchical descent
>>>>
>>>> sage
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-03 14:47 ` Sage Weil
2017-02-03 15:08 ` Loic Dachary
@ 2017-02-03 15:26 ` Dan van der Ster
2017-02-03 17:37 ` Dan van der Ster
1 sibling, 1 reply; 70+ messages in thread
From: Dan van der Ster @ 2017-02-03 15:26 UTC (permalink / raw)
To: Sage Weil; +Cc: Loic Dachary, ceph-devel
On Fri, Feb 3, 2017 at 3:47 PM, Sage Weil <sweil@redhat.com> wrote:
> On Fri, 3 Feb 2017, Loic Dachary wrote:
>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>> > Hi Sage,
>> >
>> > Still trying to understand what you did :-) I have one question below.
>> >
>> > On 01/26/2017 04:05 AM, Sage Weil wrote:
>> >> This is a longstanding bug,
>> >>
>> >> http://tracker.ceph.com/issues/15653
>> >>
>> >> that causes low-weighted devices to get more data than they should. Loic's
>> >> recent activity resurrected discussion on the original PR
>> >>
>> >> https://github.com/ceph/ceph/pull/10218
>> >>
>> >> but since it's closed and almost nobody will see it I'm moving the
>> >> discussion here.
>> >>
>> >> The main news is that I have a simple adjustment for the weights that
>> >> works (almost perfectly) for the 2nd round of placements. The solution is
>> >> pretty simple, although as with most probabilities it tends to make my
>> >> brain hurt.
>> >>
>> >> The idea is that, on the second round, the original weight for the small
>> >> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>> >> P(pick small | first pick not small). Since P(a|b) (the probability of a
>> >> given b) is P(a && b) / P(b),
>> >
>> >>From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>> >
>> >>
>> >> P(pick small | first pick not small)
>> >> = P(pick small && first pick not small) / P(first pick not small)
>> >>
>> >> The last term is easy to calculate,
>> >>
>> >> P(first pick not small) = (total_weight - small_weight) / total_weight
>> >>
>> >> and the && term is the distribution we're trying to produce.
>> >
>> > https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>> >
>> >> For exmaple,
>> >> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>> >> their second replica be the small OSD. So
>> >>
>> >> P(pick small && first pick not small) = small_weight / total_weight
>> >>
>> >> Putting those together,
>> >>
>> >> P(pick small | first pick not small)
>> >> = P(pick small && first pick not small) / P(first pick not small)
>> >> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>> >> = small_weight / (total_weight - small_weight)
>> >>
>> >> This is, on the second round, we should adjust the weights by the above so
>> >> that we get the right distribution of second choices. It turns out it
>> >> works to adjust *all* weights like this to get hte conditional probability
>> >> that they weren't already chosen.
>> >>
>> >> I have a branch that hacks this into straw2 and it appears to work
>> >
>> > This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>
>> In
>>
>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>
>> double neww = oldw / (bucketw - oldw) * bucketw;
>>
>> I don't get why we need "* bucketw" at the end ?
>
> It's just to keep the values within a reasonable range so that we don't
> lose precision by dropping down into small integers.
>
> I futzed around with this some more last week trying to get the third
> replica to work and ended up doubting that this piece is correct. The
> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
> slightly from what I would expect from first principles and what I get out
> of this derivation by about 1%.. which would explain the bias I as seeing.
>
> I'm hoping we can find someone with a strong stats/probability background
> and loads of free time who can tackle this...
>
I'm *not* that person, but I gave it a go last weekend and realized a
few things:
1. We should add the additional constraint that for all PGs assigned
to an OSD, 1/N of them must be primary replicas, 1/N must be
secondary, 1/N must be tertiary, etc. for N replicas/stripes. E.g. for
a 3 replica pool, the "small" OSD should still have the property that
1/3rd are primaries, 1/3rd are secondary, 1/3rd are tertiary.
2. I believe this is a case of the balls-into-bins problem -- we have
colored balls and weighted bins. I didn't find a definition of the
problem where the goal is to allow users to specify weights which must
be respected after N rounds.
3. I wrote some quick python to simulate different reweighting
algorithms. The solution is definitely not obvious - I often thought
I'd solved it (e.g. for simple OSD weight sets like 3, 3, 3, 1) - but
changing the OSDs weights to e.g. 3, 3, 1, 1 completely broke things.
I can clean up and share that python if it's can help.
My gut feeling is that because CRUSH trees and rulesets can be
arbitrarily complex, the most pragmatic & reliable way to solve this
problem is to balance the PGs with a reweight-by-pg loop at crush
compilation time. This is what admins should do now -- we should just
automate it.
Cheers, Dan
P.S. -- maybe these guys can help: http://math.stackexchange.com/
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-03 15:26 ` Dan van der Ster
@ 2017-02-03 17:37 ` Dan van der Ster
2017-02-06 8:31 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Dan van der Ster @ 2017-02-03 17:37 UTC (permalink / raw)
To: Sage Weil; +Cc: Loic Dachary, ceph-devel
Anyway, here's my simple simulation. It might be helpful for testing
ideas quickly: https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992
Below is the output using the P(pick small | first pick not small)
observation, using OSDs having weights 3, 3, 3, & 1 respectively. It
seems to *almost* work, but only when we have just one small OSD.
See the end of the script for other various ideas.
-- Dan
> python mpa.py
OSDs (id: weight): {0: 3, 1: 3, 2: 3, 3: 1}
Expected PGs per OSD: {0: 90000, 1: 90000, 2: 90000, 3: 30000}
Simulating with existing CRUSH
Observed: {0: 85944, 1: 85810, 2: 85984, 3: 42262}
Observed for Nth replica: [{0: 29936, 1: 30045, 2: 30061, 3: 9958},
{0: 29037, 1: 29073, 2: 29041, 3: 12849}, {0: 26971, 1: 26692, 2:
26882, 3: 19455}]
Now trying your new algorithm
Observed: {0: 89423, 1: 89443, 2: 89476, 3: 31658}
Observed for Nth replica: [{0: 30103, 1: 30132, 2: 29805, 3: 9960},
{0: 29936, 1: 29964, 2: 29796, 3: 10304}, {0: 29384, 1: 29347, 2:
29875, 3: 11394}]
On Fri, Feb 3, 2017 at 4:26 PM, Dan van der Ster <dan@vanderster.com> wrote:
> On Fri, Feb 3, 2017 at 3:47 PM, Sage Weil <sweil@redhat.com> wrote:
>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>> > Hi Sage,
>>> >
>>> > Still trying to understand what you did :-) I have one question below.
>>> >
>>> > On 01/26/2017 04:05 AM, Sage Weil wrote:
>>> >> This is a longstanding bug,
>>> >>
>>> >> http://tracker.ceph.com/issues/15653
>>> >>
>>> >> that causes low-weighted devices to get more data than they should. Loic's
>>> >> recent activity resurrected discussion on the original PR
>>> >>
>>> >> https://github.com/ceph/ceph/pull/10218
>>> >>
>>> >> but since it's closed and almost nobody will see it I'm moving the
>>> >> discussion here.
>>> >>
>>> >> The main news is that I have a simple adjustment for the weights that
>>> >> works (almost perfectly) for the 2nd round of placements. The solution is
>>> >> pretty simple, although as with most probabilities it tends to make my
>>> >> brain hurt.
>>> >>
>>> >> The idea is that, on the second round, the original weight for the small
>>> >> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>> >> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>> >> given b) is P(a && b) / P(b),
>>> >
>>> >>From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>> >
>>> >>
>>> >> P(pick small | first pick not small)
>>> >> = P(pick small && first pick not small) / P(first pick not small)
>>> >>
>>> >> The last term is easy to calculate,
>>> >>
>>> >> P(first pick not small) = (total_weight - small_weight) / total_weight
>>> >>
>>> >> and the && term is the distribution we're trying to produce.
>>> >
>>> > https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>> >
>>> >> For exmaple,
>>> >> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>> >> their second replica be the small OSD. So
>>> >>
>>> >> P(pick small && first pick not small) = small_weight / total_weight
>>> >>
>>> >> Putting those together,
>>> >>
>>> >> P(pick small | first pick not small)
>>> >> = P(pick small && first pick not small) / P(first pick not small)
>>> >> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>> >> = small_weight / (total_weight - small_weight)
>>> >>
>>> >> This is, on the second round, we should adjust the weights by the above so
>>> >> that we get the right distribution of second choices. It turns out it
>>> >> works to adjust *all* weights like this to get hte conditional probability
>>> >> that they weren't already chosen.
>>> >>
>>> >> I have a branch that hacks this into straw2 and it appears to work
>>> >
>>> > This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>
>>> In
>>>
>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>
>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>
>>> I don't get why we need "* bucketw" at the end ?
>>
>> It's just to keep the values within a reasonable range so that we don't
>> lose precision by dropping down into small integers.
>>
>> I futzed around with this some more last week trying to get the third
>> replica to work and ended up doubting that this piece is correct. The
>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>> slightly from what I would expect from first principles and what I get out
>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>
>> I'm hoping we can find someone with a strong stats/probability background
>> and loads of free time who can tackle this...
>>
>
> I'm *not* that person, but I gave it a go last weekend and realized a
> few things:
>
> 1. We should add the additional constraint that for all PGs assigned
> to an OSD, 1/N of them must be primary replicas, 1/N must be
> secondary, 1/N must be tertiary, etc. for N replicas/stripes. E.g. for
> a 3 replica pool, the "small" OSD should still have the property that
> 1/3rd are primaries, 1/3rd are secondary, 1/3rd are tertiary.
>
> 2. I believe this is a case of the balls-into-bins problem -- we have
> colored balls and weighted bins. I didn't find a definition of the
> problem where the goal is to allow users to specify weights which must
> be respected after N rounds.
>
> 3. I wrote some quick python to simulate different reweighting
> algorithms. The solution is definitely not obvious - I often thought
> I'd solved it (e.g. for simple OSD weight sets like 3, 3, 3, 1) - but
> changing the OSDs weights to e.g. 3, 3, 1, 1 completely broke things.
> I can clean up and share that python if it's can help.
>
> My gut feeling is that because CRUSH trees and rulesets can be
> arbitrarily complex, the most pragmatic & reliable way to solve this
> problem is to balance the PGs with a reweight-by-pg loop at crush
> compilation time. This is what admins should do now -- we should just
> automate it.
>
> Cheers, Dan
>
> P.S. -- maybe these guys can help: http://math.stackexchange.com/
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-03 15:08 ` Loic Dachary
@ 2017-02-03 18:54 ` Loic Dachary
2017-02-06 3:08 ` Jaze Lee
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-03 18:54 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On 02/03/2017 04:08 PM, Loic Dachary wrote:
>
>
> On 02/03/2017 03:47 PM, Sage Weil wrote:
>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>> Hi Sage,
>>>>
>>>> Still trying to understand what you did :-) I have one question below.
>>>>
>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>> This is a longstanding bug,
>>>>>
>>>>> http://tracker.ceph.com/issues/15653
>>>>>
>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>> recent activity resurrected discussion on the original PR
>>>>>
>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>
>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>> discussion here.
>>>>>
>>>>> The main news is that I have a simple adjustment for the weights that
>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>> brain hurt.
>>>>>
>>>>> The idea is that, on the second round, the original weight for the small
>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>> given b) is P(a && b) / P(b),
>>>>
>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>
>>>>>
>>>>> P(pick small | first pick not small)
>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>
>>>>> The last term is easy to calculate,
>>>>>
>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>
>>>>> and the && term is the distribution we're trying to produce.
>>>>
>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>
>>>>> For exmaple,
>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>> their second replica be the small OSD. So
>>>>>
>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>
>>>>> Putting those together,
>>>>>
>>>>> P(pick small | first pick not small)
>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>> = small_weight / (total_weight - small_weight)
>>>>>
>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>> that we get the right distribution of second choices. It turns out it
>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>> that they weren't already chosen.
>>>>>
>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>
>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>
>>> In
>>>
>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>
>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>
>>> I don't get why we need "* bucketw" at the end ?
>>
>> It's just to keep the values within a reasonable range so that we don't
>> lose precision by dropping down into small integers.
>>
>> I futzed around with this some more last week trying to get the third
>> replica to work and ended up doubting that this piece is correct. The
>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>> slightly from what I would expect from first principles and what I get out
>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>
>> I'm hoping we can find someone with a strong stats/probability background
>> and loads of free time who can tackle this...
>>
>
> It would help to formulate the problem into a self contained puzzle to present a mathematician. I tried to do it last week but failed. I'll give it another shot and submit a draft, hoping something bad could be the start of something better ;-)
Here is what I have. I realize this is not good but I'm hoping someone more knowledgeable will pity me and provide something sensible. Otherwise I'm happy to keep making a fool of myself :-) In the following a bin is the device, the ball is a replica and the color is the object id.
We have D bins and each bin can hold D(B) balls. All balls have the
same size. There is exactly X balls of the same color. Each ball must
be placed in a bin that does not already contain a ball of the same
color.
What distribution guarantees that, for all X, the bins are filled in
the same proportion ?
Details
=======
* One placement: all balls are the same color and we place each of them
in a bin with a probability of:
P(BIN) = BIN(B) / SUM(BINi(B) for i in [1..D])
so that bins are equally filled regardless of their capacity.
* Two placements: for each ball there is exactly one other ball of the
same color. A ball is placed as in experience 1 and the chosen bin
is set aside. The other ball of the same color is placed as in
experience 1 with the remaining bins. The probability for a ball
to be placed in a given BIN is:
P(BIN) + P(all bins but BIN | BIN)
Examples
========
For instance we have 5 bins, a, b, c, d, e and they can hold:
a = 10 million balls
b = 10 million balls
c = 10 million balls
d = 10 million balls
e = 1 million balls
In the first experience with place each ball in
a with a probability of 10 / ( 10 + 10 + 10 + 10 + 1 ) = 10 / 41
same for b, c, d
e with a probability of 1 / 41
after 100,000 placements, the bins have
a = 243456
b = 243624
c = 244486
d = 243881
e = 24553
they are
a = 2.43 % full
b = 2.43 % full
c = 2.44 % full
d = 2.43 % full
e = 0.24 % full
In the second experience
>> sage
>>
>>
>>>
>>>>
>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>> current code, you get
>>>>>
>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>> device 0: 19765965 [9899364,9866601]
>>>>> device 1: 19768033 [9899444,9868589]
>>>>> device 2: 19769938 [9901770,9868168]
>>>>> device 3: 19766918 [9898851,9868067]
>>>>> device 6: 929148 [400572,528576]
>>>>>
>>>>> which is very close for the first replica (primary), but way off for the
>>>>> second. With my hacky change,
>>>>>
>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>> device 0: 19797315 [9899364,9897951]
>>>>> device 1: 19799199 [9899444,9899755]
>>>>> device 2: 19801016 [9901770,9899246]
>>>>> device 3: 19797906 [9898851,9899055]
>>>>> device 6: 804566 [400572,403994]
>>>>>
>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>> 1%).
>>>>>
>>>>> Next steps:
>>>>>
>>>>> 1- generalize this for >2 replicas
>>>>> 2- figure out why it skews high
>>>>> 3- make this work for multi-level hierarchical descent
>>>>>
>>>>> sage
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-03 18:54 ` Loic Dachary
@ 2017-02-06 3:08 ` Jaze Lee
2017-02-06 8:18 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Jaze Lee @ 2017-02-06 3:08 UTC (permalink / raw)
To: Loic Dachary; +Cc: Sage Weil, ceph-devel
It is more complicated than i have expected.....
I viewed http://tracker.ceph.com/issues/15653, and know that if the
replica number is
bigger than the host we choose, we may meet the problem.
That is
if we have
host: a b c d
host: e f g h
host: i j k l
we only choose one from each host for replica three, and the distribution
is as we expected? Right ?
The problem described in http://tracker.ceph.com/issues/15653, may happen
when
1)
host: a b c d e f g
and we choose all three replica from this host. But this is few happen
in production. Right?
May be i do not understand the problem correctly ?
2017-02-04 2:54 GMT+08:00 Loic Dachary <loic@dachary.org>:
>
>
> On 02/03/2017 04:08 PM, Loic Dachary wrote:
>>
>>
>> On 02/03/2017 03:47 PM, Sage Weil wrote:
>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>> Hi Sage,
>>>>>
>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>
>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>> This is a longstanding bug,
>>>>>>
>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>
>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>> recent activity resurrected discussion on the original PR
>>>>>>
>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>
>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>> discussion here.
>>>>>>
>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>> brain hurt.
>>>>>>
>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>> given b) is P(a && b) / P(b),
>>>>>
>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>
>>>>>>
>>>>>> P(pick small | first pick not small)
>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>
>>>>>> The last term is easy to calculate,
>>>>>>
>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>
>>>>>> and the && term is the distribution we're trying to produce.
>>>>>
>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>
>>>>>> For exmaple,
>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>> their second replica be the small OSD. So
>>>>>>
>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>
>>>>>> Putting those together,
>>>>>>
>>>>>> P(pick small | first pick not small)
>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>
>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>> that they weren't already chosen.
>>>>>>
>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>
>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>
>>>> In
>>>>
>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>
>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>
>>>> I don't get why we need "* bucketw" at the end ?
>>>
>>> It's just to keep the values within a reasonable range so that we don't
>>> lose precision by dropping down into small integers.
>>>
>>> I futzed around with this some more last week trying to get the third
>>> replica to work and ended up doubting that this piece is correct. The
>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>> slightly from what I would expect from first principles and what I get out
>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>
>>> I'm hoping we can find someone with a strong stats/probability background
>>> and loads of free time who can tackle this...
>>>
>>
>> It would help to formulate the problem into a self contained puzzle to present a mathematician. I tried to do it last week but failed. I'll give it another shot and submit a draft, hoping something bad could be the start of something better ;-)
>
> Here is what I have. I realize this is not good but I'm hoping someone more knowledgeable will pity me and provide something sensible. Otherwise I'm happy to keep making a fool of myself :-) In the following a bin is the device, the ball is a replica and the color is the object id.
>
> We have D bins and each bin can hold D(B) balls. All balls have the
> same size. There is exactly X balls of the same color. Each ball must
> be placed in a bin that does not already contain a ball of the same
> color.
>
> What distribution guarantees that, for all X, the bins are filled in
> the same proportion ?
>
> Details
> =======
>
> * One placement: all balls are the same color and we place each of them
> in a bin with a probability of:
>
> P(BIN) = BIN(B) / SUM(BINi(B) for i in [1..D])
>
> so that bins are equally filled regardless of their capacity.
>
> * Two placements: for each ball there is exactly one other ball of the
> same color. A ball is placed as in experience 1 and the chosen bin
> is set aside. The other ball of the same color is placed as in
> experience 1 with the remaining bins. The probability for a ball
> to be placed in a given BIN is:
>
> P(BIN) + P(all bins but BIN | BIN)
>
> Examples
> ========
>
> For instance we have 5 bins, a, b, c, d, e and they can hold:
>
> a = 10 million balls
> b = 10 million balls
> c = 10 million balls
> d = 10 million balls
> e = 1 million balls
>
> In the first experience with place each ball in
>
> a with a probability of 10 / ( 10 + 10 + 10 + 10 + 1 ) = 10 / 41
> same for b, c, d
> e with a probability of 1 / 41
>
> after 100,000 placements, the bins have
>
> a = 243456
> b = 243624
> c = 244486
> d = 243881
> e = 24553
>
> they are
>
> a = 2.43 % full
> b = 2.43 % full
> c = 2.44 % full
> d = 2.43 % full
> e = 0.24 % full
>
> In the second experience
>
>
>>> sage
>>>
>>>
>>>>
>>>>>
>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>>> current code, you get
>>>>>>
>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>> device 0: 19765965 [9899364,9866601]
>>>>>> device 1: 19768033 [9899444,9868589]
>>>>>> device 2: 19769938 [9901770,9868168]
>>>>>> device 3: 19766918 [9898851,9868067]
>>>>>> device 6: 929148 [400572,528576]
>>>>>>
>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>> second. With my hacky change,
>>>>>>
>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>> device 0: 19797315 [9899364,9897951]
>>>>>> device 1: 19799199 [9899444,9899755]
>>>>>> device 2: 19801016 [9901770,9899246]
>>>>>> device 3: 19797906 [9898851,9899055]
>>>>>> device 6: 804566 [400572,403994]
>>>>>>
>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>> 1%).
>>>>>>
>>>>>> Next steps:
>>>>>>
>>>>>> 1- generalize this for >2 replicas
>>>>>> 2- figure out why it skews high
>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>
>>>>>> sage
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
谦谦君子
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-06 3:08 ` Jaze Lee
@ 2017-02-06 8:18 ` Loic Dachary
2017-02-06 14:11 ` Jaze Lee
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-06 8:18 UTC (permalink / raw)
To: Jaze Lee; +Cc: ceph-devel
Hi,
On 02/06/2017 04:08 AM, Jaze Lee wrote:
> It is more complicated than i have expected.....
> I viewed http://tracker.ceph.com/issues/15653, and know that if the
> replica number is
> bigger than the host we choose, we may meet the problem.
>
> That is
> if we have
> host: a b c d
> host: e f g h
> host: i j k l
>
> we only choose one from each host for replica three, and the distribution
> is as we expected? Right ?
>
>
> The problem described in http://tracker.ceph.com/issues/15653, may happen
> when
> 1)
> host: a b c d e f g
>
> and we choose all three replica from this host. But this is few happen
> in production. Right?
>
>
> May be i do not understand the problem correctly ?
The problem also happens with host: a b c d e f g when you try to get three replicas that are not on the same disk. You can experiment with Dan's script
https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992
Cheers
>
>
>
>
>
>
>
>
>
>
> 2017-02-04 2:54 GMT+08:00 Loic Dachary <loic@dachary.org>:
>>
>>
>> On 02/03/2017 04:08 PM, Loic Dachary wrote:
>>>
>>>
>>> On 02/03/2017 03:47 PM, Sage Weil wrote:
>>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>>> Hi Sage,
>>>>>>
>>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>>
>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>> This is a longstanding bug,
>>>>>>>
>>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>>
>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>
>>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>>
>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>> discussion here.
>>>>>>>
>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>> brain hurt.
>>>>>>>
>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>>> given b) is P(a && b) / P(b),
>>>>>>
>>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>>
>>>>>>>
>>>>>>> P(pick small | first pick not small)
>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>
>>>>>>> The last term is easy to calculate,
>>>>>>>
>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>
>>>>>>> and the && term is the distribution we're trying to produce.
>>>>>>
>>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>>
>>>>>>> For exmaple,
>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>> their second replica be the small OSD. So
>>>>>>>
>>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>
>>>>>>> Putting those together,
>>>>>>>
>>>>>>> P(pick small | first pick not small)
>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>>
>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>> that they weren't already chosen.
>>>>>>>
>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>
>>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>>
>>>>> In
>>>>>
>>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>>
>>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>>
>>>>> I don't get why we need "* bucketw" at the end ?
>>>>
>>>> It's just to keep the values within a reasonable range so that we don't
>>>> lose precision by dropping down into small integers.
>>>>
>>>> I futzed around with this some more last week trying to get the third
>>>> replica to work and ended up doubting that this piece is correct. The
>>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>>> slightly from what I would expect from first principles and what I get out
>>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>>
>>>> I'm hoping we can find someone with a strong stats/probability background
>>>> and loads of free time who can tackle this...
>>>>
>>>
>>> It would help to formulate the problem into a self contained puzzle to present a mathematician. I tried to do it last week but failed. I'll give it another shot and submit a draft, hoping something bad could be the start of something better ;-)
>>
>> Here is what I have. I realize this is not good but I'm hoping someone more knowledgeable will pity me and provide something sensible. Otherwise I'm happy to keep making a fool of myself :-) In the following a bin is the device, the ball is a replica and the color is the object id.
>>
>> We have D bins and each bin can hold D(B) balls. All balls have the
>> same size. There is exactly X balls of the same color. Each ball must
>> be placed in a bin that does not already contain a ball of the same
>> color.
>>
>> What distribution guarantees that, for all X, the bins are filled in
>> the same proportion ?
>>
>> Details
>> =======
>>
>> * One placement: all balls are the same color and we place each of them
>> in a bin with a probability of:
>>
>> P(BIN) = BIN(B) / SUM(BINi(B) for i in [1..D])
>>
>> so that bins are equally filled regardless of their capacity.
>>
>> * Two placements: for each ball there is exactly one other ball of the
>> same color. A ball is placed as in experience 1 and the chosen bin
>> is set aside. The other ball of the same color is placed as in
>> experience 1 with the remaining bins. The probability for a ball
>> to be placed in a given BIN is:
>>
>> P(BIN) + P(all bins but BIN | BIN)
>>
>> Examples
>> ========
>>
>> For instance we have 5 bins, a, b, c, d, e and they can hold:
>>
>> a = 10 million balls
>> b = 10 million balls
>> c = 10 million balls
>> d = 10 million balls
>> e = 1 million balls
>>
>> In the first experience with place each ball in
>>
>> a with a probability of 10 / ( 10 + 10 + 10 + 10 + 1 ) = 10 / 41
>> same for b, c, d
>> e with a probability of 1 / 41
>>
>> after 100,000 placements, the bins have
>>
>> a = 243456
>> b = 243624
>> c = 244486
>> d = 243881
>> e = 24553
>>
>> they are
>>
>> a = 2.43 % full
>> b = 2.43 % full
>> c = 2.44 % full
>> d = 2.43 % full
>> e = 0.24 % full
>>
>> In the second experience
>>
>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>>>> current code, you get
>>>>>>>
>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>> device 0: 19765965 [9899364,9866601]
>>>>>>> device 1: 19768033 [9899444,9868589]
>>>>>>> device 2: 19769938 [9901770,9868168]
>>>>>>> device 3: 19766918 [9898851,9868067]
>>>>>>> device 6: 929148 [400572,528576]
>>>>>>>
>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>> second. With my hacky change,
>>>>>>>
>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>> device 0: 19797315 [9899364,9897951]
>>>>>>> device 1: 19799199 [9899444,9899755]
>>>>>>> device 2: 19801016 [9901770,9899246]
>>>>>>> device 3: 19797906 [9898851,9899055]
>>>>>>> device 6: 804566 [400572,403994]
>>>>>>>
>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>> 1%).
>>>>>>>
>>>>>>> Next steps:
>>>>>>>
>>>>>>> 1- generalize this for >2 replicas
>>>>>>> 2- figure out why it skews high
>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>
>>>>>>> sage
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-03 17:37 ` Dan van der Ster
@ 2017-02-06 8:31 ` Loic Dachary
2017-02-06 9:13 ` Dan van der Ster
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-06 8:31 UTC (permalink / raw)
To: Dan van der Ster; +Cc: ceph-devel, Szymon Datko, Tomasz Kuzemko
Hi Dan,
Your script turns out to be a nice self contained problem statement :-) Tomasz & Szymon discussed it today @ FOSDEM and I was enlightened by the way Szymon described how to calculate P(E|A) using a probability tree (see the picture at http://dachary.org/loic/crush-probability-schema.jpg).
Cheers
On 02/03/2017 06:37 PM, Dan van der Ster wrote:
> Anyway, here's my simple simulation. It might be helpful for testing
> ideas quickly: https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992
>
> Below is the output using the P(pick small | first pick not small)
> observation, using OSDs having weights 3, 3, 3, & 1 respectively. It
> seems to *almost* work, but only when we have just one small OSD.
>
> See the end of the script for other various ideas.
>
> -- Dan
>
>> python mpa.py
> OSDs (id: weight): {0: 3, 1: 3, 2: 3, 3: 1}
>
> Expected PGs per OSD: {0: 90000, 1: 90000, 2: 90000, 3: 30000}
>
> Simulating with existing CRUSH
>
> Observed: {0: 85944, 1: 85810, 2: 85984, 3: 42262}
> Observed for Nth replica: [{0: 29936, 1: 30045, 2: 30061, 3: 9958},
> {0: 29037, 1: 29073, 2: 29041, 3: 12849}, {0: 26971, 1: 26692, 2:
> 26882, 3: 19455}]
>
> Now trying your new algorithm
>
> Observed: {0: 89423, 1: 89443, 2: 89476, 3: 31658}
> Observed for Nth replica: [{0: 30103, 1: 30132, 2: 29805, 3: 9960},
> {0: 29936, 1: 29964, 2: 29796, 3: 10304}, {0: 29384, 1: 29347, 2:
> 29875, 3: 11394}]
>
>
> On Fri, Feb 3, 2017 at 4:26 PM, Dan van der Ster <dan@vanderster.com> wrote:
>> On Fri, Feb 3, 2017 at 3:47 PM, Sage Weil <sweil@redhat.com> wrote:
>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>> Hi Sage,
>>>>>
>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>
>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>> This is a longstanding bug,
>>>>>>
>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>
>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>> recent activity resurrected discussion on the original PR
>>>>>>
>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>
>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>> discussion here.
>>>>>>
>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>> brain hurt.
>>>>>>
>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>> given b) is P(a && b) / P(b),
>>>>>
>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>
>>>>>>
>>>>>> P(pick small | first pick not small)
>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>
>>>>>> The last term is easy to calculate,
>>>>>>
>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>
>>>>>> and the && term is the distribution we're trying to produce.
>>>>>
>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>
>>>>>> For exmaple,
>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>> their second replica be the small OSD. So
>>>>>>
>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>
>>>>>> Putting those together,
>>>>>>
>>>>>> P(pick small | first pick not small)
>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>
>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>> that they weren't already chosen.
>>>>>>
>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>
>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>
>>>> In
>>>>
>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>
>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>
>>>> I don't get why we need "* bucketw" at the end ?
>>>
>>> It's just to keep the values within a reasonable range so that we don't
>>> lose precision by dropping down into small integers.
>>>
>>> I futzed around with this some more last week trying to get the third
>>> replica to work and ended up doubting that this piece is correct. The
>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>> slightly from what I would expect from first principles and what I get out
>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>
>>> I'm hoping we can find someone with a strong stats/probability background
>>> and loads of free time who can tackle this...
>>>
>>
>> I'm *not* that person, but I gave it a go last weekend and realized a
>> few things:
>>
>> 1. We should add the additional constraint that for all PGs assigned
>> to an OSD, 1/N of them must be primary replicas, 1/N must be
>> secondary, 1/N must be tertiary, etc. for N replicas/stripes. E.g. for
>> a 3 replica pool, the "small" OSD should still have the property that
>> 1/3rd are primaries, 1/3rd are secondary, 1/3rd are tertiary.
>>
>> 2. I believe this is a case of the balls-into-bins problem -- we have
>> colored balls and weighted bins. I didn't find a definition of the
>> problem where the goal is to allow users to specify weights which must
>> be respected after N rounds.
>>
>> 3. I wrote some quick python to simulate different reweighting
>> algorithms. The solution is definitely not obvious - I often thought
>> I'd solved it (e.g. for simple OSD weight sets like 3, 3, 3, 1) - but
>> changing the OSDs weights to e.g. 3, 3, 1, 1 completely broke things.
>> I can clean up and share that python if it's can help.
>>
>> My gut feeling is that because CRUSH trees and rulesets can be
>> arbitrarily complex, the most pragmatic & reliable way to solve this
>> problem is to balance the PGs with a reweight-by-pg loop at crush
>> compilation time. This is what admins should do now -- we should just
>> automate it.
>>
>> Cheers, Dan
>>
>> P.S. -- maybe these guys can help: http://math.stackexchange.com/
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-06 8:31 ` Loic Dachary
@ 2017-02-06 9:13 ` Dan van der Ster
2017-02-06 16:53 ` Dan van der Ster
0 siblings, 1 reply; 70+ messages in thread
From: Dan van der Ster @ 2017-02-06 9:13 UTC (permalink / raw)
To: Loic Dachary; +Cc: ceph-devel, Szymon Datko, Tomasz Kuzemko
Hi Loic,
Here's my current understanding of the problem. (Below I work with the
example having four OSDs with weights 3, 3, 3, 1, respectively).
I'm elaborating on the observation that for every replication "round",
the PG ratios for each and every OSD must be equal to the "target" or
goal weight of that OSD. So, for an OSD that should get 10% of PGs,
that OSD gets 10% in round 1, 10% in round 2, etc... But we need to
multiply each of these ratios by the probability that this OSD is
still available in Round r.
Hence I believe we have this loop invariant:
P(OSD.x still available in Round r) * (Weight of OSD.x in Round r)
/ (Total sum of all weights in Round r) == (Original "target" Weight
of OSD.x) / (Total sum of all target weights)
I simplify all these terms:
P(OSD.x still available for Round r) = P_x_r
Weight of OSD.x in Round r = W_x_r
Total sum of all weights in Round r = T_r
Original "target" Weight of OSD.x = W_x
Total sum of all target weights = T
So rewriting the equation, we have:
P_x_r * W_x_r / T_r == W_x / T
We then calculate the needed weight of OSD.x in Round r. W_x_r is what
we're trying to solve for!!
W_x_r = W_x / T * T_r / P_x_r
The first term W_x / T is a constant and easy to compute. (For my
example small OSD, W_x / T = 0.1)
P_x_r is also -- I believe -- simple to compute. P_x_r gets smaller
for each round and is a function of what happened in the previous
round:
Round 1: P_x_1 = 1.0
Round 2: P_x_2 = P_x_1 * (1 - W_x_1 / T_1)
Round 3: P_x_3 = P_x_2 * (1 - W_x_2 / T_2)
...
But T_r is a challenge -- T_r is the sum of W_x_r for all x in round
r. Hence, the problem is that we don't know T_r until *after* we
compute all W_x_r's for that round. I tried various ways to estimate
T_r but didn't make any progress.
Do you think this formulation is correct? Any clever ideas where to go next?
Cheers, Dan
On Mon, Feb 6, 2017 at 9:31 AM, Loic Dachary <loic@dachary.org> wrote:
> Hi Dan,
>
> Your script turns out to be a nice self contained problem statement :-) Tomasz & Szymon discussed it today @ FOSDEM and I was enlightened by the way Szymon described how to calculate P(E|A) using a probability tree (see the picture at http://dachary.org/loic/crush-probability-schema.jpg).
>
> Cheers
>
> On 02/03/2017 06:37 PM, Dan van der Ster wrote:
>> Anyway, here's my simple simulation. It might be helpful for testing
>> ideas quickly: https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992
>>
>> Below is the output using the P(pick small | first pick not small)
>> observation, using OSDs having weights 3, 3, 3, & 1 respectively. It
>> seems to *almost* work, but only when we have just one small OSD.
>>
>> See the end of the script for other various ideas.
>>
>> -- Dan
>>
>>> python mpa.py
>> OSDs (id: weight): {0: 3, 1: 3, 2: 3, 3: 1}
>>
>> Expected PGs per OSD: {0: 90000, 1: 90000, 2: 90000, 3: 30000}
>>
>> Simulating with existing CRUSH
>>
>> Observed: {0: 85944, 1: 85810, 2: 85984, 3: 42262}
>> Observed for Nth replica: [{0: 29936, 1: 30045, 2: 30061, 3: 9958},
>> {0: 29037, 1: 29073, 2: 29041, 3: 12849}, {0: 26971, 1: 26692, 2:
>> 26882, 3: 19455}]
>>
>> Now trying your new algorithm
>>
>> Observed: {0: 89423, 1: 89443, 2: 89476, 3: 31658}
>> Observed for Nth replica: [{0: 30103, 1: 30132, 2: 29805, 3: 9960},
>> {0: 29936, 1: 29964, 2: 29796, 3: 10304}, {0: 29384, 1: 29347, 2:
>> 29875, 3: 11394}]
>>
>>
>> On Fri, Feb 3, 2017 at 4:26 PM, Dan van der Ster <dan@vanderster.com> wrote:
>>> On Fri, Feb 3, 2017 at 3:47 PM, Sage Weil <sweil@redhat.com> wrote:
>>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>>> Hi Sage,
>>>>>>
>>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>>
>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>> This is a longstanding bug,
>>>>>>>
>>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>>
>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>
>>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>>
>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>> discussion here.
>>>>>>>
>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>> brain hurt.
>>>>>>>
>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>>> given b) is P(a && b) / P(b),
>>>>>>
>>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>>
>>>>>>>
>>>>>>> P(pick small | first pick not small)
>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>
>>>>>>> The last term is easy to calculate,
>>>>>>>
>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>
>>>>>>> and the && term is the distribution we're trying to produce.
>>>>>>
>>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>>
>>>>>>> For exmaple,
>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>> their second replica be the small OSD. So
>>>>>>>
>>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>
>>>>>>> Putting those together,
>>>>>>>
>>>>>>> P(pick small | first pick not small)
>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>>
>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>> that they weren't already chosen.
>>>>>>>
>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>
>>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>>
>>>>> In
>>>>>
>>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>>
>>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>>
>>>>> I don't get why we need "* bucketw" at the end ?
>>>>
>>>> It's just to keep the values within a reasonable range so that we don't
>>>> lose precision by dropping down into small integers.
>>>>
>>>> I futzed around with this some more last week trying to get the third
>>>> replica to work and ended up doubting that this piece is correct. The
>>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>>> slightly from what I would expect from first principles and what I get out
>>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>>
>>>> I'm hoping we can find someone with a strong stats/probability background
>>>> and loads of free time who can tackle this...
>>>>
>>>
>>> I'm *not* that person, but I gave it a go last weekend and realized a
>>> few things:
>>>
>>> 1. We should add the additional constraint that for all PGs assigned
>>> to an OSD, 1/N of them must be primary replicas, 1/N must be
>>> secondary, 1/N must be tertiary, etc. for N replicas/stripes. E.g. for
>>> a 3 replica pool, the "small" OSD should still have the property that
>>> 1/3rd are primaries, 1/3rd are secondary, 1/3rd are tertiary.
>>>
>>> 2. I believe this is a case of the balls-into-bins problem -- we have
>>> colored balls and weighted bins. I didn't find a definition of the
>>> problem where the goal is to allow users to specify weights which must
>>> be respected after N rounds.
>>>
>>> 3. I wrote some quick python to simulate different reweighting
>>> algorithms. The solution is definitely not obvious - I often thought
>>> I'd solved it (e.g. for simple OSD weight sets like 3, 3, 3, 1) - but
>>> changing the OSDs weights to e.g. 3, 3, 1, 1 completely broke things.
>>> I can clean up and share that python if it's can help.
>>>
>>> My gut feeling is that because CRUSH trees and rulesets can be
>>> arbitrarily complex, the most pragmatic & reliable way to solve this
>>> problem is to balance the PGs with a reweight-by-pg loop at crush
>>> compilation time. This is what admins should do now -- we should just
>>> automate it.
>>>
>>> Cheers, Dan
>>>
>>> P.S. -- maybe these guys can help: http://math.stackexchange.com/
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-06 8:18 ` Loic Dachary
@ 2017-02-06 14:11 ` Jaze Lee
2017-02-06 17:07 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Jaze Lee @ 2017-02-06 14:11 UTC (permalink / raw)
To: Loic Dachary; +Cc: ceph-devel
2017-02-06 16:18 GMT+08:00 Loic Dachary <loic@dachary.org>:
> Hi,
>
> On 02/06/2017 04:08 AM, Jaze Lee wrote:
>> It is more complicated than i have expected.....
>> I viewed http://tracker.ceph.com/issues/15653, and know that if the
>> replica number is
>> bigger than the host we choose, we may meet the problem.
>>
>> That is
>> if we have
>> host: a b c d
>> host: e f g h
>> host: i j k l
>>
>> we only choose one from each host for replica three, and the distribution
>> is as we expected? Right ?
>>
>>
>> The problem described in http://tracker.ceph.com/issues/15653, may happen
>> when
>> 1)
>> host: a b c d e f g
>>
>> and we choose all three replica from this host. But this is few happen
>> in production. Right?
>>
>>
>> May be i do not understand the problem correctly ?
>
> The problem also happens with host: a b c d e f g when you try to get three replicas that are not on the same disk. You can experiment with Dan's script
Yes, I mean why we choose three from one host ? In production the host
number is ALWAYS
more than replica number.....
root
rack-0
host A
host B
rack-1
host C
host D
rack -2
host E
host F
when choose pg 1.1 for osd, it will always choose one from rack-0, one
from rack-1, one from rack-2.
any pg will cause one be choosed from rack-0, rack-1, rack-2.
The problem is happened when we want to choose more than one osd from
a bucket for a pg, right ?
>
> https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992
>
> Cheers
>
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2017-02-04 2:54 GMT+08:00 Loic Dachary <loic@dachary.org>:
>>>
>>>
>>> On 02/03/2017 04:08 PM, Loic Dachary wrote:
>>>>
>>>>
>>>> On 02/03/2017 03:47 PM, Sage Weil wrote:
>>>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>>>> Hi Sage,
>>>>>>>
>>>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>>>
>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>> This is a longstanding bug,
>>>>>>>>
>>>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>>>
>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>
>>>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>>>
>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>> discussion here.
>>>>>>>>
>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>> brain hurt.
>>>>>>>>
>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>
>>>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>>>
>>>>>>>>
>>>>>>>> P(pick small | first pick not small)
>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>
>>>>>>>> The last term is easy to calculate,
>>>>>>>>
>>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>
>>>>>>>> and the && term is the distribution we're trying to produce.
>>>>>>>
>>>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>>>
>>>>>>>> For exmaple,
>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>> their second replica be the small OSD. So
>>>>>>>>
>>>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>
>>>>>>>> Putting those together,
>>>>>>>>
>>>>>>>> P(pick small | first pick not small)
>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>>>
>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>> that they weren't already chosen.
>>>>>>>>
>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>
>>>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>>>
>>>>>> In
>>>>>>
>>>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>>>
>>>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>>>
>>>>>> I don't get why we need "* bucketw" at the end ?
>>>>>
>>>>> It's just to keep the values within a reasonable range so that we don't
>>>>> lose precision by dropping down into small integers.
>>>>>
>>>>> I futzed around with this some more last week trying to get the third
>>>>> replica to work and ended up doubting that this piece is correct. The
>>>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>>>> slightly from what I would expect from first principles and what I get out
>>>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>>>
>>>>> I'm hoping we can find someone with a strong stats/probability background
>>>>> and loads of free time who can tackle this...
>>>>>
>>>>
>>>> It would help to formulate the problem into a self contained puzzle to present a mathematician. I tried to do it last week but failed. I'll give it another shot and submit a draft, hoping something bad could be the start of something better ;-)
>>>
>>> Here is what I have. I realize this is not good but I'm hoping someone more knowledgeable will pity me and provide something sensible. Otherwise I'm happy to keep making a fool of myself :-) In the following a bin is the device, the ball is a replica and the color is the object id.
>>>
>>> We have D bins and each bin can hold D(B) balls. All balls have the
>>> same size. There is exactly X balls of the same color. Each ball must
>>> be placed in a bin that does not already contain a ball of the same
>>> color.
>>>
>>> What distribution guarantees that, for all X, the bins are filled in
>>> the same proportion ?
>>>
>>> Details
>>> =======
>>>
>>> * One placement: all balls are the same color and we place each of them
>>> in a bin with a probability of:
>>>
>>> P(BIN) = BIN(B) / SUM(BINi(B) for i in [1..D])
>>>
>>> so that bins are equally filled regardless of their capacity.
>>>
>>> * Two placements: for each ball there is exactly one other ball of the
>>> same color. A ball is placed as in experience 1 and the chosen bin
>>> is set aside. The other ball of the same color is placed as in
>>> experience 1 with the remaining bins. The probability for a ball
>>> to be placed in a given BIN is:
>>>
>>> P(BIN) + P(all bins but BIN | BIN)
>>>
>>> Examples
>>> ========
>>>
>>> For instance we have 5 bins, a, b, c, d, e and they can hold:
>>>
>>> a = 10 million balls
>>> b = 10 million balls
>>> c = 10 million balls
>>> d = 10 million balls
>>> e = 1 million balls
>>>
>>> In the first experience with place each ball in
>>>
>>> a with a probability of 10 / ( 10 + 10 + 10 + 10 + 1 ) = 10 / 41
>>> same for b, c, d
>>> e with a probability of 1 / 41
>>>
>>> after 100,000 placements, the bins have
>>>
>>> a = 243456
>>> b = 243624
>>> c = 244486
>>> d = 243881
>>> e = 24553
>>>
>>> they are
>>>
>>> a = 2.43 % full
>>> b = 2.43 % full
>>> c = 2.44 % full
>>> d = 2.43 % full
>>> e = 0.24 % full
>>>
>>> In the second experience
>>>
>>>
>>>>> sage
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>>>>> current code, you get
>>>>>>>>
>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>> device 0: 19765965 [9899364,9866601]
>>>>>>>> device 1: 19768033 [9899444,9868589]
>>>>>>>> device 2: 19769938 [9901770,9868168]
>>>>>>>> device 3: 19766918 [9898851,9868067]
>>>>>>>> device 6: 929148 [400572,528576]
>>>>>>>>
>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>> second. With my hacky change,
>>>>>>>>
>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>> device 0: 19797315 [9899364,9897951]
>>>>>>>> device 1: 19799199 [9899444,9899755]
>>>>>>>> device 2: 19801016 [9901770,9899246]
>>>>>>>> device 3: 19797906 [9898851,9899055]
>>>>>>>> device 6: 804566 [400572,403994]
>>>>>>>>
>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>> 1%).
>>>>>>>>
>>>>>>>> Next steps:
>>>>>>>>
>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>> 2- figure out why it skews high
>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>
>>>>>>>> sage
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
--
谦谦君子
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-06 9:13 ` Dan van der Ster
@ 2017-02-06 16:53 ` Dan van der Ster
0 siblings, 0 replies; 70+ messages in thread
From: Dan van der Ster @ 2017-02-06 16:53 UTC (permalink / raw)
To: Loic Dachary; +Cc: ceph-devel, Szymon Datko, Tomasz Kuzemko
On Mon, Feb 6, 2017 at 10:13 AM, Dan van der Ster <dan@vanderster.com> wrote:
> Hi Loic,
>
> Here's my current understanding of the problem. (Below I work with the
> example having four OSDs with weights 3, 3, 3, 1, respectively).
>
> I'm elaborating on the observation that for every replication "round",
> the PG ratios for each and every OSD must be equal to the "target" or
> goal weight of that OSD. So, for an OSD that should get 10% of PGs,
> that OSD gets 10% in round 1, 10% in round 2, etc... But we need to
> multiply each of these ratios by the probability that this OSD is
> still available in Round r.
>
> Hence I believe we have this loop invariant:
>
> P(OSD.x still available in Round r) * (Weight of OSD.x in Round r)
> / (Total sum of all weights in Round r) == (Original "target" Weight
> of OSD.x) / (Total sum of all target weights)
>
> I simplify all these terms:
> P(OSD.x still available for Round r) = P_x_r
> Weight of OSD.x in Round r = W_x_r
> Total sum of all weights in Round r = T_r
> Original "target" Weight of OSD.x = W_x
> Total sum of all target weights = T
>
> So rewriting the equation, we have:
>
> P_x_r * W_x_r / T_r == W_x / T
>
> We then calculate the needed weight of OSD.x in Round r. W_x_r is what
> we're trying to solve for!!
>
> W_x_r = W_x / T * T_r / P_x_r
>
> The first term W_x / T is a constant and easy to compute. (For my
> example small OSD, W_x / T = 0.1)
>
> P_x_r is also -- I believe -- simple to compute. P_x_r gets smaller
> for each round and is a function of what happened in the previous
> round:
>
> Round 1: P_x_1 = 1.0
> Round 2: P_x_2 = P_x_1 * (1 - W_x_1 / T_1)
> Round 3: P_x_3 = P_x_2 * (1 - W_x_2 / T_2)
> ...
>
> But T_r is a challenge -- T_r is the sum of W_x_r for all x in round
> r. Hence, the problem is that we don't know T_r until *after* we
> compute all W_x_r's for that round. I tried various ways to estimate
> T_r but didn't make any progress.
>
> Do you think this formulation is correct? Any clever ideas where to go next?
>
Something is wrong, because the system of equations that this gives is
unsolvable.
In round 2 for the 3,3,3,1 OSD set, assuming the first, weight 3,
OSD.0 was chosen in the first round, we have:
P_1_2 = (1-3/10) = 0.7
P_2_2 = (1-3/10) = 0.7
P_3_2 = (1-1/10) = 0.9
And we know:
W_1 / T = 3/10 = 0.3
W_2 / T = 3/10 = 0.3
W_3 / T = 1/10 = 0.1
So we can describe the whole round:
W_1_2 = W_1 / T * T_2 / P_1_2 = 0.3 * T_2 / 0.7 = 0.4286 T_2
W_2_2 = W_2 / T * T_2 / P_2_2 = 0.3 * T_2 / 0.7 = 0.4286 T_2
W_3_2 = W_3 / T * T_2 / P_3_2 = 0.1 * T_2 / 0.9 = 0.1111 T_2
W_1_2 + W_2_2 + W_3_2 = T_2
Putting this all into a solver gives 0.9683 * T_2 = T_2, which is nonsense.
-- Dan
> Cheers, Dan
>
>
>
>
> On Mon, Feb 6, 2017 at 9:31 AM, Loic Dachary <loic@dachary.org> wrote:
>> Hi Dan,
>>
>> Your script turns out to be a nice self contained problem statement :-) Tomasz & Szymon discussed it today @ FOSDEM and I was enlightened by the way Szymon described how to calculate P(E|A) using a probability tree (see the picture at http://dachary.org/loic/crush-probability-schema.jpg).
>>
>> Cheers
>>
>> On 02/03/2017 06:37 PM, Dan van der Ster wrote:
>>> Anyway, here's my simple simulation. It might be helpful for testing
>>> ideas quickly: https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992
>>>
>>> Below is the output using the P(pick small | first pick not small)
>>> observation, using OSDs having weights 3, 3, 3, & 1 respectively. It
>>> seems to *almost* work, but only when we have just one small OSD.
>>>
>>> See the end of the script for other various ideas.
>>>
>>> -- Dan
>>>
>>>> python mpa.py
>>> OSDs (id: weight): {0: 3, 1: 3, 2: 3, 3: 1}
>>>
>>> Expected PGs per OSD: {0: 90000, 1: 90000, 2: 90000, 3: 30000}
>>>
>>> Simulating with existing CRUSH
>>>
>>> Observed: {0: 85944, 1: 85810, 2: 85984, 3: 42262}
>>> Observed for Nth replica: [{0: 29936, 1: 30045, 2: 30061, 3: 9958},
>>> {0: 29037, 1: 29073, 2: 29041, 3: 12849}, {0: 26971, 1: 26692, 2:
>>> 26882, 3: 19455}]
>>>
>>> Now trying your new algorithm
>>>
>>> Observed: {0: 89423, 1: 89443, 2: 89476, 3: 31658}
>>> Observed for Nth replica: [{0: 30103, 1: 30132, 2: 29805, 3: 9960},
>>> {0: 29936, 1: 29964, 2: 29796, 3: 10304}, {0: 29384, 1: 29347, 2:
>>> 29875, 3: 11394}]
>>>
>>>
>>> On Fri, Feb 3, 2017 at 4:26 PM, Dan van der Ster <dan@vanderster.com> wrote:
>>>> On Fri, Feb 3, 2017 at 3:47 PM, Sage Weil <sweil@redhat.com> wrote:
>>>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>>>> Hi Sage,
>>>>>>>
>>>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>>>
>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>> This is a longstanding bug,
>>>>>>>>
>>>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>>>
>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>
>>>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>>>
>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>> discussion here.
>>>>>>>>
>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>> brain hurt.
>>>>>>>>
>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>
>>>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>>>
>>>>>>>>
>>>>>>>> P(pick small | first pick not small)
>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>
>>>>>>>> The last term is easy to calculate,
>>>>>>>>
>>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>
>>>>>>>> and the && term is the distribution we're trying to produce.
>>>>>>>
>>>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>>>
>>>>>>>> For exmaple,
>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>> their second replica be the small OSD. So
>>>>>>>>
>>>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>
>>>>>>>> Putting those together,
>>>>>>>>
>>>>>>>> P(pick small | first pick not small)
>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>>>
>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>> that they weren't already chosen.
>>>>>>>>
>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>
>>>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>>>
>>>>>> In
>>>>>>
>>>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>>>
>>>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>>>
>>>>>> I don't get why we need "* bucketw" at the end ?
>>>>>
>>>>> It's just to keep the values within a reasonable range so that we don't
>>>>> lose precision by dropping down into small integers.
>>>>>
>>>>> I futzed around with this some more last week trying to get the third
>>>>> replica to work and ended up doubting that this piece is correct. The
>>>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>>>> slightly from what I would expect from first principles and what I get out
>>>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>>>
>>>>> I'm hoping we can find someone with a strong stats/probability background
>>>>> and loads of free time who can tackle this...
>>>>>
>>>>
>>>> I'm *not* that person, but I gave it a go last weekend and realized a
>>>> few things:
>>>>
>>>> 1. We should add the additional constraint that for all PGs assigned
>>>> to an OSD, 1/N of them must be primary replicas, 1/N must be
>>>> secondary, 1/N must be tertiary, etc. for N replicas/stripes. E.g. for
>>>> a 3 replica pool, the "small" OSD should still have the property that
>>>> 1/3rd are primaries, 1/3rd are secondary, 1/3rd are tertiary.
>>>>
>>>> 2. I believe this is a case of the balls-into-bins problem -- we have
>>>> colored balls and weighted bins. I didn't find a definition of the
>>>> problem where the goal is to allow users to specify weights which must
>>>> be respected after N rounds.
>>>>
>>>> 3. I wrote some quick python to simulate different reweighting
>>>> algorithms. The solution is definitely not obvious - I often thought
>>>> I'd solved it (e.g. for simple OSD weight sets like 3, 3, 3, 1) - but
>>>> changing the OSDs weights to e.g. 3, 3, 1, 1 completely broke things.
>>>> I can clean up and share that python if it's can help.
>>>>
>>>> My gut feeling is that because CRUSH trees and rulesets can be
>>>> arbitrarily complex, the most pragmatic & reliable way to solve this
>>>> problem is to balance the PGs with a reweight-by-pg loop at crush
>>>> compilation time. This is what admins should do now -- we should just
>>>> automate it.
>>>>
>>>> Cheers, Dan
>>>>
>>>> P.S. -- maybe these guys can help: http://math.stackexchange.com/
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-06 14:11 ` Jaze Lee
@ 2017-02-06 17:07 ` Loic Dachary
0 siblings, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-02-06 17:07 UTC (permalink / raw)
To: Jaze Lee; +Cc: ceph-devel
On 02/06/2017 03:11 PM, Jaze Lee wrote:
> 2017-02-06 16:18 GMT+08:00 Loic Dachary <loic@dachary.org>:
>> Hi,
>>
>> On 02/06/2017 04:08 AM, Jaze Lee wrote:
>>> It is more complicated than i have expected.....
>>> I viewed http://tracker.ceph.com/issues/15653, and know that if the
>>> replica number is
>>> bigger than the host we choose, we may meet the problem.
>>>
>>> That is
>>> if we have
>>> host: a b c d
>>> host: e f g h
>>> host: i j k l
>>>
>>> we only choose one from each host for replica three, and the distribution
>>> is as we expected? Right ?
>>>
>>>
>>> The problem described in http://tracker.ceph.com/issues/15653, may happen
>>> when
>>> 1)
>>> host: a b c d e f g
>>>
>>> and we choose all three replica from this host. But this is few happen
>>> in production. Right?
>>>
>>>
>>> May be i do not understand the problem correctly ?
>>
>> The problem also happens with host: a b c d e f g when you try to get three replicas that are not on the same disk. You can experiment with Dan's script
>
> Yes, I mean why we choose three from one host ?
Because it should work in this specific case. And also because this is a problem that shows in every situation, not just this specific situation.
Cheers
> In production the host
> number is ALWAYS
> more than replica number.....
>
> root
> rack-0
> host A
> host B
> rack-1
> host C
> host D
> rack -2
> host E
> host F
>
> when choose pg 1.1 for osd, it will always choose one from rack-0, one
> from rack-1, one from rack-2.
> any pg will cause one be choosed from rack-0, rack-1, rack-2.
>
> The problem is happened when we want to choose more than one osd from
> a bucket for a pg, right ?
>
>
>
>>
>> https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992
>>
>> Cheers
>>
>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2017-02-04 2:54 GMT+08:00 Loic Dachary <loic@dachary.org>:
>>>>
>>>>
>>>> On 02/03/2017 04:08 PM, Loic Dachary wrote:
>>>>>
>>>>>
>>>>> On 02/03/2017 03:47 PM, Sage Weil wrote:
>>>>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>>>>> Hi Sage,
>>>>>>>>
>>>>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>>>>
>>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>>> This is a longstanding bug,
>>>>>>>>>
>>>>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>>>>
>>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>>
>>>>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>>>>
>>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>>> discussion here.
>>>>>>>>>
>>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>>> brain hurt.
>>>>>>>>>
>>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>
>>>>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>>>>
>>>>>>>>>
>>>>>>>>> P(pick small | first pick not small)
>>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>
>>>>>>>>> The last term is easy to calculate,
>>>>>>>>>
>>>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>>
>>>>>>>>> and the && term is the distribution we're trying to produce.
>>>>>>>>
>>>>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>>>>
>>>>>>>>> For exmaple,
>>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>>> their second replica be the small OSD. So
>>>>>>>>>
>>>>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>>
>>>>>>>>> Putting those together,
>>>>>>>>>
>>>>>>>>> P(pick small | first pick not small)
>>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>>>>
>>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>>> that they weren't already chosen.
>>>>>>>>>
>>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>>
>>>>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>>>>
>>>>>>> In
>>>>>>>
>>>>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>>>>
>>>>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>>>>
>>>>>>> I don't get why we need "* bucketw" at the end ?
>>>>>>
>>>>>> It's just to keep the values within a reasonable range so that we don't
>>>>>> lose precision by dropping down into small integers.
>>>>>>
>>>>>> I futzed around with this some more last week trying to get the third
>>>>>> replica to work and ended up doubting that this piece is correct. The
>>>>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>>>>> slightly from what I would expect from first principles and what I get out
>>>>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>>>>
>>>>>> I'm hoping we can find someone with a strong stats/probability background
>>>>>> and loads of free time who can tackle this...
>>>>>>
>>>>>
>>>>> It would help to formulate the problem into a self contained puzzle to present a mathematician. I tried to do it last week but failed. I'll give it another shot and submit a draft, hoping something bad could be the start of something better ;-)
>>>>
>>>> Here is what I have. I realize this is not good but I'm hoping someone more knowledgeable will pity me and provide something sensible. Otherwise I'm happy to keep making a fool of myself :-) In the following a bin is the device, the ball is a replica and the color is the object id.
>>>>
>>>> We have D bins and each bin can hold D(B) balls. All balls have the
>>>> same size. There is exactly X balls of the same color. Each ball must
>>>> be placed in a bin that does not already contain a ball of the same
>>>> color.
>>>>
>>>> What distribution guarantees that, for all X, the bins are filled in
>>>> the same proportion ?
>>>>
>>>> Details
>>>> =======
>>>>
>>>> * One placement: all balls are the same color and we place each of them
>>>> in a bin with a probability of:
>>>>
>>>> P(BIN) = BIN(B) / SUM(BINi(B) for i in [1..D])
>>>>
>>>> so that bins are equally filled regardless of their capacity.
>>>>
>>>> * Two placements: for each ball there is exactly one other ball of the
>>>> same color. A ball is placed as in experience 1 and the chosen bin
>>>> is set aside. The other ball of the same color is placed as in
>>>> experience 1 with the remaining bins. The probability for a ball
>>>> to be placed in a given BIN is:
>>>>
>>>> P(BIN) + P(all bins but BIN | BIN)
>>>>
>>>> Examples
>>>> ========
>>>>
>>>> For instance we have 5 bins, a, b, c, d, e and they can hold:
>>>>
>>>> a = 10 million balls
>>>> b = 10 million balls
>>>> c = 10 million balls
>>>> d = 10 million balls
>>>> e = 1 million balls
>>>>
>>>> In the first experience with place each ball in
>>>>
>>>> a with a probability of 10 / ( 10 + 10 + 10 + 10 + 1 ) = 10 / 41
>>>> same for b, c, d
>>>> e with a probability of 1 / 41
>>>>
>>>> after 100,000 placements, the bins have
>>>>
>>>> a = 243456
>>>> b = 243624
>>>> c = 244486
>>>> d = 243881
>>>> e = 24553
>>>>
>>>> they are
>>>>
>>>> a = 2.43 % full
>>>> b = 2.43 % full
>>>> c = 2.44 % full
>>>> d = 2.43 % full
>>>> e = 0.24 % full
>>>>
>>>> In the second experience
>>>>
>>>>
>>>>>> sage
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>>>>>> current code, you get
>>>>>>>>>
>>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>>> device 0: 19765965 [9899364,9866601]
>>>>>>>>> device 1: 19768033 [9899444,9868589]
>>>>>>>>> device 2: 19769938 [9901770,9868168]
>>>>>>>>> device 3: 19766918 [9898851,9868067]
>>>>>>>>> device 6: 929148 [400572,528576]
>>>>>>>>>
>>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>>> second. With my hacky change,
>>>>>>>>>
>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>>> device 0: 19797315 [9899364,9897951]
>>>>>>>>> device 1: 19799199 [9899444,9899755]
>>>>>>>>> device 2: 19801016 [9901770,9899246]
>>>>>>>>> device 3: 19797906 [9898851,9899055]
>>>>>>>>> device 6: 804566 [400572,403994]
>>>>>>>>>
>>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>>> 1%).
>>>>>>>>>
>>>>>>>>> Next steps:
>>>>>>>>>
>>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>>> 2- figure out why it skews high
>>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>>
>>>>>>>>> sage
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>
>
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-01-26 3:05 crush multipick anomaly Sage Weil
2017-01-26 11:13 ` Loic Dachary
@ 2017-02-13 10:36 ` Loic Dachary
2017-02-13 14:21 ` Sage Weil
2017-02-13 14:53 ` Gregory Farnum
1 sibling, 2 replies; 70+ messages in thread
From: Loic Dachary @ 2017-02-13 10:36 UTC (permalink / raw)
To: Sage Weil, ceph-devel
Hi,
Dan van der Ster reached out to colleagues and friends and Pedro López-Adeva Fernández-Layos came up with a well written analysis of the problem and a tentative solution which he described at : https://github.com/plafl/notebooks/blob/master/replication.ipynb
Unless I'm reading the document incorrectly (very possible ;) it also means that the probability of each disk needs to take in account the weight of all disks. Which means that whenever a disk is added / removed or its weight is changed, this has an impact on the probability of all disks in the cluster and objects are likely to move everywhere. Am I mistaken ?
Cheers
On 01/26/2017 04:05 AM, Sage Weil wrote:
> This is a longstanding bug,
>
> http://tracker.ceph.com/issues/15653
>
> that causes low-weighted devices to get more data than they should. Loic's
> recent activity resurrected discussion on the original PR
>
> https://github.com/ceph/ceph/pull/10218
>
> but since it's closed and almost nobody will see it I'm moving the
> discussion here.
>
> The main news is that I have a simple adjustment for the weights that
> works (almost perfectly) for the 2nd round of placements. The solution is
> pretty simple, although as with most probabilities it tends to make my
> brain hurt.
>
> The idea is that, on the second round, the original weight for the small
> OSD (call it P(pick small)) isn't what we should use. Instead, we want
> P(pick small | first pick not small). Since P(a|b) (the probability of a
> given b) is P(a && b) / P(b),
>
> P(pick small | first pick not small)
> = P(pick small && first pick not small) / P(first pick not small)
>
> The last term is easy to calculate,
>
> P(first pick not small) = (total_weight - small_weight) / total_weight
>
> and the && term is the distribution we're trying to produce. For exmaple,
> if small has 1/10 the weight, then we should see 1/10th of the PGs have
> their second replica be the small OSD. So
>
> P(pick small && first pick not small) = small_weight / total_weight
>
> Putting those together,
>
> P(pick small | first pick not small)
> = P(pick small && first pick not small) / P(first pick not small)
> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
> = small_weight / (total_weight - small_weight)
>
> This is, on the second round, we should adjust the weights by the above so
> that we get the right distribution of second choices. It turns out it
> works to adjust *all* weights like this to get hte conditional probability
> that they weren't already chosen.
>
> I have a branch that hacks this into straw2 and it appears to work
> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
> current code, you get
>
> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
> rule 0 (data), x = 0..40000000, numrep = 2..2
> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
> device 0: 19765965 [9899364,9866601]
> device 1: 19768033 [9899444,9868589]
> device 2: 19769938 [9901770,9868168]
> device 3: 19766918 [9898851,9868067]
> device 6: 929148 [400572,528576]
>
> which is very close for the first replica (primary), but way off for the
> second. With my hacky change,
>
> rule 0 (data), x = 0..40000000, numrep = 2..2
> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
> device 0: 19797315 [9899364,9897951]
> device 1: 19799199 [9899444,9899755]
> device 2: 19801016 [9901770,9899246]
> device 3: 19797906 [9898851,9899055]
> device 6: 804566 [400572,403994]
>
> which is quite close, but still skewing slightly high (by a big less than
> 1%).
>
> Next steps:
>
> 1- generalize this for >2 replicas
> 2- figure out why it skews high
> 3- make this work for multi-level hierarchical descent
>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-13 10:36 ` Loic Dachary
@ 2017-02-13 14:21 ` Sage Weil
2017-02-13 18:50 ` Loic Dachary
2017-02-16 22:04 ` Pedro López-Adeva
2017-02-13 14:53 ` Gregory Farnum
1 sibling, 2 replies; 70+ messages in thread
From: Sage Weil @ 2017-02-13 14:21 UTC (permalink / raw)
To: Loic Dachary; +Cc: ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 5293 bytes --]
On Mon, 13 Feb 2017, Loic Dachary wrote:
> Hi,
>
> Dan van der Ster reached out to colleagues and friends and Pedro
> López-Adeva Fernández-Layos came up with a well written analysis of the
> problem and a tentative solution which he described at :
> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>
> Unless I'm reading the document incorrectly (very possible ;) it also
> means that the probability of each disk needs to take in account the
> weight of all disks. Which means that whenever a disk is added / removed
> or its weight is changed, this has an impact on the probability of all
> disks in the cluster and objects are likely to move everywhere. Am I
> mistaken ?
Maybe (I haven't looked closely at the above yet). But for comparison, in
the normal straw2 case, adding or removing a disk also changes the
probabilities for everything else (e.g., removing one out of 10 identical
disks changes the probability from 1/10 to 1/9). The key property that
straw2 *is* able to handle is that as long as the relative probabilities
between two unmodified disks does not change, then straw2 will avoid
moving any objects between them (i.e., all data movement is to or from
the disk that is reweighted).
sage
>
> Cheers
>
> On 01/26/2017 04:05 AM, Sage Weil wrote:
> > This is a longstanding bug,
> >
> > http://tracker.ceph.com/issues/15653
> >
> > that causes low-weighted devices to get more data than they should. Loic's
> > recent activity resurrected discussion on the original PR
> >
> > https://github.com/ceph/ceph/pull/10218
> >
> > but since it's closed and almost nobody will see it I'm moving the
> > discussion here.
> >
> > The main news is that I have a simple adjustment for the weights that
> > works (almost perfectly) for the 2nd round of placements. The solution is
> > pretty simple, although as with most probabilities it tends to make my
> > brain hurt.
> >
> > The idea is that, on the second round, the original weight for the small
> > OSD (call it P(pick small)) isn't what we should use. Instead, we want
> > P(pick small | first pick not small). Since P(a|b) (the probability of a
> > given b) is P(a && b) / P(b),
> >
> > P(pick small | first pick not small)
> > = P(pick small && first pick not small) / P(first pick not small)
> >
> > The last term is easy to calculate,
> >
> > P(first pick not small) = (total_weight - small_weight) / total_weight
> >
> > and the && term is the distribution we're trying to produce. For exmaple,
> > if small has 1/10 the weight, then we should see 1/10th of the PGs have
> > their second replica be the small OSD. So
> >
> > P(pick small && first pick not small) = small_weight / total_weight
> >
> > Putting those together,
> >
> > P(pick small | first pick not small)
> > = P(pick small && first pick not small) / P(first pick not small)
> > = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
> > = small_weight / (total_weight - small_weight)
> >
> > This is, on the second round, we should adjust the weights by the above so
> > that we get the right distribution of second choices. It turns out it
> > works to adjust *all* weights like this to get hte conditional probability
> > that they weren't already chosen.
> >
> > I have a branch that hacks this into straw2 and it appears to work
> > properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
> > current code, you get
> >
> > $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
> > rule 0 (data), x = 0..40000000, numrep = 2..2
> > rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
> > device 0: 19765965 [9899364,9866601]
> > device 1: 19768033 [9899444,9868589]
> > device 2: 19769938 [9901770,9868168]
> > device 3: 19766918 [9898851,9868067]
> > device 6: 929148 [400572,528576]
> >
> > which is very close for the first replica (primary), but way off for the
> > second. With my hacky change,
> >
> > rule 0 (data), x = 0..40000000, numrep = 2..2
> > rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
> > device 0: 19797315 [9899364,9897951]
> > device 1: 19799199 [9899444,9899755]
> > device 2: 19801016 [9901770,9899246]
> > device 3: 19797906 [9898851,9899055]
> > device 6: 804566 [400572,403994]
> >
> > which is quite close, but still skewing slightly high (by a big less than
> > 1%).
> >
> > Next steps:
> >
> > 1- generalize this for >2 replicas
> > 2- figure out why it skews high
> > 3- make this work for multi-level hierarchical descent
> >
> > sage
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-13 10:36 ` Loic Dachary
2017-02-13 14:21 ` Sage Weil
@ 2017-02-13 14:53 ` Gregory Farnum
2017-02-20 8:47 ` Loic Dachary
1 sibling, 1 reply; 70+ messages in thread
From: Gregory Farnum @ 2017-02-13 14:53 UTC (permalink / raw)
To: Loic Dachary; +Cc: Sage Weil, ceph-devel
On Mon, Feb 13, 2017 at 2:36 AM, Loic Dachary <loic@dachary.org> wrote:
> Hi,
>
> Dan van der Ster reached out to colleagues and friends and Pedro López-Adeva Fernández-Layos came up with a well written analysis of the problem and a tentative solution which he described at : https://github.com/plafl/notebooks/blob/master/replication.ipynb
>
> Unless I'm reading the document incorrectly (very possible ;) it also means that the probability of each disk needs to take in account the weight of all disks. Which means that whenever a disk is added / removed or its weight is changed, this has an impact on the probability of all disks in the cluster and objects are likely to move everywhere. Am I mistaken ?
Keep in mind that in the math presented, "all disks" for our purposes
really means "all items within a CRUSH bucket" (at least, best I can
tell). So if you reweight a disk, you have to recalculate weights
within its bucket and within each parent bucket, but each bucket has a
bounded size N so the calculation should remain feasible. I didn't
step through the more complicated math at the end but it made
intuitive sense as far as I went.
-Greg
>
> Cheers
>
> On 01/26/2017 04:05 AM, Sage Weil wrote:
>> This is a longstanding bug,
>>
>> http://tracker.ceph.com/issues/15653
>>
>> that causes low-weighted devices to get more data than they should. Loic's
>> recent activity resurrected discussion on the original PR
>>
>> https://github.com/ceph/ceph/pull/10218
>>
>> but since it's closed and almost nobody will see it I'm moving the
>> discussion here.
>>
>> The main news is that I have a simple adjustment for the weights that
>> works (almost perfectly) for the 2nd round of placements. The solution is
>> pretty simple, although as with most probabilities it tends to make my
>> brain hurt.
>>
>> The idea is that, on the second round, the original weight for the small
>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>> given b) is P(a && b) / P(b),
>>
>> P(pick small | first pick not small)
>> = P(pick small && first pick not small) / P(first pick not small)
>>
>> The last term is easy to calculate,
>>
>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>
>> and the && term is the distribution we're trying to produce. For exmaple,
>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>> their second replica be the small OSD. So
>>
>> P(pick small && first pick not small) = small_weight / total_weight
>>
>> Putting those together,
>>
>> P(pick small | first pick not small)
>> = P(pick small && first pick not small) / P(first pick not small)
>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>> = small_weight / (total_weight - small_weight)
>>
>> This is, on the second round, we should adjust the weights by the above so
>> that we get the right distribution of second choices. It turns out it
>> works to adjust *all* weights like this to get hte conditional probability
>> that they weren't already chosen.
>>
>> I have a branch that hacks this into straw2 and it appears to work
>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>> current code, you get
>>
>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>> rule 0 (data), x = 0..40000000, numrep = 2..2
>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>> device 0: 19765965 [9899364,9866601]
>> device 1: 19768033 [9899444,9868589]
>> device 2: 19769938 [9901770,9868168]
>> device 3: 19766918 [9898851,9868067]
>> device 6: 929148 [400572,528576]
>>
>> which is very close for the first replica (primary), but way off for the
>> second. With my hacky change,
>>
>> rule 0 (data), x = 0..40000000, numrep = 2..2
>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>> device 0: 19797315 [9899364,9897951]
>> device 1: 19799199 [9899444,9899755]
>> device 2: 19801016 [9901770,9899246]
>> device 3: 19797906 [9898851,9899055]
>> device 6: 804566 [400572,403994]
>>
>> which is quite close, but still skewing slightly high (by a big less than
>> 1%).
>>
>> Next steps:
>>
>> 1- generalize this for >2 replicas
>> 2- figure out why it skews high
>> 3- make this work for multi-level hierarchical descent
>>
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-13 14:21 ` Sage Weil
@ 2017-02-13 18:50 ` Loic Dachary
2017-02-13 19:16 ` Sage Weil
2017-02-16 22:04 ` Pedro López-Adeva
1 sibling, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-13 18:50 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
[-- Attachment #1: Type: text/plain, Size: 7029 bytes --]
Hi Sage,
I wrote a little program to show where objects are moving when a new disk is added (disk 10 below) and it looks like this:
00 01 02 03 04 05 06 07 08 09 10
00: 0 14 17 14 19 23 13 22 21 20 1800
01: 12 0 11 13 19 19 15 10 16 17 1841
02: 17 27 0 17 15 15 13 19 18 11 1813
03: 14 17 15 0 23 11 20 15 23 17 1792
04: 14 18 16 25 0 27 13 8 15 16 1771
05: 19 16 22 25 13 0 9 19 21 21 1813
06: 18 15 21 17 10 18 0 10 18 11 1873
07: 13 17 22 13 16 17 14 0 25 12 1719
08: 23 20 16 17 19 18 11 12 0 18 1830
09: 14 20 15 17 12 16 17 11 13 0 1828
10: 0 0 0 0 0 0 0 0 0 0 0
before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
Each line shows how many objects moved from a given disk to the others after disk 10 was added. Most objects go to the new disk and around 1% go to each other disks. The before and after lines show how many objects are mapped to each disk. They all have the same weight and it's using replica 2 and straw2. Does that look right ?
Cheers
On 02/13/2017 03:21 PM, Sage Weil wrote:
> On Mon, 13 Feb 2017, Loic Dachary wrote:
>> Hi,
>>
>> Dan van der Ster reached out to colleagues and friends and Pedro
>> López-Adeva Fernández-Layos came up with a well written analysis of the
>> problem and a tentative solution which he described at :
>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>
>> Unless I'm reading the document incorrectly (very possible ;) it also
>> means that the probability of each disk needs to take in account the
>> weight of all disks. Which means that whenever a disk is added / removed
>> or its weight is changed, this has an impact on the probability of all
>> disks in the cluster and objects are likely to move everywhere. Am I
>> mistaken ?
>
> Maybe (I haven't looked closely at the above yet). But for comparison, in
> the normal straw2 case, adding or removing a disk also changes the
> probabilities for everything else (e.g., removing one out of 10 identical
> disks changes the probability from 1/10 to 1/9). The key property that
> straw2 *is* able to handle is that as long as the relative probabilities
> between two unmodified disks does not change, then straw2 will avoid
> moving any objects between them (i.e., all data movement is to or from
> the disk that is reweighted).
>
> sage
>
>
>>
>> Cheers
>>
>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>> This is a longstanding bug,
>>>
>>> http://tracker.ceph.com/issues/15653
>>>
>>> that causes low-weighted devices to get more data than they should. Loic's
>>> recent activity resurrected discussion on the original PR
>>>
>>> https://github.com/ceph/ceph/pull/10218
>>>
>>> but since it's closed and almost nobody will see it I'm moving the
>>> discussion here.
>>>
>>> The main news is that I have a simple adjustment for the weights that
>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>> pretty simple, although as with most probabilities it tends to make my
>>> brain hurt.
>>>
>>> The idea is that, on the second round, the original weight for the small
>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>> given b) is P(a && b) / P(b),
>>>
>>> P(pick small | first pick not small)
>>> = P(pick small && first pick not small) / P(first pick not small)
>>>
>>> The last term is easy to calculate,
>>>
>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>
>>> and the && term is the distribution we're trying to produce. For exmaple,
>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>> their second replica be the small OSD. So
>>>
>>> P(pick small && first pick not small) = small_weight / total_weight
>>>
>>> Putting those together,
>>>
>>> P(pick small | first pick not small)
>>> = P(pick small && first pick not small) / P(first pick not small)
>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>> = small_weight / (total_weight - small_weight)
>>>
>>> This is, on the second round, we should adjust the weights by the above so
>>> that we get the right distribution of second choices. It turns out it
>>> works to adjust *all* weights like this to get hte conditional probability
>>> that they weren't already chosen.
>>>
>>> I have a branch that hacks this into straw2 and it appears to work
>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>> current code, you get
>>>
>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>> device 0: 19765965 [9899364,9866601]
>>> device 1: 19768033 [9899444,9868589]
>>> device 2: 19769938 [9901770,9868168]
>>> device 3: 19766918 [9898851,9868067]
>>> device 6: 929148 [400572,528576]
>>>
>>> which is very close for the first replica (primary), but way off for the
>>> second. With my hacky change,
>>>
>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>> device 0: 19797315 [9899364,9897951]
>>> device 1: 19799199 [9899444,9899755]
>>> device 2: 19801016 [9901770,9899246]
>>> device 3: 19797906 [9898851,9899055]
>>> device 6: 804566 [400572,403994]
>>>
>>> which is quite close, but still skewing slightly high (by a big less than
>>> 1%).
>>>
>>> Next steps:
>>>
>>> 1- generalize this for >2 replicas
>>> 2- figure out why it skews high
>>> 3- make this work for multi-level hierarchical descent
>>>
>>> sage
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
--
Loïc Dachary, Artisan Logiciel Libre
[-- Attachment #2: compare.c --]
[-- Type: text/x-csrc, Size: 4795 bytes --]
#include "mapper.h"
#include "builder.h"
#include "crush.h"
#include "hash.h"
#include "stdio.h"
#define NUMBER_OF_OBJECTS 100000
void map_with_crush(int replication_count, int hosts_count, int object_map[][NUMBER_OF_OBJECTS]) {
struct crush_map *m = crush_create();
m->choose_local_tries = 0;
m->choose_local_fallback_tries = 0;
m->choose_total_tries = 50;
m->chooseleaf_descend_once = 1;
m->chooseleaf_vary_r = 1;
m->chooseleaf_stable = 1;
m->allowed_bucket_algs =
(1 << CRUSH_BUCKET_UNIFORM) |
(1 << CRUSH_BUCKET_LIST) |
(1 << CRUSH_BUCKET_STRAW2);
int root_type = 1;
int host_type = 2;
int bucketno = 0;
int hosts[hosts_count];
int weights[hosts_count];
int disk = 0;
for(int host = 0; host < hosts_count; host++) {
struct crush_bucket *b;
b = crush_make_bucket(m, CRUSH_BUCKET_STRAW2, CRUSH_HASH_DEFAULT, host_type,
0, NULL, NULL);
assert(b != NULL);
assert(crush_bucket_add_item(m, b, disk, 0x10000) == 0);
assert(crush_add_bucket(m, 0, b, &bucketno) == 0);
hosts[host] = bucketno;
weights[host] = 0x10000;
disk++;
}
struct crush_bucket *root;
int bucket_root;
root = crush_make_bucket(m, CRUSH_BUCKET_STRAW2, CRUSH_HASH_DEFAULT, root_type,
hosts_count, hosts, weights);
assert(root != NULL);
assert(crush_add_bucket(m, 0, root, &bucket_root) == 0);
assert(crush_reweight_bucket(m, root) == 0);
struct crush_rule *r;
int minsize = 1;
int maxsize = 5;
int number_of_steps = 3;
r = crush_make_rule(number_of_steps, 0, 0, minsize, maxsize);
assert(r != NULL);
crush_rule_set_step(r, 0, CRUSH_RULE_TAKE, bucket_root, 0);
crush_rule_set_step(r, 1, CRUSH_RULE_CHOOSELEAF_FIRSTN, replication_count, host_type);
crush_rule_set_step(r, 2, CRUSH_RULE_EMIT, 0, 0);
int ruleno = crush_add_rule(m, r, -1);
assert(ruleno >= 0);
crush_finalize(m);
{
int result[replication_count];
__u32 weights[hosts_count];
for(int i = 0; i < hosts_count; i++)
weights[i] = 0x10000;
int cwin_size = crush_work_size(m, replication_count);
char cwin[cwin_size];
crush_init_workspace(m, cwin);
for(int x = 0; x < NUMBER_OF_OBJECTS; x++) {
memset(result, '\0', sizeof(int) * replication_count);
assert(crush_do_rule(m, ruleno, x, result, replication_count, weights, hosts_count, cwin) == 2);
for(int i = 0; i < replication_count; i++) {
object_map[i][x] = result[i];
}
}
}
crush_destroy(m);
}
int same_set(int object, int replication_count, int before[][NUMBER_OF_OBJECTS], int after[][NUMBER_OF_OBJECTS]) {
for(int r = 0; r < replication_count; r++) {
int found = 0;
for(int s = 0; s < replication_count; s++)
if(before[r][object] == after[s][object]) {
found = 1;
break;
}
if(!found)
return 0;
}
return 1;
}
void with_crush(int replication_count, int hosts_count) {
int before[replication_count][NUMBER_OF_OBJECTS];
map_with_crush(replication_count, hosts_count, &before[0]);
int after[replication_count][NUMBER_OF_OBJECTS];
map_with_crush(replication_count, hosts_count+1, &after[0]);
int movement[hosts_count + 1][hosts_count + 1];
memset(movement, '\0', sizeof(movement));
int count_before[hosts_count + 1];
memset(count_before, '\0', sizeof(count_before));
int count_after[hosts_count + 1];
memset(count_after, '\0', sizeof(count_after));
for(int object = 0; object < NUMBER_OF_OBJECTS; object++) {
// if(same_set(object, replication_count, &before[0], &after[0]))
// continue;
for(int replica = 0; replica < replication_count; replica++) {
count_before[before[replica][object]]++;
count_after[after[replica][object]]++;
if (before[replica][object] == after[replica][object])
continue;
movement[before[replica][object]][after[replica][object]]++;
}
}
printf(" ");
for(int host = 0; host < hosts_count + 1; host++)
printf(" %02d ", host);
printf("\n");
for(int from = 0; from < hosts_count + 1; from++) {
printf("%02d: ", from);
for(int to = 0; to < hosts_count + 1; to++)
printf("%6d ", movement[from][to]);
printf("\n");
}
printf("before: ");
for(int host = 0; host < hosts_count + 1; host++)
printf("%6d ", count_before[host]);
printf("\n");
printf("after: ");
for(int host = 0; host < hosts_count + 1; host++)
printf("%6d ", count_after[host]);
printf("\n");
}
int main(int argc, char* argv[]) {
int replication_count = atoi(argv[1]);
int hosts_count = atoi(argv[2]);
with_crush(replication_count, hosts_count);
}
/*
* Local Variables:
* compile-command: "gcc -g -o compare compare.c $(pkg-config --cflags --libs libcrush) && ./compare 2 10"
* End:
*/
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-13 18:50 ` Loic Dachary
@ 2017-02-13 19:16 ` Sage Weil
2017-02-13 20:18 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Sage Weil @ 2017-02-13 19:16 UTC (permalink / raw)
To: Loic Dachary; +Cc: ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 8071 bytes --]
On Mon, 13 Feb 2017, Loic Dachary wrote:
> Hi Sage,
>
> I wrote a little program to show where objects are moving when a new disk is added (disk 10 below) and it looks like this:
>
> 00 01 02 03 04 05 06 07 08 09 10
> 00: 0 14 17 14 19 23 13 22 21 20 1800
> 01: 12 0 11 13 19 19 15 10 16 17 1841
> 02: 17 27 0 17 15 15 13 19 18 11 1813
> 03: 14 17 15 0 23 11 20 15 23 17 1792
> 04: 14 18 16 25 0 27 13 8 15 16 1771
> 05: 19 16 22 25 13 0 9 19 21 21 1813
> 06: 18 15 21 17 10 18 0 10 18 11 1873
> 07: 13 17 22 13 16 17 14 0 25 12 1719
> 08: 23 20 16 17 19 18 11 12 0 18 1830
> 09: 14 20 15 17 12 16 17 11 13 0 1828
> 10: 0 0 0 0 0 0 0 0 0 0 0
>
> before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
> after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
>
>
> Each line shows how many objects moved from a given disk to the others
> after disk 10 was added. Most objects go to the new disk and around 1%
> go to each other disks. The before and after lines show how many objects
> are mapped to each disk. They all have the same weight and it's using
> replica 2 and straw2. Does that look right ?
Hmm, that doesn't look right. This is what the CRUSH.straw2_reweight unit
test is there to validate: that data on moves to or from the device whose
weight changed.
It also follows from the straw2 algorithm itself: each possible choice
gets a 'straw' length derived only from that item's weight (and other
fixed factors, like the item id and the bucket id), and we select the max
across all items. Two devices whose weights didn't change will have the
same straw lengths, and the max between them will not change. It's only
possible that the changed item's straw length changed and wasn't max and
now is (got longer) or was max and now isn't (got shorter).
sage
>
> Cheers
>
> On 02/13/2017 03:21 PM, Sage Weil wrote:
> > On Mon, 13 Feb 2017, Loic Dachary wrote:
> >> Hi,
> >>
> >> Dan van der Ster reached out to colleagues and friends and Pedro
> >> López-Adeva Fernández-Layos came up with a well written analysis of the
> >> problem and a tentative solution which he described at :
> >> https://github.com/plafl/notebooks/blob/master/replication.ipynb
> >>
> >> Unless I'm reading the document incorrectly (very possible ;) it also
> >> means that the probability of each disk needs to take in account the
> >> weight of all disks. Which means that whenever a disk is added / removed
> >> or its weight is changed, this has an impact on the probability of all
> >> disks in the cluster and objects are likely to move everywhere. Am I
> >> mistaken ?
> >
> > Maybe (I haven't looked closely at the above yet). But for comparison, in
> > the normal straw2 case, adding or removing a disk also changes the
> > probabilities for everything else (e.g., removing one out of 10 identical
> > disks changes the probability from 1/10 to 1/9). The key property that
> > straw2 *is* able to handle is that as long as the relative probabilities
> > between two unmodified disks does not change, then straw2 will avoid
> > moving any objects between them (i.e., all data movement is to or from
> > the disk that is reweighted).
> >
> > sage
> >
> >
> >>
> >> Cheers
> >>
> >> On 01/26/2017 04:05 AM, Sage Weil wrote:
> >>> This is a longstanding bug,
> >>>
> >>> http://tracker.ceph.com/issues/15653
> >>>
> >>> that causes low-weighted devices to get more data than they should. Loic's
> >>> recent activity resurrected discussion on the original PR
> >>>
> >>> https://github.com/ceph/ceph/pull/10218
> >>>
> >>> but since it's closed and almost nobody will see it I'm moving the
> >>> discussion here.
> >>>
> >>> The main news is that I have a simple adjustment for the weights that
> >>> works (almost perfectly) for the 2nd round of placements. The solution is
> >>> pretty simple, although as with most probabilities it tends to make my
> >>> brain hurt.
> >>>
> >>> The idea is that, on the second round, the original weight for the small
> >>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
> >>> P(pick small | first pick not small). Since P(a|b) (the probability of a
> >>> given b) is P(a && b) / P(b),
> >>>
> >>> P(pick small | first pick not small)
> >>> = P(pick small && first pick not small) / P(first pick not small)
> >>>
> >>> The last term is easy to calculate,
> >>>
> >>> P(first pick not small) = (total_weight - small_weight) / total_weight
> >>>
> >>> and the && term is the distribution we're trying to produce. For exmaple,
> >>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
> >>> their second replica be the small OSD. So
> >>>
> >>> P(pick small && first pick not small) = small_weight / total_weight
> >>>
> >>> Putting those together,
> >>>
> >>> P(pick small | first pick not small)
> >>> = P(pick small && first pick not small) / P(first pick not small)
> >>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
> >>> = small_weight / (total_weight - small_weight)
> >>>
> >>> This is, on the second round, we should adjust the weights by the above so
> >>> that we get the right distribution of second choices. It turns out it
> >>> works to adjust *all* weights like this to get hte conditional probability
> >>> that they weren't already chosen.
> >>>
> >>> I have a branch that hacks this into straw2 and it appears to work
> >>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
> >>> current code, you get
> >>>
> >>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
> >>> rule 0 (data), x = 0..40000000, numrep = 2..2
> >>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
> >>> device 0: 19765965 [9899364,9866601]
> >>> device 1: 19768033 [9899444,9868589]
> >>> device 2: 19769938 [9901770,9868168]
> >>> device 3: 19766918 [9898851,9868067]
> >>> device 6: 929148 [400572,528576]
> >>>
> >>> which is very close for the first replica (primary), but way off for the
> >>> second. With my hacky change,
> >>>
> >>> rule 0 (data), x = 0..40000000, numrep = 2..2
> >>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
> >>> device 0: 19797315 [9899364,9897951]
> >>> device 1: 19799199 [9899444,9899755]
> >>> device 2: 19801016 [9901770,9899246]
> >>> device 3: 19797906 [9898851,9899055]
> >>> device 6: 804566 [400572,403994]
> >>>
> >>> which is quite close, but still skewing slightly high (by a big less than
> >>> 1%).
> >>>
> >>> Next steps:
> >>>
> >>> 1- generalize this for >2 replicas
> >>> 2- figure out why it skews high
> >>> 3- make this work for multi-level hierarchical descent
> >>>
> >>> sage
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >> --
> >> Loïc Dachary, Artisan Logiciel Libre
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-13 19:16 ` Sage Weil
@ 2017-02-13 20:18 ` Loic Dachary
2017-02-13 21:01 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-13 20:18 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On 02/13/2017 08:16 PM, Sage Weil wrote:
> On Mon, 13 Feb 2017, Loic Dachary wrote:
>> Hi Sage,
>>
>> I wrote a little program to show where objects are moving when a new disk is added (disk 10 below) and it looks like this:
>>
>> 00 01 02 03 04 05 06 07 08 09 10
>> 00: 0 14 17 14 19 23 13 22 21 20 1800
>> 01: 12 0 11 13 19 19 15 10 16 17 1841
>> 02: 17 27 0 17 15 15 13 19 18 11 1813
>> 03: 14 17 15 0 23 11 20 15 23 17 1792
>> 04: 14 18 16 25 0 27 13 8 15 16 1771
>> 05: 19 16 22 25 13 0 9 19 21 21 1813
>> 06: 18 15 21 17 10 18 0 10 18 11 1873
>> 07: 13 17 22 13 16 17 14 0 25 12 1719
>> 08: 23 20 16 17 19 18 11 12 0 18 1830
>> 09: 14 20 15 17 12 16 17 11 13 0 1828
>> 10: 0 0 0 0 0 0 0 0 0 0 0
>>
>> before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
>> after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
>>
>>
>> Each line shows how many objects moved from a given disk to the others
>> after disk 10 was added. Most objects go to the new disk and around 1%
>> go to each other disks. The before and after lines show how many objects
>> are mapped to each disk. They all have the same weight and it's using
>> replica 2 and straw2. Does that look right ?
>
> Hmm, that doesn't look right. This is what the CRUSH.straw2_reweight unit
> test is there to validate: that data on moves to or from the device whose
> weight changed.
In the above, the bucket size changes: it has a new item. And the bucket size plays a role in bucket_straw2_choose because it loops on all items. In CRUSH.straw2_reweight only the weights change. I'm not entirely sure how that would explain the results I get though...
> It also follows from the straw2 algorithm itself: each possible choice
> gets a 'straw' length derived only from that item's weight (and other
> fixed factors, like the item id and the bucket id), and we select the max
> across all items. Two devices whose weights didn't change will have the
> same straw lengths, and the max between them will not change. It's only
> possible that the changed item's straw length changed and wasn't max and
> now is (got longer) or was max and now isn't (got shorter).
That's a crystal clear explanation, cool :-)
Cheers
> sage
>
>
>>
>> Cheers
>>
>> On 02/13/2017 03:21 PM, Sage Weil wrote:
>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>> Hi,
>>>>
>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>> problem and a tentative solution which he described at :
>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>
>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>> means that the probability of each disk needs to take in account the
>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>> or its weight is changed, this has an impact on the probability of all
>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>> mistaken ?
>>>
>>> Maybe (I haven't looked closely at the above yet). But for comparison, in
>>> the normal straw2 case, adding or removing a disk also changes the
>>> probabilities for everything else (e.g., removing one out of 10 identical
>>> disks changes the probability from 1/10 to 1/9). The key property that
>>> straw2 *is* able to handle is that as long as the relative probabilities
>>> between two unmodified disks does not change, then straw2 will avoid
>>> moving any objects between them (i.e., all data movement is to or from
>>> the disk that is reweighted).
>>>
>>> sage
>>>
>>>
>>>>
>>>> Cheers
>>>>
>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>> This is a longstanding bug,
>>>>>
>>>>> http://tracker.ceph.com/issues/15653
>>>>>
>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>> recent activity resurrected discussion on the original PR
>>>>>
>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>
>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>> discussion here.
>>>>>
>>>>> The main news is that I have a simple adjustment for the weights that
>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>> brain hurt.
>>>>>
>>>>> The idea is that, on the second round, the original weight for the small
>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>> given b) is P(a && b) / P(b),
>>>>>
>>>>> P(pick small | first pick not small)
>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>
>>>>> The last term is easy to calculate,
>>>>>
>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>
>>>>> and the && term is the distribution we're trying to produce. For exmaple,
>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>> their second replica be the small OSD. So
>>>>>
>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>
>>>>> Putting those together,
>>>>>
>>>>> P(pick small | first pick not small)
>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>> = small_weight / (total_weight - small_weight)
>>>>>
>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>> that we get the right distribution of second choices. It turns out it
>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>> that they weren't already chosen.
>>>>>
>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>> current code, you get
>>>>>
>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>> device 0: 19765965 [9899364,9866601]
>>>>> device 1: 19768033 [9899444,9868589]
>>>>> device 2: 19769938 [9901770,9868168]
>>>>> device 3: 19766918 [9898851,9868067]
>>>>> device 6: 929148 [400572,528576]
>>>>>
>>>>> which is very close for the first replica (primary), but way off for the
>>>>> second. With my hacky change,
>>>>>
>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>> device 0: 19797315 [9899364,9897951]
>>>>> device 1: 19799199 [9899444,9899755]
>>>>> device 2: 19801016 [9901770,9899246]
>>>>> device 3: 19797906 [9898851,9899055]
>>>>> device 6: 804566 [400572,403994]
>>>>>
>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>> 1%).
>>>>>
>>>>> Next steps:
>>>>>
>>>>> 1- generalize this for >2 replicas
>>>>> 2- figure out why it skews high
>>>>> 3- make this work for multi-level hierarchical descent
>>>>>
>>>>> sage
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-13 20:18 ` Loic Dachary
@ 2017-02-13 21:01 ` Loic Dachary
2017-02-13 21:15 ` Sage Weil
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-13 21:01 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
I get the expected behavior for replica 1 (which is what CRUSH.straw2_reweight does). The movement between buckets observered below is for replica 2.
00 01 02 03 04 05 06 07 08 09 10
00: 0 0 0 0 0 0 0 0 0 0 927
01: 0 0 0 0 0 0 0 0 0 0 904
02: 0 0 0 0 0 0 0 0 0 0 928
03: 0 0 0 0 0 0 0 0 0 0 886
04: 0 0 0 0 0 0 0 0 0 0 927
05: 0 0 0 0 0 0 0 0 0 0 927
06: 0 0 0 0 0 0 0 0 0 0 930
07: 0 0 0 0 0 0 0 0 0 0 842
08: 0 0 0 0 0 0 0 0 0 0 943
09: 0 0 0 0 0 0 0 0 0 0 904
10: 0 0 0 0 0 0 0 0 0 0 0
before: 10149 10066 9893 9955 10030 10025 9895 10013 10008 9966 0
after: 9222 9162 8965 9069 9103 9098 8965 9171 9065 9062 9118
On 02/13/2017 09:18 PM, Loic Dachary wrote:
>
>
> On 02/13/2017 08:16 PM, Sage Weil wrote:
>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>> Hi Sage,
>>>
>>> I wrote a little program to show where objects are moving when a new disk is added (disk 10 below) and it looks like this:
>>>
>>> 00 01 02 03 04 05 06 07 08 09 10
>>> 00: 0 14 17 14 19 23 13 22 21 20 1800
>>> 01: 12 0 11 13 19 19 15 10 16 17 1841
>>> 02: 17 27 0 17 15 15 13 19 18 11 1813
>>> 03: 14 17 15 0 23 11 20 15 23 17 1792
>>> 04: 14 18 16 25 0 27 13 8 15 16 1771
>>> 05: 19 16 22 25 13 0 9 19 21 21 1813
>>> 06: 18 15 21 17 10 18 0 10 18 11 1873
>>> 07: 13 17 22 13 16 17 14 0 25 12 1719
>>> 08: 23 20 16 17 19 18 11 12 0 18 1830
>>> 09: 14 20 15 17 12 16 17 11 13 0 1828
>>> 10: 0 0 0 0 0 0 0 0 0 0 0
>>>
>>> before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
>>> after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
>>>
>>>
>>> Each line shows how many objects moved from a given disk to the others
>>> after disk 10 was added. Most objects go to the new disk and around 1%
>>> go to each other disks. The before and after lines show how many objects
>>> are mapped to each disk. They all have the same weight and it's using
>>> replica 2 and straw2. Does that look right ?
>>
>> Hmm, that doesn't look right. This is what the CRUSH.straw2_reweight unit
>> test is there to validate: that data on moves to or from the device whose
>> weight changed.
>
> In the above, the bucket size changes: it has a new item. And the bucket size plays a role in bucket_straw2_choose because it loops on all items. In CRUSH.straw2_reweight only the weights change. I'm not entirely sure how that would explain the results I get though...
>
>> It also follows from the straw2 algorithm itself: each possible choice
>> gets a 'straw' length derived only from that item's weight (and other
>> fixed factors, like the item id and the bucket id), and we select the max
>> across all items. Two devices whose weights didn't change will have the
>> same straw lengths, and the max between them will not change. It's only
>> possible that the changed item's straw length changed and wasn't max and
>> now is (got longer) or was max and now isn't (got shorter).
>
> That's a crystal clear explanation, cool :-)
>
> Cheers
>
>> sage
>>
>>
>>>
>>> Cheers
>>>
>>> On 02/13/2017 03:21 PM, Sage Weil wrote:
>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>> Hi,
>>>>>
>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>> problem and a tentative solution which he described at :
>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>
>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>> means that the probability of each disk needs to take in account the
>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>> or its weight is changed, this has an impact on the probability of all
>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>> mistaken ?
>>>>
>>>> Maybe (I haven't looked closely at the above yet). But for comparison, in
>>>> the normal straw2 case, adding or removing a disk also changes the
>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>> disks changes the probability from 1/10 to 1/9). The key property that
>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>> between two unmodified disks does not change, then straw2 will avoid
>>>> moving any objects between them (i.e., all data movement is to or from
>>>> the disk that is reweighted).
>>>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>> This is a longstanding bug,
>>>>>>
>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>
>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>> recent activity resurrected discussion on the original PR
>>>>>>
>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>
>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>> discussion here.
>>>>>>
>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>> brain hurt.
>>>>>>
>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>> given b) is P(a && b) / P(b),
>>>>>>
>>>>>> P(pick small | first pick not small)
>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>
>>>>>> The last term is easy to calculate,
>>>>>>
>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>
>>>>>> and the && term is the distribution we're trying to produce. For exmaple,
>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>> their second replica be the small OSD. So
>>>>>>
>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>
>>>>>> Putting those together,
>>>>>>
>>>>>> P(pick small | first pick not small)
>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>
>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>> that they weren't already chosen.
>>>>>>
>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>>> current code, you get
>>>>>>
>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>> device 0: 19765965 [9899364,9866601]
>>>>>> device 1: 19768033 [9899444,9868589]
>>>>>> device 2: 19769938 [9901770,9868168]
>>>>>> device 3: 19766918 [9898851,9868067]
>>>>>> device 6: 929148 [400572,528576]
>>>>>>
>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>> second. With my hacky change,
>>>>>>
>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>> device 0: 19797315 [9899364,9897951]
>>>>>> device 1: 19799199 [9899444,9899755]
>>>>>> device 2: 19801016 [9901770,9899246]
>>>>>> device 3: 19797906 [9898851,9899055]
>>>>>> device 6: 804566 [400572,403994]
>>>>>>
>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>> 1%).
>>>>>>
>>>>>> Next steps:
>>>>>>
>>>>>> 1- generalize this for >2 replicas
>>>>>> 2- figure out why it skews high
>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>
>>>>>> sage
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-13 21:01 ` Loic Dachary
@ 2017-02-13 21:15 ` Sage Weil
2017-02-13 21:19 ` Gregory Farnum
2017-02-13 21:43 ` Loic Dachary
0 siblings, 2 replies; 70+ messages in thread
From: Sage Weil @ 2017-02-13 21:15 UTC (permalink / raw)
To: Loic Dachary; +Cc: ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 11229 bytes --]
On Mon, 13 Feb 2017, Loic Dachary wrote:
> I get the expected behavior for replica 1 (which is what
> CRUSH.straw2_reweight does). The movement between buckets observered
> below is for replica 2.
Oh, right, now I remember. The movement for the second replica is
unavoidable (as far as I can see). For the second replica, sometimes we
end up picking a dup (the same thing we got for the first
replica) and trying again; any change in the behavior of the first choice
may mean that we have more or less "second tries." Although any given try
will behave as we like (only moving to or from the reweighted item),
adding new tries will pick uniformly. In your example below, I think all
of the second replicas that moved to osds 0-9 were objects that originally
picked a dup for the second try and, once 10 was added, did not--because
the first replica was now on the new osd 10.
sage
> 00 01 02 03 04 05 06 07 08 09 10
> 00: 0 0 0 0 0 0 0 0 0 0 927
> 01: 0 0 0 0 0 0 0 0 0 0 904
> 02: 0 0 0 0 0 0 0 0 0 0 928
> 03: 0 0 0 0 0 0 0 0 0 0 886
> 04: 0 0 0 0 0 0 0 0 0 0 927
> 05: 0 0 0 0 0 0 0 0 0 0 927
> 06: 0 0 0 0 0 0 0 0 0 0 930
> 07: 0 0 0 0 0 0 0 0 0 0 842
> 08: 0 0 0 0 0 0 0 0 0 0 943
> 09: 0 0 0 0 0 0 0 0 0 0 904
> 10: 0 0 0 0 0 0 0 0 0 0 0
> before: 10149 10066 9893 9955 10030 10025 9895 10013 10008 9966 0
> after: 9222 9162 8965 9069 9103 9098 8965 9171 9065 9062 9118
>
>
> On 02/13/2017 09:18 PM, Loic Dachary wrote:
> >
> >
> > On 02/13/2017 08:16 PM, Sage Weil wrote:
> >> On Mon, 13 Feb 2017, Loic Dachary wrote:
> >>> Hi Sage,
> >>>
> >>> I wrote a little program to show where objects are moving when a new disk is added (disk 10 below) and it looks like this:
> >>>
> >>> 00 01 02 03 04 05 06 07 08 09 10
> >>> 00: 0 14 17 14 19 23 13 22 21 20 1800
> >>> 01: 12 0 11 13 19 19 15 10 16 17 1841
> >>> 02: 17 27 0 17 15 15 13 19 18 11 1813
> >>> 03: 14 17 15 0 23 11 20 15 23 17 1792
> >>> 04: 14 18 16 25 0 27 13 8 15 16 1771
> >>> 05: 19 16 22 25 13 0 9 19 21 21 1813
> >>> 06: 18 15 21 17 10 18 0 10 18 11 1873
> >>> 07: 13 17 22 13 16 17 14 0 25 12 1719
> >>> 08: 23 20 16 17 19 18 11 12 0 18 1830
> >>> 09: 14 20 15 17 12 16 17 11 13 0 1828
> >>> 10: 0 0 0 0 0 0 0 0 0 0 0
> >>>
> >>> before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
> >>> after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
> >>>
> >>>
> >>> Each line shows how many objects moved from a given disk to the others
> >>> after disk 10 was added. Most objects go to the new disk and around 1%
> >>> go to each other disks. The before and after lines show how many objects
> >>> are mapped to each disk. They all have the same weight and it's using
> >>> replica 2 and straw2. Does that look right ?
> >>
> >> Hmm, that doesn't look right. This is what the CRUSH.straw2_reweight unit
> >> test is there to validate: that data on moves to or from the device whose
> >> weight changed.
> >
> > In the above, the bucket size changes: it has a new item. And the bucket size plays a role in bucket_straw2_choose because it loops on all items. In CRUSH.straw2_reweight only the weights change. I'm not entirely sure how that would explain the results I get though...
> >
> >> It also follows from the straw2 algorithm itself: each possible choice
> >> gets a 'straw' length derived only from that item's weight (and other
> >> fixed factors, like the item id and the bucket id), and we select the max
> >> across all items. Two devices whose weights didn't change will have the
> >> same straw lengths, and the max between them will not change. It's only
> >> possible that the changed item's straw length changed and wasn't max and
> >> now is (got longer) or was max and now isn't (got shorter).
> >
> > That's a crystal clear explanation, cool :-)
> >
> > Cheers
> >
> >> sage
> >>
> >>
> >>>
> >>> Cheers
> >>>
> >>> On 02/13/2017 03:21 PM, Sage Weil wrote:
> >>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
> >>>>> Hi,
> >>>>>
> >>>>> Dan van der Ster reached out to colleagues and friends and Pedro
> >>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
> >>>>> problem and a tentative solution which he described at :
> >>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
> >>>>>
> >>>>> Unless I'm reading the document incorrectly (very possible ;) it also
> >>>>> means that the probability of each disk needs to take in account the
> >>>>> weight of all disks. Which means that whenever a disk is added / removed
> >>>>> or its weight is changed, this has an impact on the probability of all
> >>>>> disks in the cluster and objects are likely to move everywhere. Am I
> >>>>> mistaken ?
> >>>>
> >>>> Maybe (I haven't looked closely at the above yet). But for comparison, in
> >>>> the normal straw2 case, adding or removing a disk also changes the
> >>>> probabilities for everything else (e.g., removing one out of 10 identical
> >>>> disks changes the probability from 1/10 to 1/9). The key property that
> >>>> straw2 *is* able to handle is that as long as the relative probabilities
> >>>> between two unmodified disks does not change, then straw2 will avoid
> >>>> moving any objects between them (i.e., all data movement is to or from
> >>>> the disk that is reweighted).
> >>>>
> >>>> sage
> >>>>
> >>>>
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
> >>>>>> This is a longstanding bug,
> >>>>>>
> >>>>>> http://tracker.ceph.com/issues/15653
> >>>>>>
> >>>>>> that causes low-weighted devices to get more data than they should. Loic's
> >>>>>> recent activity resurrected discussion on the original PR
> >>>>>>
> >>>>>> https://github.com/ceph/ceph/pull/10218
> >>>>>>
> >>>>>> but since it's closed and almost nobody will see it I'm moving the
> >>>>>> discussion here.
> >>>>>>
> >>>>>> The main news is that I have a simple adjustment for the weights that
> >>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
> >>>>>> pretty simple, although as with most probabilities it tends to make my
> >>>>>> brain hurt.
> >>>>>>
> >>>>>> The idea is that, on the second round, the original weight for the small
> >>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
> >>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
> >>>>>> given b) is P(a && b) / P(b),
> >>>>>>
> >>>>>> P(pick small | first pick not small)
> >>>>>> = P(pick small && first pick not small) / P(first pick not small)
> >>>>>>
> >>>>>> The last term is easy to calculate,
> >>>>>>
> >>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
> >>>>>>
> >>>>>> and the && term is the distribution we're trying to produce. For exmaple,
> >>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
> >>>>>> their second replica be the small OSD. So
> >>>>>>
> >>>>>> P(pick small && first pick not small) = small_weight / total_weight
> >>>>>>
> >>>>>> Putting those together,
> >>>>>>
> >>>>>> P(pick small | first pick not small)
> >>>>>> = P(pick small && first pick not small) / P(first pick not small)
> >>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
> >>>>>> = small_weight / (total_weight - small_weight)
> >>>>>>
> >>>>>> This is, on the second round, we should adjust the weights by the above so
> >>>>>> that we get the right distribution of second choices. It turns out it
> >>>>>> works to adjust *all* weights like this to get hte conditional probability
> >>>>>> that they weren't already chosen.
> >>>>>>
> >>>>>> I have a branch that hacks this into straw2 and it appears to work
> >>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
> >>>>>> current code, you get
> >>>>>>
> >>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
> >>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
> >>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
> >>>>>> device 0: 19765965 [9899364,9866601]
> >>>>>> device 1: 19768033 [9899444,9868589]
> >>>>>> device 2: 19769938 [9901770,9868168]
> >>>>>> device 3: 19766918 [9898851,9868067]
> >>>>>> device 6: 929148 [400572,528576]
> >>>>>>
> >>>>>> which is very close for the first replica (primary), but way off for the
> >>>>>> second. With my hacky change,
> >>>>>>
> >>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
> >>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
> >>>>>> device 0: 19797315 [9899364,9897951]
> >>>>>> device 1: 19799199 [9899444,9899755]
> >>>>>> device 2: 19801016 [9901770,9899246]
> >>>>>> device 3: 19797906 [9898851,9899055]
> >>>>>> device 6: 804566 [400572,403994]
> >>>>>>
> >>>>>> which is quite close, but still skewing slightly high (by a big less than
> >>>>>> 1%).
> >>>>>>
> >>>>>> Next steps:
> >>>>>>
> >>>>>> 1- generalize this for >2 replicas
> >>>>>> 2- figure out why it skews high
> >>>>>> 3- make this work for multi-level hierarchical descent
> >>>>>>
> >>>>>> sage
> >>>>>>
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>>> the body of a message to majordomo@vger.kernel.org
> >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Loïc Dachary, Artisan Logiciel Libre
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>> the body of a message to majordomo@vger.kernel.org
> >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>
> >>> --
> >>> Loïc Dachary, Artisan Logiciel Libre
> >
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-13 21:15 ` Sage Weil
@ 2017-02-13 21:19 ` Gregory Farnum
2017-02-13 21:26 ` Sage Weil
2017-02-13 21:43 ` Loic Dachary
1 sibling, 1 reply; 70+ messages in thread
From: Gregory Farnum @ 2017-02-13 21:19 UTC (permalink / raw)
To: Sage Weil; +Cc: Loic Dachary, ceph-devel
On Mon, Feb 13, 2017 at 1:15 PM, Sage Weil <sweil@redhat.com> wrote:
> On Mon, 13 Feb 2017, Loic Dachary wrote:
>> I get the expected behavior for replica 1 (which is what
>> CRUSH.straw2_reweight does). The movement between buckets observered
>> below is for replica 2.
>
> Oh, right, now I remember. The movement for the second replica is
> unavoidable (as far as I can see). For the second replica, sometimes we
> end up picking a dup (the same thing we got for the first
> replica) and trying again; any change in the behavior of the first choice
> may mean that we have more or less "second tries." Although any given try
> will behave as we like (only moving to or from the reweighted item),
> adding new tries will pick uniformly. In your example below, I think all
> of the second replicas that moved to osds 0-9 were objects that originally
> picked a dup for the second try and, once 10 was added, did not--because
> the first replica was now on the new osd 10.
Just to be clear, that's within a bucket, right?
Because obviously changing bucket weights in the CRUSH hierarchy will
move new data to them, not all of which ends up on the new disk.
-Greg
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-13 21:19 ` Gregory Farnum
@ 2017-02-13 21:26 ` Sage Weil
0 siblings, 0 replies; 70+ messages in thread
From: Sage Weil @ 2017-02-13 21:26 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Loic Dachary, ceph-devel
On Mon, 13 Feb 2017, Gregory Farnum wrote:
> On Mon, Feb 13, 2017 at 1:15 PM, Sage Weil <sweil@redhat.com> wrote:
> > On Mon, 13 Feb 2017, Loic Dachary wrote:
> >> I get the expected behavior for replica 1 (which is what
> >> CRUSH.straw2_reweight does). The movement between buckets observered
> >> below is for replica 2.
> >
> > Oh, right, now I remember. The movement for the second replica is
> > unavoidable (as far as I can see). For the second replica, sometimes we
> > end up picking a dup (the same thing we got for the first
> > replica) and trying again; any change in the behavior of the first choice
> > may mean that we have more or less "second tries." Although any given try
> > will behave as we like (only moving to or from the reweighted item),
> > adding new tries will pick uniformly. In your example below, I think all
> > of the second replicas that moved to osds 0-9 were objects that originally
> > picked a dup for the second try and, once 10 was added, did not--because
> > the first replica was now on the new osd 10.
>
> Just to be clear, that's within a bucket, right?
Right, within a (straw2) bucket.
> Because obviously changing bucket weights in the CRUSH hierarchy will
> move new data to them, not all of which ends up on the new disk.
Yep!
sage
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-13 21:15 ` Sage Weil
2017-02-13 21:19 ` Gregory Farnum
@ 2017-02-13 21:43 ` Loic Dachary
1 sibling, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-02-13 21:43 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On 02/13/2017 10:15 PM, Sage Weil wrote:
> On Mon, 13 Feb 2017, Loic Dachary wrote:
>> I get the expected behavior for replica 1 (which is what
>> CRUSH.straw2_reweight does). The movement between buckets observered
>> below is for replica 2.
>
> Oh, right, now I remember. The movement for the second replica is
> unavoidable (as far as I can see). For the second replica, sometimes we
> end up picking a dup (the same thing we got for the first
> replica) and trying again; any change in the behavior of the first choice
> may mean that we have more or less "second tries." Although any given try
> will behave as we like (only moving to or from the reweighted item),
> adding new tries will pick uniformly. In your example below, I think all
> of the second replicas that moved to osds 0-9 were objects that originally
> picked a dup for the second try and, once 10 was added, did not--because
> the first replica was now on the new osd 10.
So this is another manifestation of the multipick anomaly ?
> sage
>
>> 00 01 02 03 04 05 06 07 08 09 10
>> 00: 0 0 0 0 0 0 0 0 0 0 927
>> 01: 0 0 0 0 0 0 0 0 0 0 904
>> 02: 0 0 0 0 0 0 0 0 0 0 928
>> 03: 0 0 0 0 0 0 0 0 0 0 886
>> 04: 0 0 0 0 0 0 0 0 0 0 927
>> 05: 0 0 0 0 0 0 0 0 0 0 927
>> 06: 0 0 0 0 0 0 0 0 0 0 930
>> 07: 0 0 0 0 0 0 0 0 0 0 842
>> 08: 0 0 0 0 0 0 0 0 0 0 943
>> 09: 0 0 0 0 0 0 0 0 0 0 904
>> 10: 0 0 0 0 0 0 0 0 0 0 0
>> before: 10149 10066 9893 9955 10030 10025 9895 10013 10008 9966 0
>> after: 9222 9162 8965 9069 9103 9098 8965 9171 9065 9062 9118
>>
>>
>> On 02/13/2017 09:18 PM, Loic Dachary wrote:
>>>
>>>
>>> On 02/13/2017 08:16 PM, Sage Weil wrote:
>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>> Hi Sage,
>>>>>
>>>>> I wrote a little program to show where objects are moving when a new disk is added (disk 10 below) and it looks like this:
>>>>>
>>>>> 00 01 02 03 04 05 06 07 08 09 10
>>>>> 00: 0 14 17 14 19 23 13 22 21 20 1800
>>>>> 01: 12 0 11 13 19 19 15 10 16 17 1841
>>>>> 02: 17 27 0 17 15 15 13 19 18 11 1813
>>>>> 03: 14 17 15 0 23 11 20 15 23 17 1792
>>>>> 04: 14 18 16 25 0 27 13 8 15 16 1771
>>>>> 05: 19 16 22 25 13 0 9 19 21 21 1813
>>>>> 06: 18 15 21 17 10 18 0 10 18 11 1873
>>>>> 07: 13 17 22 13 16 17 14 0 25 12 1719
>>>>> 08: 23 20 16 17 19 18 11 12 0 18 1830
>>>>> 09: 14 20 15 17 12 16 17 11 13 0 1828
>>>>> 10: 0 0 0 0 0 0 0 0 0 0 0
>>>>>
>>>>> before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
>>>>> after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
>>>>>
>>>>>
>>>>> Each line shows how many objects moved from a given disk to the others
>>>>> after disk 10 was added. Most objects go to the new disk and around 1%
>>>>> go to each other disks. The before and after lines show how many objects
>>>>> are mapped to each disk. They all have the same weight and it's using
>>>>> replica 2 and straw2. Does that look right ?
>>>>
>>>> Hmm, that doesn't look right. This is what the CRUSH.straw2_reweight unit
>>>> test is there to validate: that data on moves to or from the device whose
>>>> weight changed.
>>>
>>> In the above, the bucket size changes: it has a new item. And the bucket size plays a role in bucket_straw2_choose because it loops on all items. In CRUSH.straw2_reweight only the weights change. I'm not entirely sure how that would explain the results I get though...
>>>
>>>> It also follows from the straw2 algorithm itself: each possible choice
>>>> gets a 'straw' length derived only from that item's weight (and other
>>>> fixed factors, like the item id and the bucket id), and we select the max
>>>> across all items. Two devices whose weights didn't change will have the
>>>> same straw lengths, and the max between them will not change. It's only
>>>> possible that the changed item's straw length changed and wasn't max and
>>>> now is (got longer) or was max and now isn't (got shorter).
>>>
>>> That's a crystal clear explanation, cool :-)
>>>
>>> Cheers
>>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 02/13/2017 03:21 PM, Sage Weil wrote:
>>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>>> problem and a tentative solution which he described at :
>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>
>>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>>> means that the probability of each disk needs to take in account the
>>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>>> mistaken ?
>>>>>>
>>>>>> Maybe (I haven't looked closely at the above yet). But for comparison, in
>>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>>> disks changes the probability from 1/10 to 1/9). The key property that
>>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>>> the disk that is reweighted).
>>>>>>
>>>>>> sage
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>> This is a longstanding bug,
>>>>>>>>
>>>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>>>
>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>
>>>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>>>
>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>> discussion here.
>>>>>>>>
>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>> brain hurt.
>>>>>>>>
>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>
>>>>>>>> P(pick small | first pick not small)
>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>
>>>>>>>> The last term is easy to calculate,
>>>>>>>>
>>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>
>>>>>>>> and the && term is the distribution we're trying to produce. For exmaple,
>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>> their second replica be the small OSD. So
>>>>>>>>
>>>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>
>>>>>>>> Putting those together,
>>>>>>>>
>>>>>>>> P(pick small | first pick not small)
>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>>>
>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>> that they weren't already chosen.
>>>>>>>>
>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>>>>> current code, you get
>>>>>>>>
>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>> device 0: 19765965 [9899364,9866601]
>>>>>>>> device 1: 19768033 [9899444,9868589]
>>>>>>>> device 2: 19769938 [9901770,9868168]
>>>>>>>> device 3: 19766918 [9898851,9868067]
>>>>>>>> device 6: 929148 [400572,528576]
>>>>>>>>
>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>> second. With my hacky change,
>>>>>>>>
>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>> device 0: 19797315 [9899364,9897951]
>>>>>>>> device 1: 19799199 [9899444,9899755]
>>>>>>>> device 2: 19801016 [9901770,9899246]
>>>>>>>> device 3: 19797906 [9898851,9899055]
>>>>>>>> device 6: 804566 [400572,403994]
>>>>>>>>
>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>> 1%).
>>>>>>>>
>>>>>>>> Next steps:
>>>>>>>>
>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>> 2- figure out why it skews high
>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>
>>>>>>>> sage
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-13 14:21 ` Sage Weil
2017-02-13 18:50 ` Loic Dachary
@ 2017-02-16 22:04 ` Pedro López-Adeva
2017-02-22 7:52 ` Loic Dachary
1 sibling, 1 reply; 70+ messages in thread
From: Pedro López-Adeva @ 2017-02-16 22:04 UTC (permalink / raw)
To: Sage Weil; +Cc: Loic Dachary, ceph-devel
I have updated the algorithm to handle an arbitrary number of replicas
and arbitrary constraints.
Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
(Note: GitHub's renderization of the notebook and the PDF is quite
deficient, I recommend downloading/cloning)
In the following by policy I mean the concrete set of probabilities of
selecting the first replica, the second replica, etc...
In practical terms there are several problems:
- It's not practical for a high number of disks or replicas.
Possible solution: approximate summation over all possible disk
selections with a Monte Carlo method.
the algorithm would be: we start with a candidate solution, we run a
simulation and based on the results
we update the probabilities. Repeat until we are happy with the result.
Other solution: cluster similar disks together.
- Since it's a non-linear optimization problem I'm not sure right now
about it's convergence properties.
Does it converge to a global optimum? How fast does it converge?
Possible solution: the algorithm always converges, but it can converge
to a locally optimum policy. I see
no escape except by carefully designing the policy. All solutions to
the problem are going to be non linear
since we must condition current probabilities on previous disk selections.
- Although it can handle arbitrary constraints it does so by rejecting
disks selections that violate at least one constraint.
This means that for bad policies it can spend all the time rejecting
invalid disks selection candidates.
Possible solution: the policy cannot be designed independently of the
constraints. I don't know what constraints
are typical use cases but having a look should be the first step. The
constraints must be an input to the policy.
I hope it's of some use. Quite frankly I'm not a ceph user, I just
found the problem an interesting puzzle.
Anyway I will try to have a look at the CRUSH paper this weekend.
2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
> On Mon, 13 Feb 2017, Loic Dachary wrote:
>> Hi,
>>
>> Dan van der Ster reached out to colleagues and friends and Pedro
>> López-Adeva Fernández-Layos came up with a well written analysis of the
>> problem and a tentative solution which he described at :
>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>
>> Unless I'm reading the document incorrectly (very possible ;) it also
>> means that the probability of each disk needs to take in account the
>> weight of all disks. Which means that whenever a disk is added / removed
>> or its weight is changed, this has an impact on the probability of all
>> disks in the cluster and objects are likely to move everywhere. Am I
>> mistaken ?
>
> Maybe (I haven't looked closely at the above yet). But for comparison, in
> the normal straw2 case, adding or removing a disk also changes the
> probabilities for everything else (e.g., removing one out of 10 identical
> disks changes the probability from 1/10 to 1/9). The key property that
> straw2 *is* able to handle is that as long as the relative probabilities
> between two unmodified disks does not change, then straw2 will avoid
> moving any objects between them (i.e., all data movement is to or from
> the disk that is reweighted).
>
> sage
>
>
>>
>> Cheers
>>
>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>> > This is a longstanding bug,
>> >
>> > http://tracker.ceph.com/issues/15653
>> >
>> > that causes low-weighted devices to get more data than they should. Loic's
>> > recent activity resurrected discussion on the original PR
>> >
>> > https://github.com/ceph/ceph/pull/10218
>> >
>> > but since it's closed and almost nobody will see it I'm moving the
>> > discussion here.
>> >
>> > The main news is that I have a simple adjustment for the weights that
>> > works (almost perfectly) for the 2nd round of placements. The solution is
>> > pretty simple, although as with most probabilities it tends to make my
>> > brain hurt.
>> >
>> > The idea is that, on the second round, the original weight for the small
>> > OSD (call it P(pick small)) isn't what we should use. Instead, we want
>> > P(pick small | first pick not small). Since P(a|b) (the probability of a
>> > given b) is P(a && b) / P(b),
>> >
>> > P(pick small | first pick not small)
>> > = P(pick small && first pick not small) / P(first pick not small)
>> >
>> > The last term is easy to calculate,
>> >
>> > P(first pick not small) = (total_weight - small_weight) / total_weight
>> >
>> > and the && term is the distribution we're trying to produce. For exmaple,
>> > if small has 1/10 the weight, then we should see 1/10th of the PGs have
>> > their second replica be the small OSD. So
>> >
>> > P(pick small && first pick not small) = small_weight / total_weight
>> >
>> > Putting those together,
>> >
>> > P(pick small | first pick not small)
>> > = P(pick small && first pick not small) / P(first pick not small)
>> > = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>> > = small_weight / (total_weight - small_weight)
>> >
>> > This is, on the second round, we should adjust the weights by the above so
>> > that we get the right distribution of second choices. It turns out it
>> > works to adjust *all* weights like this to get hte conditional probability
>> > that they weren't already chosen.
>> >
>> > I have a branch that hacks this into straw2 and it appears to work
>> > properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>> > current code, you get
>> >
>> > $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>> > rule 0 (data), x = 0..40000000, numrep = 2..2
>> > rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>> > device 0: 19765965 [9899364,9866601]
>> > device 1: 19768033 [9899444,9868589]
>> > device 2: 19769938 [9901770,9868168]
>> > device 3: 19766918 [9898851,9868067]
>> > device 6: 929148 [400572,528576]
>> >
>> > which is very close for the first replica (primary), but way off for the
>> > second. With my hacky change,
>> >
>> > rule 0 (data), x = 0..40000000, numrep = 2..2
>> > rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>> > device 0: 19797315 [9899364,9897951]
>> > device 1: 19799199 [9899444,9899755]
>> > device 2: 19801016 [9901770,9899246]
>> > device 3: 19797906 [9898851,9899055]
>> > device 6: 804566 [400572,403994]
>> >
>> > which is quite close, but still skewing slightly high (by a big less than
>> > 1%).
>> >
>> > Next steps:
>> >
>> > 1- generalize this for >2 replicas
>> > 2- figure out why it skews high
>> > 3- make this work for multi-level hierarchical descent
>> >
>> > sage
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>> >
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-13 14:53 ` Gregory Farnum
@ 2017-02-20 8:47 ` Loic Dachary
2017-02-20 17:32 ` Gregory Farnum
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-20 8:47 UTC (permalink / raw)
To: Gregory Farnum; +Cc: ceph-devel
On 02/13/2017 03:53 PM, Gregory Farnum wrote:
> On Mon, Feb 13, 2017 at 2:36 AM, Loic Dachary <loic@dachary.org> wrote:
>> Hi,
>>
>> Dan van der Ster reached out to colleagues and friends and Pedro López-Adeva Fernández-Layos came up with a well written analysis of the problem and a tentative solution which he described at : https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>
>> Unless I'm reading the document incorrectly (very possible ;) it also means that the probability of each disk needs to take in account the weight of all disks. Which means that whenever a disk is added / removed or its weight is changed, this has an impact on the probability of all disks in the cluster and objects are likely to move everywhere. Am I mistaken ?
>
> Keep in mind that in the math presented, "all disks" for our purposes
> really means "all items within a CRUSH bucket" (at least, best I can
> tell). So if you reweight a disk, you have to recalculate weights
> within its bucket and within each parent bucket, but each bucket has a
> bounded size N so the calculation should remain feasible. I didn't
> step through the more complicated math at the end but it made
> intuitive sense as far as I went.
When crush chooses the second replica it ensures it does not land on the same host, rack etc. depending on the step CHOOSE* argument of the rule. When looking for the best weights (in the updated https://github.com/plafl/notebooks/blob/master/converted/replication.pdf versions) I think we would focus on the host weights (assuming the failure domain is the host) and not the disk weights. When drawing disks after the host was selected, the probabilities of each disk should not need to be modified because there will never be a rejection at that level (i.e. no conditional probability).
If the failure domain is the host I think the crush map should be something like:
root:
host1:
disk1
disk2
host2:
disk3
disk4
host3:
disk5
disk6
Introducing racks such as in:
root:
rack0:
host1:
disk1
disk2
host2:
disk3
disk4
rack1:
host3:
disk5
disk6
Is going to complicate the problem further, for no good reason other than a pretty display / architecture reminder. Since rejecting a second replica on host3 means it will land in rack0 instead of rack1, I think the probability distribution of the racks will need to be adjusted in the same way the probabilty distribution of the failure domain buckets need to.
Does that make sense ?
> -Greg
>
>>
>> Cheers
>>
>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>> This is a longstanding bug,
>>>
>>> http://tracker.ceph.com/issues/15653
>>>
>>> that causes low-weighted devices to get more data than they should. Loic's
>>> recent activity resurrected discussion on the original PR
>>>
>>> https://github.com/ceph/ceph/pull/10218
>>>
>>> but since it's closed and almost nobody will see it I'm moving the
>>> discussion here.
>>>
>>> The main news is that I have a simple adjustment for the weights that
>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>> pretty simple, although as with most probabilities it tends to make my
>>> brain hurt.
>>>
>>> The idea is that, on the second round, the original weight for the small
>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>> given b) is P(a && b) / P(b),
>>>
>>> P(pick small | first pick not small)
>>> = P(pick small && first pick not small) / P(first pick not small)
>>>
>>> The last term is easy to calculate,
>>>
>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>
>>> and the && term is the distribution we're trying to produce. For exmaple,
>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>> their second replica be the small OSD. So
>>>
>>> P(pick small && first pick not small) = small_weight / total_weight
>>>
>>> Putting those together,
>>>
>>> P(pick small | first pick not small)
>>> = P(pick small && first pick not small) / P(first pick not small)
>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>> = small_weight / (total_weight - small_weight)
>>>
>>> This is, on the second round, we should adjust the weights by the above so
>>> that we get the right distribution of second choices. It turns out it
>>> works to adjust *all* weights like this to get hte conditional probability
>>> that they weren't already chosen.
>>>
>>> I have a branch that hacks this into straw2 and it appears to work
>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>> current code, you get
>>>
>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>> device 0: 19765965 [9899364,9866601]
>>> device 1: 19768033 [9899444,9868589]
>>> device 2: 19769938 [9901770,9868168]
>>> device 3: 19766918 [9898851,9868067]
>>> device 6: 929148 [400572,528576]
>>>
>>> which is very close for the first replica (primary), but way off for the
>>> second. With my hacky change,
>>>
>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>> device 0: 19797315 [9899364,9897951]
>>> device 1: 19799199 [9899444,9899755]
>>> device 2: 19801016 [9901770,9899246]
>>> device 3: 19797906 [9898851,9899055]
>>> device 6: 804566 [400572,403994]
>>>
>>> which is quite close, but still skewing slightly high (by a big less than
>>> 1%).
>>>
>>> Next steps:
>>>
>>> 1- generalize this for >2 replicas
>>> 2- figure out why it skews high
>>> 3- make this work for multi-level hierarchical descent
>>>
>>> sage
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-20 8:47 ` Loic Dachary
@ 2017-02-20 17:32 ` Gregory Farnum
2017-02-20 19:31 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Gregory Farnum @ 2017-02-20 17:32 UTC (permalink / raw)
To: Loic Dachary; +Cc: ceph-devel
On Mon, Feb 20, 2017 at 12:47 AM, Loic Dachary <loic@dachary.org> wrote:
>
>
> On 02/13/2017 03:53 PM, Gregory Farnum wrote:
>> On Mon, Feb 13, 2017 at 2:36 AM, Loic Dachary <loic@dachary.org> wrote:
>>> Hi,
>>>
>>> Dan van der Ster reached out to colleagues and friends and Pedro López-Adeva Fernández-Layos came up with a well written analysis of the problem and a tentative solution which he described at : https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>
>>> Unless I'm reading the document incorrectly (very possible ;) it also means that the probability of each disk needs to take in account the weight of all disks. Which means that whenever a disk is added / removed or its weight is changed, this has an impact on the probability of all disks in the cluster and objects are likely to move everywhere. Am I mistaken ?
>>
>> Keep in mind that in the math presented, "all disks" for our purposes
>> really means "all items within a CRUSH bucket" (at least, best I can
>> tell). So if you reweight a disk, you have to recalculate weights
>> within its bucket and within each parent bucket, but each bucket has a
>> bounded size N so the calculation should remain feasible. I didn't
>> step through the more complicated math at the end but it made
>> intuitive sense as far as I went.
>
> When crush chooses the second replica it ensures it does not land on the same host, rack etc. depending on the step CHOOSE* argument of the rule. When looking for the best weights (in the updated https://github.com/plafl/notebooks/blob/master/converted/replication.pdf versions) I think we would focus on the host weights (assuming the failure domain is the host) and not the disk weights. When drawing disks after the host was selected, the probabilities of each disk should not need to be modified because there will never be a rejection at that level (i.e. no conditional probability).
Well, you'd have changed the number of disks, so you'd need to
recalculate within the host that got a new disk added. And then you'd
need to recalculate the host and its peer buckets, and if it was in a
rack then the rack and its peer buckets, and on up the chain.
>
> If the failure domain is the host I think the crush map should be something like:
>
> root:
> host1:
> disk1
> disk2
> host2:
> disk3
> disk4
> host3:
> disk5
> disk6
>
> Introducing racks such as in:
>
> root:
> rack0:
> host1:
> disk1
> disk2
> host2:
> disk3
> disk4
> rack1:
> host3:
> disk5
> disk6
>
> Is going to complicate the problem further, for no good reason other than a pretty display / architecture reminder.
Well, there's not much point if you're replicating across hosts, since
the rack layer is very unbalanced here. But that's essentially a
misconfiguration which is going to cause problems with any CRUSH-like
system.
> Since rejecting a second replica on host3 means it will land in rack0 instead of rack1, I think the probability distribution of the racks will need to be adjusted in the same way the probabilty distribution of the failure domain buckets need to.
I think maybe you're saying what I did before? "All disks" for our
purposes really means "all items within a CRUSH bucket". The racks are
CRUSH items within the root bucket.
-Greg
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-20 17:32 ` Gregory Farnum
@ 2017-02-20 19:31 ` Loic Dachary
0 siblings, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-02-20 19:31 UTC (permalink / raw)
To: Gregory Farnum; +Cc: ceph-devel
On 02/20/2017 06:32 PM, Gregory Farnum wrote:
> On Mon, Feb 20, 2017 at 12:47 AM, Loic Dachary <loic@dachary.org> wrote:
>>
>>
>> On 02/13/2017 03:53 PM, Gregory Farnum wrote:
>>> On Mon, Feb 13, 2017 at 2:36 AM, Loic Dachary <loic@dachary.org> wrote:
>>>> Hi,
>>>>
>>>> Dan van der Ster reached out to colleagues and friends and Pedro López-Adeva Fernández-Layos came up with a well written analysis of the problem and a tentative solution which he described at : https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>
>>>> Unless I'm reading the document incorrectly (very possible ;) it also means that the probability of each disk needs to take in account the weight of all disks. Which means that whenever a disk is added / removed or its weight is changed, this has an impact on the probability of all disks in the cluster and objects are likely to move everywhere. Am I mistaken ?
>>>
>>> Keep in mind that in the math presented, "all disks" for our purposes
>>> really means "all items within a CRUSH bucket" (at least, best I can
>>> tell). So if you reweight a disk, you have to recalculate weights
>>> within its bucket and within each parent bucket, but each bucket has a
>>> bounded size N so the calculation should remain feasible. I didn't
>>> step through the more complicated math at the end but it made
>>> intuitive sense as far as I went.
>>
>> When crush chooses the second replica it ensures it does not land on the same host, rack etc. depending on the step CHOOSE* argument of the rule. When looking for the best weights (in the updated https://github.com/plafl/notebooks/blob/master/converted/replication.pdf versions) I think we would focus on the host weights (assuming the failure domain is the host) and not the disk weights. When drawing disks after the host was selected, the probabilities of each disk should not need to be modified because there will never be a rejection at that level (i.e. no conditional probability).
>
> Well, you'd have changed the number of disks, so you'd need to
> recalculate within the host that got a new disk added. And then you'd
> need to recalculate the host and its peer buckets, and if it was in a
> rack then the rack and its peer buckets, and on up the chain.
I meant to say that you do not need to change the weight of the disks within other hosts. But you need to change the weight of all other hosts, not just the host in which a new disk was inserted/removed.
>
>>
>> If the failure domain is the host I think the crush map should be something like:
>>
>> root:
>> host1:
>> disk1
>> disk2
>> host2:
>> disk3
>> disk4
>> host3:
>> disk5
>> disk6
>>
>> Introducing racks such as in:
>>
>> root:
>> rack0:
>> host1:
>> disk1
>> disk2
>> host2:
>> disk3
>> disk4
>> rack1:
>> host3:
>> disk5
>> disk6
>>
>> Is going to complicate the problem further, for no good reason other than a pretty display / architecture reminder.
>
> Well, there's not much point if you're replicating across hosts, since
> the rack layer is very unbalanced here. But that's essentially a
> misconfiguration which is going to cause problems with any CRUSH-like
> system.
>
>
>> Since rejecting a second replica on host3 means it will land in rack0 instead of rack1, I think the probability distribution of the racks will need to be adjusted in the same way the probabilty distribution of the failure domain buckets need to.
>
> I think maybe you're saying what I did before? "All disks" for our
> purposes really means "all items within a CRUSH bucket". The racks are
> CRUSH items within the root bucket.
> -Greg
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-16 22:04 ` Pedro López-Adeva
@ 2017-02-22 7:52 ` Loic Dachary
2017-02-22 11:26 ` Pedro López-Adeva
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-22 7:52 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: ceph-devel
Hi Pedro,
On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
> I have updated the algorithm to handle an arbitrary number of replicas
> and arbitrary constraints.
>
> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
[1] https://en.wikipedia.org/wiki/Loss_function
[2] https://en.wikipedia.org/wiki/Gradient
[3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
From the above you can hopefully see how far off my understanding is. And I have one question below.
> (Note: GitHub's renderization of the notebook and the PDF is quite
> deficient, I recommend downloading/cloning)
>
>
> In the following by policy I mean the concrete set of probabilities of
> selecting the first replica, the second replica, etc...
> In practical terms there are several problems:
>
> - It's not practical for a high number of disks or replicas.
>
> Possible solution: approximate summation over all possible disk
> selections with a Monte Carlo method.
> the algorithm would be: we start with a candidate solution, we run a
> simulation and based on the results
> we update the probabilities. Repeat until we are happy with the result.
>
> Other solution: cluster similar disks together.
>
> - Since it's a non-linear optimization problem I'm not sure right now
> about it's convergence properties.
> Does it converge to a global optimum? How fast does it converge?
>
> Possible solution: the algorithm always converges, but it can converge
> to a locally optimum policy. I see
> no escape except by carefully designing the policy. All solutions to
> the problem are going to be non linear
> since we must condition current probabilities on previous disk selections.
>
> - Although it can handle arbitrary constraints it does so by rejecting
> disks selections that violate at least one constraint.
> This means that for bad policies it can spend all the time rejecting
> invalid disks selection candidates.
>
> Possible solution: the policy cannot be designed independently of the
> constraints. I don't know what constraints
> are typical use cases but having a look should be the first step. The
> constraints must be an input to the policy.
>
>
> I hope it's of some use. Quite frankly I'm not a ceph user, I just
> found the problem an interesting puzzle.
> Anyway I will try to have a look at the CRUSH paper this weekend.
In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
[1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
[2] https://github.com/ceph/ceph/tree/master/src/crush
Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
00 01 02 03 04 05 06 07 08 09 10
00: 0 14 17 14 19 23 13 22 21 20 1800
01: 12 0 11 13 19 19 15 10 16 17 1841
02: 17 27 0 17 15 15 13 19 18 11 1813
03: 14 17 15 0 23 11 20 15 23 17 1792
04: 14 18 16 25 0 27 13 8 15 16 1771
05: 19 16 22 25 13 0 9 19 21 21 1813
06: 18 15 21 17 10 18 0 10 18 11 1873
07: 13 17 22 13 16 17 14 0 25 12 1719
08: 23 20 16 17 19 18 11 12 0 18 1830
09: 14 20 15 17 12 16 17 11 13 0 1828
10: 0 0 0 0 0 0 0 0 0 0 0
before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
[1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
Cheers
>
>
> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>> Hi,
>>>
>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>> problem and a tentative solution which he described at :
>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>
>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>> means that the probability of each disk needs to take in account the
>>> weight of all disks. Which means that whenever a disk is added / removed
>>> or its weight is changed, this has an impact on the probability of all
>>> disks in the cluster and objects are likely to move everywhere. Am I
>>> mistaken ?
>>
>> Maybe (I haven't looked closely at the above yet). But for comparison, in
>> the normal straw2 case, adding or removing a disk also changes the
>> probabilities for everything else (e.g., removing one out of 10 identical
>> disks changes the probability from 1/10 to 1/9). The key property that
>> straw2 *is* able to handle is that as long as the relative probabilities
>> between two unmodified disks does not change, then straw2 will avoid
>> moving any objects between them (i.e., all data movement is to or from
>> the disk that is reweighted).
>>
>> sage
>>
>>
>>>
>>> Cheers
>>>
>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>> This is a longstanding bug,
>>>>
>>>> http://tracker.ceph.com/issues/15653
>>>>
>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>> recent activity resurrected discussion on the original PR
>>>>
>>>> https://github.com/ceph/ceph/pull/10218
>>>>
>>>> but since it's closed and almost nobody will see it I'm moving the
>>>> discussion here.
>>>>
>>>> The main news is that I have a simple adjustment for the weights that
>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>> pretty simple, although as with most probabilities it tends to make my
>>>> brain hurt.
>>>>
>>>> The idea is that, on the second round, the original weight for the small
>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>> given b) is P(a && b) / P(b),
>>>>
>>>> P(pick small | first pick not small)
>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>
>>>> The last term is easy to calculate,
>>>>
>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>
>>>> and the && term is the distribution we're trying to produce. For exmaple,
>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>> their second replica be the small OSD. So
>>>>
>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>
>>>> Putting those together,
>>>>
>>>> P(pick small | first pick not small)
>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>> = small_weight / (total_weight - small_weight)
>>>>
>>>> This is, on the second round, we should adjust the weights by the above so
>>>> that we get the right distribution of second choices. It turns out it
>>>> works to adjust *all* weights like this to get hte conditional probability
>>>> that they weren't already chosen.
>>>>
>>>> I have a branch that hacks this into straw2 and it appears to work
>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>> current code, you get
>>>>
>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>> device 0: 19765965 [9899364,9866601]
>>>> device 1: 19768033 [9899444,9868589]
>>>> device 2: 19769938 [9901770,9868168]
>>>> device 3: 19766918 [9898851,9868067]
>>>> device 6: 929148 [400572,528576]
>>>>
>>>> which is very close for the first replica (primary), but way off for the
>>>> second. With my hacky change,
>>>>
>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>> device 0: 19797315 [9899364,9897951]
>>>> device 1: 19799199 [9899444,9899755]
>>>> device 2: 19801016 [9901770,9899246]
>>>> device 3: 19797906 [9898851,9899055]
>>>> device 6: 804566 [400572,403994]
>>>>
>>>> which is quite close, but still skewing slightly high (by a big less than
>>>> 1%).
>>>>
>>>> Next steps:
>>>>
>>>> 1- generalize this for >2 replicas
>>>> 2- figure out why it skews high
>>>> 3- make this work for multi-level hierarchical descent
>>>>
>>>> sage
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-22 7:52 ` Loic Dachary
@ 2017-02-22 11:26 ` Pedro López-Adeva
2017-02-22 11:38 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Pedro López-Adeva @ 2017-02-22 11:26 UTC (permalink / raw)
To: Loic Dachary; +Cc: ceph-devel
Hi,
I think your description of my proposed solution is quite good.
I had a first look at Sage's paper but not ceph's implementation. My
plan is to finish the paper and make an implementation in python that
mimicks more closely ceph's algorithm.
Regarding your question about data movement:
If I understood the paper correctly what is happening right now is
that when weights change on the devices some of them will become
overloaded and the current algorithm will try to correct for that but
this approach, I think, is independent of how we compute the weights
for each device. My point is that the current data movement pattern
will not be modified.
Could the data movement algorithm be improved? Maybe. I don't know.
Maybe by making the probabilities non-stationary with the new disk
getting at first very high probability and after each replica
placement decrease it until it stabilizes to it's final value. But I'm
just guessing and I really don't know if this can be made to work in a
distributed manner as is currently the case and how would this fit in
the current architecture. In any case it would be a problem as hard at
least as the current reweighting problem.
So, to summarize, my current plans:
- Have another look at the paper
- Make an implementation in python that imitates more closely the
current algorithm
- Make sure the new reweighting algorithm is fast and gives the desired results
I will give updates here when there are significant changes so
everyone can have a look and suggest improvements.
Cheers,
Pedro.
2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
> Hi Pedro,
>
> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>> I have updated the algorithm to handle an arbitrary number of replicas
>> and arbitrary constraints.
>>
>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>
> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>
> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>
> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>
> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>
> [1] https://en.wikipedia.org/wiki/Loss_function
> [2] https://en.wikipedia.org/wiki/Gradient
> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>
> From the above you can hopefully see how far off my understanding is. And I have one question below.
>
>> (Note: GitHub's renderization of the notebook and the PDF is quite
>> deficient, I recommend downloading/cloning)
>>
>>
>> In the following by policy I mean the concrete set of probabilities of
>> selecting the first replica, the second replica, etc...
>> In practical terms there are several problems:
>>
>> - It's not practical for a high number of disks or replicas.
>>
>> Possible solution: approximate summation over all possible disk
>> selections with a Monte Carlo method.
>> the algorithm would be: we start with a candidate solution, we run a
>> simulation and based on the results
>> we update the probabilities. Repeat until we are happy with the result.
>>
>> Other solution: cluster similar disks together.
>>
>> - Since it's a non-linear optimization problem I'm not sure right now
>> about it's convergence properties.
>> Does it converge to a global optimum? How fast does it converge?
>>
>> Possible solution: the algorithm always converges, but it can converge
>> to a locally optimum policy. I see
>> no escape except by carefully designing the policy. All solutions to
>> the problem are going to be non linear
>> since we must condition current probabilities on previous disk selections.
>>
>> - Although it can handle arbitrary constraints it does so by rejecting
>> disks selections that violate at least one constraint.
>> This means that for bad policies it can spend all the time rejecting
>> invalid disks selection candidates.
>>
>> Possible solution: the policy cannot be designed independently of the
>> constraints. I don't know what constraints
>> are typical use cases but having a look should be the first step. The
>> constraints must be an input to the policy.
>>
>>
>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>> found the problem an interesting puzzle.
>> Anyway I will try to have a look at the CRUSH paper this weekend.
>
> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>
> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
> [2] https://github.com/ceph/ceph/tree/master/src/crush
>
> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>
> 00 01 02 03 04 05 06 07 08 09 10
> 00: 0 14 17 14 19 23 13 22 21 20 1800
> 01: 12 0 11 13 19 19 15 10 16 17 1841
> 02: 17 27 0 17 15 15 13 19 18 11 1813
> 03: 14 17 15 0 23 11 20 15 23 17 1792
> 04: 14 18 16 25 0 27 13 8 15 16 1771
> 05: 19 16 22 25 13 0 9 19 21 21 1813
> 06: 18 15 21 17 10 18 0 10 18 11 1873
> 07: 13 17 22 13 16 17 14 0 25 12 1719
> 08: 23 20 16 17 19 18 11 12 0 18 1830
> 09: 14 20 15 17 12 16 17 11 13 0 1828
> 10: 0 0 0 0 0 0 0 0 0 0 0
> before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
> after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
>
> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>
> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>
> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>
> Cheers
>
>>
>>
>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>> Hi,
>>>>
>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>> problem and a tentative solution which he described at :
>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>
>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>> means that the probability of each disk needs to take in account the
>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>> or its weight is changed, this has an impact on the probability of all
>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>> mistaken ?
>>>
>>> Maybe (I haven't looked closely at the above yet). But for comparison, in
>>> the normal straw2 case, adding or removing a disk also changes the
>>> probabilities for everything else (e.g., removing one out of 10 identical
>>> disks changes the probability from 1/10 to 1/9). The key property that
>>> straw2 *is* able to handle is that as long as the relative probabilities
>>> between two unmodified disks does not change, then straw2 will avoid
>>> moving any objects between them (i.e., all data movement is to or from
>>> the disk that is reweighted).
>>>
>>> sage
>>>
>>>
>>>>
>>>> Cheers
>>>>
>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>> This is a longstanding bug,
>>>>>
>>>>> http://tracker.ceph.com/issues/15653
>>>>>
>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>> recent activity resurrected discussion on the original PR
>>>>>
>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>
>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>> discussion here.
>>>>>
>>>>> The main news is that I have a simple adjustment for the weights that
>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>> brain hurt.
>>>>>
>>>>> The idea is that, on the second round, the original weight for the small
>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>> given b) is P(a && b) / P(b),
>>>>>
>>>>> P(pick small | first pick not small)
>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>
>>>>> The last term is easy to calculate,
>>>>>
>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>
>>>>> and the && term is the distribution we're trying to produce. For exmaple,
>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>> their second replica be the small OSD. So
>>>>>
>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>
>>>>> Putting those together,
>>>>>
>>>>> P(pick small | first pick not small)
>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>> = small_weight / (total_weight - small_weight)
>>>>>
>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>> that we get the right distribution of second choices. It turns out it
>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>> that they weren't already chosen.
>>>>>
>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>> current code, you get
>>>>>
>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>> device 0: 19765965 [9899364,9866601]
>>>>> device 1: 19768033 [9899444,9868589]
>>>>> device 2: 19769938 [9901770,9868168]
>>>>> device 3: 19766918 [9898851,9868067]
>>>>> device 6: 929148 [400572,528576]
>>>>>
>>>>> which is very close for the first replica (primary), but way off for the
>>>>> second. With my hacky change,
>>>>>
>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>> device 0: 19797315 [9899364,9897951]
>>>>> device 1: 19799199 [9899444,9899755]
>>>>> device 2: 19801016 [9901770,9899246]
>>>>> device 3: 19797906 [9898851,9899055]
>>>>> device 6: 804566 [400572,403994]
>>>>>
>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>> 1%).
>>>>>
>>>>> Next steps:
>>>>>
>>>>> 1- generalize this for >2 replicas
>>>>> 2- figure out why it skews high
>>>>> 3- make this work for multi-level hierarchical descent
>>>>>
>>>>> sage
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-22 11:26 ` Pedro López-Adeva
@ 2017-02-22 11:38 ` Loic Dachary
2017-02-22 11:46 ` Pedro López-Adeva
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-22 11:38 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: ceph-devel
On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
> Hi,
>
> I think your description of my proposed solution is quite good.
>
> I had a first look at Sage's paper but not ceph's implementation. My
> plan is to finish the paper and make an implementation in python that
> mimicks more closely ceph's algorithm.
>
> Regarding your question about data movement:
>
> If I understood the paper correctly what is happening right now is
> that when weights change on the devices some of them will become
> overloaded and the current algorithm will try to correct for that but
> this approach, I think, is independent of how we compute the weights
> for each device. My point is that the current data movement pattern
> will not be modified.
>
> Could the data movement algorithm be improved? Maybe. I don't know.
> Maybe by making the probabilities non-stationary with the new disk
> getting at first very high probability and after each replica
> placement decrease it until it stabilizes to it's final value. But I'm
> just guessing and I really don't know if this can be made to work in a
> distributed manner as is currently the case and how would this fit in
> the current architecture. In any case it would be a problem as hard at
> least as the current reweighting problem.
>
> So, to summarize, my current plans:
>
> - Have another look at the paper
> - Make an implementation in python that imitates more closely the
> current algorithm
What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
> - Make sure the new reweighting algorithm is fast and gives the desired results
>
> I will give updates here when there are significant changes so
> everyone can have a look and suggest improvements.
>
> Cheers,
> Pedro.
>
> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>> I have updated the algorithm to handle an arbitrary number of replicas
>>> and arbitrary constraints.
>>>
>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>
>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>
>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>
>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>
>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>
>> [1] https://en.wikipedia.org/wiki/Loss_function
>> [2] https://en.wikipedia.org/wiki/Gradient
>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>
>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>
>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>> deficient, I recommend downloading/cloning)
>>>
>>>
>>> In the following by policy I mean the concrete set of probabilities of
>>> selecting the first replica, the second replica, etc...
>>> In practical terms there are several problems:
>>>
>>> - It's not practical for a high number of disks or replicas.
>>>
>>> Possible solution: approximate summation over all possible disk
>>> selections with a Monte Carlo method.
>>> the algorithm would be: we start with a candidate solution, we run a
>>> simulation and based on the results
>>> we update the probabilities. Repeat until we are happy with the result.
>>>
>>> Other solution: cluster similar disks together.
>>>
>>> - Since it's a non-linear optimization problem I'm not sure right now
>>> about it's convergence properties.
>>> Does it converge to a global optimum? How fast does it converge?
>>>
>>> Possible solution: the algorithm always converges, but it can converge
>>> to a locally optimum policy. I see
>>> no escape except by carefully designing the policy. All solutions to
>>> the problem are going to be non linear
>>> since we must condition current probabilities on previous disk selections.
>>>
>>> - Although it can handle arbitrary constraints it does so by rejecting
>>> disks selections that violate at least one constraint.
>>> This means that for bad policies it can spend all the time rejecting
>>> invalid disks selection candidates.
>>>
>>> Possible solution: the policy cannot be designed independently of the
>>> constraints. I don't know what constraints
>>> are typical use cases but having a look should be the first step. The
>>> constraints must be an input to the policy.
>>>
>>>
>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>> found the problem an interesting puzzle.
>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>
>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>
>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>
>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>
>> 00 01 02 03 04 05 06 07 08 09 10
>> 00: 0 14 17 14 19 23 13 22 21 20 1800
>> 01: 12 0 11 13 19 19 15 10 16 17 1841
>> 02: 17 27 0 17 15 15 13 19 18 11 1813
>> 03: 14 17 15 0 23 11 20 15 23 17 1792
>> 04: 14 18 16 25 0 27 13 8 15 16 1771
>> 05: 19 16 22 25 13 0 9 19 21 21 1813
>> 06: 18 15 21 17 10 18 0 10 18 11 1873
>> 07: 13 17 22 13 16 17 14 0 25 12 1719
>> 08: 23 20 16 17 19 18 11 12 0 18 1830
>> 09: 14 20 15 17 12 16 17 11 13 0 1828
>> 10: 0 0 0 0 0 0 0 0 0 0 0
>> before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
>> after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
>>
>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>
>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>
>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>
>> Cheers
>>
>>>
>>>
>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>> Hi,
>>>>>
>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>> problem and a tentative solution which he described at :
>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>
>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>> means that the probability of each disk needs to take in account the
>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>> or its weight is changed, this has an impact on the probability of all
>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>> mistaken ?
>>>>
>>>> Maybe (I haven't looked closely at the above yet). But for comparison, in
>>>> the normal straw2 case, adding or removing a disk also changes the
>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>> disks changes the probability from 1/10 to 1/9). The key property that
>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>> between two unmodified disks does not change, then straw2 will avoid
>>>> moving any objects between them (i.e., all data movement is to or from
>>>> the disk that is reweighted).
>>>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>> This is a longstanding bug,
>>>>>>
>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>
>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>> recent activity resurrected discussion on the original PR
>>>>>>
>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>
>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>> discussion here.
>>>>>>
>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>> brain hurt.
>>>>>>
>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>> given b) is P(a && b) / P(b),
>>>>>>
>>>>>> P(pick small | first pick not small)
>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>
>>>>>> The last term is easy to calculate,
>>>>>>
>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>
>>>>>> and the && term is the distribution we're trying to produce. For exmaple,
>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>> their second replica be the small OSD. So
>>>>>>
>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>
>>>>>> Putting those together,
>>>>>>
>>>>>> P(pick small | first pick not small)
>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>
>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>> that they weren't already chosen.
>>>>>>
>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>>> current code, you get
>>>>>>
>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>> device 0: 19765965 [9899364,9866601]
>>>>>> device 1: 19768033 [9899444,9868589]
>>>>>> device 2: 19769938 [9901770,9868168]
>>>>>> device 3: 19766918 [9898851,9868067]
>>>>>> device 6: 929148 [400572,528576]
>>>>>>
>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>> second. With my hacky change,
>>>>>>
>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>> device 0: 19797315 [9899364,9897951]
>>>>>> device 1: 19799199 [9899444,9899755]
>>>>>> device 2: 19801016 [9901770,9899246]
>>>>>> device 3: 19797906 [9898851,9899055]
>>>>>> device 6: 804566 [400572,403994]
>>>>>>
>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>> 1%).
>>>>>>
>>>>>> Next steps:
>>>>>>
>>>>>> 1- generalize this for >2 replicas
>>>>>> 2- figure out why it skews high
>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>
>>>>>> sage
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-22 11:38 ` Loic Dachary
@ 2017-02-22 11:46 ` Pedro López-Adeva
2017-02-25 0:38 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Pedro López-Adeva @ 2017-02-22 11:46 UTC (permalink / raw)
To: Loic Dachary; +Cc: ceph-devel
That, for validation, would be great. Until weekend I don't think I'm
going to have time to work on this anyway.
2017-02-22 12:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>
>
> On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
>> Hi,
>>
>> I think your description of my proposed solution is quite good.
>>
>> I had a first look at Sage's paper but not ceph's implementation. My
>> plan is to finish the paper and make an implementation in python that
>> mimicks more closely ceph's algorithm.
>>
>> Regarding your question about data movement:
>>
>> If I understood the paper correctly what is happening right now is
>> that when weights change on the devices some of them will become
>> overloaded and the current algorithm will try to correct for that but
>> this approach, I think, is independent of how we compute the weights
>> for each device. My point is that the current data movement pattern
>> will not be modified.
>>
>> Could the data movement algorithm be improved? Maybe. I don't know.
>> Maybe by making the probabilities non-stationary with the new disk
>> getting at first very high probability and after each replica
>> placement decrease it until it stabilizes to it's final value. But I'm
>> just guessing and I really don't know if this can be made to work in a
>> distributed manner as is currently the case and how would this fit in
>> the current architecture. In any case it would be a problem as hard at
>> least as the current reweighting problem.
>>
>> So, to summarize, my current plans:
>>
>> - Have another look at the paper
>> - Make an implementation in python that imitates more closely the
>> current algorithm
>
> What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
>
>> - Make sure the new reweighting algorithm is fast and gives the desired results
>>
>> I will give updates here when there are significant changes so
>> everyone can have a look and suggest improvements.
>>
>> Cheers,
>> Pedro.
>>
>> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> Hi Pedro,
>>>
>>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>>> I have updated the algorithm to handle an arbitrary number of replicas
>>>> and arbitrary constraints.
>>>>
>>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>>
>>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>>
>>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>>
>>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>>
>>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>>
>>> [1] https://en.wikipedia.org/wiki/Loss_function
>>> [2] https://en.wikipedia.org/wiki/Gradient
>>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>>
>>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>>
>>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>>> deficient, I recommend downloading/cloning)
>>>>
>>>>
>>>> In the following by policy I mean the concrete set of probabilities of
>>>> selecting the first replica, the second replica, etc...
>>>> In practical terms there are several problems:
>>>>
>>>> - It's not practical for a high number of disks or replicas.
>>>>
>>>> Possible solution: approximate summation over all possible disk
>>>> selections with a Monte Carlo method.
>>>> the algorithm would be: we start with a candidate solution, we run a
>>>> simulation and based on the results
>>>> we update the probabilities. Repeat until we are happy with the result.
>>>>
>>>> Other solution: cluster similar disks together.
>>>>
>>>> - Since it's a non-linear optimization problem I'm not sure right now
>>>> about it's convergence properties.
>>>> Does it converge to a global optimum? How fast does it converge?
>>>>
>>>> Possible solution: the algorithm always converges, but it can converge
>>>> to a locally optimum policy. I see
>>>> no escape except by carefully designing the policy. All solutions to
>>>> the problem are going to be non linear
>>>> since we must condition current probabilities on previous disk selections.
>>>>
>>>> - Although it can handle arbitrary constraints it does so by rejecting
>>>> disks selections that violate at least one constraint.
>>>> This means that for bad policies it can spend all the time rejecting
>>>> invalid disks selection candidates.
>>>>
>>>> Possible solution: the policy cannot be designed independently of the
>>>> constraints. I don't know what constraints
>>>> are typical use cases but having a look should be the first step. The
>>>> constraints must be an input to the policy.
>>>>
>>>>
>>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>>> found the problem an interesting puzzle.
>>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>>
>>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>>
>>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>>
>>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>>
>>> 00 01 02 03 04 05 06 07 08 09 10
>>> 00: 0 14 17 14 19 23 13 22 21 20 1800
>>> 01: 12 0 11 13 19 19 15 10 16 17 1841
>>> 02: 17 27 0 17 15 15 13 19 18 11 1813
>>> 03: 14 17 15 0 23 11 20 15 23 17 1792
>>> 04: 14 18 16 25 0 27 13 8 15 16 1771
>>> 05: 19 16 22 25 13 0 9 19 21 21 1813
>>> 06: 18 15 21 17 10 18 0 10 18 11 1873
>>> 07: 13 17 22 13 16 17 14 0 25 12 1719
>>> 08: 23 20 16 17 19 18 11 12 0 18 1830
>>> 09: 14 20 15 17 12 16 17 11 13 0 1828
>>> 10: 0 0 0 0 0 0 0 0 0 0 0
>>> before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
>>> after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
>>>
>>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>>
>>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>>
>>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>>
>>> Cheers
>>>
>>>>
>>>>
>>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>> problem and a tentative solution which he described at :
>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>
>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>> means that the probability of each disk needs to take in account the
>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>> mistaken ?
>>>>>
>>>>> Maybe (I haven't looked closely at the above yet). But for comparison, in
>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>> disks changes the probability from 1/10 to 1/9). The key property that
>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>> the disk that is reweighted).
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>> This is a longstanding bug,
>>>>>>>
>>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>>
>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>
>>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>>
>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>> discussion here.
>>>>>>>
>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>> brain hurt.
>>>>>>>
>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>
>>>>>>> P(pick small | first pick not small)
>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>
>>>>>>> The last term is easy to calculate,
>>>>>>>
>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>
>>>>>>> and the && term is the distribution we're trying to produce. For exmaple,
>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>> their second replica be the small OSD. So
>>>>>>>
>>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>
>>>>>>> Putting those together,
>>>>>>>
>>>>>>> P(pick small | first pick not small)
>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>>
>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>> that they weren't already chosen.
>>>>>>>
>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>>>> current code, you get
>>>>>>>
>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>> device 0: 19765965 [9899364,9866601]
>>>>>>> device 1: 19768033 [9899444,9868589]
>>>>>>> device 2: 19769938 [9901770,9868168]
>>>>>>> device 3: 19766918 [9898851,9868067]
>>>>>>> device 6: 929148 [400572,528576]
>>>>>>>
>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>> second. With my hacky change,
>>>>>>>
>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>> device 0: 19797315 [9899364,9897951]
>>>>>>> device 1: 19799199 [9899444,9899755]
>>>>>>> device 2: 19801016 [9901770,9899246]
>>>>>>> device 3: 19797906 [9898851,9899055]
>>>>>>> device 6: 804566 [400572,403994]
>>>>>>>
>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>> 1%).
>>>>>>>
>>>>>>> Next steps:
>>>>>>>
>>>>>>> 1- generalize this for >2 replicas
>>>>>>> 2- figure out why it skews high
>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>
>>>>>>> sage
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-22 11:46 ` Pedro López-Adeva
@ 2017-02-25 0:38 ` Loic Dachary
2017-02-25 8:41 ` Pedro López-Adeva
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-25 0:38 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: ceph-devel
Hi Pedro,
On 02/22/2017 12:46 PM, Pedro López-Adeva wrote:
> That, for validation, would be great. Until weekend I don't think I'm
> going to have time to work on this anyway.
An initial version of the module is ready and documented at http://crush.readthedocs.io/en/latest/.
Cheers
> 2017-02-22 12:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>
>>
>> On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
>>> Hi,
>>>
>>> I think your description of my proposed solution is quite good.
>>>
>>> I had a first look at Sage's paper but not ceph's implementation. My
>>> plan is to finish the paper and make an implementation in python that
>>> mimicks more closely ceph's algorithm.
>>>
>>> Regarding your question about data movement:
>>>
>>> If I understood the paper correctly what is happening right now is
>>> that when weights change on the devices some of them will become
>>> overloaded and the current algorithm will try to correct for that but
>>> this approach, I think, is independent of how we compute the weights
>>> for each device. My point is that the current data movement pattern
>>> will not be modified.
>>>
>>> Could the data movement algorithm be improved? Maybe. I don't know.
>>> Maybe by making the probabilities non-stationary with the new disk
>>> getting at first very high probability and after each replica
>>> placement decrease it until it stabilizes to it's final value. But I'm
>>> just guessing and I really don't know if this can be made to work in a
>>> distributed manner as is currently the case and how would this fit in
>>> the current architecture. In any case it would be a problem as hard at
>>> least as the current reweighting problem.
>>>
>>> So, to summarize, my current plans:
>>>
>>> - Have another look at the paper
>>> - Make an implementation in python that imitates more closely the
>>> current algorithm
>>
>> What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
>>
>>> - Make sure the new reweighting algorithm is fast and gives the desired results
>>>
>>> I will give updates here when there are significant changes so
>>> everyone can have a look and suggest improvements.
>>>
>>> Cheers,
>>> Pedro.
>>>
>>> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>> Hi Pedro,
>>>>
>>>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>>>> I have updated the algorithm to handle an arbitrary number of replicas
>>>>> and arbitrary constraints.
>>>>>
>>>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>>>
>>>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>>>
>>>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>>>
>>>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>>>
>>>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>>>
>>>> [1] https://en.wikipedia.org/wiki/Loss_function
>>>> [2] https://en.wikipedia.org/wiki/Gradient
>>>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>>>
>>>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>>>
>>>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>>>> deficient, I recommend downloading/cloning)
>>>>>
>>>>>
>>>>> In the following by policy I mean the concrete set of probabilities of
>>>>> selecting the first replica, the second replica, etc...
>>>>> In practical terms there are several problems:
>>>>>
>>>>> - It's not practical for a high number of disks or replicas.
>>>>>
>>>>> Possible solution: approximate summation over all possible disk
>>>>> selections with a Monte Carlo method.
>>>>> the algorithm would be: we start with a candidate solution, we run a
>>>>> simulation and based on the results
>>>>> we update the probabilities. Repeat until we are happy with the result.
>>>>>
>>>>> Other solution: cluster similar disks together.
>>>>>
>>>>> - Since it's a non-linear optimization problem I'm not sure right now
>>>>> about it's convergence properties.
>>>>> Does it converge to a global optimum? How fast does it converge?
>>>>>
>>>>> Possible solution: the algorithm always converges, but it can converge
>>>>> to a locally optimum policy. I see
>>>>> no escape except by carefully designing the policy. All solutions to
>>>>> the problem are going to be non linear
>>>>> since we must condition current probabilities on previous disk selections.
>>>>>
>>>>> - Although it can handle arbitrary constraints it does so by rejecting
>>>>> disks selections that violate at least one constraint.
>>>>> This means that for bad policies it can spend all the time rejecting
>>>>> invalid disks selection candidates.
>>>>>
>>>>> Possible solution: the policy cannot be designed independently of the
>>>>> constraints. I don't know what constraints
>>>>> are typical use cases but having a look should be the first step. The
>>>>> constraints must be an input to the policy.
>>>>>
>>>>>
>>>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>>>> found the problem an interesting puzzle.
>>>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>>>
>>>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>>>
>>>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>>>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>>>
>>>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>>>
>>>> 00 01 02 03 04 05 06 07 08 09 10
>>>> 00: 0 14 17 14 19 23 13 22 21 20 1800
>>>> 01: 12 0 11 13 19 19 15 10 16 17 1841
>>>> 02: 17 27 0 17 15 15 13 19 18 11 1813
>>>> 03: 14 17 15 0 23 11 20 15 23 17 1792
>>>> 04: 14 18 16 25 0 27 13 8 15 16 1771
>>>> 05: 19 16 22 25 13 0 9 19 21 21 1813
>>>> 06: 18 15 21 17 10 18 0 10 18 11 1873
>>>> 07: 13 17 22 13 16 17 14 0 25 12 1719
>>>> 08: 23 20 16 17 19 18 11 12 0 18 1830
>>>> 09: 14 20 15 17 12 16 17 11 13 0 1828
>>>> 10: 0 0 0 0 0 0 0 0 0 0 0
>>>> before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
>>>> after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
>>>>
>>>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>>>
>>>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>>>
>>>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>>>
>>>> Cheers
>>>>
>>>>>
>>>>>
>>>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>>> problem and a tentative solution which he described at :
>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>
>>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>>> means that the probability of each disk needs to take in account the
>>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>>> mistaken ?
>>>>>>
>>>>>> Maybe (I haven't looked closely at the above yet). But for comparison, in
>>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>>> disks changes the probability from 1/10 to 1/9). The key property that
>>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>>> the disk that is reweighted).
>>>>>>
>>>>>> sage
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>> This is a longstanding bug,
>>>>>>>>
>>>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>>>
>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>
>>>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>>>
>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>> discussion here.
>>>>>>>>
>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>> brain hurt.
>>>>>>>>
>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>
>>>>>>>> P(pick small | first pick not small)
>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>
>>>>>>>> The last term is easy to calculate,
>>>>>>>>
>>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>
>>>>>>>> and the && term is the distribution we're trying to produce. For exmaple,
>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>> their second replica be the small OSD. So
>>>>>>>>
>>>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>
>>>>>>>> Putting those together,
>>>>>>>>
>>>>>>>> P(pick small | first pick not small)
>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>>>
>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>> that they weren't already chosen.
>>>>>>>>
>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>>>>> current code, you get
>>>>>>>>
>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>> device 0: 19765965 [9899364,9866601]
>>>>>>>> device 1: 19768033 [9899444,9868589]
>>>>>>>> device 2: 19769938 [9901770,9868168]
>>>>>>>> device 3: 19766918 [9898851,9868067]
>>>>>>>> device 6: 929148 [400572,528576]
>>>>>>>>
>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>> second. With my hacky change,
>>>>>>>>
>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>> device 0: 19797315 [9899364,9897951]
>>>>>>>> device 1: 19799199 [9899444,9899755]
>>>>>>>> device 2: 19801016 [9901770,9899246]
>>>>>>>> device 3: 19797906 [9898851,9899055]
>>>>>>>> device 6: 804566 [400572,403994]
>>>>>>>>
>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>> 1%).
>>>>>>>>
>>>>>>>> Next steps:
>>>>>>>>
>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>> 2- figure out why it skews high
>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>
>>>>>>>> sage
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-25 0:38 ` Loic Dachary
@ 2017-02-25 8:41 ` Pedro López-Adeva
2017-02-25 9:02 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Pedro López-Adeva @ 2017-02-25 8:41 UTC (permalink / raw)
To: Loic Dachary; +Cc: ceph-devel
Great! Installed without problem and ran the example OK. I will
convert what I already have to use the library and continue from
there.
2017-02-25 1:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
> Hi Pedro,
>
> On 02/22/2017 12:46 PM, Pedro López-Adeva wrote:
>> That, for validation, would be great. Until weekend I don't think I'm
>> going to have time to work on this anyway.
>
> An initial version of the module is ready and documented at http://crush.readthedocs.io/en/latest/.
>
> Cheers
>
>> 2017-02-22 12:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>
>>>
>>> On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
>>>> Hi,
>>>>
>>>> I think your description of my proposed solution is quite good.
>>>>
>>>> I had a first look at Sage's paper but not ceph's implementation. My
>>>> plan is to finish the paper and make an implementation in python that
>>>> mimicks more closely ceph's algorithm.
>>>>
>>>> Regarding your question about data movement:
>>>>
>>>> If I understood the paper correctly what is happening right now is
>>>> that when weights change on the devices some of them will become
>>>> overloaded and the current algorithm will try to correct for that but
>>>> this approach, I think, is independent of how we compute the weights
>>>> for each device. My point is that the current data movement pattern
>>>> will not be modified.
>>>>
>>>> Could the data movement algorithm be improved? Maybe. I don't know.
>>>> Maybe by making the probabilities non-stationary with the new disk
>>>> getting at first very high probability and after each replica
>>>> placement decrease it until it stabilizes to it's final value. But I'm
>>>> just guessing and I really don't know if this can be made to work in a
>>>> distributed manner as is currently the case and how would this fit in
>>>> the current architecture. In any case it would be a problem as hard at
>>>> least as the current reweighting problem.
>>>>
>>>> So, to summarize, my current plans:
>>>>
>>>> - Have another look at the paper
>>>> - Make an implementation in python that imitates more closely the
>>>> current algorithm
>>>
>>> What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
>>>
>>>> - Make sure the new reweighting algorithm is fast and gives the desired results
>>>>
>>>> I will give updates here when there are significant changes so
>>>> everyone can have a look and suggest improvements.
>>>>
>>>> Cheers,
>>>> Pedro.
>>>>
>>>> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>> Hi Pedro,
>>>>>
>>>>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>>>>> I have updated the algorithm to handle an arbitrary number of replicas
>>>>>> and arbitrary constraints.
>>>>>>
>>>>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>>>>
>>>>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>>>>
>>>>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>>>>
>>>>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>>>>
>>>>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>>>>
>>>>> [1] https://en.wikipedia.org/wiki/Loss_function
>>>>> [2] https://en.wikipedia.org/wiki/Gradient
>>>>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>>>>
>>>>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>>>>
>>>>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>>>>> deficient, I recommend downloading/cloning)
>>>>>>
>>>>>>
>>>>>> In the following by policy I mean the concrete set of probabilities of
>>>>>> selecting the first replica, the second replica, etc...
>>>>>> In practical terms there are several problems:
>>>>>>
>>>>>> - It's not practical for a high number of disks or replicas.
>>>>>>
>>>>>> Possible solution: approximate summation over all possible disk
>>>>>> selections with a Monte Carlo method.
>>>>>> the algorithm would be: we start with a candidate solution, we run a
>>>>>> simulation and based on the results
>>>>>> we update the probabilities. Repeat until we are happy with the result.
>>>>>>
>>>>>> Other solution: cluster similar disks together.
>>>>>>
>>>>>> - Since it's a non-linear optimization problem I'm not sure right now
>>>>>> about it's convergence properties.
>>>>>> Does it converge to a global optimum? How fast does it converge?
>>>>>>
>>>>>> Possible solution: the algorithm always converges, but it can converge
>>>>>> to a locally optimum policy. I see
>>>>>> no escape except by carefully designing the policy. All solutions to
>>>>>> the problem are going to be non linear
>>>>>> since we must condition current probabilities on previous disk selections.
>>>>>>
>>>>>> - Although it can handle arbitrary constraints it does so by rejecting
>>>>>> disks selections that violate at least one constraint.
>>>>>> This means that for bad policies it can spend all the time rejecting
>>>>>> invalid disks selection candidates.
>>>>>>
>>>>>> Possible solution: the policy cannot be designed independently of the
>>>>>> constraints. I don't know what constraints
>>>>>> are typical use cases but having a look should be the first step. The
>>>>>> constraints must be an input to the policy.
>>>>>>
>>>>>>
>>>>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>>>>> found the problem an interesting puzzle.
>>>>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>>>>
>>>>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>>>>
>>>>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>>>>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>>>>
>>>>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>>>>
>>>>> 00 01 02 03 04 05 06 07 08 09 10
>>>>> 00: 0 14 17 14 19 23 13 22 21 20 1800
>>>>> 01: 12 0 11 13 19 19 15 10 16 17 1841
>>>>> 02: 17 27 0 17 15 15 13 19 18 11 1813
>>>>> 03: 14 17 15 0 23 11 20 15 23 17 1792
>>>>> 04: 14 18 16 25 0 27 13 8 15 16 1771
>>>>> 05: 19 16 22 25 13 0 9 19 21 21 1813
>>>>> 06: 18 15 21 17 10 18 0 10 18 11 1873
>>>>> 07: 13 17 22 13 16 17 14 0 25 12 1719
>>>>> 08: 23 20 16 17 19 18 11 12 0 18 1830
>>>>> 09: 14 20 15 17 12 16 17 11 13 0 1828
>>>>> 10: 0 0 0 0 0 0 0 0 0 0 0
>>>>> before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
>>>>> after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
>>>>>
>>>>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>>>>
>>>>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>>>>
>>>>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>>>>
>>>>> Cheers
>>>>>
>>>>>>
>>>>>>
>>>>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>>>> problem and a tentative solution which he described at :
>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>
>>>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>>>> means that the probability of each disk needs to take in account the
>>>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>>>> mistaken ?
>>>>>>>
>>>>>>> Maybe (I haven't looked closely at the above yet). But for comparison, in
>>>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>>>> disks changes the probability from 1/10 to 1/9). The key property that
>>>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>>>> the disk that is reweighted).
>>>>>>>
>>>>>>> sage
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>>> This is a longstanding bug,
>>>>>>>>>
>>>>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>>>>
>>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>>
>>>>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>>>>
>>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>>> discussion here.
>>>>>>>>>
>>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>>> brain hurt.
>>>>>>>>>
>>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>>
>>>>>>>>> P(pick small | first pick not small)
>>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>
>>>>>>>>> The last term is easy to calculate,
>>>>>>>>>
>>>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>>
>>>>>>>>> and the && term is the distribution we're trying to produce. For exmaple,
>>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>>> their second replica be the small OSD. So
>>>>>>>>>
>>>>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>>
>>>>>>>>> Putting those together,
>>>>>>>>>
>>>>>>>>> P(pick small | first pick not small)
>>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>>>>
>>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>>> that they weren't already chosen.
>>>>>>>>>
>>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>>>>>> current code, you get
>>>>>>>>>
>>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>>> device 0: 19765965 [9899364,9866601]
>>>>>>>>> device 1: 19768033 [9899444,9868589]
>>>>>>>>> device 2: 19769938 [9901770,9868168]
>>>>>>>>> device 3: 19766918 [9898851,9868067]
>>>>>>>>> device 6: 929148 [400572,528576]
>>>>>>>>>
>>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>>> second. With my hacky change,
>>>>>>>>>
>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>>> device 0: 19797315 [9899364,9897951]
>>>>>>>>> device 1: 19799199 [9899444,9899755]
>>>>>>>>> device 2: 19801016 [9901770,9899246]
>>>>>>>>> device 3: 19797906 [9898851,9899055]
>>>>>>>>> device 6: 804566 [400572,403994]
>>>>>>>>>
>>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>>> 1%).
>>>>>>>>>
>>>>>>>>> Next steps:
>>>>>>>>>
>>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>>> 2- figure out why it skews high
>>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>>
>>>>>>>>> sage
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-25 8:41 ` Pedro López-Adeva
@ 2017-02-25 9:02 ` Loic Dachary
2017-03-02 9:43 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-25 9:02 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: ceph-devel
On 02/25/2017 09:41 AM, Pedro López-Adeva wrote:
> Great! Installed without problem and ran the example OK. I will
> convert what I already have to use the library and continue from
> there.
Cool :-) http://crush.readthedocs.io/en/latest/api.html is a complete reference of the crushmap structure, let me know if something is missing.
Cheers
>
> 2017-02-25 1:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> On 02/22/2017 12:46 PM, Pedro López-Adeva wrote:
>>> That, for validation, would be great. Until weekend I don't think I'm
>>> going to have time to work on this anyway.
>>
>> An initial version of the module is ready and documented at http://crush.readthedocs.io/en/latest/.
>>
>> Cheers
>>
>>> 2017-02-22 12:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>
>>>>
>>>> On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
>>>>> Hi,
>>>>>
>>>>> I think your description of my proposed solution is quite good.
>>>>>
>>>>> I had a first look at Sage's paper but not ceph's implementation. My
>>>>> plan is to finish the paper and make an implementation in python that
>>>>> mimicks more closely ceph's algorithm.
>>>>>
>>>>> Regarding your question about data movement:
>>>>>
>>>>> If I understood the paper correctly what is happening right now is
>>>>> that when weights change on the devices some of them will become
>>>>> overloaded and the current algorithm will try to correct for that but
>>>>> this approach, I think, is independent of how we compute the weights
>>>>> for each device. My point is that the current data movement pattern
>>>>> will not be modified.
>>>>>
>>>>> Could the data movement algorithm be improved? Maybe. I don't know.
>>>>> Maybe by making the probabilities non-stationary with the new disk
>>>>> getting at first very high probability and after each replica
>>>>> placement decrease it until it stabilizes to it's final value. But I'm
>>>>> just guessing and I really don't know if this can be made to work in a
>>>>> distributed manner as is currently the case and how would this fit in
>>>>> the current architecture. In any case it would be a problem as hard at
>>>>> least as the current reweighting problem.
>>>>>
>>>>> So, to summarize, my current plans:
>>>>>
>>>>> - Have another look at the paper
>>>>> - Make an implementation in python that imitates more closely the
>>>>> current algorithm
>>>>
>>>> What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
>>>>
>>>>> - Make sure the new reweighting algorithm is fast and gives the desired results
>>>>>
>>>>> I will give updates here when there are significant changes so
>>>>> everyone can have a look and suggest improvements.
>>>>>
>>>>> Cheers,
>>>>> Pedro.
>>>>>
>>>>> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>>>>>> I have updated the algorithm to handle an arbitrary number of replicas
>>>>>>> and arbitrary constraints.
>>>>>>>
>>>>>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>>>>>
>>>>>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>>>>>
>>>>>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>>>>>
>>>>>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>>>>>
>>>>>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>>>>>
>>>>>> [1] https://en.wikipedia.org/wiki/Loss_function
>>>>>> [2] https://en.wikipedia.org/wiki/Gradient
>>>>>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>>>>>
>>>>>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>>>>>
>>>>>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>>>>>> deficient, I recommend downloading/cloning)
>>>>>>>
>>>>>>>
>>>>>>> In the following by policy I mean the concrete set of probabilities of
>>>>>>> selecting the first replica, the second replica, etc...
>>>>>>> In practical terms there are several problems:
>>>>>>>
>>>>>>> - It's not practical for a high number of disks or replicas.
>>>>>>>
>>>>>>> Possible solution: approximate summation over all possible disk
>>>>>>> selections with a Monte Carlo method.
>>>>>>> the algorithm would be: we start with a candidate solution, we run a
>>>>>>> simulation and based on the results
>>>>>>> we update the probabilities. Repeat until we are happy with the result.
>>>>>>>
>>>>>>> Other solution: cluster similar disks together.
>>>>>>>
>>>>>>> - Since it's a non-linear optimization problem I'm not sure right now
>>>>>>> about it's convergence properties.
>>>>>>> Does it converge to a global optimum? How fast does it converge?
>>>>>>>
>>>>>>> Possible solution: the algorithm always converges, but it can converge
>>>>>>> to a locally optimum policy. I see
>>>>>>> no escape except by carefully designing the policy. All solutions to
>>>>>>> the problem are going to be non linear
>>>>>>> since we must condition current probabilities on previous disk selections.
>>>>>>>
>>>>>>> - Although it can handle arbitrary constraints it does so by rejecting
>>>>>>> disks selections that violate at least one constraint.
>>>>>>> This means that for bad policies it can spend all the time rejecting
>>>>>>> invalid disks selection candidates.
>>>>>>>
>>>>>>> Possible solution: the policy cannot be designed independently of the
>>>>>>> constraints. I don't know what constraints
>>>>>>> are typical use cases but having a look should be the first step. The
>>>>>>> constraints must be an input to the policy.
>>>>>>>
>>>>>>>
>>>>>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>>>>>> found the problem an interesting puzzle.
>>>>>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>>>>>
>>>>>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>>>>>
>>>>>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>>>>>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>>>>>
>>>>>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>>>>>
>>>>>> 00 01 02 03 04 05 06 07 08 09 10
>>>>>> 00: 0 14 17 14 19 23 13 22 21 20 1800
>>>>>> 01: 12 0 11 13 19 19 15 10 16 17 1841
>>>>>> 02: 17 27 0 17 15 15 13 19 18 11 1813
>>>>>> 03: 14 17 15 0 23 11 20 15 23 17 1792
>>>>>> 04: 14 18 16 25 0 27 13 8 15 16 1771
>>>>>> 05: 19 16 22 25 13 0 9 19 21 21 1813
>>>>>> 06: 18 15 21 17 10 18 0 10 18 11 1873
>>>>>> 07: 13 17 22 13 16 17 14 0 25 12 1719
>>>>>> 08: 23 20 16 17 19 18 11 12 0 18 1830
>>>>>> 09: 14 20 15 17 12 16 17 11 13 0 1828
>>>>>> 10: 0 0 0 0 0 0 0 0 0 0 0
>>>>>> before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
>>>>>> after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
>>>>>>
>>>>>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>>>>>
>>>>>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>>>>>
>>>>>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>>>>> problem and a tentative solution which he described at :
>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>
>>>>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>>>>> means that the probability of each disk needs to take in account the
>>>>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>>>>> mistaken ?
>>>>>>>>
>>>>>>>> Maybe (I haven't looked closely at the above yet). But for comparison, in
>>>>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>>>>> disks changes the probability from 1/10 to 1/9). The key property that
>>>>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>>>>> the disk that is reweighted).
>>>>>>>>
>>>>>>>> sage
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>>>> This is a longstanding bug,
>>>>>>>>>>
>>>>>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>>>>>
>>>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>>>
>>>>>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>>>>>
>>>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>>>> discussion here.
>>>>>>>>>>
>>>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>>>> brain hurt.
>>>>>>>>>>
>>>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>>>
>>>>>>>>>> P(pick small | first pick not small)
>>>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>
>>>>>>>>>> The last term is easy to calculate,
>>>>>>>>>>
>>>>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>>>
>>>>>>>>>> and the && term is the distribution we're trying to produce. For exmaple,
>>>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>>>> their second replica be the small OSD. So
>>>>>>>>>>
>>>>>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>>>
>>>>>>>>>> Putting those together,
>>>>>>>>>>
>>>>>>>>>> P(pick small | first pick not small)
>>>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>>>>>
>>>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>>>> that they weren't already chosen.
>>>>>>>>>>
>>>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>>>>>>> current code, you get
>>>>>>>>>>
>>>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>>>> device 0: 19765965 [9899364,9866601]
>>>>>>>>>> device 1: 19768033 [9899444,9868589]
>>>>>>>>>> device 2: 19769938 [9901770,9868168]
>>>>>>>>>> device 3: 19766918 [9898851,9868067]
>>>>>>>>>> device 6: 929148 [400572,528576]
>>>>>>>>>>
>>>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>>>> second. With my hacky change,
>>>>>>>>>>
>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>>>> device 0: 19797315 [9899364,9897951]
>>>>>>>>>> device 1: 19799199 [9899444,9899755]
>>>>>>>>>> device 2: 19801016 [9901770,9899246]
>>>>>>>>>> device 3: 19797906 [9898851,9899055]
>>>>>>>>>> device 6: 804566 [400572,403994]
>>>>>>>>>>
>>>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>>>> 1%).
>>>>>>>>>>
>>>>>>>>>> Next steps:
>>>>>>>>>>
>>>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>>>> 2- figure out why it skews high
>>>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>>>
>>>>>>>>>> sage
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-02-25 9:02 ` Loic Dachary
@ 2017-03-02 9:43 ` Loic Dachary
2017-03-02 9:58 ` Pedro López-Adeva
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-03-02 9:43 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: ceph-devel
Hi Pedro,
There is a new version of python-crush at https://pypi.python.org/pypi/crush which changes the layout of the crushmap and the documentation was updated accordingly at http://crush.readthedocs.io/. Sorry for the inconvenience.
Cheers
On 02/25/2017 10:02 AM, Loic Dachary wrote:
>
>
> On 02/25/2017 09:41 AM, Pedro López-Adeva wrote:
>> Great! Installed without problem and ran the example OK. I will
>> convert what I already have to use the library and continue from
>> there.
>
> Cool :-) http://crush.readthedocs.io/en/latest/api.html is a complete reference of the crushmap structure, let me know if something is missing.
>
> Cheers
>
>>
>> 2017-02-25 1:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> Hi Pedro,
>>>
>>> On 02/22/2017 12:46 PM, Pedro López-Adeva wrote:
>>>> That, for validation, would be great. Until weekend I don't think I'm
>>>> going to have time to work on this anyway.
>>>
>>> An initial version of the module is ready and documented at http://crush.readthedocs.io/en/latest/.
>>>
>>> Cheers
>>>
>>>> 2017-02-22 12:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>
>>>>>
>>>>> On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I think your description of my proposed solution is quite good.
>>>>>>
>>>>>> I had a first look at Sage's paper but not ceph's implementation. My
>>>>>> plan is to finish the paper and make an implementation in python that
>>>>>> mimicks more closely ceph's algorithm.
>>>>>>
>>>>>> Regarding your question about data movement:
>>>>>>
>>>>>> If I understood the paper correctly what is happening right now is
>>>>>> that when weights change on the devices some of them will become
>>>>>> overloaded and the current algorithm will try to correct for that but
>>>>>> this approach, I think, is independent of how we compute the weights
>>>>>> for each device. My point is that the current data movement pattern
>>>>>> will not be modified.
>>>>>>
>>>>>> Could the data movement algorithm be improved? Maybe. I don't know.
>>>>>> Maybe by making the probabilities non-stationary with the new disk
>>>>>> getting at first very high probability and after each replica
>>>>>> placement decrease it until it stabilizes to it's final value. But I'm
>>>>>> just guessing and I really don't know if this can be made to work in a
>>>>>> distributed manner as is currently the case and how would this fit in
>>>>>> the current architecture. In any case it would be a problem as hard at
>>>>>> least as the current reweighting problem.
>>>>>>
>>>>>> So, to summarize, my current plans:
>>>>>>
>>>>>> - Have another look at the paper
>>>>>> - Make an implementation in python that imitates more closely the
>>>>>> current algorithm
>>>>>
>>>>> What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
>>>>>
>>>>>> - Make sure the new reweighting algorithm is fast and gives the desired results
>>>>>>
>>>>>> I will give updates here when there are significant changes so
>>>>>> everyone can have a look and suggest improvements.
>>>>>>
>>>>>> Cheers,
>>>>>> Pedro.
>>>>>>
>>>>>> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>> Hi Pedro,
>>>>>>>
>>>>>>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>>>>>>> I have updated the algorithm to handle an arbitrary number of replicas
>>>>>>>> and arbitrary constraints.
>>>>>>>>
>>>>>>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>>>>>>
>>>>>>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>>>>>>
>>>>>>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>>>>>>
>>>>>>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>>>>>>
>>>>>>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>>>>>>
>>>>>>> [1] https://en.wikipedia.org/wiki/Loss_function
>>>>>>> [2] https://en.wikipedia.org/wiki/Gradient
>>>>>>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>>>>>>
>>>>>>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>>>>>>
>>>>>>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>>>>>>> deficient, I recommend downloading/cloning)
>>>>>>>>
>>>>>>>>
>>>>>>>> In the following by policy I mean the concrete set of probabilities of
>>>>>>>> selecting the first replica, the second replica, etc...
>>>>>>>> In practical terms there are several problems:
>>>>>>>>
>>>>>>>> - It's not practical for a high number of disks or replicas.
>>>>>>>>
>>>>>>>> Possible solution: approximate summation over all possible disk
>>>>>>>> selections with a Monte Carlo method.
>>>>>>>> the algorithm would be: we start with a candidate solution, we run a
>>>>>>>> simulation and based on the results
>>>>>>>> we update the probabilities. Repeat until we are happy with the result.
>>>>>>>>
>>>>>>>> Other solution: cluster similar disks together.
>>>>>>>>
>>>>>>>> - Since it's a non-linear optimization problem I'm not sure right now
>>>>>>>> about it's convergence properties.
>>>>>>>> Does it converge to a global optimum? How fast does it converge?
>>>>>>>>
>>>>>>>> Possible solution: the algorithm always converges, but it can converge
>>>>>>>> to a locally optimum policy. I see
>>>>>>>> no escape except by carefully designing the policy. All solutions to
>>>>>>>> the problem are going to be non linear
>>>>>>>> since we must condition current probabilities on previous disk selections.
>>>>>>>>
>>>>>>>> - Although it can handle arbitrary constraints it does so by rejecting
>>>>>>>> disks selections that violate at least one constraint.
>>>>>>>> This means that for bad policies it can spend all the time rejecting
>>>>>>>> invalid disks selection candidates.
>>>>>>>>
>>>>>>>> Possible solution: the policy cannot be designed independently of the
>>>>>>>> constraints. I don't know what constraints
>>>>>>>> are typical use cases but having a look should be the first step. The
>>>>>>>> constraints must be an input to the policy.
>>>>>>>>
>>>>>>>>
>>>>>>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>>>>>>> found the problem an interesting puzzle.
>>>>>>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>>>>>>
>>>>>>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>>>>>>
>>>>>>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>>>>>>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>>>>>>
>>>>>>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>>>>>>
>>>>>>> 00 01 02 03 04 05 06 07 08 09 10
>>>>>>> 00: 0 14 17 14 19 23 13 22 21 20 1800
>>>>>>> 01: 12 0 11 13 19 19 15 10 16 17 1841
>>>>>>> 02: 17 27 0 17 15 15 13 19 18 11 1813
>>>>>>> 03: 14 17 15 0 23 11 20 15 23 17 1792
>>>>>>> 04: 14 18 16 25 0 27 13 8 15 16 1771
>>>>>>> 05: 19 16 22 25 13 0 9 19 21 21 1813
>>>>>>> 06: 18 15 21 17 10 18 0 10 18 11 1873
>>>>>>> 07: 13 17 22 13 16 17 14 0 25 12 1719
>>>>>>> 08: 23 20 16 17 19 18 11 12 0 18 1830
>>>>>>> 09: 14 20 15 17 12 16 17 11 13 0 1828
>>>>>>> 10: 0 0 0 0 0 0 0 0 0 0 0
>>>>>>> before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
>>>>>>> after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
>>>>>>>
>>>>>>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>>>>>>
>>>>>>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>>>>>>
>>>>>>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>>>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>>>>>> problem and a tentative solution which he described at :
>>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>>
>>>>>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>>>>>> means that the probability of each disk needs to take in account the
>>>>>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>>>>>> mistaken ?
>>>>>>>>>
>>>>>>>>> Maybe (I haven't looked closely at the above yet). But for comparison, in
>>>>>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>>>>>> disks changes the probability from 1/10 to 1/9). The key property that
>>>>>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>>>>>> the disk that is reweighted).
>>>>>>>>>
>>>>>>>>> sage
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>>>>> This is a longstanding bug,
>>>>>>>>>>>
>>>>>>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>>>>>>
>>>>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>>>>>>
>>>>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>>>>> discussion here.
>>>>>>>>>>>
>>>>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>>>>> brain hurt.
>>>>>>>>>>>
>>>>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>>>>
>>>>>>>>>>> P(pick small | first pick not small)
>>>>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>>
>>>>>>>>>>> The last term is easy to calculate,
>>>>>>>>>>>
>>>>>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>>>>
>>>>>>>>>>> and the && term is the distribution we're trying to produce. For exmaple,
>>>>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>>>>> their second replica be the small OSD. So
>>>>>>>>>>>
>>>>>>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>>>>
>>>>>>>>>>> Putting those together,
>>>>>>>>>>>
>>>>>>>>>>> P(pick small | first pick not small)
>>>>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>>>>>>
>>>>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>>>>> that they weren't already chosen.
>>>>>>>>>>>
>>>>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>>>>>>>> current code, you get
>>>>>>>>>>>
>>>>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>>>>> device 0: 19765965 [9899364,9866601]
>>>>>>>>>>> device 1: 19768033 [9899444,9868589]
>>>>>>>>>>> device 2: 19769938 [9901770,9868168]
>>>>>>>>>>> device 3: 19766918 [9898851,9868067]
>>>>>>>>>>> device 6: 929148 [400572,528576]
>>>>>>>>>>>
>>>>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>>>>> second. With my hacky change,
>>>>>>>>>>>
>>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>>>>> device 0: 19797315 [9899364,9897951]
>>>>>>>>>>> device 1: 19799199 [9899444,9899755]
>>>>>>>>>>> device 2: 19801016 [9901770,9899246]
>>>>>>>>>>> device 3: 19797906 [9898851,9899055]
>>>>>>>>>>> device 6: 804566 [400572,403994]
>>>>>>>>>>>
>>>>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>>>>> 1%).
>>>>>>>>>>>
>>>>>>>>>>> Next steps:
>>>>>>>>>>>
>>>>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>>>>> 2- figure out why it skews high
>>>>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>>>>
>>>>>>>>>>> sage
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-02 9:43 ` Loic Dachary
@ 2017-03-02 9:58 ` Pedro López-Adeva
2017-03-02 10:31 ` Loic Dachary
2017-03-07 23:06 ` Sage Weil
0 siblings, 2 replies; 70+ messages in thread
From: Pedro López-Adeva @ 2017-03-02 9:58 UTC (permalink / raw)
To: Loic Dachary; +Cc: ceph-devel
Hi,
I will have a look. BTW, I have not progressed that much but I have
been thinking about it. In order to adapt the previous algorithm in
the python notebook I need to substitute the iteration over all
possible devices permutations to iteration over all the possible
selections that crush would make. That is the main thing I need to
work on.
The other thing is of course that weights change for each replica.
That is, they cannot be really fixed in the crush map. So the
algorithm inside libcrush, not only the weights in the map, need to be
changed. The weights in the crush map should reflect then, maybe, the
desired usage frequencies. Or maybe each replica should have their own
crush map, but then the information about the previous selection
should be passed to the next replica placement run so it avoids
selecting the same one again.
I have a question also. Is there any significant difference between
the device selection algorithm description in the paper and its final
implementation?
Cheers,
Pedro.
2017-03-02 10:43 GMT+01:00 Loic Dachary <loic@dachary.org>:
> Hi Pedro,
>
> There is a new version of python-crush at https://pypi.python.org/pypi/crush which changes the layout of the crushmap and the documentation was updated accordingly at http://crush.readthedocs.io/. Sorry for the inconvenience.
>
> Cheers
>
> On 02/25/2017 10:02 AM, Loic Dachary wrote:
>>
>>
>> On 02/25/2017 09:41 AM, Pedro López-Adeva wrote:
>>> Great! Installed without problem and ran the example OK. I will
>>> convert what I already have to use the library and continue from
>>> there.
>>
>> Cool :-) http://crush.readthedocs.io/en/latest/api.html is a complete reference of the crushmap structure, let me know if something is missing.
>>
>> Cheers
>>
>>>
>>> 2017-02-25 1:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>> Hi Pedro,
>>>>
>>>> On 02/22/2017 12:46 PM, Pedro López-Adeva wrote:
>>>>> That, for validation, would be great. Until weekend I don't think I'm
>>>>> going to have time to work on this anyway.
>>>>
>>>> An initial version of the module is ready and documented at http://crush.readthedocs.io/en/latest/.
>>>>
>>>> Cheers
>>>>
>>>>> 2017-02-22 12:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>
>>>>>>
>>>>>> On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I think your description of my proposed solution is quite good.
>>>>>>>
>>>>>>> I had a first look at Sage's paper but not ceph's implementation. My
>>>>>>> plan is to finish the paper and make an implementation in python that
>>>>>>> mimicks more closely ceph's algorithm.
>>>>>>>
>>>>>>> Regarding your question about data movement:
>>>>>>>
>>>>>>> If I understood the paper correctly what is happening right now is
>>>>>>> that when weights change on the devices some of them will become
>>>>>>> overloaded and the current algorithm will try to correct for that but
>>>>>>> this approach, I think, is independent of how we compute the weights
>>>>>>> for each device. My point is that the current data movement pattern
>>>>>>> will not be modified.
>>>>>>>
>>>>>>> Could the data movement algorithm be improved? Maybe. I don't know.
>>>>>>> Maybe by making the probabilities non-stationary with the new disk
>>>>>>> getting at first very high probability and after each replica
>>>>>>> placement decrease it until it stabilizes to it's final value. But I'm
>>>>>>> just guessing and I really don't know if this can be made to work in a
>>>>>>> distributed manner as is currently the case and how would this fit in
>>>>>>> the current architecture. In any case it would be a problem as hard at
>>>>>>> least as the current reweighting problem.
>>>>>>>
>>>>>>> So, to summarize, my current plans:
>>>>>>>
>>>>>>> - Have another look at the paper
>>>>>>> - Make an implementation in python that imitates more closely the
>>>>>>> current algorithm
>>>>>>
>>>>>> What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
>>>>>>
>>>>>>> - Make sure the new reweighting algorithm is fast and gives the desired results
>>>>>>>
>>>>>>> I will give updates here when there are significant changes so
>>>>>>> everyone can have a look and suggest improvements.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Pedro.
>>>>>>>
>>>>>>> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>>>>>>>> I have updated the algorithm to handle an arbitrary number of replicas
>>>>>>>>> and arbitrary constraints.
>>>>>>>>>
>>>>>>>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>>>>>>>
>>>>>>>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>>>>>>>
>>>>>>>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>>>>>>>
>>>>>>>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>>>>>>>
>>>>>>>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>>>>>>>
>>>>>>>> [1] https://en.wikipedia.org/wiki/Loss_function
>>>>>>>> [2] https://en.wikipedia.org/wiki/Gradient
>>>>>>>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>>>>>>>
>>>>>>>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>>>>>>>
>>>>>>>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>>>>>>>> deficient, I recommend downloading/cloning)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In the following by policy I mean the concrete set of probabilities of
>>>>>>>>> selecting the first replica, the second replica, etc...
>>>>>>>>> In practical terms there are several problems:
>>>>>>>>>
>>>>>>>>> - It's not practical for a high number of disks or replicas.
>>>>>>>>>
>>>>>>>>> Possible solution: approximate summation over all possible disk
>>>>>>>>> selections with a Monte Carlo method.
>>>>>>>>> the algorithm would be: we start with a candidate solution, we run a
>>>>>>>>> simulation and based on the results
>>>>>>>>> we update the probabilities. Repeat until we are happy with the result.
>>>>>>>>>
>>>>>>>>> Other solution: cluster similar disks together.
>>>>>>>>>
>>>>>>>>> - Since it's a non-linear optimization problem I'm not sure right now
>>>>>>>>> about it's convergence properties.
>>>>>>>>> Does it converge to a global optimum? How fast does it converge?
>>>>>>>>>
>>>>>>>>> Possible solution: the algorithm always converges, but it can converge
>>>>>>>>> to a locally optimum policy. I see
>>>>>>>>> no escape except by carefully designing the policy. All solutions to
>>>>>>>>> the problem are going to be non linear
>>>>>>>>> since we must condition current probabilities on previous disk selections.
>>>>>>>>>
>>>>>>>>> - Although it can handle arbitrary constraints it does so by rejecting
>>>>>>>>> disks selections that violate at least one constraint.
>>>>>>>>> This means that for bad policies it can spend all the time rejecting
>>>>>>>>> invalid disks selection candidates.
>>>>>>>>>
>>>>>>>>> Possible solution: the policy cannot be designed independently of the
>>>>>>>>> constraints. I don't know what constraints
>>>>>>>>> are typical use cases but having a look should be the first step. The
>>>>>>>>> constraints must be an input to the policy.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>>>>>>>> found the problem an interesting puzzle.
>>>>>>>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>>>>>>>
>>>>>>>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>>>>>>>
>>>>>>>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>>>>>>>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>>>>>>>
>>>>>>>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>>>>>>>
>>>>>>>> 00 01 02 03 04 05 06 07 08 09 10
>>>>>>>> 00: 0 14 17 14 19 23 13 22 21 20 1800
>>>>>>>> 01: 12 0 11 13 19 19 15 10 16 17 1841
>>>>>>>> 02: 17 27 0 17 15 15 13 19 18 11 1813
>>>>>>>> 03: 14 17 15 0 23 11 20 15 23 17 1792
>>>>>>>> 04: 14 18 16 25 0 27 13 8 15 16 1771
>>>>>>>> 05: 19 16 22 25 13 0 9 19 21 21 1813
>>>>>>>> 06: 18 15 21 17 10 18 0 10 18 11 1873
>>>>>>>> 07: 13 17 22 13 16 17 14 0 25 12 1719
>>>>>>>> 08: 23 20 16 17 19 18 11 12 0 18 1830
>>>>>>>> 09: 14 20 15 17 12 16 17 11 13 0 1828
>>>>>>>> 10: 0 0 0 0 0 0 0 0 0 0 0
>>>>>>>> before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
>>>>>>>> after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
>>>>>>>>
>>>>>>>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>>>>>>>
>>>>>>>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>>>>>>>
>>>>>>>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>>>>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>>>>>>> problem and a tentative solution which he described at :
>>>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>>>
>>>>>>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>>>>>>> means that the probability of each disk needs to take in account the
>>>>>>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>>>>>>> mistaken ?
>>>>>>>>>>
>>>>>>>>>> Maybe (I haven't looked closely at the above yet). But for comparison, in
>>>>>>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>>>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>>>>>>> disks changes the probability from 1/10 to 1/9). The key property that
>>>>>>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>>>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>>>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>>>>>>> the disk that is reweighted).
>>>>>>>>>>
>>>>>>>>>> sage
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>>
>>>>>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>>>>>> This is a longstanding bug,
>>>>>>>>>>>>
>>>>>>>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>>>>>>>
>>>>>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>>>>>>>
>>>>>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>>>>>> discussion here.
>>>>>>>>>>>>
>>>>>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>>>>>> brain hurt.
>>>>>>>>>>>>
>>>>>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>>>>>
>>>>>>>>>>>> P(pick small | first pick not small)
>>>>>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>>>
>>>>>>>>>>>> The last term is easy to calculate,
>>>>>>>>>>>>
>>>>>>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>>>>>
>>>>>>>>>>>> and the && term is the distribution we're trying to produce. For exmaple,
>>>>>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>>>>>> their second replica be the small OSD. So
>>>>>>>>>>>>
>>>>>>>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>>>>>
>>>>>>>>>>>> Putting those together,
>>>>>>>>>>>>
>>>>>>>>>>>> P(pick small | first pick not small)
>>>>>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>>>>>>>
>>>>>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>>>>>> that they weren't already chosen.
>>>>>>>>>>>>
>>>>>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>>>>>>>>> current code, you get
>>>>>>>>>>>>
>>>>>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>>>>>> device 0: 19765965 [9899364,9866601]
>>>>>>>>>>>> device 1: 19768033 [9899444,9868589]
>>>>>>>>>>>> device 2: 19769938 [9901770,9868168]
>>>>>>>>>>>> device 3: 19766918 [9898851,9868067]
>>>>>>>>>>>> device 6: 929148 [400572,528576]
>>>>>>>>>>>>
>>>>>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>>>>>> second. With my hacky change,
>>>>>>>>>>>>
>>>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>>>>>> device 0: 19797315 [9899364,9897951]
>>>>>>>>>>>> device 1: 19799199 [9899444,9899755]
>>>>>>>>>>>> device 2: 19801016 [9901770,9899246]
>>>>>>>>>>>> device 3: 19797906 [9898851,9899055]
>>>>>>>>>>>> device 6: 804566 [400572,403994]
>>>>>>>>>>>>
>>>>>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>>>>>> 1%).
>>>>>>>>>>>>
>>>>>>>>>>>> Next steps:
>>>>>>>>>>>>
>>>>>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>>>>>> 2- figure out why it skews high
>>>>>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>>>>>
>>>>>>>>>>>> sage
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-02 9:58 ` Pedro López-Adeva
@ 2017-03-02 10:31 ` Loic Dachary
2017-03-07 23:06 ` Sage Weil
1 sibling, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-03-02 10:31 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: ceph-devel
On 03/02/2017 10:58 AM, Pedro López-Adeva wrote:
> Hi,
>
> I will have a look. BTW, I have not progressed that much but I have
> been thinking about it. In order to adapt the previous algorithm in
> the python notebook I need to substitute the iteration over all
> possible devices permutations to iteration over all the possible
> selections that crush would make. That is the main thing I need to
> work on.
That should be easy.
> The other thing is of course that weights change for each replica.
> That is, they cannot be really fixed in the crush map.
Do you mean that the weights for the replicas cannot be pre-calculated and stored in the crushmap before it is used for actually object mapping ?
> So the
> algorithm inside libcrush, not only the weights in the map, need to be
> changed. The weights in the crush map should reflect then, maybe, the
> desired usage frequencies. Or maybe each replica should have their own
> crush map, but then the information about the previous selection
> should be passed to the next replica placement run so it avoids
> selecting the same one again.
>
> I have a question also. Is there any significant difference between
> the device selection algorithm description in the paper and its final
> implementation?
The implementation "Algorithm 1" is crush_do_rule and is different although it looks the same[1]. Since the devil is in the details, I would refer to the current code. It is a little difficult to read because it contains parts only required for backward compatibility. You can assume vary_r == 1, stable == 1, chooseleaf_descend_once == 0, local_fallback_retries == 0 and ignore all the parts that are not in that code path.
[1] crush_do_rule http://libcrush.org/main/libcrush/blob/master/crush/mapper.c#L852
> Cheers,
> Pedro.
>
> 2017-03-02 10:43 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> There is a new version of python-crush at https://pypi.python.org/pypi/crush which changes the layout of the crushmap and the documentation was updated accordingly at http://crush.readthedocs.io/. Sorry for the inconvenience.
>>
>> Cheers
>>
>> On 02/25/2017 10:02 AM, Loic Dachary wrote:
>>>
>>>
>>> On 02/25/2017 09:41 AM, Pedro López-Adeva wrote:
>>>> Great! Installed without problem and ran the example OK. I will
>>>> convert what I already have to use the library and continue from
>>>> there.
>>>
>>> Cool :-) http://crush.readthedocs.io/en/latest/api.html is a complete reference of the crushmap structure, let me know if something is missing.
>>>
>>> Cheers
>>>
>>>>
>>>> 2017-02-25 1:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>> Hi Pedro,
>>>>>
>>>>> On 02/22/2017 12:46 PM, Pedro López-Adeva wrote:
>>>>>> That, for validation, would be great. Until weekend I don't think I'm
>>>>>> going to have time to work on this anyway.
>>>>>
>>>>> An initial version of the module is ready and documented at http://crush.readthedocs.io/en/latest/.
>>>>>
>>>>> Cheers
>>>>>
>>>>>> 2017-02-22 12:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>>
>>>>>>>
>>>>>>> On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I think your description of my proposed solution is quite good.
>>>>>>>>
>>>>>>>> I had a first look at Sage's paper but not ceph's implementation. My
>>>>>>>> plan is to finish the paper and make an implementation in python that
>>>>>>>> mimicks more closely ceph's algorithm.
>>>>>>>>
>>>>>>>> Regarding your question about data movement:
>>>>>>>>
>>>>>>>> If I understood the paper correctly what is happening right now is
>>>>>>>> that when weights change on the devices some of them will become
>>>>>>>> overloaded and the current algorithm will try to correct for that but
>>>>>>>> this approach, I think, is independent of how we compute the weights
>>>>>>>> for each device. My point is that the current data movement pattern
>>>>>>>> will not be modified.
>>>>>>>>
>>>>>>>> Could the data movement algorithm be improved? Maybe. I don't know.
>>>>>>>> Maybe by making the probabilities non-stationary with the new disk
>>>>>>>> getting at first very high probability and after each replica
>>>>>>>> placement decrease it until it stabilizes to it's final value. But I'm
>>>>>>>> just guessing and I really don't know if this can be made to work in a
>>>>>>>> distributed manner as is currently the case and how would this fit in
>>>>>>>> the current architecture. In any case it would be a problem as hard at
>>>>>>>> least as the current reweighting problem.
>>>>>>>>
>>>>>>>> So, to summarize, my current plans:
>>>>>>>>
>>>>>>>> - Have another look at the paper
>>>>>>>> - Make an implementation in python that imitates more closely the
>>>>>>>> current algorithm
>>>>>>>
>>>>>>> What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
>>>>>>>
>>>>>>>> - Make sure the new reweighting algorithm is fast and gives the desired results
>>>>>>>>
>>>>>>>> I will give updates here when there are significant changes so
>>>>>>>> everyone can have a look and suggest improvements.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Pedro.
>>>>>>>>
>>>>>>>> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>>>> Hi Pedro,
>>>>>>>>>
>>>>>>>>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>>>>>>>>> I have updated the algorithm to handle an arbitrary number of replicas
>>>>>>>>>> and arbitrary constraints.
>>>>>>>>>>
>>>>>>>>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>>>>>>>>
>>>>>>>>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>>>>>>>>
>>>>>>>>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>>>>>>>>
>>>>>>>>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>>>>>>>>
>>>>>>>>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>>>>>>>>
>>>>>>>>> [1] https://en.wikipedia.org/wiki/Loss_function
>>>>>>>>> [2] https://en.wikipedia.org/wiki/Gradient
>>>>>>>>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>>>>>>>>
>>>>>>>>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>>>>>>>>
>>>>>>>>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>>>>>>>>> deficient, I recommend downloading/cloning)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In the following by policy I mean the concrete set of probabilities of
>>>>>>>>>> selecting the first replica, the second replica, etc...
>>>>>>>>>> In practical terms there are several problems:
>>>>>>>>>>
>>>>>>>>>> - It's not practical for a high number of disks or replicas.
>>>>>>>>>>
>>>>>>>>>> Possible solution: approximate summation over all possible disk
>>>>>>>>>> selections with a Monte Carlo method.
>>>>>>>>>> the algorithm would be: we start with a candidate solution, we run a
>>>>>>>>>> simulation and based on the results
>>>>>>>>>> we update the probabilities. Repeat until we are happy with the result.
>>>>>>>>>>
>>>>>>>>>> Other solution: cluster similar disks together.
>>>>>>>>>>
>>>>>>>>>> - Since it's a non-linear optimization problem I'm not sure right now
>>>>>>>>>> about it's convergence properties.
>>>>>>>>>> Does it converge to a global optimum? How fast does it converge?
>>>>>>>>>>
>>>>>>>>>> Possible solution: the algorithm always converges, but it can converge
>>>>>>>>>> to a locally optimum policy. I see
>>>>>>>>>> no escape except by carefully designing the policy. All solutions to
>>>>>>>>>> the problem are going to be non linear
>>>>>>>>>> since we must condition current probabilities on previous disk selections.
>>>>>>>>>>
>>>>>>>>>> - Although it can handle arbitrary constraints it does so by rejecting
>>>>>>>>>> disks selections that violate at least one constraint.
>>>>>>>>>> This means that for bad policies it can spend all the time rejecting
>>>>>>>>>> invalid disks selection candidates.
>>>>>>>>>>
>>>>>>>>>> Possible solution: the policy cannot be designed independently of the
>>>>>>>>>> constraints. I don't know what constraints
>>>>>>>>>> are typical use cases but having a look should be the first step. The
>>>>>>>>>> constraints must be an input to the policy.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>>>>>>>>> found the problem an interesting puzzle.
>>>>>>>>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>>>>>>>>
>>>>>>>>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>>>>>>>>
>>>>>>>>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>>>>>>>>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>>>>>>>>
>>>>>>>>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>>>>>>>>
>>>>>>>>> 00 01 02 03 04 05 06 07 08 09 10
>>>>>>>>> 00: 0 14 17 14 19 23 13 22 21 20 1800
>>>>>>>>> 01: 12 0 11 13 19 19 15 10 16 17 1841
>>>>>>>>> 02: 17 27 0 17 15 15 13 19 18 11 1813
>>>>>>>>> 03: 14 17 15 0 23 11 20 15 23 17 1792
>>>>>>>>> 04: 14 18 16 25 0 27 13 8 15 16 1771
>>>>>>>>> 05: 19 16 22 25 13 0 9 19 21 21 1813
>>>>>>>>> 06: 18 15 21 17 10 18 0 10 18 11 1873
>>>>>>>>> 07: 13 17 22 13 16 17 14 0 25 12 1719
>>>>>>>>> 08: 23 20 16 17 19 18 11 12 0 18 1830
>>>>>>>>> 09: 14 20 15 17 12 16 17 11 13 0 1828
>>>>>>>>> 10: 0 0 0 0 0 0 0 0 0 0 0
>>>>>>>>> before: 20164 19990 19863 19959 19977 20004 19926 20133 20041 19943 0
>>>>>>>>> after: 18345 18181 18053 18170 18200 18190 18040 18391 18227 18123 18080
>>>>>>>>>
>>>>>>>>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>>>>>>>>
>>>>>>>>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>>>>>>>>
>>>>>>>>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>>>>>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>>>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>>>>>>>> problem and a tentative solution which he described at :
>>>>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>>>>
>>>>>>>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>>>>>>>> means that the probability of each disk needs to take in account the
>>>>>>>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>>>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>>>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>>>>>>>> mistaken ?
>>>>>>>>>>>
>>>>>>>>>>> Maybe (I haven't looked closely at the above yet). But for comparison, in
>>>>>>>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>>>>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>>>>>>>> disks changes the probability from 1/10 to 1/9). The key property that
>>>>>>>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>>>>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>>>>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>>>>>>>> the disk that is reweighted).
>>>>>>>>>>>
>>>>>>>>>>> sage
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>>
>>>>>>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>>>>>>> This is a longstanding bug,
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://tracker.ceph.com/issues/15653
>>>>>>>>>>>>>
>>>>>>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/ceph/ceph/pull/10218
>>>>>>>>>>>>>
>>>>>>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>>>>>>> discussion here.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>>>>>>> works (almost perfectly) for the 2nd round of placements. The solution is
>>>>>>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>>>>>>> brain hurt.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>>>>>>> OSD (call it P(pick small)) isn't what we should use. Instead, we want
>>>>>>>>>>>>> P(pick small | first pick not small). Since P(a|b) (the probability of a
>>>>>>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>>>>>>
>>>>>>>>>>>>> P(pick small | first pick not small)
>>>>>>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>>>>
>>>>>>>>>>>>> The last term is easy to calculate,
>>>>>>>>>>>>>
>>>>>>>>>>>>> P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>>>>>>
>>>>>>>>>>>>> and the && term is the distribution we're trying to produce. For exmaple,
>>>>>>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>>>>>>> their second replica be the small OSD. So
>>>>>>>>>>>>>
>>>>>>>>>>>>> P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>>>>>>
>>>>>>>>>>>>> Putting those together,
>>>>>>>>>>>>>
>>>>>>>>>>>>> P(pick small | first pick not small)
>>>>>>>>>>>>> = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>>>> = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>>>>>> = small_weight / (total_weight - small_weight)
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>>>>>>> that we get the right distribution of second choices. It turns out it
>>>>>>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>>>>>>> that they weren't already chosen.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>>>>>>> properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the
>>>>>>>>>>>>> current code, you get
>>>>>>>>>>>>>
>>>>>>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>>>>>>> device 0: 19765965 [9899364,9866601]
>>>>>>>>>>>>> device 1: 19768033 [9899444,9868589]
>>>>>>>>>>>>> device 2: 19769938 [9901770,9868168]
>>>>>>>>>>>>> device 3: 19766918 [9898851,9868067]
>>>>>>>>>>>>> device 6: 929148 [400572,528576]
>>>>>>>>>>>>>
>>>>>>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>>>>>>> second. With my hacky change,
>>>>>>>>>>>>>
>>>>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2: 40000001/40000001
>>>>>>>>>>>>> device 0: 19797315 [9899364,9897951]
>>>>>>>>>>>>> device 1: 19799199 [9899444,9899755]
>>>>>>>>>>>>> device 2: 19801016 [9901770,9899246]
>>>>>>>>>>>>> device 3: 19797906 [9898851,9899055]
>>>>>>>>>>>>> device 6: 804566 [400572,403994]
>>>>>>>>>>>>>
>>>>>>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>>>>>>> 1%).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Next steps:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>>>>>>> 2- figure out why it skews high
>>>>>>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>>>>>>
>>>>>>>>>>>>> sage
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-02 9:58 ` Pedro López-Adeva
2017-03-02 10:31 ` Loic Dachary
@ 2017-03-07 23:06 ` Sage Weil
2017-03-09 8:47 ` Pedro López-Adeva
1 sibling, 1 reply; 70+ messages in thread
From: Sage Weil @ 2017-03-07 23:06 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: Loic Dachary, ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 2112 bytes --]
Hi Pedro,
Thanks for taking a look at this! It's a frustrating problem and we
haven't made much headway.
On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
> Hi,
>
> I will have a look. BTW, I have not progressed that much but I have
> been thinking about it. In order to adapt the previous algorithm in
> the python notebook I need to substitute the iteration over all
> possible devices permutations to iteration over all the possible
> selections that crush would make. That is the main thing I need to
> work on.
>
> The other thing is of course that weights change for each replica.
> That is, they cannot be really fixed in the crush map. So the
> algorithm inside libcrush, not only the weights in the map, need to be
> changed. The weights in the crush map should reflect then, maybe, the
> desired usage frequencies. Or maybe each replica should have their own
> crush map, but then the information about the previous selection
> should be passed to the next replica placement run so it avoids
> selecting the same one again.
My suspicion is that the best solution here (whatever that means!)
leaves the CRUSH weights intact with the desired distribution, and
then generates a set of derivative weights--probably one set for each
round/replica/rank.
One nice property of this is that once the support is added to encode
multiple sets of weights, the algorithm used to generate them is free to
change and evolve independently. (In most cases any change is
CRUSH's mapping behavior is difficult to roll out because all
parties participating in the cluster have to support any new behavior
before it is enabled or used.)
> I have a question also. Is there any significant difference between
> the device selection algorithm description in the paper and its final
> implementation?
The main difference is the "retry_bucket" behavior was found to be a bad
idea; any collision or failed()/overload() case triggers the
retry_descent.
There are other changes, of course, but I don't think they'll impact any
solution we come with here (or at least any solution can be suitably
adapted)!
sage
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-07 23:06 ` Sage Weil
@ 2017-03-09 8:47 ` Pedro López-Adeva
2017-03-18 9:21 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Pedro López-Adeva @ 2017-03-09 8:47 UTC (permalink / raw)
To: Sage Weil; +Cc: Loic Dachary, ceph-devel
Great, thanks for the clarifications.
I also think that the most natural way is to keep just a set of
weights in the CRUSH map and update them inside the algorithm.
I keep working on it.
2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
> Hi Pedro,
>
> Thanks for taking a look at this! It's a frustrating problem and we
> haven't made much headway.
>
> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>> Hi,
>>
>> I will have a look. BTW, I have not progressed that much but I have
>> been thinking about it. In order to adapt the previous algorithm in
>> the python notebook I need to substitute the iteration over all
>> possible devices permutations to iteration over all the possible
>> selections that crush would make. That is the main thing I need to
>> work on.
>>
>> The other thing is of course that weights change for each replica.
>> That is, they cannot be really fixed in the crush map. So the
>> algorithm inside libcrush, not only the weights in the map, need to be
>> changed. The weights in the crush map should reflect then, maybe, the
>> desired usage frequencies. Or maybe each replica should have their own
>> crush map, but then the information about the previous selection
>> should be passed to the next replica placement run so it avoids
>> selecting the same one again.
>
> My suspicion is that the best solution here (whatever that means!)
> leaves the CRUSH weights intact with the desired distribution, and
> then generates a set of derivative weights--probably one set for each
> round/replica/rank.
>
> One nice property of this is that once the support is added to encode
> multiple sets of weights, the algorithm used to generate them is free to
> change and evolve independently. (In most cases any change is
> CRUSH's mapping behavior is difficult to roll out because all
> parties participating in the cluster have to support any new behavior
> before it is enabled or used.)
>
>> I have a question also. Is there any significant difference between
>> the device selection algorithm description in the paper and its final
>> implementation?
>
> The main difference is the "retry_bucket" behavior was found to be a bad
> idea; any collision or failed()/overload() case triggers the
> retry_descent.
>
> There are other changes, of course, but I don't think they'll impact any
> solution we come with here (or at least any solution can be suitably
> adapted)!
>
> sage
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-09 8:47 ` Pedro López-Adeva
@ 2017-03-18 9:21 ` Loic Dachary
2017-03-19 22:31 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-03-18 9:21 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: ceph-devel
Hi Pedro,
I'm going to experiment with what you did at
https://github.com/plafl/notebooks/blob/master/replication.ipynb
and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
Cheers
On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
> Great, thanks for the clarifications.
> I also think that the most natural way is to keep just a set of
> weights in the CRUSH map and update them inside the algorithm.
>
> I keep working on it.
>
>
> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>> Hi Pedro,
>>
>> Thanks for taking a look at this! It's a frustrating problem and we
>> haven't made much headway.
>>
>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>> Hi,
>>>
>>> I will have a look. BTW, I have not progressed that much but I have
>>> been thinking about it. In order to adapt the previous algorithm in
>>> the python notebook I need to substitute the iteration over all
>>> possible devices permutations to iteration over all the possible
>>> selections that crush would make. That is the main thing I need to
>>> work on.
>>>
>>> The other thing is of course that weights change for each replica.
>>> That is, they cannot be really fixed in the crush map. So the
>>> algorithm inside libcrush, not only the weights in the map, need to be
>>> changed. The weights in the crush map should reflect then, maybe, the
>>> desired usage frequencies. Or maybe each replica should have their own
>>> crush map, but then the information about the previous selection
>>> should be passed to the next replica placement run so it avoids
>>> selecting the same one again.
>>
>> My suspicion is that the best solution here (whatever that means!)
>> leaves the CRUSH weights intact with the desired distribution, and
>> then generates a set of derivative weights--probably one set for each
>> round/replica/rank.
>>
>> One nice property of this is that once the support is added to encode
>> multiple sets of weights, the algorithm used to generate them is free to
>> change and evolve independently. (In most cases any change is
>> CRUSH's mapping behavior is difficult to roll out because all
>> parties participating in the cluster have to support any new behavior
>> before it is enabled or used.)
>>
>>> I have a question also. Is there any significant difference between
>>> the device selection algorithm description in the paper and its final
>>> implementation?
>>
>> The main difference is the "retry_bucket" behavior was found to be a bad
>> idea; any collision or failed()/overload() case triggers the
>> retry_descent.
>>
>> There are other changes, of course, but I don't think they'll impact any
>> solution we come with here (or at least any solution can be suitably
>> adapted)!
>>
>> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-18 9:21 ` Loic Dachary
@ 2017-03-19 22:31 ` Loic Dachary
2017-03-20 10:49 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-03-19 22:31 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: ceph-devel
Hi Pedro,
It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
Cheers
On 03/18/2017 10:21 AM, Loic Dachary wrote:
> Hi Pedro,
>
> I'm going to experiment with what you did at
>
> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>
> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>
> Cheers
>
> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>> Great, thanks for the clarifications.
>> I also think that the most natural way is to keep just a set of
>> weights in the CRUSH map and update them inside the algorithm.
>>
>> I keep working on it.
>>
>>
>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>> Hi Pedro,
>>>
>>> Thanks for taking a look at this! It's a frustrating problem and we
>>> haven't made much headway.
>>>
>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>> Hi,
>>>>
>>>> I will have a look. BTW, I have not progressed that much but I have
>>>> been thinking about it. In order to adapt the previous algorithm in
>>>> the python notebook I need to substitute the iteration over all
>>>> possible devices permutations to iteration over all the possible
>>>> selections that crush would make. That is the main thing I need to
>>>> work on.
>>>>
>>>> The other thing is of course that weights change for each replica.
>>>> That is, they cannot be really fixed in the crush map. So the
>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>> desired usage frequencies. Or maybe each replica should have their own
>>>> crush map, but then the information about the previous selection
>>>> should be passed to the next replica placement run so it avoids
>>>> selecting the same one again.
>>>
>>> My suspicion is that the best solution here (whatever that means!)
>>> leaves the CRUSH weights intact with the desired distribution, and
>>> then generates a set of derivative weights--probably one set for each
>>> round/replica/rank.
>>>
>>> One nice property of this is that once the support is added to encode
>>> multiple sets of weights, the algorithm used to generate them is free to
>>> change and evolve independently. (In most cases any change is
>>> CRUSH's mapping behavior is difficult to roll out because all
>>> parties participating in the cluster have to support any new behavior
>>> before it is enabled or used.)
>>>
>>>> I have a question also. Is there any significant difference between
>>>> the device selection algorithm description in the paper and its final
>>>> implementation?
>>>
>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>> idea; any collision or failed()/overload() case triggers the
>>> retry_descent.
>>>
>>> There are other changes, of course, but I don't think they'll impact any
>>> solution we come with here (or at least any solution can be suitably
>>> adapted)!
>>>
>>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-19 22:31 ` Loic Dachary
@ 2017-03-20 10:49 ` Loic Dachary
2017-03-23 11:49 ` Pedro López-Adeva
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-03-20 10:49 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: ceph-devel
Hi,
I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
Thanks !
Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
------------------------------------------------------------------------
Before: All replicas on each hard drive
Expected vs actual use (20000 samples)
disk 0: 1.39e-01 1.12e-01
disk 1: 1.11e-01 1.10e-01
disk 2: 8.33e-02 1.13e-01
disk 3: 1.39e-01 1.11e-01
disk 4: 1.11e-01 1.11e-01
disk 5: 8.33e-02 1.11e-01
disk 6: 1.39e-01 1.12e-01
disk 7: 1.11e-01 1.12e-01
disk 8: 8.33e-02 1.10e-01
it= 1 jac norm=1.59e-01 loss=5.27e-03
it= 2 jac norm=1.55e-01 loss=5.03e-03
...
it= 212 jac norm=1.02e-03 loss=2.41e-07
it= 213 jac norm=1.00e-03 loss=2.31e-07
Converged to desired accuracy :)
After: All replicas on each hard drive
Expected vs actual use (20000 samples)
disk 0: 1.39e-01 1.42e-01
disk 1: 1.11e-01 1.09e-01
disk 2: 8.33e-02 8.37e-02
disk 3: 1.39e-01 1.40e-01
disk 4: 1.11e-01 1.13e-01
disk 5: 8.33e-02 8.08e-02
disk 6: 1.39e-01 1.38e-01
disk 7: 1.11e-01 1.09e-01
disk 8: 8.33e-02 8.48e-02
Simulation: R=2 devices capacity [10 10 10 10 1]
------------------------------------------------------------------------
Before: All replicas on each hard drive
Expected vs actual use (20000 samples)
disk 0: 2.44e-01 2.36e-01
disk 1: 2.44e-01 2.38e-01
disk 2: 2.44e-01 2.34e-01
disk 3: 2.44e-01 2.38e-01
disk 4: 2.44e-02 5.37e-02
it= 1 jac norm=2.43e-01 loss=2.98e-03
it= 2 jac norm=2.28e-01 loss=2.47e-03
...
it= 37 jac norm=1.28e-03 loss=3.48e-08
it= 38 jac norm=1.07e-03 loss=2.42e-08
Converged to desired accuracy :)
After: All replicas on each hard drive
Expected vs actual use (20000 samples)
disk 0: 2.44e-01 2.46e-01
disk 1: 2.44e-01 2.44e-01
disk 2: 2.44e-01 2.41e-01
disk 3: 2.44e-01 2.45e-01
disk 4: 2.44e-02 2.33e-02
[1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
[2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
On 03/19/2017 11:31 PM, Loic Dachary wrote:
> Hi Pedro,
>
> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>
> Cheers
>
> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>> Hi Pedro,
>>
>> I'm going to experiment with what you did at
>>
>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>
>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>
>> Cheers
>>
>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>> Great, thanks for the clarifications.
>>> I also think that the most natural way is to keep just a set of
>>> weights in the CRUSH map and update them inside the algorithm.
>>>
>>> I keep working on it.
>>>
>>>
>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>> Hi Pedro,
>>>>
>>>> Thanks for taking a look at this! It's a frustrating problem and we
>>>> haven't made much headway.
>>>>
>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>> Hi,
>>>>>
>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>> the python notebook I need to substitute the iteration over all
>>>>> possible devices permutations to iteration over all the possible
>>>>> selections that crush would make. That is the main thing I need to
>>>>> work on.
>>>>>
>>>>> The other thing is of course that weights change for each replica.
>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>> crush map, but then the information about the previous selection
>>>>> should be passed to the next replica placement run so it avoids
>>>>> selecting the same one again.
>>>>
>>>> My suspicion is that the best solution here (whatever that means!)
>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>> then generates a set of derivative weights--probably one set for each
>>>> round/replica/rank.
>>>>
>>>> One nice property of this is that once the support is added to encode
>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>> change and evolve independently. (In most cases any change is
>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>> parties participating in the cluster have to support any new behavior
>>>> before it is enabled or used.)
>>>>
>>>>> I have a question also. Is there any significant difference between
>>>>> the device selection algorithm description in the paper and its final
>>>>> implementation?
>>>>
>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>> idea; any collision or failed()/overload() case triggers the
>>>> retry_descent.
>>>>
>>>> There are other changes, of course, but I don't think they'll impact any
>>>> solution we come with here (or at least any solution can be suitably
>>>> adapted)!
>>>>
>>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-20 10:49 ` Loic Dachary
@ 2017-03-23 11:49 ` Pedro López-Adeva
2017-03-23 14:13 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Pedro López-Adeva @ 2017-03-23 11:49 UTC (permalink / raw)
To: Loic Dachary; +Cc: ceph-devel
Hi Loic,
From what I see everything seems OK. The interesting thing would be to
test on some complex mapping. The reason is that "CrushPolicyFamily"
is right now modeling just a single straw bucket not the full CRUSH
algorithm. That's the work that remains to be done. The only way that
would avoid reimplementing the CRUSH algorithm and computing the
gradient would be treating CRUSH as a black box and eliminating the
necessity of computing the gradient either by using a gradient-free
optimization method or making an estimation of the gradient.
2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
> Hi,
>
> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>
> Thanks !
>
> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
> ------------------------------------------------------------------------
> Before: All replicas on each hard drive
> Expected vs actual use (20000 samples)
> disk 0: 1.39e-01 1.12e-01
> disk 1: 1.11e-01 1.10e-01
> disk 2: 8.33e-02 1.13e-01
> disk 3: 1.39e-01 1.11e-01
> disk 4: 1.11e-01 1.11e-01
> disk 5: 8.33e-02 1.11e-01
> disk 6: 1.39e-01 1.12e-01
> disk 7: 1.11e-01 1.12e-01
> disk 8: 8.33e-02 1.10e-01
> it= 1 jac norm=1.59e-01 loss=5.27e-03
> it= 2 jac norm=1.55e-01 loss=5.03e-03
> ...
> it= 212 jac norm=1.02e-03 loss=2.41e-07
> it= 213 jac norm=1.00e-03 loss=2.31e-07
> Converged to desired accuracy :)
> After: All replicas on each hard drive
> Expected vs actual use (20000 samples)
> disk 0: 1.39e-01 1.42e-01
> disk 1: 1.11e-01 1.09e-01
> disk 2: 8.33e-02 8.37e-02
> disk 3: 1.39e-01 1.40e-01
> disk 4: 1.11e-01 1.13e-01
> disk 5: 8.33e-02 8.08e-02
> disk 6: 1.39e-01 1.38e-01
> disk 7: 1.11e-01 1.09e-01
> disk 8: 8.33e-02 8.48e-02
>
>
> Simulation: R=2 devices capacity [10 10 10 10 1]
> ------------------------------------------------------------------------
> Before: All replicas on each hard drive
> Expected vs actual use (20000 samples)
> disk 0: 2.44e-01 2.36e-01
> disk 1: 2.44e-01 2.38e-01
> disk 2: 2.44e-01 2.34e-01
> disk 3: 2.44e-01 2.38e-01
> disk 4: 2.44e-02 5.37e-02
> it= 1 jac norm=2.43e-01 loss=2.98e-03
> it= 2 jac norm=2.28e-01 loss=2.47e-03
> ...
> it= 37 jac norm=1.28e-03 loss=3.48e-08
> it= 38 jac norm=1.07e-03 loss=2.42e-08
> Converged to desired accuracy :)
> After: All replicas on each hard drive
> Expected vs actual use (20000 samples)
> disk 0: 2.44e-01 2.46e-01
> disk 1: 2.44e-01 2.44e-01
> disk 2: 2.44e-01 2.41e-01
> disk 3: 2.44e-01 2.45e-01
> disk 4: 2.44e-02 2.33e-02
>
>
> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>
> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>> Hi Pedro,
>>
>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>
>> Cheers
>>
>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>> Hi Pedro,
>>>
>>> I'm going to experiment with what you did at
>>>
>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>
>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>
>>> Cheers
>>>
>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>> Great, thanks for the clarifications.
>>>> I also think that the most natural way is to keep just a set of
>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>
>>>> I keep working on it.
>>>>
>>>>
>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>> Hi Pedro,
>>>>>
>>>>> Thanks for taking a look at this! It's a frustrating problem and we
>>>>> haven't made much headway.
>>>>>
>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>> the python notebook I need to substitute the iteration over all
>>>>>> possible devices permutations to iteration over all the possible
>>>>>> selections that crush would make. That is the main thing I need to
>>>>>> work on.
>>>>>>
>>>>>> The other thing is of course that weights change for each replica.
>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>> crush map, but then the information about the previous selection
>>>>>> should be passed to the next replica placement run so it avoids
>>>>>> selecting the same one again.
>>>>>
>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>> then generates a set of derivative weights--probably one set for each
>>>>> round/replica/rank.
>>>>>
>>>>> One nice property of this is that once the support is added to encode
>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>> change and evolve independently. (In most cases any change is
>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>> parties participating in the cluster have to support any new behavior
>>>>> before it is enabled or used.)
>>>>>
>>>>>> I have a question also. Is there any significant difference between
>>>>>> the device selection algorithm description in the paper and its final
>>>>>> implementation?
>>>>>
>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>> idea; any collision or failed()/overload() case triggers the
>>>>> retry_descent.
>>>>>
>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>> solution we come with here (or at least any solution can be suitably
>>>>> adapted)!
>>>>>
>>>>> sage
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-23 11:49 ` Pedro López-Adeva
@ 2017-03-23 14:13 ` Loic Dachary
2017-03-23 15:32 ` Pedro López-Adeva
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-03-23 14:13 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: ceph-devel
Hi Pedro,
On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
> Hi Loic,
>
>>From what I see everything seems OK.
Cool. I'll keep going in this direction then !
> The interesting thing would be to
> test on some complex mapping. The reason is that "CrushPolicyFamily"
> is right now modeling just a single straw bucket not the full CRUSH
> algorithm.
A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
> That's the work that remains to be done. The only way that
> would avoid reimplementing the CRUSH algorithm and computing the
> gradient would be treating CRUSH as a black box and eliminating the
> necessity of computing the gradient either by using a gradient-free
> optimization method or making an estimation of the gradient.
By gradient-free optimization you mean simulated annealing or Monte Carlo ?
Cheers
>
>
> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> Hi,
>>
>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>
>> Thanks !
>>
>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
>> ------------------------------------------------------------------------
>> Before: All replicas on each hard drive
>> Expected vs actual use (20000 samples)
>> disk 0: 1.39e-01 1.12e-01
>> disk 1: 1.11e-01 1.10e-01
>> disk 2: 8.33e-02 1.13e-01
>> disk 3: 1.39e-01 1.11e-01
>> disk 4: 1.11e-01 1.11e-01
>> disk 5: 8.33e-02 1.11e-01
>> disk 6: 1.39e-01 1.12e-01
>> disk 7: 1.11e-01 1.12e-01
>> disk 8: 8.33e-02 1.10e-01
>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>> ...
>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>> Converged to desired accuracy :)
>> After: All replicas on each hard drive
>> Expected vs actual use (20000 samples)
>> disk 0: 1.39e-01 1.42e-01
>> disk 1: 1.11e-01 1.09e-01
>> disk 2: 8.33e-02 8.37e-02
>> disk 3: 1.39e-01 1.40e-01
>> disk 4: 1.11e-01 1.13e-01
>> disk 5: 8.33e-02 8.08e-02
>> disk 6: 1.39e-01 1.38e-01
>> disk 7: 1.11e-01 1.09e-01
>> disk 8: 8.33e-02 8.48e-02
>>
>>
>> Simulation: R=2 devices capacity [10 10 10 10 1]
>> ------------------------------------------------------------------------
>> Before: All replicas on each hard drive
>> Expected vs actual use (20000 samples)
>> disk 0: 2.44e-01 2.36e-01
>> disk 1: 2.44e-01 2.38e-01
>> disk 2: 2.44e-01 2.34e-01
>> disk 3: 2.44e-01 2.38e-01
>> disk 4: 2.44e-02 5.37e-02
>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>> ...
>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>> Converged to desired accuracy :)
>> After: All replicas on each hard drive
>> Expected vs actual use (20000 samples)
>> disk 0: 2.44e-01 2.46e-01
>> disk 1: 2.44e-01 2.44e-01
>> disk 2: 2.44e-01 2.41e-01
>> disk 3: 2.44e-01 2.45e-01
>> disk 4: 2.44e-02 2.33e-02
>>
>>
>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>
>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>> Hi Pedro,
>>>
>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>
>>> Cheers
>>>
>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>> Hi Pedro,
>>>>
>>>> I'm going to experiment with what you did at
>>>>
>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>
>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>
>>>> Cheers
>>>>
>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>> Great, thanks for the clarifications.
>>>>> I also think that the most natural way is to keep just a set of
>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>
>>>>> I keep working on it.
>>>>>
>>>>>
>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> Thanks for taking a look at this! It's a frustrating problem and we
>>>>>> haven't made much headway.
>>>>>>
>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>> work on.
>>>>>>>
>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>> crush map, but then the information about the previous selection
>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>> selecting the same one again.
>>>>>>
>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>> round/replica/rank.
>>>>>>
>>>>>> One nice property of this is that once the support is added to encode
>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>> change and evolve independently. (In most cases any change is
>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>> parties participating in the cluster have to support any new behavior
>>>>>> before it is enabled or used.)
>>>>>>
>>>>>>> I have a question also. Is there any significant difference between
>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>> implementation?
>>>>>>
>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>> retry_descent.
>>>>>>
>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>> adapted)!
>>>>>>
>>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-23 14:13 ` Loic Dachary
@ 2017-03-23 15:32 ` Pedro López-Adeva
2017-03-23 16:18 ` Loic Dachary
` (3 more replies)
0 siblings, 4 replies; 70+ messages in thread
From: Pedro López-Adeva @ 2017-03-23 15:32 UTC (permalink / raw)
To: Loic Dachary; +Cc: ceph-devel
There are lot of gradient-free methods. I will try first to run the
ones available using just scipy
(https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
Some of them don't require the gradient and some of them can estimate
it. The reason to go without the gradient is to run the CRUSH
algorithm as a black box. In that case this would be the pseudo-code:
- BEGIN CODE -
def build_target(desired_freqs):
def target(weights):
# run a simulation of CRUSH for a number of objects
sim_freqs = run_crush(weights)
# Kullback-Leibler divergence between desired frequencies and
current ones
return loss(sim_freqs, desired_freqs)
return target
weights = scipy.optimize.minimize(build_target(desired_freqs))
- END CODE -
The tricky thing here is that this procedure can be slow if the
simulation (run_crush) needs to place a lot of objects to get accurate
simulated frequencies. This is true specially if the minimize method
attempts to approximate the gradient using finite differences since it
will evaluate the target function a number of times proportional to
the number of weights). Apart from the ones in scipy I would try also
optimization methods that try to perform as few evaluations as
possible like for example HyperOpt
(http://hyperopt.github.io/hyperopt/), which by the way takes into
account that the target function can be noisy.
This black box approximation is simple to implement and makes the
computer do all the work instead of us.
I think that this black box approximation is worthy to try even if
it's not the final one because if this approximation works then we
know that a more elaborate one that computes the gradient of the CRUSH
algorithm will work for sure.
I can try this black box approximation this weekend not on the real
CRUSH algorithm but with the simple implementation I did in python. If
it works it's just a matter of substituting one simulation with
another and see what happens.
2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
> Hi Pedro,
>
> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>> Hi Loic,
>>
>>>From what I see everything seems OK.
>
> Cool. I'll keep going in this direction then !
>
>> The interesting thing would be to
>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>> is right now modeling just a single straw bucket not the full CRUSH
>> algorithm.
>
> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>
>> That's the work that remains to be done. The only way that
>> would avoid reimplementing the CRUSH algorithm and computing the
>> gradient would be treating CRUSH as a black box and eliminating the
>> necessity of computing the gradient either by using a gradient-free
>> optimization method or making an estimation of the gradient.
>
> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>
> Cheers
>
>>
>>
>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> Hi,
>>>
>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>
>>> Thanks !
>>>
>>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
>>> ------------------------------------------------------------------------
>>> Before: All replicas on each hard drive
>>> Expected vs actual use (20000 samples)
>>> disk 0: 1.39e-01 1.12e-01
>>> disk 1: 1.11e-01 1.10e-01
>>> disk 2: 8.33e-02 1.13e-01
>>> disk 3: 1.39e-01 1.11e-01
>>> disk 4: 1.11e-01 1.11e-01
>>> disk 5: 8.33e-02 1.11e-01
>>> disk 6: 1.39e-01 1.12e-01
>>> disk 7: 1.11e-01 1.12e-01
>>> disk 8: 8.33e-02 1.10e-01
>>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>>> ...
>>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>>> Converged to desired accuracy :)
>>> After: All replicas on each hard drive
>>> Expected vs actual use (20000 samples)
>>> disk 0: 1.39e-01 1.42e-01
>>> disk 1: 1.11e-01 1.09e-01
>>> disk 2: 8.33e-02 8.37e-02
>>> disk 3: 1.39e-01 1.40e-01
>>> disk 4: 1.11e-01 1.13e-01
>>> disk 5: 8.33e-02 8.08e-02
>>> disk 6: 1.39e-01 1.38e-01
>>> disk 7: 1.11e-01 1.09e-01
>>> disk 8: 8.33e-02 8.48e-02
>>>
>>>
>>> Simulation: R=2 devices capacity [10 10 10 10 1]
>>> ------------------------------------------------------------------------
>>> Before: All replicas on each hard drive
>>> Expected vs actual use (20000 samples)
>>> disk 0: 2.44e-01 2.36e-01
>>> disk 1: 2.44e-01 2.38e-01
>>> disk 2: 2.44e-01 2.34e-01
>>> disk 3: 2.44e-01 2.38e-01
>>> disk 4: 2.44e-02 5.37e-02
>>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>>> ...
>>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>>> Converged to desired accuracy :)
>>> After: All replicas on each hard drive
>>> Expected vs actual use (20000 samples)
>>> disk 0: 2.44e-01 2.46e-01
>>> disk 1: 2.44e-01 2.44e-01
>>> disk 2: 2.44e-01 2.41e-01
>>> disk 3: 2.44e-01 2.45e-01
>>> disk 4: 2.44e-02 2.33e-02
>>>
>>>
>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>
>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>> Hi Pedro,
>>>>
>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>
>>>> Cheers
>>>>
>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>> Hi Pedro,
>>>>>
>>>>> I'm going to experiment with what you did at
>>>>>
>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>
>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>> Great, thanks for the clarifications.
>>>>>> I also think that the most natural way is to keep just a set of
>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>
>>>>>> I keep working on it.
>>>>>>
>>>>>>
>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>> Hi Pedro,
>>>>>>>
>>>>>>> Thanks for taking a look at this! It's a frustrating problem and we
>>>>>>> haven't made much headway.
>>>>>>>
>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>> work on.
>>>>>>>>
>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>> selecting the same one again.
>>>>>>>
>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>> round/replica/rank.
>>>>>>>
>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>> change and evolve independently. (In most cases any change is
>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>> before it is enabled or used.)
>>>>>>>
>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>> implementation?
>>>>>>>
>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>> retry_descent.
>>>>>>>
>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>> adapted)!
>>>>>>>
>>>>>>> sage
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-23 15:32 ` Pedro López-Adeva
@ 2017-03-23 16:18 ` Loic Dachary
2017-03-25 18:42 ` Sage Weil
` (2 subsequent siblings)
3 siblings, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-03-23 16:18 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: ceph-devel
On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
> There are lot of gradient-free methods. I will try first to run the
> ones available using just scipy
> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> Some of them don't require the gradient and some of them can estimate
> it. The reason to go without the gradient is to run the CRUSH
> algorithm as a black box. In that case this would be the pseudo-code:
>
> - BEGIN CODE -
> def build_target(desired_freqs):
> def target(weights):
> # run a simulation of CRUSH for a number of objects
> sim_freqs = run_crush(weights)
> # Kullback-Leibler divergence between desired frequencies and
> current ones
> return loss(sim_freqs, desired_freqs)
> return target
>
> weights = scipy.optimize.minimize(build_target(desired_freqs))
> - END CODE -
>
> The tricky thing here is that this procedure can be slow if the
> simulation (run_crush) needs to place a lot of objects to get accurate
> simulated frequencies. This is true specially if the minimize method
> attempts to approximate the gradient using finite differences since it
> will evaluate the target function a number of times proportional to
> the number of weights). Apart from the ones in scipy I would try also
> optimization methods that try to perform as few evaluations as
> possible like for example HyperOpt
> (http://hyperopt.github.io/hyperopt/), which by the way takes into
> account that the target function can be noisy.
>
> This black box approximation is simple to implement and makes the
> computer do all the work instead of us.
> I think that this black box approximation is worthy to try even if
> it's not the final one because if this approximation works then we
> know that a more elaborate one that computes the gradient of the CRUSH
> algorithm will work for sure.
>
> I can try this black box approximation this weekend not on the real
> CRUSH algorithm but with the simple implementation I did in python. If
> it works it's just a matter of substituting one simulation with
> another and see what happens.
Great! And I'll do whatever is needed to adapt what you did to use crush.
Cheers
>
> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>> Hi Loic,
>>>
>>> >From what I see everything seems OK.
>>
>> Cool. I'll keep going in this direction then !
>>
>>> The interesting thing would be to
>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>> is right now modeling just a single straw bucket not the full CRUSH
>>> algorithm.
>>
>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>
>>> That's the work that remains to be done. The only way that
>>> would avoid reimplementing the CRUSH algorithm and computing the
>>> gradient would be treating CRUSH as a black box and eliminating the
>>> necessity of computing the gradient either by using a gradient-free
>>> optimization method or making an estimation of the gradient.
>>
>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>
>> Cheers
>>
>>>
>>>
>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>> Hi,
>>>>
>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>
>>>> Thanks !
>>>>
>>>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
>>>> ------------------------------------------------------------------------
>>>> Before: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>> disk 0: 1.39e-01 1.12e-01
>>>> disk 1: 1.11e-01 1.10e-01
>>>> disk 2: 8.33e-02 1.13e-01
>>>> disk 3: 1.39e-01 1.11e-01
>>>> disk 4: 1.11e-01 1.11e-01
>>>> disk 5: 8.33e-02 1.11e-01
>>>> disk 6: 1.39e-01 1.12e-01
>>>> disk 7: 1.11e-01 1.12e-01
>>>> disk 8: 8.33e-02 1.10e-01
>>>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>>>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>>>> ...
>>>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>>>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>>>> Converged to desired accuracy :)
>>>> After: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>> disk 0: 1.39e-01 1.42e-01
>>>> disk 1: 1.11e-01 1.09e-01
>>>> disk 2: 8.33e-02 8.37e-02
>>>> disk 3: 1.39e-01 1.40e-01
>>>> disk 4: 1.11e-01 1.13e-01
>>>> disk 5: 8.33e-02 8.08e-02
>>>> disk 6: 1.39e-01 1.38e-01
>>>> disk 7: 1.11e-01 1.09e-01
>>>> disk 8: 8.33e-02 8.48e-02
>>>>
>>>>
>>>> Simulation: R=2 devices capacity [10 10 10 10 1]
>>>> ------------------------------------------------------------------------
>>>> Before: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>> disk 0: 2.44e-01 2.36e-01
>>>> disk 1: 2.44e-01 2.38e-01
>>>> disk 2: 2.44e-01 2.34e-01
>>>> disk 3: 2.44e-01 2.38e-01
>>>> disk 4: 2.44e-02 5.37e-02
>>>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>>>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>>>> ...
>>>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>>>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>>>> Converged to desired accuracy :)
>>>> After: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>> disk 0: 2.44e-01 2.46e-01
>>>> disk 1: 2.44e-01 2.44e-01
>>>> disk 2: 2.44e-01 2.41e-01
>>>> disk 3: 2.44e-01 2.45e-01
>>>> disk 4: 2.44e-02 2.33e-02
>>>>
>>>>
>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>
>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>> Hi Pedro,
>>>>>
>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> I'm going to experiment with what you did at
>>>>>>
>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>
>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>> Great, thanks for the clarifications.
>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>
>>>>>>> I keep working on it.
>>>>>>>
>>>>>>>
>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> Thanks for taking a look at this! It's a frustrating problem and we
>>>>>>>> haven't made much headway.
>>>>>>>>
>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>> work on.
>>>>>>>>>
>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>> selecting the same one again.
>>>>>>>>
>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>> round/replica/rank.
>>>>>>>>
>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>> change and evolve independently. (In most cases any change is
>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>> before it is enabled or used.)
>>>>>>>>
>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>> implementation?
>>>>>>>>
>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>> retry_descent.
>>>>>>>>
>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>> adapted)!
>>>>>>>>
>>>>>>>> sage
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-23 15:32 ` Pedro López-Adeva
2017-03-23 16:18 ` Loic Dachary
@ 2017-03-25 18:42 ` Sage Weil
[not found] ` <CAHMeWhHV=5u=QFggXFNMn2MzGLgQJ6nMnae+ZgK=MB5yYr1p9g@mail.gmail.com>
2017-04-11 15:22 ` Loic Dachary
2017-04-22 16:51 ` Loic Dachary
3 siblings, 1 reply; 70+ messages in thread
From: Sage Weil @ 2017-03-25 18:42 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: Loic Dachary, ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 13475 bytes --]
Hi Pedro, Loic,
For what it's worth, my intuition here (which has had a mixed record as
far as CRUSH goes) is that this is the most promising path forward.
Thinking ahead a few steps, and confirming that I'm following the
discussion so far, if you're able to do get black (or white) box gradient
descent to work, then this will give us a set of weights for each item in
the tree for each selection round, derived from the tree structure and
original (target) weights. That would basically give us a map of item id
(bucket id or leave item id) to weight for each round. i.e.,
map<int, map<int, float>> weight_by_position; // position -> item -> weight
where the 0 round would (I think?) match the target weights, and each
round after that would skew low-weighted items lower to some degree.
Right?
The next question I have is: does this generalize from the single-bucket
case to the hierarchy? I.e., if I have a "tree" (single bucket) like
3.1
|_____________
| \ \ \
1.0 1.0 1.0 .1
it clearly works, but when we have a multi-level tree like
8.4
|____________________________________
| \ \
3.1 3.1 2.2
|_____________ |_____________ |_____________
| \ \ \ | \ \ \ | \ \ \
1.0 1.0 1.0 .1 1.0 1.0 1.0 .1 1.0 1.0 .1 .1
and the second round weights skew the small .1 leaves lower, can we
continue to build the summed-weight hierarchy, such that the adjusted
weights at the higher level are appropriately adjusted to give us the
right probabilities of descending into those trees? I'm not sure if that
logically follows from the above or if my intuition is oversimplifying
things.
If this *is* how we think this will shake out, then I'm wondering if we
should go ahead and build this weigh matrix into CRUSH sooner rather
than later (i.e., for luminous). As with the explicit remappings, the
hard part is all done offline, and the adjustments to the CRUSH mapping
calculation itself (storing and making use of the adjusted weights for
each round of placement) are relatively straightforward. And the sooner
this is incorporated into a release the sooner real users will be able to
roll out code to all clients and start making us of it.
Thanks again for looking at this problem! I'm excited that we may be
closing in on a real solution!
sage
On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
> There are lot of gradient-free methods. I will try first to run the
> ones available using just scipy
> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> Some of them don't require the gradient and some of them can estimate
> it. The reason to go without the gradient is to run the CRUSH
> algorithm as a black box. In that case this would be the pseudo-code:
>
> - BEGIN CODE -
> def build_target(desired_freqs):
> def target(weights):
> # run a simulation of CRUSH for a number of objects
> sim_freqs = run_crush(weights)
> # Kullback-Leibler divergence between desired frequencies and
> current ones
> return loss(sim_freqs, desired_freqs)
> return target
>
> weights = scipy.optimize.minimize(build_target(desired_freqs))
> - END CODE -
>
> The tricky thing here is that this procedure can be slow if the
> simulation (run_crush) needs to place a lot of objects to get accurate
> simulated frequencies. This is true specially if the minimize method
> attempts to approximate the gradient using finite differences since it
> will evaluate the target function a number of times proportional to
> the number of weights). Apart from the ones in scipy I would try also
> optimization methods that try to perform as few evaluations as
> possible like for example HyperOpt
> (http://hyperopt.github.io/hyperopt/), which by the way takes into
> account that the target function can be noisy.
>
> This black box approximation is simple to implement and makes the
> computer do all the work instead of us.
> I think that this black box approximation is worthy to try even if
> it's not the final one because if this approximation works then we
> know that a more elaborate one that computes the gradient of the CRUSH
> algorithm will work for sure.
>
> I can try this black box approximation this weekend not on the real
> CRUSH algorithm but with the simple implementation I did in python. If
> it works it's just a matter of substituting one simulation with
> another and see what happens.
>
> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
> > Hi Pedro,
> >
> > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
> >> Hi Loic,
> >>
> >>>From what I see everything seems OK.
> >
> > Cool. I'll keep going in this direction then !
> >
> >> The interesting thing would be to
> >> test on some complex mapping. The reason is that "CrushPolicyFamily"
> >> is right now modeling just a single straw bucket not the full CRUSH
> >> algorithm.
> >
> > A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
> >
> >> That's the work that remains to be done. The only way that
> >> would avoid reimplementing the CRUSH algorithm and computing the
> >> gradient would be treating CRUSH as a black box and eliminating the
> >> necessity of computing the gradient either by using a gradient-free
> >> optimization method or making an estimation of the gradient.
> >
> > By gradient-free optimization you mean simulated annealing or Monte Carlo ?
> >
> > Cheers
> >
> >>
> >>
> >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
> >>> Hi,
> >>>
> >>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
> >>>
> >>> Thanks !
> >>>
> >>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
> >>> ------------------------------------------------------------------------
> >>> Before: All replicas on each hard drive
> >>> Expected vs actual use (20000 samples)
> >>> disk 0: 1.39e-01 1.12e-01
> >>> disk 1: 1.11e-01 1.10e-01
> >>> disk 2: 8.33e-02 1.13e-01
> >>> disk 3: 1.39e-01 1.11e-01
> >>> disk 4: 1.11e-01 1.11e-01
> >>> disk 5: 8.33e-02 1.11e-01
> >>> disk 6: 1.39e-01 1.12e-01
> >>> disk 7: 1.11e-01 1.12e-01
> >>> disk 8: 8.33e-02 1.10e-01
> >>> it= 1 jac norm=1.59e-01 loss=5.27e-03
> >>> it= 2 jac norm=1.55e-01 loss=5.03e-03
> >>> ...
> >>> it= 212 jac norm=1.02e-03 loss=2.41e-07
> >>> it= 213 jac norm=1.00e-03 loss=2.31e-07
> >>> Converged to desired accuracy :)
> >>> After: All replicas on each hard drive
> >>> Expected vs actual use (20000 samples)
> >>> disk 0: 1.39e-01 1.42e-01
> >>> disk 1: 1.11e-01 1.09e-01
> >>> disk 2: 8.33e-02 8.37e-02
> >>> disk 3: 1.39e-01 1.40e-01
> >>> disk 4: 1.11e-01 1.13e-01
> >>> disk 5: 8.33e-02 8.08e-02
> >>> disk 6: 1.39e-01 1.38e-01
> >>> disk 7: 1.11e-01 1.09e-01
> >>> disk 8: 8.33e-02 8.48e-02
> >>>
> >>>
> >>> Simulation: R=2 devices capacity [10 10 10 10 1]
> >>> ------------------------------------------------------------------------
> >>> Before: All replicas on each hard drive
> >>> Expected vs actual use (20000 samples)
> >>> disk 0: 2.44e-01 2.36e-01
> >>> disk 1: 2.44e-01 2.38e-01
> >>> disk 2: 2.44e-01 2.34e-01
> >>> disk 3: 2.44e-01 2.38e-01
> >>> disk 4: 2.44e-02 5.37e-02
> >>> it= 1 jac norm=2.43e-01 loss=2.98e-03
> >>> it= 2 jac norm=2.28e-01 loss=2.47e-03
> >>> ...
> >>> it= 37 jac norm=1.28e-03 loss=3.48e-08
> >>> it= 38 jac norm=1.07e-03 loss=2.42e-08
> >>> Converged to desired accuracy :)
> >>> After: All replicas on each hard drive
> >>> Expected vs actual use (20000 samples)
> >>> disk 0: 2.44e-01 2.46e-01
> >>> disk 1: 2.44e-01 2.44e-01
> >>> disk 2: 2.44e-01 2.41e-01
> >>> disk 3: 2.44e-01 2.45e-01
> >>> disk 4: 2.44e-02 2.33e-02
> >>>
> >>>
> >>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
> >>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
> >>>
> >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
> >>>> Hi Pedro,
> >>>>
> >>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
> >>>>
> >>>> Cheers
> >>>>
> >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
> >>>>> Hi Pedro,
> >>>>>
> >>>>> I'm going to experiment with what you did at
> >>>>>
> >>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
> >>>>>
> >>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
> >>>>>> Great, thanks for the clarifications.
> >>>>>> I also think that the most natural way is to keep just a set of
> >>>>>> weights in the CRUSH map and update them inside the algorithm.
> >>>>>>
> >>>>>> I keep working on it.
> >>>>>>
> >>>>>>
> >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
> >>>>>>> Hi Pedro,
> >>>>>>>
> >>>>>>> Thanks for taking a look at this! It's a frustrating problem and we
> >>>>>>> haven't made much headway.
> >>>>>>>
> >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I will have a look. BTW, I have not progressed that much but I have
> >>>>>>>> been thinking about it. In order to adapt the previous algorithm in
> >>>>>>>> the python notebook I need to substitute the iteration over all
> >>>>>>>> possible devices permutations to iteration over all the possible
> >>>>>>>> selections that crush would make. That is the main thing I need to
> >>>>>>>> work on.
> >>>>>>>>
> >>>>>>>> The other thing is of course that weights change for each replica.
> >>>>>>>> That is, they cannot be really fixed in the crush map. So the
> >>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
> >>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
> >>>>>>>> desired usage frequencies. Or maybe each replica should have their own
> >>>>>>>> crush map, but then the information about the previous selection
> >>>>>>>> should be passed to the next replica placement run so it avoids
> >>>>>>>> selecting the same one again.
> >>>>>>>
> >>>>>>> My suspicion is that the best solution here (whatever that means!)
> >>>>>>> leaves the CRUSH weights intact with the desired distribution, and
> >>>>>>> then generates a set of derivative weights--probably one set for each
> >>>>>>> round/replica/rank.
> >>>>>>>
> >>>>>>> One nice property of this is that once the support is added to encode
> >>>>>>> multiple sets of weights, the algorithm used to generate them is free to
> >>>>>>> change and evolve independently. (In most cases any change is
> >>>>>>> CRUSH's mapping behavior is difficult to roll out because all
> >>>>>>> parties participating in the cluster have to support any new behavior
> >>>>>>> before it is enabled or used.)
> >>>>>>>
> >>>>>>>> I have a question also. Is there any significant difference between
> >>>>>>>> the device selection algorithm description in the paper and its final
> >>>>>>>> implementation?
> >>>>>>>
> >>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
> >>>>>>> idea; any collision or failed()/overload() case triggers the
> >>>>>>> retry_descent.
> >>>>>>>
> >>>>>>> There are other changes, of course, but I don't think they'll impact any
> >>>>>>> solution we come with here (or at least any solution can be suitably
> >>>>>>> adapted)!
> >>>>>>>
> >>>>>>> sage
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>>> the body of a message to majordomo@vger.kernel.org
> >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>> --
> >>> Loïc Dachary, Artisan Logiciel Libre
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>
> >
> > --
> > Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
[not found] ` <CAHMeWhHV=5u=QFggXFNMn2MzGLgQJ6nMnae+ZgK=MB5yYr1p9g@mail.gmail.com>
@ 2017-03-27 2:33 ` Sage Weil
2017-03-27 6:45 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Sage Weil @ 2017-03-27 2:33 UTC (permalink / raw)
To: Adam Kupczyk; +Cc: Pedro López-Adeva, Loic Dachary, Ceph Development
[-- Attachment #1: Type: TEXT/PLAIN, Size: 20718 bytes --]
On Sun, 26 Mar 2017, Adam Kupczyk wrote:
> Hello Sage, Loic, Pedro,
>
>
> I am certain that almost perfect mapping can be achieved by
> substituting weights from crush map with slightly modified weights.
> By perfect mapping I mean we get on each OSD number of PGs exactly
> proportional to weights specified in crush map.
>
> 1. Example
> Lets think of PGs of single object pool.
> We have OSDs with following weights:
> [10, 10, 10, 5, 5]
>
> Ideally, we would like following distribution of 200PG x 3 copies = 600
> PGcopies :
> [150, 150, 150, 75, 75]
>
> However, because crush simulates random process we have:
> [143, 152, 158, 71, 76]
>
> We could have obtained perfect distribution had we used weights like this:
> [10.2, 9.9, 9.6, 5.2, 4.9]
>
>
> 2. Obtaining perfect mapping weights from OSD capacity weights
>
> When we apply crush for the first time, distribution of PGs comes as random.
> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
>
> But CRUSH is not random proces at all, it behaves in numerically stable way.
> Specifically, if we increase weight on one node, we will get more PGs on
> this node and less on every other node:
> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
>
> Now, finding ideal weights can be done by any numerical minimization method,
> for example NLMS.
>
>
> 3. The proposal
> For each pool, from initial weights given in crush map perfect weights will
> be derived.
> This weights will be used to calculate PG distribution. This of course will
> be close to perfect.
>
> 3a: Downside when OSD is out
> When an OSD is out, missing PG copies will be replicated elsewhere.
> Because now weights deviate from OSD capacity, some OSDs will statistically
> get more copies then they should.
> This unevenness in distribution is proportional to scale of deviation of
> calculated weights to capacity weights.
>
> 3b: Upside
> This all can be achieved without changes to crush.
Yes!
And no. You're totally right--we should use an offline optimization to
tweak the crush input weights to get a better balance. It won't be robust
to changes to the cluster, but we can incrementally optimize after that
happens to converge on something better.
The problem with doing this with current versions of Ceph is that we lose
the original "input" or "target" weights (i.e., the actual size of
the OSD) that we want to converge on. This is one reason why we haven't
done something like this before.
In luminous we *could* work around this by storing those canonical
weights outside of crush using something (probably?) ugly and
maintain backward compatibility with older clients using existing
CRUSH behavior.
OR, (and this is my preferred route), if the multi-pick anomaly approach
that Pedro is working on works out, we'll want to extend the CRUSH map to
include a set of derivative weights used for actual placement calculations
instead of the canonical target weights, and we can do what you're
proposing *and* solve the multipick problem with one change in the crush
map and algorithm. (Actually choosing those derivative weights will
be an offline process that can both improve the balance for the inputs we
care about *and* adjust them based on the position to fix the skew issue
for replicas.) This doesn't help pre-luminous clients, but I think the
end solution will be simpler and more elegant...
What do you think?
sage
> 4. Extra
> Some time ago I made such change to perfectly balance Thomson-Reuters
> cluster.
> It succeeded.
> A solution was not accepted, because modification of OSD weights were higher
> then 50%, which was caused by fact that different placement rules operated
> on different sets of OSDs, and those sets were not disjointed.
>
> Best regards,
> Adam
>
>
> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
> Hi Pedro, Loic,
>
> For what it's worth, my intuition here (which has had a mixed
> record as
> far as CRUSH goes) is that this is the most promising path
> forward.
>
> Thinking ahead a few steps, and confirming that I'm following
> the
> discussion so far, if you're able to do get black (or white) box
> gradient
> descent to work, then this will give us a set of weights for
> each item in
> the tree for each selection round, derived from the tree
> structure and
> original (target) weights. That would basically give us a map
> of item id
> (bucket id or leave item id) to weight for each round. i.e.,
>
> map<int, map<int, float>> weight_by_position; // position ->
> item -> weight
>
> where the 0 round would (I think?) match the target weights, and
> each
> round after that would skew low-weighted items lower to some
> degree.
> Right?
>
> The next question I have is: does this generalize from the
> single-bucket
> case to the hierarchy? I.e., if I have a "tree" (single bucket)
> like
>
> 3.1
> |_____________
> | \ \ \
> 1.0 1.0 1.0 .1
>
> it clearly works, but when we have a multi-level tree like
>
>
> 8.4
> |____________________________________
> | \ \
> 3.1 3.1 2.2
> |_____________ |_____________ |_____________
> | \ \ \ | \ \ \ | \ \ \
> 1.0 1.0 1.0 .1 1.0 1.0 1.0 .1 1.0 1.0 .1 .1
>
> and the second round weights skew the small .1 leaves lower, can
> we
> continue to build the summed-weight hierarchy, such that the
> adjusted
> weights at the higher level are appropriately adjusted to give
> us the
> right probabilities of descending into those trees? I'm not
> sure if that
> logically follows from the above or if my intuition is
> oversimplifying
> things.
>
> If this *is* how we think this will shake out, then I'm
> wondering if we
> should go ahead and build this weigh matrix into CRUSH sooner
> rather
> than later (i.e., for luminous). As with the explicit
> remappings, the
> hard part is all done offline, and the adjustments to the CRUSH
> mapping
> calculation itself (storing and making use of the adjusted
> weights for
> each round of placement) are relatively straightforward. And
> the sooner
> this is incorporated into a release the sooner real users will
> be able to
> roll out code to all clients and start making us of it.
>
> Thanks again for looking at this problem! I'm excited that we
> may be
> closing in on a real solution!
>
> sage
>
>
>
>
>
> On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
>
> > There are lot of gradient-free methods. I will try first to
> run the
> > ones available using just scipy
> >
> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> > Some of them don't require the gradient and some of them can
> estimate
> > it. The reason to go without the gradient is to run the CRUSH
> > algorithm as a black box. In that case this would be the
> pseudo-code:
> >
> > - BEGIN CODE -
> > def build_target(desired_freqs):
> > def target(weights):
> > # run a simulation of CRUSH for a number of objects
> > sim_freqs = run_crush(weights)
> > # Kullback-Leibler divergence between desired
> frequencies and
> > current ones
> > return loss(sim_freqs, desired_freqs)
> > return target
> >
> > weights = scipy.optimize.minimize(build_target(desired_freqs))
> > - END CODE -
> >
> > The tricky thing here is that this procedure can be slow if
> the
> > simulation (run_crush) needs to place a lot of objects to get
> accurate
> > simulated frequencies. This is true specially if the minimize
> method
> > attempts to approximate the gradient using finite differences
> since it
> > will evaluate the target function a number of times
> proportional to
> > the number of weights). Apart from the ones in scipy I would
> try also
> > optimization methods that try to perform as few evaluations as
> > possible like for example HyperOpt
> > (http://hyperopt.github.io/hyperopt/), which by the way takes
> into
> > account that the target function can be noisy.
> >
> > This black box approximation is simple to implement and makes
> the
> > computer do all the work instead of us.
> > I think that this black box approximation is worthy to try
> even if
> > it's not the final one because if this approximation works
> then we
> > know that a more elaborate one that computes the gradient of
> the CRUSH
> > algorithm will work for sure.
> >
> > I can try this black box approximation this weekend not on the
> real
> > CRUSH algorithm but with the simple implementation I did in
> python. If
> > it works it's just a matter of substituting one simulation
> with
> > another and see what happens.
> >
> > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
> > > Hi Pedro,
> > >
> > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
> > >> Hi Loic,
> > >>
> > >>>From what I see everything seems OK.
> > >
> > > Cool. I'll keep going in this direction then !
> > >
> > >> The interesting thing would be to
> > >> test on some complex mapping. The reason is that
> "CrushPolicyFamily"
> > >> is right now modeling just a single straw bucket not the
> full CRUSH
> > >> algorithm.
> > >
> > > A number of use cases use a single straw bucket, maybe the
> majority of them. Even though it does not reflect the full range
> of what crush can offer, it could be useful. To be more
> specific, a crush map that states "place objects so that there
> is at most one replica per host" or "one replica per rack" is
> common. Such a crushmap can be reduced to a single straw bucket
> that contains all the hosts and by using the CrushPolicyFamily,
> we can change the weights of each host to fix the probabilities.
> The hosts themselves contain disks with varying weights but I
> think we can ignore that because crush will only recurse to
> place one object within a given host.
> > >
> > >> That's the work that remains to be done. The only way that
> > >> would avoid reimplementing the CRUSH algorithm and
> computing the
> > >> gradient would be treating CRUSH as a black box and
> eliminating the
> > >> necessity of computing the gradient either by using a
> gradient-free
> > >> optimization method or making an estimation of the
> gradient.
> > >
> > > By gradient-free optimization you mean simulated annealing
> or Monte Carlo ?
> > >
> > > Cheers
> > >
> > >>
> > >>
> > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
> > >>> Hi,
> > >>>
> > >>> I modified the crush library to accept two weights (one
> for the first disk, the other for the remaining disks)[1]. This
> really is a hack for experimentation purposes only ;-) I was
> able to run a variation of your code[2] and got the following
> results which are encouraging. Do you think what I did is
> sensible ? Or is there a problem I don't see ?
> > >>>
> > >>> Thanks !
> > >>>
> > >>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8
> 6]
> > >>>
> ------------------------------------------------------------------------
> > >>> Before: All replicas on each hard drive
> > >>> Expected vs actual use (20000 samples)
> > >>> disk 0: 1.39e-01 1.12e-01
> > >>> disk 1: 1.11e-01 1.10e-01
> > >>> disk 2: 8.33e-02 1.13e-01
> > >>> disk 3: 1.39e-01 1.11e-01
> > >>> disk 4: 1.11e-01 1.11e-01
> > >>> disk 5: 8.33e-02 1.11e-01
> > >>> disk 6: 1.39e-01 1.12e-01
> > >>> disk 7: 1.11e-01 1.12e-01
> > >>> disk 8: 8.33e-02 1.10e-01
> > >>> it= 1 jac norm=1.59e-01 loss=5.27e-03
> > >>> it= 2 jac norm=1.55e-01 loss=5.03e-03
> > >>> ...
> > >>> it= 212 jac norm=1.02e-03 loss=2.41e-07
> > >>> it= 213 jac norm=1.00e-03 loss=2.31e-07
> > >>> Converged to desired accuracy :)
> > >>> After: All replicas on each hard drive
> > >>> Expected vs actual use (20000 samples)
> > >>> disk 0: 1.39e-01 1.42e-01
> > >>> disk 1: 1.11e-01 1.09e-01
> > >>> disk 2: 8.33e-02 8.37e-02
> > >>> disk 3: 1.39e-01 1.40e-01
> > >>> disk 4: 1.11e-01 1.13e-01
> > >>> disk 5: 8.33e-02 8.08e-02
> > >>> disk 6: 1.39e-01 1.38e-01
> > >>> disk 7: 1.11e-01 1.09e-01
> > >>> disk 8: 8.33e-02 8.48e-02
> > >>>
> > >>>
> > >>> Simulation: R=2 devices capacity [10 10 10 10 1]
> > >>>
> ------------------------------------------------------------------------
> > >>> Before: All replicas on each hard drive
> > >>> Expected vs actual use (20000 samples)
> > >>> disk 0: 2.44e-01 2.36e-01
> > >>> disk 1: 2.44e-01 2.38e-01
> > >>> disk 2: 2.44e-01 2.34e-01
> > >>> disk 3: 2.44e-01 2.38e-01
> > >>> disk 4: 2.44e-02 5.37e-02
> > >>> it= 1 jac norm=2.43e-01 loss=2.98e-03
> > >>> it= 2 jac norm=2.28e-01 loss=2.47e-03
> > >>> ...
> > >>> it= 37 jac norm=1.28e-03 loss=3.48e-08
> > >>> it= 38 jac norm=1.07e-03 loss=2.42e-08
> > >>> Converged to desired accuracy :)
> > >>> After: All replicas on each hard drive
> > >>> Expected vs actual use (20000 samples)
> > >>> disk 0: 2.44e-01 2.46e-01
> > >>> disk 1: 2.44e-01 2.44e-01
> > >>> disk 2: 2.44e-01 2.41e-01
> > >>> disk 3: 2.44e-01 2.45e-01
> > >>> disk 4: 2.44e-02 2.33e-02
> > >>>
> > >>>
> > >>> [1] crush hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
> 56fee8
> > >>> [2] python-crush hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
> bd25f8f2c4b68
> > >>>
> > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
> > >>>> Hi Pedro,
> > >>>>
> > >>>> It looks like trying to experiment with crush won't work
> as expected because crush does not distinguish the probability
> of selecting the first device from the probability of selecting
> the second or third device. Am I mistaken ?
> > >>>>
> > >>>> Cheers
> > >>>>
> > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
> > >>>>> Hi Pedro,
> > >>>>>
> > >>>>> I'm going to experiment with what you did at
> > >>>>>
> > >>>>>
> https://github.com/plafl/notebooks/blob/master/replication.ipynb
> > >>>>>
> > >>>>> and the latest python-crush published today. A
> comparison function was added that will help measure the data
> movement. I'm hoping we can release an offline tool based on
> your solution. Please let me know if I should wait before diving
> into this, in case you have unpublished drafts or new ideas.
> > >>>>>
> > >>>>> Cheers
> > >>>>>
> > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
> > >>>>>> Great, thanks for the clarifications.
> > >>>>>> I also think that the most natural way is to keep just
> a set of
> > >>>>>> weights in the CRUSH map and update them inside the
> algorithm.
> > >>>>>>
> > >>>>>> I keep working on it.
> > >>>>>>
> > >>>>>>
> > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
> <sage@newdream.net>:
> > >>>>>>> Hi Pedro,
> > >>>>>>>
> > >>>>>>> Thanks for taking a look at this! It's a frustrating
> problem and we
> > >>>>>>> haven't made much headway.
> > >>>>>>>
> > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> I will have a look. BTW, I have not progressed that
> much but I have
> > >>>>>>>> been thinking about it. In order to adapt the
> previous algorithm in
> > >>>>>>>> the python notebook I need to substitute the
> iteration over all
> > >>>>>>>> possible devices permutations to iteration over all
> the possible
> > >>>>>>>> selections that crush would make. That is the main
> thing I need to
> > >>>>>>>> work on.
> > >>>>>>>>
> > >>>>>>>> The other thing is of course that weights change for
> each replica.
> > >>>>>>>> That is, they cannot be really fixed in the crush
> map. So the
> > >>>>>>>> algorithm inside libcrush, not only the weights in
> the map, need to be
> > >>>>>>>> changed. The weights in the crush map should reflect
> then, maybe, the
> > >>>>>>>> desired usage frequencies. Or maybe each replica
> should have their own
> > >>>>>>>> crush map, but then the information about the
> previous selection
> > >>>>>>>> should be passed to the next replica placement run so
> it avoids
> > >>>>>>>> selecting the same one again.
> > >>>>>>>
> > >>>>>>> My suspicion is that the best solution here (whatever
> that means!)
> > >>>>>>> leaves the CRUSH weights intact with the desired
> distribution, and
> > >>>>>>> then generates a set of derivative weights--probably
> one set for each
> > >>>>>>> round/replica/rank.
> > >>>>>>>
> > >>>>>>> One nice property of this is that once the support is
> added to encode
> > >>>>>>> multiple sets of weights, the algorithm used to
> generate them is free to
> > >>>>>>> change and evolve independently. (In most cases any
> change is
> > >>>>>>> CRUSH's mapping behavior is difficult to roll out
> because all
> > >>>>>>> parties participating in the cluster have to support
> any new behavior
> > >>>>>>> before it is enabled or used.)
> > >>>>>>>
> > >>>>>>>> I have a question also. Is there any significant
> difference between
> > >>>>>>>> the device selection algorithm description in the
> paper and its final
> > >>>>>>>> implementation?
> > >>>>>>>
> > >>>>>>> The main difference is the "retry_bucket" behavior was
> found to be a bad
> > >>>>>>> idea; any collision or failed()/overload() case
> triggers the
> > >>>>>>> retry_descent.
> > >>>>>>>
> > >>>>>>> There are other changes, of course, but I don't think
> they'll impact any
> > >>>>>>> solution we come with here (or at least any solution
> can be suitably
> > >>>>>>> adapted)!
> > >>>>>>>
> > >>>>>>> sage
> > >>>>>> --
> > >>>>>> To unsubscribe from this list: send the line
> "unsubscribe ceph-devel" in
> > >>>>>> the body of a message to majordomo@vger.kernel.org
> > >>>>>> More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>> --
> > >>> Loïc Dachary, Artisan Logiciel Libre
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe
> ceph-devel" in
> > >> the body of a message to majordomo@vger.kernel.org
> > >> More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> > >>
> > >
> > > --
> > > Loïc Dachary, Artisan Logiciel Libre
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> >
> >
>
>
>
>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-27 2:33 ` Sage Weil
@ 2017-03-27 6:45 ` Loic Dachary
[not found] ` <CAHMeWhGuJnu2664VTxomQ-wJewBEPjRT_VGWH+g-v5k3ka6X5Q@mail.gmail.com>
2017-03-27 13:24 ` Sage Weil
0 siblings, 2 replies; 70+ messages in thread
From: Loic Dachary @ 2017-03-27 6:45 UTC (permalink / raw)
To: Sage Weil, Adam Kupczyk; +Cc: Ceph Development
On 03/27/2017 04:33 AM, Sage Weil wrote:
> On Sun, 26 Mar 2017, Adam Kupczyk wrote:
>> Hello Sage, Loic, Pedro,
>>
>>
>> I am certain that almost perfect mapping can be achieved by
>> substituting weights from crush map with slightly modified weights.
>> By perfect mapping I mean we get on each OSD number of PGs exactly
>> proportional to weights specified in crush map.
>>
>> 1. Example
>> Lets think of PGs of single object pool.
>> We have OSDs with following weights:
>> [10, 10, 10, 5, 5]
>>
>> Ideally, we would like following distribution of 200PG x 3 copies = 600
>> PGcopies :
>> [150, 150, 150, 75, 75]
>>
>> However, because crush simulates random process we have:
>> [143, 152, 158, 71, 76]
>>
>> We could have obtained perfect distribution had we used weights like this:
>> [10.2, 9.9, 9.6, 5.2, 4.9]
>>
>>
>> 2. Obtaining perfect mapping weights from OSD capacity weights
>>
>> When we apply crush for the first time, distribution of PGs comes as random.
>> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
>>
>> But CRUSH is not random proces at all, it behaves in numerically stable way.
>> Specifically, if we increase weight on one node, we will get more PGs on
>> this node and less on every other node:
>> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
>>
>> Now, finding ideal weights can be done by any numerical minimization method,
>> for example NLMS.
>>
>>
>> 3. The proposal
>> For each pool, from initial weights given in crush map perfect weights will
>> be derived.
>> This weights will be used to calculate PG distribution. This of course will
>> be close to perfect.
>>
>> 3a: Downside when OSD is out
>> When an OSD is out, missing PG copies will be replicated elsewhere.
>> Because now weights deviate from OSD capacity, some OSDs will statistically
>> get more copies then they should.
>> This unevenness in distribution is proportional to scale of deviation of
>> calculated weights to capacity weights.
>>
>> 3b: Upside
>> This all can be achieved without changes to crush.
>
> Yes!
>
> And no. You're totally right--we should use an offline optimization to
> tweak the crush input weights to get a better balance. It won't be robust
> to changes to the cluster, but we can incrementally optimize after that
> happens to converge on something better.
>
> The problem with doing this with current versions of Ceph is that we lose
> the original "input" or "target" weights (i.e., the actual size of
> the OSD) that we want to converge on. This is one reason why we haven't
> done something like this before.
>
> In luminous we *could* work around this by storing those canonical
> weights outside of crush using something (probably?) ugly and
> maintain backward compatibility with older clients using existing
> CRUSH behavior.
These canonical weights could be stored in crush by creating dedicated buckets. For instance the root-canonical bucket could be created to store the canonical weights of the root bucket. The sysadmin needs to be aware of the difference and know to add a new device in the host01-canonical bucket instead of the host01 bucket. And to run an offline tool to keep the two buckets in sync and compute the weight to use for placement derived from the weights representing the device capacity.
It is a little bit ugly ;-)
> OR, (and this is my preferred route), if the multi-pick anomaly approach
> that Pedro is working on works out, we'll want to extend the CRUSH map to
> include a set of derivative weights used for actual placement calculations
> instead of the canonical target weights, and we can do what you're
> proposing *and* solve the multipick problem with one change in the crush
> map and algorithm. (Actually choosing those derivative weights will
> be an offline process that can both improve the balance for the inputs we
> care about *and* adjust them based on the position to fix the skew issue
> for replicas.) This doesn't help pre-luminous clients, but I think the
> end solution will be simpler and more elegant...
>
> What do you think?
>
> sage
>
>
>> 4. Extra
>> Some time ago I made such change to perfectly balance Thomson-Reuters
>> cluster.
>> It succeeded.
>> A solution was not accepted, because modification of OSD weights were higher
>> then 50%, which was caused by fact that different placement rules operated
>> on different sets of OSDs, and those sets were not disjointed.
>
>
>>
>> Best regards,
>> Adam
>>
>>
>> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
>> Hi Pedro, Loic,
>>
>> For what it's worth, my intuition here (which has had a mixed
>> record as
>> far as CRUSH goes) is that this is the most promising path
>> forward.
>>
>> Thinking ahead a few steps, and confirming that I'm following
>> the
>> discussion so far, if you're able to do get black (or white) box
>> gradient
>> descent to work, then this will give us a set of weights for
>> each item in
>> the tree for each selection round, derived from the tree
>> structure and
>> original (target) weights. That would basically give us a map
>> of item id
>> (bucket id or leave item id) to weight for each round. i.e.,
>>
>> map<int, map<int, float>> weight_by_position; // position ->
>> item -> weight
>>
>> where the 0 round would (I think?) match the target weights, and
>> each
>> round after that would skew low-weighted items lower to some
>> degree.
>> Right?
>>
>> The next question I have is: does this generalize from the
>> single-bucket
>> case to the hierarchy? I.e., if I have a "tree" (single bucket)
>> like
>>
>> 3.1
>> |_____________
>> | \ \ \
>> 1.0 1.0 1.0 .1
>>
>> it clearly works, but when we have a multi-level tree like
>>
>>
>> 8.4
>> |____________________________________
>> | \ \
>> 3.1 3.1 2.2
>> |_____________ |_____________ |_____________
>> | \ \ \ | \ \ \ | \ \ \
>> 1.0 1.0 1.0 .1 1.0 1.0 1.0 .1 1.0 1.0 .1 .1
>>
>> and the second round weights skew the small .1 leaves lower, can
>> we
>> continue to build the summed-weight hierarchy, such that the
>> adjusted
>> weights at the higher level are appropriately adjusted to give
>> us the
>> right probabilities of descending into those trees? I'm not
>> sure if that
>> logically follows from the above or if my intuition is
>> oversimplifying
>> things.
>>
>> If this *is* how we think this will shake out, then I'm
>> wondering if we
>> should go ahead and build this weigh matrix into CRUSH sooner
>> rather
>> than later (i.e., for luminous). As with the explicit
>> remappings, the
>> hard part is all done offline, and the adjustments to the CRUSH
>> mapping
>> calculation itself (storing and making use of the adjusted
>> weights for
>> each round of placement) are relatively straightforward. And
>> the sooner
>> this is incorporated into a release the sooner real users will
>> be able to
>> roll out code to all clients and start making us of it.
>>
>> Thanks again for looking at this problem! I'm excited that we
>> may be
>> closing in on a real solution!
>>
>> sage
>>
>>
>>
>>
>>
>> On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
>>
>> > There are lot of gradient-free methods. I will try first to
>> run the
>> > ones available using just scipy
>> >
>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>> > Some of them don't require the gradient and some of them can
>> estimate
>> > it. The reason to go without the gradient is to run the CRUSH
>> > algorithm as a black box. In that case this would be the
>> pseudo-code:
>> >
>> > - BEGIN CODE -
>> > def build_target(desired_freqs):
>> > def target(weights):
>> > # run a simulation of CRUSH for a number of objects
>> > sim_freqs = run_crush(weights)
>> > # Kullback-Leibler divergence between desired
>> frequencies and
>> > current ones
>> > return loss(sim_freqs, desired_freqs)
>> > return target
>> >
>> > weights = scipy.optimize.minimize(build_target(desired_freqs))
>> > - END CODE -
>> >
>> > The tricky thing here is that this procedure can be slow if
>> the
>> > simulation (run_crush) needs to place a lot of objects to get
>> accurate
>> > simulated frequencies. This is true specially if the minimize
>> method
>> > attempts to approximate the gradient using finite differences
>> since it
>> > will evaluate the target function a number of times
>> proportional to
>> > the number of weights). Apart from the ones in scipy I would
>> try also
>> > optimization methods that try to perform as few evaluations as
>> > possible like for example HyperOpt
>> > (http://hyperopt.github.io/hyperopt/), which by the way takes
>> into
>> > account that the target function can be noisy.
>> >
>> > This black box approximation is simple to implement and makes
>> the
>> > computer do all the work instead of us.
>> > I think that this black box approximation is worthy to try
>> even if
>> > it's not the final one because if this approximation works
>> then we
>> > know that a more elaborate one that computes the gradient of
>> the CRUSH
>> > algorithm will work for sure.
>> >
>> > I can try this black box approximation this weekend not on the
>> real
>> > CRUSH algorithm but with the simple implementation I did in
>> python. If
>> > it works it's just a matter of substituting one simulation
>> with
>> > another and see what happens.
>> >
>> > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> > > Hi Pedro,
>> > >
>> > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>> > >> Hi Loic,
>> > >>
>> > >>>From what I see everything seems OK.
>> > >
>> > > Cool. I'll keep going in this direction then !
>> > >
>> > >> The interesting thing would be to
>> > >> test on some complex mapping. The reason is that
>> "CrushPolicyFamily"
>> > >> is right now modeling just a single straw bucket not the
>> full CRUSH
>> > >> algorithm.
>> > >
>> > > A number of use cases use a single straw bucket, maybe the
>> majority of them. Even though it does not reflect the full range
>> of what crush can offer, it could be useful. To be more
>> specific, a crush map that states "place objects so that there
>> is at most one replica per host" or "one replica per rack" is
>> common. Such a crushmap can be reduced to a single straw bucket
>> that contains all the hosts and by using the CrushPolicyFamily,
>> we can change the weights of each host to fix the probabilities.
>> The hosts themselves contain disks with varying weights but I
>> think we can ignore that because crush will only recurse to
>> place one object within a given host.
>> > >
>> > >> That's the work that remains to be done. The only way that
>> > >> would avoid reimplementing the CRUSH algorithm and
>> computing the
>> > >> gradient would be treating CRUSH as a black box and
>> eliminating the
>> > >> necessity of computing the gradient either by using a
>> gradient-free
>> > >> optimization method or making an estimation of the
>> gradient.
>> > >
>> > > By gradient-free optimization you mean simulated annealing
>> or Monte Carlo ?
>> > >
>> > > Cheers
>> > >
>> > >>
>> > >>
>> > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> > >>> Hi,
>> > >>>
>> > >>> I modified the crush library to accept two weights (one
>> for the first disk, the other for the remaining disks)[1]. This
>> really is a hack for experimentation purposes only ;-) I was
>> able to run a variation of your code[2] and got the following
>> results which are encouraging. Do you think what I did is
>> sensible ? Or is there a problem I don't see ?
>> > >>>
>> > >>> Thanks !
>> > >>>
>> > >>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8
>> 6]
>> > >>>
>> ------------------------------------------------------------------------
>> > >>> Before: All replicas on each hard drive
>> > >>> Expected vs actual use (20000 samples)
>> > >>> disk 0: 1.39e-01 1.12e-01
>> > >>> disk 1: 1.11e-01 1.10e-01
>> > >>> disk 2: 8.33e-02 1.13e-01
>> > >>> disk 3: 1.39e-01 1.11e-01
>> > >>> disk 4: 1.11e-01 1.11e-01
>> > >>> disk 5: 8.33e-02 1.11e-01
>> > >>> disk 6: 1.39e-01 1.12e-01
>> > >>> disk 7: 1.11e-01 1.12e-01
>> > >>> disk 8: 8.33e-02 1.10e-01
>> > >>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>> > >>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>> > >>> ...
>> > >>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>> > >>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>> > >>> Converged to desired accuracy :)
>> > >>> After: All replicas on each hard drive
>> > >>> Expected vs actual use (20000 samples)
>> > >>> disk 0: 1.39e-01 1.42e-01
>> > >>> disk 1: 1.11e-01 1.09e-01
>> > >>> disk 2: 8.33e-02 8.37e-02
>> > >>> disk 3: 1.39e-01 1.40e-01
>> > >>> disk 4: 1.11e-01 1.13e-01
>> > >>> disk 5: 8.33e-02 8.08e-02
>> > >>> disk 6: 1.39e-01 1.38e-01
>> > >>> disk 7: 1.11e-01 1.09e-01
>> > >>> disk 8: 8.33e-02 8.48e-02
>> > >>>
>> > >>>
>> > >>> Simulation: R=2 devices capacity [10 10 10 10 1]
>> > >>>
>> ------------------------------------------------------------------------
>> > >>> Before: All replicas on each hard drive
>> > >>> Expected vs actual use (20000 samples)
>> > >>> disk 0: 2.44e-01 2.36e-01
>> > >>> disk 1: 2.44e-01 2.38e-01
>> > >>> disk 2: 2.44e-01 2.34e-01
>> > >>> disk 3: 2.44e-01 2.38e-01
>> > >>> disk 4: 2.44e-02 5.37e-02
>> > >>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>> > >>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>> > >>> ...
>> > >>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>> > >>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>> > >>> Converged to desired accuracy :)
>> > >>> After: All replicas on each hard drive
>> > >>> Expected vs actual use (20000 samples)
>> > >>> disk 0: 2.44e-01 2.46e-01
>> > >>> disk 1: 2.44e-01 2.44e-01
>> > >>> disk 2: 2.44e-01 2.41e-01
>> > >>> disk 3: 2.44e-01 2.45e-01
>> > >>> disk 4: 2.44e-02 2.33e-02
>> > >>>
>> > >>>
>> > >>> [1] crush hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
>> 56fee8
>> > >>> [2] python-crush hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
>> bd25f8f2c4b68
>> > >>>
>> > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>> > >>>> Hi Pedro,
>> > >>>>
>> > >>>> It looks like trying to experiment with crush won't work
>> as expected because crush does not distinguish the probability
>> of selecting the first device from the probability of selecting
>> the second or third device. Am I mistaken ?
>> > >>>>
>> > >>>> Cheers
>> > >>>>
>> > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>> > >>>>> Hi Pedro,
>> > >>>>>
>> > >>>>> I'm going to experiment with what you did at
>> > >>>>>
>> > >>>>>
>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>> > >>>>>
>> > >>>>> and the latest python-crush published today. A
>> comparison function was added that will help measure the data
>> movement. I'm hoping we can release an offline tool based on
>> your solution. Please let me know if I should wait before diving
>> into this, in case you have unpublished drafts or new ideas.
>> > >>>>>
>> > >>>>> Cheers
>> > >>>>>
>> > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>> > >>>>>> Great, thanks for the clarifications.
>> > >>>>>> I also think that the most natural way is to keep just
>> a set of
>> > >>>>>> weights in the CRUSH map and update them inside the
>> algorithm.
>> > >>>>>>
>> > >>>>>> I keep working on it.
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
>> <sage@newdream.net>:
>> > >>>>>>> Hi Pedro,
>> > >>>>>>>
>> > >>>>>>> Thanks for taking a look at this! It's a frustrating
>> problem and we
>> > >>>>>>> haven't made much headway.
>> > >>>>>>>
>> > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>> > >>>>>>>> Hi,
>> > >>>>>>>>
>> > >>>>>>>> I will have a look. BTW, I have not progressed that
>> much but I have
>> > >>>>>>>> been thinking about it. In order to adapt the
>> previous algorithm in
>> > >>>>>>>> the python notebook I need to substitute the
>> iteration over all
>> > >>>>>>>> possible devices permutations to iteration over all
>> the possible
>> > >>>>>>>> selections that crush would make. That is the main
>> thing I need to
>> > >>>>>>>> work on.
>> > >>>>>>>>
>> > >>>>>>>> The other thing is of course that weights change for
>> each replica.
>> > >>>>>>>> That is, they cannot be really fixed in the crush
>> map. So the
>> > >>>>>>>> algorithm inside libcrush, not only the weights in
>> the map, need to be
>> > >>>>>>>> changed. The weights in the crush map should reflect
>> then, maybe, the
>> > >>>>>>>> desired usage frequencies. Or maybe each replica
>> should have their own
>> > >>>>>>>> crush map, but then the information about the
>> previous selection
>> > >>>>>>>> should be passed to the next replica placement run so
>> it avoids
>> > >>>>>>>> selecting the same one again.
>> > >>>>>>>
>> > >>>>>>> My suspicion is that the best solution here (whatever
>> that means!)
>> > >>>>>>> leaves the CRUSH weights intact with the desired
>> distribution, and
>> > >>>>>>> then generates a set of derivative weights--probably
>> one set for each
>> > >>>>>>> round/replica/rank.
>> > >>>>>>>
>> > >>>>>>> One nice property of this is that once the support is
>> added to encode
>> > >>>>>>> multiple sets of weights, the algorithm used to
>> generate them is free to
>> > >>>>>>> change and evolve independently. (In most cases any
>> change is
>> > >>>>>>> CRUSH's mapping behavior is difficult to roll out
>> because all
>> > >>>>>>> parties participating in the cluster have to support
>> any new behavior
>> > >>>>>>> before it is enabled or used.)
>> > >>>>>>>
>> > >>>>>>>> I have a question also. Is there any significant
>> difference between
>> > >>>>>>>> the device selection algorithm description in the
>> paper and its final
>> > >>>>>>>> implementation?
>> > >>>>>>>
>> > >>>>>>> The main difference is the "retry_bucket" behavior was
>> found to be a bad
>> > >>>>>>> idea; any collision or failed()/overload() case
>> triggers the
>> > >>>>>>> retry_descent.
>> > >>>>>>>
>> > >>>>>>> There are other changes, of course, but I don't think
>> they'll impact any
>> > >>>>>>> solution we come with here (or at least any solution
>> can be suitably
>> > >>>>>>> adapted)!
>> > >>>>>>>
>> > >>>>>>> sage
>> > >>>>>> --
>> > >>>>>> To unsubscribe from this list: send the line
>> "unsubscribe ceph-devel" in
>> > >>>>>> the body of a message to majordomo@vger.kernel.org
>> > >>>>>> More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>> > >>>>>>
>> > >>>>>
>> > >>>>
>> > >>>
>> > >>> --
>> > >>> Loïc Dachary, Artisan Logiciel Libre
>> > >> --
>> > >> To unsubscribe from this list: send the line "unsubscribe
>> ceph-devel" in
>> > >> the body of a message to majordomo@vger.kernel.org
>> > >> More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>> > >>
>> > >
>> > > --
>> > > Loïc Dachary, Artisan Logiciel Libre
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe
>> ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>> >
>> >
>>
>>
>>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
[not found] ` <CAHMeWhGuJnu2664VTxomQ-wJewBEPjRT_VGWH+g-v5k3ka6X5Q@mail.gmail.com>
@ 2017-03-27 9:27 ` Adam Kupczyk
2017-03-27 10:29 ` Loic Dachary
` (2 more replies)
0 siblings, 3 replies; 70+ messages in thread
From: Adam Kupczyk @ 2017-03-27 9:27 UTC (permalink / raw)
To: Loic Dachary; +Cc: Sage Weil, Ceph Development
Hi,
My understanding is that optimal tweaked weights will depend on:
1) pool_id, because of rjenkins(pool_id) in crush
2) number of placement groups and replication factor, as it determines
amount of samples
Therefore tweaked weights should rather be property of instantialized pool,
not crush placement definition.
If tweaked weights are to be part of crush definition, than for each
created pool we need to have separate list of weights.
Is it possible to provide clients with different weights depending on on
which pool they want to operate?
Best regards,
Adam
On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
> Hi,
>
> My understanding is that optimal tweaked weights will depend on:
> 1) pool_id, because of rjenkins(pool_id) in crush
> 2) number of placement groups and replication factor, as it determines
> amount of samples
>
> Therefore tweaked weights should rather be property of instantialized pool,
> not crush placement definition.
>
> If tweaked weights are to be part of crush definition, than for each created
> pool we need to have separate list of weights.
> Is it possible to provide clients with different weights depending on on
> which pool they want to operate?
>
> Best regards,
> Adam
>
>
> On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@dachary.org> wrote:
>>
>>
>>
>> On 03/27/2017 04:33 AM, Sage Weil wrote:
>> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
>> >> Hello Sage, Loic, Pedro,
>> >>
>> >>
>> >> I am certain that almost perfect mapping can be achieved by
>> >> substituting weights from crush map with slightly modified weights.
>> >> By perfect mapping I mean we get on each OSD number of PGs exactly
>> >> proportional to weights specified in crush map.
>> >>
>> >> 1. Example
>> >> Lets think of PGs of single object pool.
>> >> We have OSDs with following weights:
>> >> [10, 10, 10, 5, 5]
>> >>
>> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
>> >> PGcopies :
>> >> [150, 150, 150, 75, 75]
>> >>
>> >> However, because crush simulates random process we have:
>> >> [143, 152, 158, 71, 76]
>> >>
>> >> We could have obtained perfect distribution had we used weights like
>> >> this:
>> >> [10.2, 9.9, 9.6, 5.2, 4.9]
>> >>
>> >>
>> >> 2. Obtaining perfect mapping weights from OSD capacity weights
>> >>
>> >> When we apply crush for the first time, distribution of PGs comes as
>> >> random.
>> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
>> >>
>> >> But CRUSH is not random proces at all, it behaves in numerically stable
>> >> way.
>> >> Specifically, if we increase weight on one node, we will get more PGs
>> >> on
>> >> this node and less on every other node:
>> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
>> >>
>> >> Now, finding ideal weights can be done by any numerical minimization
>> >> method,
>> >> for example NLMS.
>> >>
>> >>
>> >> 3. The proposal
>> >> For each pool, from initial weights given in crush map perfect weights
>> >> will
>> >> be derived.
>> >> This weights will be used to calculate PG distribution. This of course
>> >> will
>> >> be close to perfect.
>> >>
>> >> 3a: Downside when OSD is out
>> >> When an OSD is out, missing PG copies will be replicated elsewhere.
>> >> Because now weights deviate from OSD capacity, some OSDs will
>> >> statistically
>> >> get more copies then they should.
>> >> This unevenness in distribution is proportional to scale of deviation
>> >> of
>> >> calculated weights to capacity weights.
>> >>
>> >> 3b: Upside
>> >> This all can be achieved without changes to crush.
>> >
>> > Yes!
>> >
>> > And no. You're totally right--we should use an offline optimization to
>> > tweak the crush input weights to get a better balance. It won't be
>> > robust
>> > to changes to the cluster, but we can incrementally optimize after that
>> > happens to converge on something better.
>> >
>> > The problem with doing this with current versions of Ceph is that we
>> > lose
>> > the original "input" or "target" weights (i.e., the actual size of
>> > the OSD) that we want to converge on. This is one reason why we haven't
>> > done something like this before.
>> >
>> > In luminous we *could* work around this by storing those canonical
>> > weights outside of crush using something (probably?) ugly and
>> > maintain backward compatibility with older clients using existing
>> > CRUSH behavior.
>>
>> These canonical weights could be stored in crush by creating dedicated
>> buckets. For instance the root-canonical bucket could be created to store
>> the canonical weights of the root bucket. The sysadmin needs to be aware of
>> the difference and know to add a new device in the host01-canonical bucket
>> instead of the host01 bucket. And to run an offline tool to keep the two
>> buckets in sync and compute the weight to use for placement derived from the
>> weights representing the device capacity.
>>
>> It is a little bit ugly ;-)
>>
>> > OR, (and this is my preferred route), if the multi-pick anomaly approach
>> > that Pedro is working on works out, we'll want to extend the CRUSH map
>> > to
>> > include a set of derivative weights used for actual placement
>> > calculations
>> > instead of the canonical target weights, and we can do what you're
>> > proposing *and* solve the multipick problem with one change in the crush
>> > map and algorithm. (Actually choosing those derivative weights will
>> > be an offline process that can both improve the balance for the inputs
>> > we
>> > care about *and* adjust them based on the position to fix the skew issue
>> > for replicas.) This doesn't help pre-luminous clients, but I think the
>> > end solution will be simpler and more elegant...
>> >
>> > What do you think?
>> >
>> > sage
>> >
>> >
>> >> 4. Extra
>> >> Some time ago I made such change to perfectly balance Thomson-Reuters
>> >> cluster.
>> >> It succeeded.
>> >> A solution was not accepted, because modification of OSD weights were
>> >> higher
>> >> then 50%, which was caused by fact that different placement rules
>> >> operated
>> >> on different sets of OSDs, and those sets were not disjointed.
>> >
>> >
>> >>
>> >> Best regards,
>> >> Adam
>> >>
>> >>
>> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
>> >> Hi Pedro, Loic,
>> >>
>> >> For what it's worth, my intuition here (which has had a mixed
>> >> record as
>> >> far as CRUSH goes) is that this is the most promising path
>> >> forward.
>> >>
>> >> Thinking ahead a few steps, and confirming that I'm following
>> >> the
>> >> discussion so far, if you're able to do get black (or white) box
>> >> gradient
>> >> descent to work, then this will give us a set of weights for
>> >> each item in
>> >> the tree for each selection round, derived from the tree
>> >> structure and
>> >> original (target) weights. That would basically give us a map
>> >> of item id
>> >> (bucket id or leave item id) to weight for each round. i.e.,
>> >>
>> >> map<int, map<int, float>> weight_by_position; // position ->
>> >> item -> weight
>> >>
>> >> where the 0 round would (I think?) match the target weights, and
>> >> each
>> >> round after that would skew low-weighted items lower to some
>> >> degree.
>> >> Right?
>> >>
>> >> The next question I have is: does this generalize from the
>> >> single-bucket
>> >> case to the hierarchy? I.e., if I have a "tree" (single bucket)
>> >> like
>> >>
>> >> 3.1
>> >> |_____________
>> >> | \ \ \
>> >> 1.0 1.0 1.0 .1
>> >>
>> >> it clearly works, but when we have a multi-level tree like
>> >>
>> >>
>> >> 8.4
>> >> |____________________________________
>> >> | \ \
>> >> 3.1 3.1 2.2
>> >> |_____________ |_____________ |_____________
>> >> | \ \ \ | \ \ \ | \ \ \
>> >> 1.0 1.0 1.0 .1 1.0 1.0 1.0 .1 1.0 1.0 .1 .1
>> >>
>> >> and the second round weights skew the small .1 leaves lower, can
>> >> we
>> >> continue to build the summed-weight hierarchy, such that the
>> >> adjusted
>> >> weights at the higher level are appropriately adjusted to give
>> >> us the
>> >> right probabilities of descending into those trees? I'm not
>> >> sure if that
>> >> logically follows from the above or if my intuition is
>> >> oversimplifying
>> >> things.
>> >>
>> >> If this *is* how we think this will shake out, then I'm
>> >> wondering if we
>> >> should go ahead and build this weigh matrix into CRUSH sooner
>> >> rather
>> >> than later (i.e., for luminous). As with the explicit
>> >> remappings, the
>> >> hard part is all done offline, and the adjustments to the CRUSH
>> >> mapping
>> >> calculation itself (storing and making use of the adjusted
>> >> weights for
>> >> each round of placement) are relatively straightforward. And
>> >> the sooner
>> >> this is incorporated into a release the sooner real users will
>> >> be able to
>> >> roll out code to all clients and start making us of it.
>> >>
>> >> Thanks again for looking at this problem! I'm excited that we
>> >> may be
>> >> closing in on a real solution!
>> >>
>> >> sage
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
>> >>
>> >> > There are lot of gradient-free methods. I will try first to
>> >> run the
>> >> > ones available using just scipy
>> >> >
>> >>
>> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>> >> > Some of them don't require the gradient and some of them can
>> >> estimate
>> >> > it. The reason to go without the gradient is to run the CRUSH
>> >> > algorithm as a black box. In that case this would be the
>> >> pseudo-code:
>> >> >
>> >> > - BEGIN CODE -
>> >> > def build_target(desired_freqs):
>> >> > def target(weights):
>> >> > # run a simulation of CRUSH for a number of objects
>> >> > sim_freqs = run_crush(weights)
>> >> > # Kullback-Leibler divergence between desired
>> >> frequencies and
>> >> > current ones
>> >> > return loss(sim_freqs, desired_freqs)
>> >> > return target
>> >> >
>> >> > weights = scipy.optimize.minimize(build_target(desired_freqs))
>> >> > - END CODE -
>> >> >
>> >> > The tricky thing here is that this procedure can be slow if
>> >> the
>> >> > simulation (run_crush) needs to place a lot of objects to get
>> >> accurate
>> >> > simulated frequencies. This is true specially if the minimize
>> >> method
>> >> > attempts to approximate the gradient using finite differences
>> >> since it
>> >> > will evaluate the target function a number of times
>> >> proportional to
>> >> > the number of weights). Apart from the ones in scipy I would
>> >> try also
>> >> > optimization methods that try to perform as few evaluations as
>> >> > possible like for example HyperOpt
>> >> > (http://hyperopt.github.io/hyperopt/), which by the way takes
>> >> into
>> >> > account that the target function can be noisy.
>> >> >
>> >> > This black box approximation is simple to implement and makes
>> >> the
>> >> > computer do all the work instead of us.
>> >> > I think that this black box approximation is worthy to try
>> >> even if
>> >> > it's not the final one because if this approximation works
>> >> then we
>> >> > know that a more elaborate one that computes the gradient of
>> >> the CRUSH
>> >> > algorithm will work for sure.
>> >> >
>> >> > I can try this black box approximation this weekend not on the
>> >> real
>> >> > CRUSH algorithm but with the simple implementation I did in
>> >> python. If
>> >> > it works it's just a matter of substituting one simulation
>> >> with
>> >> > another and see what happens.
>> >> >
>> >> > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> >> > > Hi Pedro,
>> >> > >
>> >> > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>> >> > >> Hi Loic,
>> >> > >>
>> >> > >>>From what I see everything seems OK.
>> >> > >
>> >> > > Cool. I'll keep going in this direction then !
>> >> > >
>> >> > >> The interesting thing would be to
>> >> > >> test on some complex mapping. The reason is that
>> >> "CrushPolicyFamily"
>> >> > >> is right now modeling just a single straw bucket not the
>> >> full CRUSH
>> >> > >> algorithm.
>> >> > >
>> >> > > A number of use cases use a single straw bucket, maybe the
>> >> majority of them. Even though it does not reflect the full range
>> >> of what crush can offer, it could be useful. To be more
>> >> specific, a crush map that states "place objects so that there
>> >> is at most one replica per host" or "one replica per rack" is
>> >> common. Such a crushmap can be reduced to a single straw bucket
>> >> that contains all the hosts and by using the CrushPolicyFamily,
>> >> we can change the weights of each host to fix the probabilities.
>> >> The hosts themselves contain disks with varying weights but I
>> >> think we can ignore that because crush will only recurse to
>> >> place one object within a given host.
>> >> > >
>> >> > >> That's the work that remains to be done. The only way that
>> >> > >> would avoid reimplementing the CRUSH algorithm and
>> >> computing the
>> >> > >> gradient would be treating CRUSH as a black box and
>> >> eliminating the
>> >> > >> necessity of computing the gradient either by using a
>> >> gradient-free
>> >> > >> optimization method or making an estimation of the
>> >> gradient.
>> >> > >
>> >> > > By gradient-free optimization you mean simulated annealing
>> >> or Monte Carlo ?
>> >> > >
>> >> > > Cheers
>> >> > >
>> >> > >>
>> >> > >>
>> >> > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> >> > >>> Hi,
>> >> > >>>
>> >> > >>> I modified the crush library to accept two weights (one
>> >> for the first disk, the other for the remaining disks)[1]. This
>> >> really is a hack for experimentation purposes only ;-) I was
>> >> able to run a variation of your code[2] and got the following
>> >> results which are encouraging. Do you think what I did is
>> >> sensible ? Or is there a problem I don't see ?
>> >> > >>>
>> >> > >>> Thanks !
>> >> > >>>
>> >> > >>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8
>> >> 6]
>> >> > >>>
>> >>
>> >> ------------------------------------------------------------------------
>> >> > >>> Before: All replicas on each hard drive
>> >> > >>> Expected vs actual use (20000 samples)
>> >> > >>> disk 0: 1.39e-01 1.12e-01
>> >> > >>> disk 1: 1.11e-01 1.10e-01
>> >> > >>> disk 2: 8.33e-02 1.13e-01
>> >> > >>> disk 3: 1.39e-01 1.11e-01
>> >> > >>> disk 4: 1.11e-01 1.11e-01
>> >> > >>> disk 5: 8.33e-02 1.11e-01
>> >> > >>> disk 6: 1.39e-01 1.12e-01
>> >> > >>> disk 7: 1.11e-01 1.12e-01
>> >> > >>> disk 8: 8.33e-02 1.10e-01
>> >> > >>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>> >> > >>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>> >> > >>> ...
>> >> > >>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>> >> > >>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>> >> > >>> Converged to desired accuracy :)
>> >> > >>> After: All replicas on each hard drive
>> >> > >>> Expected vs actual use (20000 samples)
>> >> > >>> disk 0: 1.39e-01 1.42e-01
>> >> > >>> disk 1: 1.11e-01 1.09e-01
>> >> > >>> disk 2: 8.33e-02 8.37e-02
>> >> > >>> disk 3: 1.39e-01 1.40e-01
>> >> > >>> disk 4: 1.11e-01 1.13e-01
>> >> > >>> disk 5: 8.33e-02 8.08e-02
>> >> > >>> disk 6: 1.39e-01 1.38e-01
>> >> > >>> disk 7: 1.11e-01 1.09e-01
>> >> > >>> disk 8: 8.33e-02 8.48e-02
>> >> > >>>
>> >> > >>>
>> >> > >>> Simulation: R=2 devices capacity [10 10 10 10 1]
>> >> > >>>
>> >>
>> >> ------------------------------------------------------------------------
>> >> > >>> Before: All replicas on each hard drive
>> >> > >>> Expected vs actual use (20000 samples)
>> >> > >>> disk 0: 2.44e-01 2.36e-01
>> >> > >>> disk 1: 2.44e-01 2.38e-01
>> >> > >>> disk 2: 2.44e-01 2.34e-01
>> >> > >>> disk 3: 2.44e-01 2.38e-01
>> >> > >>> disk 4: 2.44e-02 5.37e-02
>> >> > >>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>> >> > >>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>> >> > >>> ...
>> >> > >>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>> >> > >>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>> >> > >>> Converged to desired accuracy :)
>> >> > >>> After: All replicas on each hard drive
>> >> > >>> Expected vs actual use (20000 samples)
>> >> > >>> disk 0: 2.44e-01 2.46e-01
>> >> > >>> disk 1: 2.44e-01 2.44e-01
>> >> > >>> disk 2: 2.44e-01 2.41e-01
>> >> > >>> disk 3: 2.44e-01 2.45e-01
>> >> > >>> disk 4: 2.44e-02 2.33e-02
>> >> > >>>
>> >> > >>>
>> >> > >>> [1] crush
>> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
>> >> 56fee8
>> >> > >>> [2] python-crush
>> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
>> >> bd25f8f2c4b68
>> >> > >>>
>> >> > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>> >> > >>>> Hi Pedro,
>> >> > >>>>
>> >> > >>>> It looks like trying to experiment with crush won't work
>> >> as expected because crush does not distinguish the probability
>> >> of selecting the first device from the probability of selecting
>> >> the second or third device. Am I mistaken ?
>> >> > >>>>
>> >> > >>>> Cheers
>> >> > >>>>
>> >> > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>> >> > >>>>> Hi Pedro,
>> >> > >>>>>
>> >> > >>>>> I'm going to experiment with what you did at
>> >> > >>>>>
>> >> > >>>>>
>> >> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>> >> > >>>>>
>> >> > >>>>> and the latest python-crush published today. A
>> >> comparison function was added that will help measure the data
>> >> movement. I'm hoping we can release an offline tool based on
>> >> your solution. Please let me know if I should wait before diving
>> >> into this, in case you have unpublished drafts or new ideas.
>> >> > >>>>>
>> >> > >>>>> Cheers
>> >> > >>>>>
>> >> > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>> >> > >>>>>> Great, thanks for the clarifications.
>> >> > >>>>>> I also think that the most natural way is to keep just
>> >> a set of
>> >> > >>>>>> weights in the CRUSH map and update them inside the
>> >> algorithm.
>> >> > >>>>>>
>> >> > >>>>>> I keep working on it.
>> >> > >>>>>>
>> >> > >>>>>>
>> >> > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
>> >> <sage@newdream.net>:
>> >> > >>>>>>> Hi Pedro,
>> >> > >>>>>>>
>> >> > >>>>>>> Thanks for taking a look at this! It's a frustrating
>> >> problem and we
>> >> > >>>>>>> haven't made much headway.
>> >> > >>>>>>>
>> >> > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>> >> > >>>>>>>> Hi,
>> >> > >>>>>>>>
>> >> > >>>>>>>> I will have a look. BTW, I have not progressed that
>> >> much but I have
>> >> > >>>>>>>> been thinking about it. In order to adapt the
>> >> previous algorithm in
>> >> > >>>>>>>> the python notebook I need to substitute the
>> >> iteration over all
>> >> > >>>>>>>> possible devices permutations to iteration over all
>> >> the possible
>> >> > >>>>>>>> selections that crush would make. That is the main
>> >> thing I need to
>> >> > >>>>>>>> work on.
>> >> > >>>>>>>>
>> >> > >>>>>>>> The other thing is of course that weights change for
>> >> each replica.
>> >> > >>>>>>>> That is, they cannot be really fixed in the crush
>> >> map. So the
>> >> > >>>>>>>> algorithm inside libcrush, not only the weights in
>> >> the map, need to be
>> >> > >>>>>>>> changed. The weights in the crush map should reflect
>> >> then, maybe, the
>> >> > >>>>>>>> desired usage frequencies. Or maybe each replica
>> >> should have their own
>> >> > >>>>>>>> crush map, but then the information about the
>> >> previous selection
>> >> > >>>>>>>> should be passed to the next replica placement run so
>> >> it avoids
>> >> > >>>>>>>> selecting the same one again.
>> >> > >>>>>>>
>> >> > >>>>>>> My suspicion is that the best solution here (whatever
>> >> that means!)
>> >> > >>>>>>> leaves the CRUSH weights intact with the desired
>> >> distribution, and
>> >> > >>>>>>> then generates a set of derivative weights--probably
>> >> one set for each
>> >> > >>>>>>> round/replica/rank.
>> >> > >>>>>>>
>> >> > >>>>>>> One nice property of this is that once the support is
>> >> added to encode
>> >> > >>>>>>> multiple sets of weights, the algorithm used to
>> >> generate them is free to
>> >> > >>>>>>> change and evolve independently. (In most cases any
>> >> change is
>> >> > >>>>>>> CRUSH's mapping behavior is difficult to roll out
>> >> because all
>> >> > >>>>>>> parties participating in the cluster have to support
>> >> any new behavior
>> >> > >>>>>>> before it is enabled or used.)
>> >> > >>>>>>>
>> >> > >>>>>>>> I have a question also. Is there any significant
>> >> difference between
>> >> > >>>>>>>> the device selection algorithm description in the
>> >> paper and its final
>> >> > >>>>>>>> implementation?
>> >> > >>>>>>>
>> >> > >>>>>>> The main difference is the "retry_bucket" behavior was
>> >> found to be a bad
>> >> > >>>>>>> idea; any collision or failed()/overload() case
>> >> triggers the
>> >> > >>>>>>> retry_descent.
>> >> > >>>>>>>
>> >> > >>>>>>> There are other changes, of course, but I don't think
>> >> they'll impact any
>> >> > >>>>>>> solution we come with here (or at least any solution
>> >> can be suitably
>> >> > >>>>>>> adapted)!
>> >> > >>>>>>>
>> >> > >>>>>>> sage
>> >> > >>>>>> --
>> >> > >>>>>> To unsubscribe from this list: send the line
>> >> "unsubscribe ceph-devel" in
>> >> > >>>>>> the body of a message to majordomo@vger.kernel.org
>> >> > >>>>>> More majordomo info at
>> >> http://vger.kernel.org/majordomo-info.html
>> >> > >>>>>>
>> >> > >>>>>
>> >> > >>>>
>> >> > >>>
>> >> > >>> --
>> >> > >>> Loïc Dachary, Artisan Logiciel Libre
>> >> > >> --
>> >> > >> To unsubscribe from this list: send the line "unsubscribe
>> >> ceph-devel" in
>> >> > >> the body of a message to majordomo@vger.kernel.org
>> >> > >> More majordomo info at
>> >> http://vger.kernel.org/majordomo-info.html
>> >> > >>
>> >> > >
>> >> > > --
>> >> > > Loïc Dachary, Artisan Logiciel Libre
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe
>> >> ceph-devel" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at
>> >> http://vger.kernel.org/majordomo-info.html
>> >> >
>> >> >
>> >>
>> >>
>> >>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>
>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-27 9:27 ` Adam Kupczyk
@ 2017-03-27 10:29 ` Loic Dachary
2017-03-27 10:37 ` Pedro López-Adeva
2017-03-27 13:39 ` Sage Weil
2 siblings, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-03-27 10:29 UTC (permalink / raw)
To: Adam Kupczyk; +Cc: Sage Weil, Ceph Development
Hi Adam,
On 03/27/2017 11:27 AM, Adam Kupczyk wrote:
> Hi,
>
> My understanding is that optimal tweaked weights will depend on:
> 1) pool_id, because of rjenkins(pool_id) in crush
> 2) number of placement groups and replication factor, as it determines
> amount of samples
>
> Therefore tweaked weights should rather be property of instantialized pool,
> not crush placement definition.
>
> If tweaked weights are to be part of crush definition, than for each
> created pool we need to have separate list of weights.
This could be achieved by creating a bucket tree for each pool. There is a hack doing that at http://libcrush.org/main/python-crush/merge_requests/40/diffs and we can hopefully get something useable for the sysadmin (see http://libcrush.org/main/python-crush/issues/13). This is however different from fixing the crush multipick anomaly, it is primarily useful when there are not enough samples to get an even distribution.
> Is it possible to provide clients with different weights depending on on
> which pool they want to operate?
>
> Best regards,
> Adam
>
> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
>> Hi,
>>
>> My understanding is that optimal tweaked weights will depend on:
>> 1) pool_id, because of rjenkins(pool_id) in crush
>> 2) number of placement groups and replication factor, as it determines
>> amount of samples
>>
>> Therefore tweaked weights should rather be property of instantialized pool,
>> not crush placement definition.
>>
>> If tweaked weights are to be part of crush definition, than for each created
>> pool we need to have separate list of weights.
>> Is it possible to provide clients with different weights depending on on
>> which pool they want to operate?
>>
>> Best regards,
>> Adam
>>
>>
>> On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@dachary.org> wrote:
>>>
>>>
>>>
>>> On 03/27/2017 04:33 AM, Sage Weil wrote:
>>>> On Sun, 26 Mar 2017, Adam Kupczyk wrote:
>>>>> Hello Sage, Loic, Pedro,
>>>>>
>>>>>
>>>>> I am certain that almost perfect mapping can be achieved by
>>>>> substituting weights from crush map with slightly modified weights.
>>>>> By perfect mapping I mean we get on each OSD number of PGs exactly
>>>>> proportional to weights specified in crush map.
>>>>>
>>>>> 1. Example
>>>>> Lets think of PGs of single object pool.
>>>>> We have OSDs with following weights:
>>>>> [10, 10, 10, 5, 5]
>>>>>
>>>>> Ideally, we would like following distribution of 200PG x 3 copies = 600
>>>>> PGcopies :
>>>>> [150, 150, 150, 75, 75]
>>>>>
>>>>> However, because crush simulates random process we have:
>>>>> [143, 152, 158, 71, 76]
>>>>>
>>>>> We could have obtained perfect distribution had we used weights like
>>>>> this:
>>>>> [10.2, 9.9, 9.6, 5.2, 4.9]
>>>>>
>>>>>
>>>>> 2. Obtaining perfect mapping weights from OSD capacity weights
>>>>>
>>>>> When we apply crush for the first time, distribution of PGs comes as
>>>>> random.
>>>>> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
>>>>>
>>>>> But CRUSH is not random proces at all, it behaves in numerically stable
>>>>> way.
>>>>> Specifically, if we increase weight on one node, we will get more PGs
>>>>> on
>>>>> this node and less on every other node:
>>>>> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
>>>>>
>>>>> Now, finding ideal weights can be done by any numerical minimization
>>>>> method,
>>>>> for example NLMS.
>>>>>
>>>>>
>>>>> 3. The proposal
>>>>> For each pool, from initial weights given in crush map perfect weights
>>>>> will
>>>>> be derived.
>>>>> This weights will be used to calculate PG distribution. This of course
>>>>> will
>>>>> be close to perfect.
>>>>>
>>>>> 3a: Downside when OSD is out
>>>>> When an OSD is out, missing PG copies will be replicated elsewhere.
>>>>> Because now weights deviate from OSD capacity, some OSDs will
>>>>> statistically
>>>>> get more copies then they should.
>>>>> This unevenness in distribution is proportional to scale of deviation
>>>>> of
>>>>> calculated weights to capacity weights.
>>>>>
>>>>> 3b: Upside
>>>>> This all can be achieved without changes to crush.
>>>>
>>>> Yes!
>>>>
>>>> And no. You're totally right--we should use an offline optimization to
>>>> tweak the crush input weights to get a better balance. It won't be
>>>> robust
>>>> to changes to the cluster, but we can incrementally optimize after that
>>>> happens to converge on something better.
>>>>
>>>> The problem with doing this with current versions of Ceph is that we
>>>> lose
>>>> the original "input" or "target" weights (i.e., the actual size of
>>>> the OSD) that we want to converge on. This is one reason why we haven't
>>>> done something like this before.
>>>>
>>>> In luminous we *could* work around this by storing those canonical
>>>> weights outside of crush using something (probably?) ugly and
>>>> maintain backward compatibility with older clients using existing
>>>> CRUSH behavior.
>>>
>>> These canonical weights could be stored in crush by creating dedicated
>>> buckets. For instance the root-canonical bucket could be created to store
>>> the canonical weights of the root bucket. The sysadmin needs to be aware of
>>> the difference and know to add a new device in the host01-canonical bucket
>>> instead of the host01 bucket. And to run an offline tool to keep the two
>>> buckets in sync and compute the weight to use for placement derived from the
>>> weights representing the device capacity.
>>>
>>> It is a little bit ugly ;-)
>>>
>>>> OR, (and this is my preferred route), if the multi-pick anomaly approach
>>>> that Pedro is working on works out, we'll want to extend the CRUSH map
>>>> to
>>>> include a set of derivative weights used for actual placement
>>>> calculations
>>>> instead of the canonical target weights, and we can do what you're
>>>> proposing *and* solve the multipick problem with one change in the crush
>>>> map and algorithm. (Actually choosing those derivative weights will
>>>> be an offline process that can both improve the balance for the inputs
>>>> we
>>>> care about *and* adjust them based on the position to fix the skew issue
>>>> for replicas.) This doesn't help pre-luminous clients, but I think the
>>>> end solution will be simpler and more elegant...
>>>>
>>>> What do you think?
>>>>
>>>> sage
>>>>
>>>>
>>>>> 4. Extra
>>>>> Some time ago I made such change to perfectly balance Thomson-Reuters
>>>>> cluster.
>>>>> It succeeded.
>>>>> A solution was not accepted, because modification of OSD weights were
>>>>> higher
>>>>> then 50%, which was caused by fact that different placement rules
>>>>> operated
>>>>> on different sets of OSDs, and those sets were not disjointed.
>>>>
>>>>
>>>>>
>>>>> Best regards,
>>>>> Adam
>>>>>
>>>>>
>>>>> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
>>>>> Hi Pedro, Loic,
>>>>>
>>>>> For what it's worth, my intuition here (which has had a mixed
>>>>> record as
>>>>> far as CRUSH goes) is that this is the most promising path
>>>>> forward.
>>>>>
>>>>> Thinking ahead a few steps, and confirming that I'm following
>>>>> the
>>>>> discussion so far, if you're able to do get black (or white) box
>>>>> gradient
>>>>> descent to work, then this will give us a set of weights for
>>>>> each item in
>>>>> the tree for each selection round, derived from the tree
>>>>> structure and
>>>>> original (target) weights. That would basically give us a map
>>>>> of item id
>>>>> (bucket id or leave item id) to weight for each round. i.e.,
>>>>>
>>>>> map<int, map<int, float>> weight_by_position; // position ->
>>>>> item -> weight
>>>>>
>>>>> where the 0 round would (I think?) match the target weights, and
>>>>> each
>>>>> round after that would skew low-weighted items lower to some
>>>>> degree.
>>>>> Right?
>>>>>
>>>>> The next question I have is: does this generalize from the
>>>>> single-bucket
>>>>> case to the hierarchy? I.e., if I have a "tree" (single bucket)
>>>>> like
>>>>>
>>>>> 3.1
>>>>> |_____________
>>>>> | \ \ \
>>>>> 1.0 1.0 1.0 .1
>>>>>
>>>>> it clearly works, but when we have a multi-level tree like
>>>>>
>>>>>
>>>>> 8.4
>>>>> |____________________________________
>>>>> | \ \
>>>>> 3.1 3.1 2.2
>>>>> |_____________ |_____________ |_____________
>>>>> | \ \ \ | \ \ \ | \ \ \
>>>>> 1.0 1.0 1.0 .1 1.0 1.0 1.0 .1 1.0 1.0 .1 .1
>>>>>
>>>>> and the second round weights skew the small .1 leaves lower, can
>>>>> we
>>>>> continue to build the summed-weight hierarchy, such that the
>>>>> adjusted
>>>>> weights at the higher level are appropriately adjusted to give
>>>>> us the
>>>>> right probabilities of descending into those trees? I'm not
>>>>> sure if that
>>>>> logically follows from the above or if my intuition is
>>>>> oversimplifying
>>>>> things.
>>>>>
>>>>> If this *is* how we think this will shake out, then I'm
>>>>> wondering if we
>>>>> should go ahead and build this weigh matrix into CRUSH sooner
>>>>> rather
>>>>> than later (i.e., for luminous). As with the explicit
>>>>> remappings, the
>>>>> hard part is all done offline, and the adjustments to the CRUSH
>>>>> mapping
>>>>> calculation itself (storing and making use of the adjusted
>>>>> weights for
>>>>> each round of placement) are relatively straightforward. And
>>>>> the sooner
>>>>> this is incorporated into a release the sooner real users will
>>>>> be able to
>>>>> roll out code to all clients and start making us of it.
>>>>>
>>>>> Thanks again for looking at this problem! I'm excited that we
>>>>> may be
>>>>> closing in on a real solution!
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
>>>>>
>>>>> > There are lot of gradient-free methods. I will try first to
>>>>> run the
>>>>> > ones available using just scipy
>>>>> >
>>>>>
>>>>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>>>> > Some of them don't require the gradient and some of them can
>>>>> estimate
>>>>> > it. The reason to go without the gradient is to run the CRUSH
>>>>> > algorithm as a black box. In that case this would be the
>>>>> pseudo-code:
>>>>> >
>>>>> > - BEGIN CODE -
>>>>> > def build_target(desired_freqs):
>>>>> > def target(weights):
>>>>> > # run a simulation of CRUSH for a number of objects
>>>>> > sim_freqs = run_crush(weights)
>>>>> > # Kullback-Leibler divergence between desired
>>>>> frequencies and
>>>>> > current ones
>>>>> > return loss(sim_freqs, desired_freqs)
>>>>> > return target
>>>>> >
>>>>> > weights = scipy.optimize.minimize(build_target(desired_freqs))
>>>>> > - END CODE -
>>>>> >
>>>>> > The tricky thing here is that this procedure can be slow if
>>>>> the
>>>>> > simulation (run_crush) needs to place a lot of objects to get
>>>>> accurate
>>>>> > simulated frequencies. This is true specially if the minimize
>>>>> method
>>>>> > attempts to approximate the gradient using finite differences
>>>>> since it
>>>>> > will evaluate the target function a number of times
>>>>> proportional to
>>>>> > the number of weights). Apart from the ones in scipy I would
>>>>> try also
>>>>> > optimization methods that try to perform as few evaluations as
>>>>> > possible like for example HyperOpt
>>>>> > (http://hyperopt.github.io/hyperopt/), which by the way takes
>>>>> into
>>>>> > account that the target function can be noisy.
>>>>> >
>>>>> > This black box approximation is simple to implement and makes
>>>>> the
>>>>> > computer do all the work instead of us.
>>>>> > I think that this black box approximation is worthy to try
>>>>> even if
>>>>> > it's not the final one because if this approximation works
>>>>> then we
>>>>> > know that a more elaborate one that computes the gradient of
>>>>> the CRUSH
>>>>> > algorithm will work for sure.
>>>>> >
>>>>> > I can try this black box approximation this weekend not on the
>>>>> real
>>>>> > CRUSH algorithm but with the simple implementation I did in
>>>>> python. If
>>>>> > it works it's just a matter of substituting one simulation
>>>>> with
>>>>> > another and see what happens.
>>>>> >
>>>>> > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>> > > Hi Pedro,
>>>>> > >
>>>>> > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>>> > >> Hi Loic,
>>>>> > >>
>>>>> > >>>From what I see everything seems OK.
>>>>> > >
>>>>> > > Cool. I'll keep going in this direction then !
>>>>> > >
>>>>> > >> The interesting thing would be to
>>>>> > >> test on some complex mapping. The reason is that
>>>>> "CrushPolicyFamily"
>>>>> > >> is right now modeling just a single straw bucket not the
>>>>> full CRUSH
>>>>> > >> algorithm.
>>>>> > >
>>>>> > > A number of use cases use a single straw bucket, maybe the
>>>>> majority of them. Even though it does not reflect the full range
>>>>> of what crush can offer, it could be useful. To be more
>>>>> specific, a crush map that states "place objects so that there
>>>>> is at most one replica per host" or "one replica per rack" is
>>>>> common. Such a crushmap can be reduced to a single straw bucket
>>>>> that contains all the hosts and by using the CrushPolicyFamily,
>>>>> we can change the weights of each host to fix the probabilities.
>>>>> The hosts themselves contain disks with varying weights but I
>>>>> think we can ignore that because crush will only recurse to
>>>>> place one object within a given host.
>>>>> > >
>>>>> > >> That's the work that remains to be done. The only way that
>>>>> > >> would avoid reimplementing the CRUSH algorithm and
>>>>> computing the
>>>>> > >> gradient would be treating CRUSH as a black box and
>>>>> eliminating the
>>>>> > >> necessity of computing the gradient either by using a
>>>>> gradient-free
>>>>> > >> optimization method or making an estimation of the
>>>>> gradient.
>>>>> > >
>>>>> > > By gradient-free optimization you mean simulated annealing
>>>>> or Monte Carlo ?
>>>>> > >
>>>>> > > Cheers
>>>>> > >
>>>>> > >>
>>>>> > >>
>>>>> > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>> > >>> Hi,
>>>>> > >>>
>>>>> > >>> I modified the crush library to accept two weights (one
>>>>> for the first disk, the other for the remaining disks)[1]. This
>>>>> really is a hack for experimentation purposes only ;-) I was
>>>>> able to run a variation of your code[2] and got the following
>>>>> results which are encouraging. Do you think what I did is
>>>>> sensible ? Or is there a problem I don't see ?
>>>>> > >>>
>>>>> > >>> Thanks !
>>>>> > >>>
>>>>> > >>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8
>>>>> 6]
>>>>> > >>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> > >>> Before: All replicas on each hard drive
>>>>> > >>> Expected vs actual use (20000 samples)
>>>>> > >>> disk 0: 1.39e-01 1.12e-01
>>>>> > >>> disk 1: 1.11e-01 1.10e-01
>>>>> > >>> disk 2: 8.33e-02 1.13e-01
>>>>> > >>> disk 3: 1.39e-01 1.11e-01
>>>>> > >>> disk 4: 1.11e-01 1.11e-01
>>>>> > >>> disk 5: 8.33e-02 1.11e-01
>>>>> > >>> disk 6: 1.39e-01 1.12e-01
>>>>> > >>> disk 7: 1.11e-01 1.12e-01
>>>>> > >>> disk 8: 8.33e-02 1.10e-01
>>>>> > >>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>>>>> > >>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>>>>> > >>> ...
>>>>> > >>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>>>>> > >>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>>>>> > >>> Converged to desired accuracy :)
>>>>> > >>> After: All replicas on each hard drive
>>>>> > >>> Expected vs actual use (20000 samples)
>>>>> > >>> disk 0: 1.39e-01 1.42e-01
>>>>> > >>> disk 1: 1.11e-01 1.09e-01
>>>>> > >>> disk 2: 8.33e-02 8.37e-02
>>>>> > >>> disk 3: 1.39e-01 1.40e-01
>>>>> > >>> disk 4: 1.11e-01 1.13e-01
>>>>> > >>> disk 5: 8.33e-02 8.08e-02
>>>>> > >>> disk 6: 1.39e-01 1.38e-01
>>>>> > >>> disk 7: 1.11e-01 1.09e-01
>>>>> > >>> disk 8: 8.33e-02 8.48e-02
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> Simulation: R=2 devices capacity [10 10 10 10 1]
>>>>> > >>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> > >>> Before: All replicas on each hard drive
>>>>> > >>> Expected vs actual use (20000 samples)
>>>>> > >>> disk 0: 2.44e-01 2.36e-01
>>>>> > >>> disk 1: 2.44e-01 2.38e-01
>>>>> > >>> disk 2: 2.44e-01 2.34e-01
>>>>> > >>> disk 3: 2.44e-01 2.38e-01
>>>>> > >>> disk 4: 2.44e-02 5.37e-02
>>>>> > >>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>>>>> > >>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>>>>> > >>> ...
>>>>> > >>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>>>>> > >>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>>>>> > >>> Converged to desired accuracy :)
>>>>> > >>> After: All replicas on each hard drive
>>>>> > >>> Expected vs actual use (20000 samples)
>>>>> > >>> disk 0: 2.44e-01 2.46e-01
>>>>> > >>> disk 1: 2.44e-01 2.44e-01
>>>>> > >>> disk 2: 2.44e-01 2.41e-01
>>>>> > >>> disk 3: 2.44e-01 2.45e-01
>>>>> > >>> disk 4: 2.44e-02 2.33e-02
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> [1] crush
>>>>> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
>>>>> 56fee8
>>>>> > >>> [2] python-crush
>>>>> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
>>>>> bd25f8f2c4b68
>>>>> > >>>
>>>>> > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>> > >>>> Hi Pedro,
>>>>> > >>>>
>>>>> > >>>> It looks like trying to experiment with crush won't work
>>>>> as expected because crush does not distinguish the probability
>>>>> of selecting the first device from the probability of selecting
>>>>> the second or third device. Am I mistaken ?
>>>>> > >>>>
>>>>> > >>>> Cheers
>>>>> > >>>>
>>>>> > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>> > >>>>> Hi Pedro,
>>>>> > >>>>>
>>>>> > >>>>> I'm going to experiment with what you did at
>>>>> > >>>>>
>>>>> > >>>>>
>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>> > >>>>>
>>>>> > >>>>> and the latest python-crush published today. A
>>>>> comparison function was added that will help measure the data
>>>>> movement. I'm hoping we can release an offline tool based on
>>>>> your solution. Please let me know if I should wait before diving
>>>>> into this, in case you have unpublished drafts or new ideas.
>>>>> > >>>>>
>>>>> > >>>>> Cheers
>>>>> > >>>>>
>>>>> > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>> > >>>>>> Great, thanks for the clarifications.
>>>>> > >>>>>> I also think that the most natural way is to keep just
>>>>> a set of
>>>>> > >>>>>> weights in the CRUSH map and update them inside the
>>>>> algorithm.
>>>>> > >>>>>>
>>>>> > >>>>>> I keep working on it.
>>>>> > >>>>>>
>>>>> > >>>>>>
>>>>> > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
>>>>> <sage@newdream.net>:
>>>>> > >>>>>>> Hi Pedro,
>>>>> > >>>>>>>
>>>>> > >>>>>>> Thanks for taking a look at this! It's a frustrating
>>>>> problem and we
>>>>> > >>>>>>> haven't made much headway.
>>>>> > >>>>>>>
>>>>> > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>> > >>>>>>>> Hi,
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> I will have a look. BTW, I have not progressed that
>>>>> much but I have
>>>>> > >>>>>>>> been thinking about it. In order to adapt the
>>>>> previous algorithm in
>>>>> > >>>>>>>> the python notebook I need to substitute the
>>>>> iteration over all
>>>>> > >>>>>>>> possible devices permutations to iteration over all
>>>>> the possible
>>>>> > >>>>>>>> selections that crush would make. That is the main
>>>>> thing I need to
>>>>> > >>>>>>>> work on.
>>>>> > >>>>>>>>
>>>>> > >>>>>>>> The other thing is of course that weights change for
>>>>> each replica.
>>>>> > >>>>>>>> That is, they cannot be really fixed in the crush
>>>>> map. So the
>>>>> > >>>>>>>> algorithm inside libcrush, not only the weights in
>>>>> the map, need to be
>>>>> > >>>>>>>> changed. The weights in the crush map should reflect
>>>>> then, maybe, the
>>>>> > >>>>>>>> desired usage frequencies. Or maybe each replica
>>>>> should have their own
>>>>> > >>>>>>>> crush map, but then the information about the
>>>>> previous selection
>>>>> > >>>>>>>> should be passed to the next replica placement run so
>>>>> it avoids
>>>>> > >>>>>>>> selecting the same one again.
>>>>> > >>>>>>>
>>>>> > >>>>>>> My suspicion is that the best solution here (whatever
>>>>> that means!)
>>>>> > >>>>>>> leaves the CRUSH weights intact with the desired
>>>>> distribution, and
>>>>> > >>>>>>> then generates a set of derivative weights--probably
>>>>> one set for each
>>>>> > >>>>>>> round/replica/rank.
>>>>> > >>>>>>>
>>>>> > >>>>>>> One nice property of this is that once the support is
>>>>> added to encode
>>>>> > >>>>>>> multiple sets of weights, the algorithm used to
>>>>> generate them is free to
>>>>> > >>>>>>> change and evolve independently. (In most cases any
>>>>> change is
>>>>> > >>>>>>> CRUSH's mapping behavior is difficult to roll out
>>>>> because all
>>>>> > >>>>>>> parties participating in the cluster have to support
>>>>> any new behavior
>>>>> > >>>>>>> before it is enabled or used.)
>>>>> > >>>>>>>
>>>>> > >>>>>>>> I have a question also. Is there any significant
>>>>> difference between
>>>>> > >>>>>>>> the device selection algorithm description in the
>>>>> paper and its final
>>>>> > >>>>>>>> implementation?
>>>>> > >>>>>>>
>>>>> > >>>>>>> The main difference is the "retry_bucket" behavior was
>>>>> found to be a bad
>>>>> > >>>>>>> idea; any collision or failed()/overload() case
>>>>> triggers the
>>>>> > >>>>>>> retry_descent.
>>>>> > >>>>>>>
>>>>> > >>>>>>> There are other changes, of course, but I don't think
>>>>> they'll impact any
>>>>> > >>>>>>> solution we come with here (or at least any solution
>>>>> can be suitably
>>>>> > >>>>>>> adapted)!
>>>>> > >>>>>>>
>>>>> > >>>>>>> sage
>>>>> > >>>>>> --
>>>>> > >>>>>> To unsubscribe from this list: send the line
>>>>> "unsubscribe ceph-devel" in
>>>>> > >>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> > >>>>>> More majordomo info at
>>>>> http://vger.kernel.org/majordomo-info.html
>>>>> > >>>>>>
>>>>> > >>>>>
>>>>> > >>>>
>>>>> > >>>
>>>>> > >>> --
>>>>> > >>> Loïc Dachary, Artisan Logiciel Libre
>>>>> > >> --
>>>>> > >> To unsubscribe from this list: send the line "unsubscribe
>>>>> ceph-devel" in
>>>>> > >> the body of a message to majordomo@vger.kernel.org
>>>>> > >> More majordomo info at
>>>>> http://vger.kernel.org/majordomo-info.html
>>>>> > >>
>>>>> > >
>>>>> > > --
>>>>> > > Loïc Dachary, Artisan Logiciel Libre
>>>>> > --
>>>>> > To unsubscribe from this list: send the line "unsubscribe
>>>>> ceph-devel" in
>>>>> > the body of a message to majordomo@vger.kernel.org
>>>>> > More majordomo info at
>>>>> http://vger.kernel.org/majordomo-info.html
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-27 9:27 ` Adam Kupczyk
2017-03-27 10:29 ` Loic Dachary
@ 2017-03-27 10:37 ` Pedro López-Adeva
2017-03-27 13:39 ` Sage Weil
2 siblings, 0 replies; 70+ messages in thread
From: Pedro López-Adeva @ 2017-03-27 10:37 UTC (permalink / raw)
To: Adam Kupczyk; +Cc: Loic Dachary, Sage Weil, Ceph Development
I have performed some tests as I said using the black box method.
Remember that I have not used the real CRUSH algorithm.
The idea was to compare the results against the white box with
gradient information.
I have used two replicas and 9 disks with the following capacities:
10, 8, 6, 10, 8, 6, 10, 8, 6
Contrary to what I thought scipy.optimize didn't give results I think
because the target function is noisy. This at least made the method I
tried (SLSQP) to return non success, so I switched to HyperOpt.
First result is that of course the method is much slower. This was expected:
Time using jacobian: 0.36s
Time using simulation: 69.19s
But the results I think are OK. The following columns show the
parameters (weights for the second replica placement) estimated using
the jacobian (white box) vs the simulation method (black box).
jac sim
--------
0.17 0.16
0.11 0.12
0.06 0.06
0.17 0.15
0.11 0.09
0.06 0.08
0.17 0.16
0.11 0.13
0.06 0.06
As you can see the agreement is reasonable.
You can see here the changes I made:
https://github.com/plafl/snippets/commit/ea701d2cffbf3884eab866ce6e2388879e040894
Where to go from here (I think):
1. Perform this same test using the real CRUSH algorithm
2. Improve the method to run faster
2017-03-27 11:27 GMT+02:00 Adam Kupczyk <akupczyk@mirantis.com>:
> Hi,
>
> My understanding is that optimal tweaked weights will depend on:
> 1) pool_id, because of rjenkins(pool_id) in crush
> 2) number of placement groups and replication factor, as it determines
> amount of samples
>
> Therefore tweaked weights should rather be property of instantialized pool,
> not crush placement definition.
>
> If tweaked weights are to be part of crush definition, than for each
> created pool we need to have separate list of weights.
> Is it possible to provide clients with different weights depending on on
> which pool they want to operate?
>
> Best regards,
> Adam
>
> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
>> Hi,
>>
>> My understanding is that optimal tweaked weights will depend on:
>> 1) pool_id, because of rjenkins(pool_id) in crush
>> 2) number of placement groups and replication factor, as it determines
>> amount of samples
>>
>> Therefore tweaked weights should rather be property of instantialized pool,
>> not crush placement definition.
>>
>> If tweaked weights are to be part of crush definition, than for each created
>> pool we need to have separate list of weights.
>> Is it possible to provide clients with different weights depending on on
>> which pool they want to operate?
>>
>> Best regards,
>> Adam
>>
>>
>> On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@dachary.org> wrote:
>>>
>>>
>>>
>>> On 03/27/2017 04:33 AM, Sage Weil wrote:
>>> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
>>> >> Hello Sage, Loic, Pedro,
>>> >>
>>> >>
>>> >> I am certain that almost perfect mapping can be achieved by
>>> >> substituting weights from crush map with slightly modified weights.
>>> >> By perfect mapping I mean we get on each OSD number of PGs exactly
>>> >> proportional to weights specified in crush map.
>>> >>
>>> >> 1. Example
>>> >> Lets think of PGs of single object pool.
>>> >> We have OSDs with following weights:
>>> >> [10, 10, 10, 5, 5]
>>> >>
>>> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
>>> >> PGcopies :
>>> >> [150, 150, 150, 75, 75]
>>> >>
>>> >> However, because crush simulates random process we have:
>>> >> [143, 152, 158, 71, 76]
>>> >>
>>> >> We could have obtained perfect distribution had we used weights like
>>> >> this:
>>> >> [10.2, 9.9, 9.6, 5.2, 4.9]
>>> >>
>>> >>
>>> >> 2. Obtaining perfect mapping weights from OSD capacity weights
>>> >>
>>> >> When we apply crush for the first time, distribution of PGs comes as
>>> >> random.
>>> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
>>> >>
>>> >> But CRUSH is not random proces at all, it behaves in numerically stable
>>> >> way.
>>> >> Specifically, if we increase weight on one node, we will get more PGs
>>> >> on
>>> >> this node and less on every other node:
>>> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
>>> >>
>>> >> Now, finding ideal weights can be done by any numerical minimization
>>> >> method,
>>> >> for example NLMS.
>>> >>
>>> >>
>>> >> 3. The proposal
>>> >> For each pool, from initial weights given in crush map perfect weights
>>> >> will
>>> >> be derived.
>>> >> This weights will be used to calculate PG distribution. This of course
>>> >> will
>>> >> be close to perfect.
>>> >>
>>> >> 3a: Downside when OSD is out
>>> >> When an OSD is out, missing PG copies will be replicated elsewhere.
>>> >> Because now weights deviate from OSD capacity, some OSDs will
>>> >> statistically
>>> >> get more copies then they should.
>>> >> This unevenness in distribution is proportional to scale of deviation
>>> >> of
>>> >> calculated weights to capacity weights.
>>> >>
>>> >> 3b: Upside
>>> >> This all can be achieved without changes to crush.
>>> >
>>> > Yes!
>>> >
>>> > And no. You're totally right--we should use an offline optimization to
>>> > tweak the crush input weights to get a better balance. It won't be
>>> > robust
>>> > to changes to the cluster, but we can incrementally optimize after that
>>> > happens to converge on something better.
>>> >
>>> > The problem with doing this with current versions of Ceph is that we
>>> > lose
>>> > the original "input" or "target" weights (i.e., the actual size of
>>> > the OSD) that we want to converge on. This is one reason why we haven't
>>> > done something like this before.
>>> >
>>> > In luminous we *could* work around this by storing those canonical
>>> > weights outside of crush using something (probably?) ugly and
>>> > maintain backward compatibility with older clients using existing
>>> > CRUSH behavior.
>>>
>>> These canonical weights could be stored in crush by creating dedicated
>>> buckets. For instance the root-canonical bucket could be created to store
>>> the canonical weights of the root bucket. The sysadmin needs to be aware of
>>> the difference and know to add a new device in the host01-canonical bucket
>>> instead of the host01 bucket. And to run an offline tool to keep the two
>>> buckets in sync and compute the weight to use for placement derived from the
>>> weights representing the device capacity.
>>>
>>> It is a little bit ugly ;-)
>>>
>>> > OR, (and this is my preferred route), if the multi-pick anomaly approach
>>> > that Pedro is working on works out, we'll want to extend the CRUSH map
>>> > to
>>> > include a set of derivative weights used for actual placement
>>> > calculations
>>> > instead of the canonical target weights, and we can do what you're
>>> > proposing *and* solve the multipick problem with one change in the crush
>>> > map and algorithm. (Actually choosing those derivative weights will
>>> > be an offline process that can both improve the balance for the inputs
>>> > we
>>> > care about *and* adjust them based on the position to fix the skew issue
>>> > for replicas.) This doesn't help pre-luminous clients, but I think the
>>> > end solution will be simpler and more elegant...
>>> >
>>> > What do you think?
>>> >
>>> > sage
>>> >
>>> >
>>> >> 4. Extra
>>> >> Some time ago I made such change to perfectly balance Thomson-Reuters
>>> >> cluster.
>>> >> It succeeded.
>>> >> A solution was not accepted, because modification of OSD weights were
>>> >> higher
>>> >> then 50%, which was caused by fact that different placement rules
>>> >> operated
>>> >> on different sets of OSDs, and those sets were not disjointed.
>>> >
>>> >
>>> >>
>>> >> Best regards,
>>> >> Adam
>>> >>
>>> >>
>>> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
>>> >> Hi Pedro, Loic,
>>> >>
>>> >> For what it's worth, my intuition here (which has had a mixed
>>> >> record as
>>> >> far as CRUSH goes) is that this is the most promising path
>>> >> forward.
>>> >>
>>> >> Thinking ahead a few steps, and confirming that I'm following
>>> >> the
>>> >> discussion so far, if you're able to do get black (or white) box
>>> >> gradient
>>> >> descent to work, then this will give us a set of weights for
>>> >> each item in
>>> >> the tree for each selection round, derived from the tree
>>> >> structure and
>>> >> original (target) weights. That would basically give us a map
>>> >> of item id
>>> >> (bucket id or leave item id) to weight for each round. i.e.,
>>> >>
>>> >> map<int, map<int, float>> weight_by_position; // position ->
>>> >> item -> weight
>>> >>
>>> >> where the 0 round would (I think?) match the target weights, and
>>> >> each
>>> >> round after that would skew low-weighted items lower to some
>>> >> degree.
>>> >> Right?
>>> >>
>>> >> The next question I have is: does this generalize from the
>>> >> single-bucket
>>> >> case to the hierarchy? I.e., if I have a "tree" (single bucket)
>>> >> like
>>> >>
>>> >> 3.1
>>> >> |_____________
>>> >> | \ \ \
>>> >> 1.0 1.0 1.0 .1
>>> >>
>>> >> it clearly works, but when we have a multi-level tree like
>>> >>
>>> >>
>>> >> 8.4
>>> >> |____________________________________
>>> >> | \ \
>>> >> 3.1 3.1 2.2
>>> >> |_____________ |_____________ |_____________
>>> >> | \ \ \ | \ \ \ | \ \ \
>>> >> 1.0 1.0 1.0 .1 1.0 1.0 1.0 .1 1.0 1.0 .1 .1
>>> >>
>>> >> and the second round weights skew the small .1 leaves lower, can
>>> >> we
>>> >> continue to build the summed-weight hierarchy, such that the
>>> >> adjusted
>>> >> weights at the higher level are appropriately adjusted to give
>>> >> us the
>>> >> right probabilities of descending into those trees? I'm not
>>> >> sure if that
>>> >> logically follows from the above or if my intuition is
>>> >> oversimplifying
>>> >> things.
>>> >>
>>> >> If this *is* how we think this will shake out, then I'm
>>> >> wondering if we
>>> >> should go ahead and build this weigh matrix into CRUSH sooner
>>> >> rather
>>> >> than later (i.e., for luminous). As with the explicit
>>> >> remappings, the
>>> >> hard part is all done offline, and the adjustments to the CRUSH
>>> >> mapping
>>> >> calculation itself (storing and making use of the adjusted
>>> >> weights for
>>> >> each round of placement) are relatively straightforward. And
>>> >> the sooner
>>> >> this is incorporated into a release the sooner real users will
>>> >> be able to
>>> >> roll out code to all clients and start making us of it.
>>> >>
>>> >> Thanks again for looking at this problem! I'm excited that we
>>> >> may be
>>> >> closing in on a real solution!
>>> >>
>>> >> sage
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
>>> >>
>>> >> > There are lot of gradient-free methods. I will try first to
>>> >> run the
>>> >> > ones available using just scipy
>>> >> >
>>> >>
>>> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>> >> > Some of them don't require the gradient and some of them can
>>> >> estimate
>>> >> > it. The reason to go without the gradient is to run the CRUSH
>>> >> > algorithm as a black box. In that case this would be the
>>> >> pseudo-code:
>>> >> >
>>> >> > - BEGIN CODE -
>>> >> > def build_target(desired_freqs):
>>> >> > def target(weights):
>>> >> > # run a simulation of CRUSH for a number of objects
>>> >> > sim_freqs = run_crush(weights)
>>> >> > # Kullback-Leibler divergence between desired
>>> >> frequencies and
>>> >> > current ones
>>> >> > return loss(sim_freqs, desired_freqs)
>>> >> > return target
>>> >> >
>>> >> > weights = scipy.optimize.minimize(build_target(desired_freqs))
>>> >> > - END CODE -
>>> >> >
>>> >> > The tricky thing here is that this procedure can be slow if
>>> >> the
>>> >> > simulation (run_crush) needs to place a lot of objects to get
>>> >> accurate
>>> >> > simulated frequencies. This is true specially if the minimize
>>> >> method
>>> >> > attempts to approximate the gradient using finite differences
>>> >> since it
>>> >> > will evaluate the target function a number of times
>>> >> proportional to
>>> >> > the number of weights). Apart from the ones in scipy I would
>>> >> try also
>>> >> > optimization methods that try to perform as few evaluations as
>>> >> > possible like for example HyperOpt
>>> >> > (http://hyperopt.github.io/hyperopt/), which by the way takes
>>> >> into
>>> >> > account that the target function can be noisy.
>>> >> >
>>> >> > This black box approximation is simple to implement and makes
>>> >> the
>>> >> > computer do all the work instead of us.
>>> >> > I think that this black box approximation is worthy to try
>>> >> even if
>>> >> > it's not the final one because if this approximation works
>>> >> then we
>>> >> > know that a more elaborate one that computes the gradient of
>>> >> the CRUSH
>>> >> > algorithm will work for sure.
>>> >> >
>>> >> > I can try this black box approximation this weekend not on the
>>> >> real
>>> >> > CRUSH algorithm but with the simple implementation I did in
>>> >> python. If
>>> >> > it works it's just a matter of substituting one simulation
>>> >> with
>>> >> > another and see what happens.
>>> >> >
>>> >> > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> >> > > Hi Pedro,
>>> >> > >
>>> >> > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>> >> > >> Hi Loic,
>>> >> > >>
>>> >> > >>>From what I see everything seems OK.
>>> >> > >
>>> >> > > Cool. I'll keep going in this direction then !
>>> >> > >
>>> >> > >> The interesting thing would be to
>>> >> > >> test on some complex mapping. The reason is that
>>> >> "CrushPolicyFamily"
>>> >> > >> is right now modeling just a single straw bucket not the
>>> >> full CRUSH
>>> >> > >> algorithm.
>>> >> > >
>>> >> > > A number of use cases use a single straw bucket, maybe the
>>> >> majority of them. Even though it does not reflect the full range
>>> >> of what crush can offer, it could be useful. To be more
>>> >> specific, a crush map that states "place objects so that there
>>> >> is at most one replica per host" or "one replica per rack" is
>>> >> common. Such a crushmap can be reduced to a single straw bucket
>>> >> that contains all the hosts and by using the CrushPolicyFamily,
>>> >> we can change the weights of each host to fix the probabilities.
>>> >> The hosts themselves contain disks with varying weights but I
>>> >> think we can ignore that because crush will only recurse to
>>> >> place one object within a given host.
>>> >> > >
>>> >> > >> That's the work that remains to be done. The only way that
>>> >> > >> would avoid reimplementing the CRUSH algorithm and
>>> >> computing the
>>> >> > >> gradient would be treating CRUSH as a black box and
>>> >> eliminating the
>>> >> > >> necessity of computing the gradient either by using a
>>> >> gradient-free
>>> >> > >> optimization method or making an estimation of the
>>> >> gradient.
>>> >> > >
>>> >> > > By gradient-free optimization you mean simulated annealing
>>> >> or Monte Carlo ?
>>> >> > >
>>> >> > > Cheers
>>> >> > >
>>> >> > >>
>>> >> > >>
>>> >> > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> >> > >>> Hi,
>>> >> > >>>
>>> >> > >>> I modified the crush library to accept two weights (one
>>> >> for the first disk, the other for the remaining disks)[1]. This
>>> >> really is a hack for experimentation purposes only ;-) I was
>>> >> able to run a variation of your code[2] and got the following
>>> >> results which are encouraging. Do you think what I did is
>>> >> sensible ? Or is there a problem I don't see ?
>>> >> > >>>
>>> >> > >>> Thanks !
>>> >> > >>>
>>> >> > >>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8
>>> >> 6]
>>> >> > >>>
>>> >>
>>> >> ------------------------------------------------------------------------
>>> >> > >>> Before: All replicas on each hard drive
>>> >> > >>> Expected vs actual use (20000 samples)
>>> >> > >>> disk 0: 1.39e-01 1.12e-01
>>> >> > >>> disk 1: 1.11e-01 1.10e-01
>>> >> > >>> disk 2: 8.33e-02 1.13e-01
>>> >> > >>> disk 3: 1.39e-01 1.11e-01
>>> >> > >>> disk 4: 1.11e-01 1.11e-01
>>> >> > >>> disk 5: 8.33e-02 1.11e-01
>>> >> > >>> disk 6: 1.39e-01 1.12e-01
>>> >> > >>> disk 7: 1.11e-01 1.12e-01
>>> >> > >>> disk 8: 8.33e-02 1.10e-01
>>> >> > >>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>>> >> > >>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>>> >> > >>> ...
>>> >> > >>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>>> >> > >>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>>> >> > >>> Converged to desired accuracy :)
>>> >> > >>> After: All replicas on each hard drive
>>> >> > >>> Expected vs actual use (20000 samples)
>>> >> > >>> disk 0: 1.39e-01 1.42e-01
>>> >> > >>> disk 1: 1.11e-01 1.09e-01
>>> >> > >>> disk 2: 8.33e-02 8.37e-02
>>> >> > >>> disk 3: 1.39e-01 1.40e-01
>>> >> > >>> disk 4: 1.11e-01 1.13e-01
>>> >> > >>> disk 5: 8.33e-02 8.08e-02
>>> >> > >>> disk 6: 1.39e-01 1.38e-01
>>> >> > >>> disk 7: 1.11e-01 1.09e-01
>>> >> > >>> disk 8: 8.33e-02 8.48e-02
>>> >> > >>>
>>> >> > >>>
>>> >> > >>> Simulation: R=2 devices capacity [10 10 10 10 1]
>>> >> > >>>
>>> >>
>>> >> ------------------------------------------------------------------------
>>> >> > >>> Before: All replicas on each hard drive
>>> >> > >>> Expected vs actual use (20000 samples)
>>> >> > >>> disk 0: 2.44e-01 2.36e-01
>>> >> > >>> disk 1: 2.44e-01 2.38e-01
>>> >> > >>> disk 2: 2.44e-01 2.34e-01
>>> >> > >>> disk 3: 2.44e-01 2.38e-01
>>> >> > >>> disk 4: 2.44e-02 5.37e-02
>>> >> > >>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>>> >> > >>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>>> >> > >>> ...
>>> >> > >>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>>> >> > >>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>>> >> > >>> Converged to desired accuracy :)
>>> >> > >>> After: All replicas on each hard drive
>>> >> > >>> Expected vs actual use (20000 samples)
>>> >> > >>> disk 0: 2.44e-01 2.46e-01
>>> >> > >>> disk 1: 2.44e-01 2.44e-01
>>> >> > >>> disk 2: 2.44e-01 2.41e-01
>>> >> > >>> disk 3: 2.44e-01 2.45e-01
>>> >> > >>> disk 4: 2.44e-02 2.33e-02
>>> >> > >>>
>>> >> > >>>
>>> >> > >>> [1] crush
>>> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
>>> >> 56fee8
>>> >> > >>> [2] python-crush
>>> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
>>> >> bd25f8f2c4b68
>>> >> > >>>
>>> >> > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>> >> > >>>> Hi Pedro,
>>> >> > >>>>
>>> >> > >>>> It looks like trying to experiment with crush won't work
>>> >> as expected because crush does not distinguish the probability
>>> >> of selecting the first device from the probability of selecting
>>> >> the second or third device. Am I mistaken ?
>>> >> > >>>>
>>> >> > >>>> Cheers
>>> >> > >>>>
>>> >> > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>> >> > >>>>> Hi Pedro,
>>> >> > >>>>>
>>> >> > >>>>> I'm going to experiment with what you did at
>>> >> > >>>>>
>>> >> > >>>>>
>>> >> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>> >> > >>>>>
>>> >> > >>>>> and the latest python-crush published today. A
>>> >> comparison function was added that will help measure the data
>>> >> movement. I'm hoping we can release an offline tool based on
>>> >> your solution. Please let me know if I should wait before diving
>>> >> into this, in case you have unpublished drafts or new ideas.
>>> >> > >>>>>
>>> >> > >>>>> Cheers
>>> >> > >>>>>
>>> >> > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>> >> > >>>>>> Great, thanks for the clarifications.
>>> >> > >>>>>> I also think that the most natural way is to keep just
>>> >> a set of
>>> >> > >>>>>> weights in the CRUSH map and update them inside the
>>> >> algorithm.
>>> >> > >>>>>>
>>> >> > >>>>>> I keep working on it.
>>> >> > >>>>>>
>>> >> > >>>>>>
>>> >> > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
>>> >> <sage@newdream.net>:
>>> >> > >>>>>>> Hi Pedro,
>>> >> > >>>>>>>
>>> >> > >>>>>>> Thanks for taking a look at this! It's a frustrating
>>> >> problem and we
>>> >> > >>>>>>> haven't made much headway.
>>> >> > >>>>>>>
>>> >> > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>> >> > >>>>>>>> Hi,
>>> >> > >>>>>>>>
>>> >> > >>>>>>>> I will have a look. BTW, I have not progressed that
>>> >> much but I have
>>> >> > >>>>>>>> been thinking about it. In order to adapt the
>>> >> previous algorithm in
>>> >> > >>>>>>>> the python notebook I need to substitute the
>>> >> iteration over all
>>> >> > >>>>>>>> possible devices permutations to iteration over all
>>> >> the possible
>>> >> > >>>>>>>> selections that crush would make. That is the main
>>> >> thing I need to
>>> >> > >>>>>>>> work on.
>>> >> > >>>>>>>>
>>> >> > >>>>>>>> The other thing is of course that weights change for
>>> >> each replica.
>>> >> > >>>>>>>> That is, they cannot be really fixed in the crush
>>> >> map. So the
>>> >> > >>>>>>>> algorithm inside libcrush, not only the weights in
>>> >> the map, need to be
>>> >> > >>>>>>>> changed. The weights in the crush map should reflect
>>> >> then, maybe, the
>>> >> > >>>>>>>> desired usage frequencies. Or maybe each replica
>>> >> should have their own
>>> >> > >>>>>>>> crush map, but then the information about the
>>> >> previous selection
>>> >> > >>>>>>>> should be passed to the next replica placement run so
>>> >> it avoids
>>> >> > >>>>>>>> selecting the same one again.
>>> >> > >>>>>>>
>>> >> > >>>>>>> My suspicion is that the best solution here (whatever
>>> >> that means!)
>>> >> > >>>>>>> leaves the CRUSH weights intact with the desired
>>> >> distribution, and
>>> >> > >>>>>>> then generates a set of derivative weights--probably
>>> >> one set for each
>>> >> > >>>>>>> round/replica/rank.
>>> >> > >>>>>>>
>>> >> > >>>>>>> One nice property of this is that once the support is
>>> >> added to encode
>>> >> > >>>>>>> multiple sets of weights, the algorithm used to
>>> >> generate them is free to
>>> >> > >>>>>>> change and evolve independently. (In most cases any
>>> >> change is
>>> >> > >>>>>>> CRUSH's mapping behavior is difficult to roll out
>>> >> because all
>>> >> > >>>>>>> parties participating in the cluster have to support
>>> >> any new behavior
>>> >> > >>>>>>> before it is enabled or used.)
>>> >> > >>>>>>>
>>> >> > >>>>>>>> I have a question also. Is there any significant
>>> >> difference between
>>> >> > >>>>>>>> the device selection algorithm description in the
>>> >> paper and its final
>>> >> > >>>>>>>> implementation?
>>> >> > >>>>>>>
>>> >> > >>>>>>> The main difference is the "retry_bucket" behavior was
>>> >> found to be a bad
>>> >> > >>>>>>> idea; any collision or failed()/overload() case
>>> >> triggers the
>>> >> > >>>>>>> retry_descent.
>>> >> > >>>>>>>
>>> >> > >>>>>>> There are other changes, of course, but I don't think
>>> >> they'll impact any
>>> >> > >>>>>>> solution we come with here (or at least any solution
>>> >> can be suitably
>>> >> > >>>>>>> adapted)!
>>> >> > >>>>>>>
>>> >> > >>>>>>> sage
>>> >> > >>>>>> --
>>> >> > >>>>>> To unsubscribe from this list: send the line
>>> >> "unsubscribe ceph-devel" in
>>> >> > >>>>>> the body of a message to majordomo@vger.kernel.org
>>> >> > >>>>>> More majordomo info at
>>> >> http://vger.kernel.org/majordomo-info.html
>>> >> > >>>>>>
>>> >> > >>>>>
>>> >> > >>>>
>>> >> > >>>
>>> >> > >>> --
>>> >> > >>> Loïc Dachary, Artisan Logiciel Libre
>>> >> > >> --
>>> >> > >> To unsubscribe from this list: send the line "unsubscribe
>>> >> ceph-devel" in
>>> >> > >> the body of a message to majordomo@vger.kernel.org
>>> >> > >> More majordomo info at
>>> >> http://vger.kernel.org/majordomo-info.html
>>> >> > >>
>>> >> > >
>>> >> > > --
>>> >> > > Loïc Dachary, Artisan Logiciel Libre
>>> >> > --
>>> >> > To unsubscribe from this list: send the line "unsubscribe
>>> >> ceph-devel" in
>>> >> > the body of a message to majordomo@vger.kernel.org
>>> >> > More majordomo info at
>>> >> http://vger.kernel.org/majordomo-info.html
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-27 6:45 ` Loic Dachary
[not found] ` <CAHMeWhGuJnu2664VTxomQ-wJewBEPjRT_VGWH+g-v5k3ka6X5Q@mail.gmail.com>
@ 2017-03-27 13:24 ` Sage Weil
1 sibling, 0 replies; 70+ messages in thread
From: Sage Weil @ 2017-03-27 13:24 UTC (permalink / raw)
To: Loic Dachary; +Cc: Adam Kupczyk, Ceph Development
[-- Attachment #1: Type: TEXT/PLAIN, Size: 23328 bytes --]
On Mon, 27 Mar 2017, Loic Dachary wrote:
> On 03/27/2017 04:33 AM, Sage Weil wrote:
> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
> >> Hello Sage, Loic, Pedro,
> >>
> >>
> >> I am certain that almost perfect mapping can be achieved by
> >> substituting weights from crush map with slightly modified weights.
> >> By perfect mapping I mean we get on each OSD number of PGs exactly
> >> proportional to weights specified in crush map.
> >>
> >> 1. Example
> >> Lets think of PGs of single object pool.
> >> We have OSDs with following weights:
> >> [10, 10, 10, 5, 5]
> >>
> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
> >> PGcopies :
> >> [150, 150, 150, 75, 75]
> >>
> >> However, because crush simulates random process we have:
> >> [143, 152, 158, 71, 76]
> >>
> >> We could have obtained perfect distribution had we used weights like this:
> >> [10.2, 9.9, 9.6, 5.2, 4.9]
> >>
> >>
> >> 2. Obtaining perfect mapping weights from OSD capacity weights
> >>
> >> When we apply crush for the first time, distribution of PGs comes as random.
> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
> >>
> >> But CRUSH is not random proces at all, it behaves in numerically stable way.
> >> Specifically, if we increase weight on one node, we will get more PGs on
> >> this node and less on every other node:
> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
> >>
> >> Now, finding ideal weights can be done by any numerical minimization method,
> >> for example NLMS.
> >>
> >>
> >> 3. The proposal
> >> For each pool, from initial weights given in crush map perfect weights will
> >> be derived.
> >> This weights will be used to calculate PG distribution. This of course will
> >> be close to perfect.
> >>
> >> 3a: Downside when OSD is out
> >> When an OSD is out, missing PG copies will be replicated elsewhere.
> >> Because now weights deviate from OSD capacity, some OSDs will statistically
> >> get more copies then they should.
> >> This unevenness in distribution is proportional to scale of deviation of
> >> calculated weights to capacity weights.
> >>
> >> 3b: Upside
> >> This all can be achieved without changes to crush.
> >
> > Yes!
> >
> > And no. You're totally right--we should use an offline optimization to
> > tweak the crush input weights to get a better balance. It won't be robust
> > to changes to the cluster, but we can incrementally optimize after that
> > happens to converge on something better.
> >
> > The problem with doing this with current versions of Ceph is that we lose
> > the original "input" or "target" weights (i.e., the actual size of
> > the OSD) that we want to converge on. This is one reason why we haven't
> > done something like this before.
> >
> > In luminous we *could* work around this by storing those canonical
> > weights outside of crush using something (probably?) ugly and
> > maintain backward compatibility with older clients using existing
> > CRUSH behavior.
>
> These canonical weights could be stored in crush by creating dedicated buckets. For instance the root-canonical bucket could be created to store the canonical weights of the root bucket. The sysadmin needs to be aware of the difference and know to add a new device in the host01-canonical bucket instead of the host01 bucket. And to run an offline tool to keep the two buckets in sync and compute the weight to use for placement derived from the weights representing the device capacity.
Oh, right! I should have looked at the PR more closely.
> It is a little bit ugly ;-)
A bit, but it could be worse. And we can kludge ceph to hide the
derivative buckets in things like 'osd tree'. I'd probably flip it around
and keep teh existing buckets as the 'canonical' ones, and create new
~adjusted buckets, or some similar naming like we are doing with the
device classes.
If there is an offline crush weight optimizer, it can either do the
somewhat ugly parallel hierarchy, or if the crush encoding is luminous+ it
can make use of the new (coming) weight matrix...
sage
>
> > OR, (and this is my preferred route), if the multi-pick anomaly approach
> > that Pedro is working on works out, we'll want to extend the CRUSH map to
> > include a set of derivative weights used for actual placement calculations
> > instead of the canonical target weights, and we can do what you're
> > proposing *and* solve the multipick problem with one change in the crush
> > map and algorithm. (Actually choosing those derivative weights will
> > be an offline process that can both improve the balance for the inputs we
> > care about *and* adjust them based on the position to fix the skew issue
> > for replicas.) This doesn't help pre-luminous clients, but I think the
> > end solution will be simpler and more elegant...
> >
> > What do you think?
> >
> > sage
> >
> >
> >> 4. Extra
> >> Some time ago I made such change to perfectly balance Thomson-Reuters
> >> cluster.
> >> It succeeded.
> >> A solution was not accepted, because modification of OSD weights were higher
> >> then 50%, which was caused by fact that different placement rules operated
> >> on different sets of OSDs, and those sets were not disjointed.
> >
> >
> >>
> >> Best regards,
> >> Adam
> >>
> >>
> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
> >> Hi Pedro, Loic,
> >>
> >> For what it's worth, my intuition here (which has had a mixed
> >> record as
> >> far as CRUSH goes) is that this is the most promising path
> >> forward.
> >>
> >> Thinking ahead a few steps, and confirming that I'm following
> >> the
> >> discussion so far, if you're able to do get black (or white) box
> >> gradient
> >> descent to work, then this will give us a set of weights for
> >> each item in
> >> the tree for each selection round, derived from the tree
> >> structure and
> >> original (target) weights. That would basically give us a map
> >> of item id
> >> (bucket id or leave item id) to weight for each round. i.e.,
> >>
> >> map<int, map<int, float>> weight_by_position; // position ->
> >> item -> weight
> >>
> >> where the 0 round would (I think?) match the target weights, and
> >> each
> >> round after that would skew low-weighted items lower to some
> >> degree.
> >> Right?
> >>
> >> The next question I have is: does this generalize from the
> >> single-bucket
> >> case to the hierarchy? I.e., if I have a "tree" (single bucket)
> >> like
> >>
> >> 3.1
> >> |_____________
> >> | \ \ \
> >> 1.0 1.0 1.0 .1
> >>
> >> it clearly works, but when we have a multi-level tree like
> >>
> >>
> >> 8.4
> >> |____________________________________
> >> | \ \
> >> 3.1 3.1 2.2
> >> |_____________ |_____________ |_____________
> >> | \ \ \ | \ \ \ | \ \ \
> >> 1.0 1.0 1.0 .1 1.0 1.0 1.0 .1 1.0 1.0 .1 .1
> >>
> >> and the second round weights skew the small .1 leaves lower, can
> >> we
> >> continue to build the summed-weight hierarchy, such that the
> >> adjusted
> >> weights at the higher level are appropriately adjusted to give
> >> us the
> >> right probabilities of descending into those trees? I'm not
> >> sure if that
> >> logically follows from the above or if my intuition is
> >> oversimplifying
> >> things.
> >>
> >> If this *is* how we think this will shake out, then I'm
> >> wondering if we
> >> should go ahead and build this weigh matrix into CRUSH sooner
> >> rather
> >> than later (i.e., for luminous). As with the explicit
> >> remappings, the
> >> hard part is all done offline, and the adjustments to the CRUSH
> >> mapping
> >> calculation itself (storing and making use of the adjusted
> >> weights for
> >> each round of placement) are relatively straightforward. And
> >> the sooner
> >> this is incorporated into a release the sooner real users will
> >> be able to
> >> roll out code to all clients and start making us of it.
> >>
> >> Thanks again for looking at this problem! I'm excited that we
> >> may be
> >> closing in on a real solution!
> >>
> >> sage
> >>
> >>
> >>
> >>
> >>
> >> On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
> >>
> >> > There are lot of gradient-free methods. I will try first to
> >> run the
> >> > ones available using just scipy
> >> >
> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> >> > Some of them don't require the gradient and some of them can
> >> estimate
> >> > it. The reason to go without the gradient is to run the CRUSH
> >> > algorithm as a black box. In that case this would be the
> >> pseudo-code:
> >> >
> >> > - BEGIN CODE -
> >> > def build_target(desired_freqs):
> >> > def target(weights):
> >> > # run a simulation of CRUSH for a number of objects
> >> > sim_freqs = run_crush(weights)
> >> > # Kullback-Leibler divergence between desired
> >> frequencies and
> >> > current ones
> >> > return loss(sim_freqs, desired_freqs)
> >> > return target
> >> >
> >> > weights = scipy.optimize.minimize(build_target(desired_freqs))
> >> > - END CODE -
> >> >
> >> > The tricky thing here is that this procedure can be slow if
> >> the
> >> > simulation (run_crush) needs to place a lot of objects to get
> >> accurate
> >> > simulated frequencies. This is true specially if the minimize
> >> method
> >> > attempts to approximate the gradient using finite differences
> >> since it
> >> > will evaluate the target function a number of times
> >> proportional to
> >> > the number of weights). Apart from the ones in scipy I would
> >> try also
> >> > optimization methods that try to perform as few evaluations as
> >> > possible like for example HyperOpt
> >> > (http://hyperopt.github.io/hyperopt/), which by the way takes
> >> into
> >> > account that the target function can be noisy.
> >> >
> >> > This black box approximation is simple to implement and makes
> >> the
> >> > computer do all the work instead of us.
> >> > I think that this black box approximation is worthy to try
> >> even if
> >> > it's not the final one because if this approximation works
> >> then we
> >> > know that a more elaborate one that computes the gradient of
> >> the CRUSH
> >> > algorithm will work for sure.
> >> >
> >> > I can try this black box approximation this weekend not on the
> >> real
> >> > CRUSH algorithm but with the simple implementation I did in
> >> python. If
> >> > it works it's just a matter of substituting one simulation
> >> with
> >> > another and see what happens.
> >> >
> >> > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
> >> > > Hi Pedro,
> >> > >
> >> > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
> >> > >> Hi Loic,
> >> > >>
> >> > >>>From what I see everything seems OK.
> >> > >
> >> > > Cool. I'll keep going in this direction then !
> >> > >
> >> > >> The interesting thing would be to
> >> > >> test on some complex mapping. The reason is that
> >> "CrushPolicyFamily"
> >> > >> is right now modeling just a single straw bucket not the
> >> full CRUSH
> >> > >> algorithm.
> >> > >
> >> > > A number of use cases use a single straw bucket, maybe the
> >> majority of them. Even though it does not reflect the full range
> >> of what crush can offer, it could be useful. To be more
> >> specific, a crush map that states "place objects so that there
> >> is at most one replica per host" or "one replica per rack" is
> >> common. Such a crushmap can be reduced to a single straw bucket
> >> that contains all the hosts and by using the CrushPolicyFamily,
> >> we can change the weights of each host to fix the probabilities.
> >> The hosts themselves contain disks with varying weights but I
> >> think we can ignore that because crush will only recurse to
> >> place one object within a given host.
> >> > >
> >> > >> That's the work that remains to be done. The only way that
> >> > >> would avoid reimplementing the CRUSH algorithm and
> >> computing the
> >> > >> gradient would be treating CRUSH as a black box and
> >> eliminating the
> >> > >> necessity of computing the gradient either by using a
> >> gradient-free
> >> > >> optimization method or making an estimation of the
> >> gradient.
> >> > >
> >> > > By gradient-free optimization you mean simulated annealing
> >> or Monte Carlo ?
> >> > >
> >> > > Cheers
> >> > >
> >> > >>
> >> > >>
> >> > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
> >> > >>> Hi,
> >> > >>>
> >> > >>> I modified the crush library to accept two weights (one
> >> for the first disk, the other for the remaining disks)[1]. This
> >> really is a hack for experimentation purposes only ;-) I was
> >> able to run a variation of your code[2] and got the following
> >> results which are encouraging. Do you think what I did is
> >> sensible ? Or is there a problem I don't see ?
> >> > >>>
> >> > >>> Thanks !
> >> > >>>
> >> > >>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8
> >> 6]
> >> > >>>
> >> ------------------------------------------------------------------------
> >> > >>> Before: All replicas on each hard drive
> >> > >>> Expected vs actual use (20000 samples)
> >> > >>> disk 0: 1.39e-01 1.12e-01
> >> > >>> disk 1: 1.11e-01 1.10e-01
> >> > >>> disk 2: 8.33e-02 1.13e-01
> >> > >>> disk 3: 1.39e-01 1.11e-01
> >> > >>> disk 4: 1.11e-01 1.11e-01
> >> > >>> disk 5: 8.33e-02 1.11e-01
> >> > >>> disk 6: 1.39e-01 1.12e-01
> >> > >>> disk 7: 1.11e-01 1.12e-01
> >> > >>> disk 8: 8.33e-02 1.10e-01
> >> > >>> it= 1 jac norm=1.59e-01 loss=5.27e-03
> >> > >>> it= 2 jac norm=1.55e-01 loss=5.03e-03
> >> > >>> ...
> >> > >>> it= 212 jac norm=1.02e-03 loss=2.41e-07
> >> > >>> it= 213 jac norm=1.00e-03 loss=2.31e-07
> >> > >>> Converged to desired accuracy :)
> >> > >>> After: All replicas on each hard drive
> >> > >>> Expected vs actual use (20000 samples)
> >> > >>> disk 0: 1.39e-01 1.42e-01
> >> > >>> disk 1: 1.11e-01 1.09e-01
> >> > >>> disk 2: 8.33e-02 8.37e-02
> >> > >>> disk 3: 1.39e-01 1.40e-01
> >> > >>> disk 4: 1.11e-01 1.13e-01
> >> > >>> disk 5: 8.33e-02 8.08e-02
> >> > >>> disk 6: 1.39e-01 1.38e-01
> >> > >>> disk 7: 1.11e-01 1.09e-01
> >> > >>> disk 8: 8.33e-02 8.48e-02
> >> > >>>
> >> > >>>
> >> > >>> Simulation: R=2 devices capacity [10 10 10 10 1]
> >> > >>>
> >> ------------------------------------------------------------------------
> >> > >>> Before: All replicas on each hard drive
> >> > >>> Expected vs actual use (20000 samples)
> >> > >>> disk 0: 2.44e-01 2.36e-01
> >> > >>> disk 1: 2.44e-01 2.38e-01
> >> > >>> disk 2: 2.44e-01 2.34e-01
> >> > >>> disk 3: 2.44e-01 2.38e-01
> >> > >>> disk 4: 2.44e-02 5.37e-02
> >> > >>> it= 1 jac norm=2.43e-01 loss=2.98e-03
> >> > >>> it= 2 jac norm=2.28e-01 loss=2.47e-03
> >> > >>> ...
> >> > >>> it= 37 jac norm=1.28e-03 loss=3.48e-08
> >> > >>> it= 38 jac norm=1.07e-03 loss=2.42e-08
> >> > >>> Converged to desired accuracy :)
> >> > >>> After: All replicas on each hard drive
> >> > >>> Expected vs actual use (20000 samples)
> >> > >>> disk 0: 2.44e-01 2.46e-01
> >> > >>> disk 1: 2.44e-01 2.44e-01
> >> > >>> disk 2: 2.44e-01 2.41e-01
> >> > >>> disk 3: 2.44e-01 2.45e-01
> >> > >>> disk 4: 2.44e-02 2.33e-02
> >> > >>>
> >> > >>>
> >> > >>> [1] crush hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
> >> 56fee8
> >> > >>> [2] python-crush hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
> >> bd25f8f2c4b68
> >> > >>>
> >> > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
> >> > >>>> Hi Pedro,
> >> > >>>>
> >> > >>>> It looks like trying to experiment with crush won't work
> >> as expected because crush does not distinguish the probability
> >> of selecting the first device from the probability of selecting
> >> the second or third device. Am I mistaken ?
> >> > >>>>
> >> > >>>> Cheers
> >> > >>>>
> >> > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
> >> > >>>>> Hi Pedro,
> >> > >>>>>
> >> > >>>>> I'm going to experiment with what you did at
> >> > >>>>>
> >> > >>>>>
> >> https://github.com/plafl/notebooks/blob/master/replication.ipynb
> >> > >>>>>
> >> > >>>>> and the latest python-crush published today. A
> >> comparison function was added that will help measure the data
> >> movement. I'm hoping we can release an offline tool based on
> >> your solution. Please let me know if I should wait before diving
> >> into this, in case you have unpublished drafts or new ideas.
> >> > >>>>>
> >> > >>>>> Cheers
> >> > >>>>>
> >> > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
> >> > >>>>>> Great, thanks for the clarifications.
> >> > >>>>>> I also think that the most natural way is to keep just
> >> a set of
> >> > >>>>>> weights in the CRUSH map and update them inside the
> >> algorithm.
> >> > >>>>>>
> >> > >>>>>> I keep working on it.
> >> > >>>>>>
> >> > >>>>>>
> >> > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
> >> <sage@newdream.net>:
> >> > >>>>>>> Hi Pedro,
> >> > >>>>>>>
> >> > >>>>>>> Thanks for taking a look at this! It's a frustrating
> >> problem and we
> >> > >>>>>>> haven't made much headway.
> >> > >>>>>>>
> >> > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
> >> > >>>>>>>> Hi,
> >> > >>>>>>>>
> >> > >>>>>>>> I will have a look. BTW, I have not progressed that
> >> much but I have
> >> > >>>>>>>> been thinking about it. In order to adapt the
> >> previous algorithm in
> >> > >>>>>>>> the python notebook I need to substitute the
> >> iteration over all
> >> > >>>>>>>> possible devices permutations to iteration over all
> >> the possible
> >> > >>>>>>>> selections that crush would make. That is the main
> >> thing I need to
> >> > >>>>>>>> work on.
> >> > >>>>>>>>
> >> > >>>>>>>> The other thing is of course that weights change for
> >> each replica.
> >> > >>>>>>>> That is, they cannot be really fixed in the crush
> >> map. So the
> >> > >>>>>>>> algorithm inside libcrush, not only the weights in
> >> the map, need to be
> >> > >>>>>>>> changed. The weights in the crush map should reflect
> >> then, maybe, the
> >> > >>>>>>>> desired usage frequencies. Or maybe each replica
> >> should have their own
> >> > >>>>>>>> crush map, but then the information about the
> >> previous selection
> >> > >>>>>>>> should be passed to the next replica placement run so
> >> it avoids
> >> > >>>>>>>> selecting the same one again.
> >> > >>>>>>>
> >> > >>>>>>> My suspicion is that the best solution here (whatever
> >> that means!)
> >> > >>>>>>> leaves the CRUSH weights intact with the desired
> >> distribution, and
> >> > >>>>>>> then generates a set of derivative weights--probably
> >> one set for each
> >> > >>>>>>> round/replica/rank.
> >> > >>>>>>>
> >> > >>>>>>> One nice property of this is that once the support is
> >> added to encode
> >> > >>>>>>> multiple sets of weights, the algorithm used to
> >> generate them is free to
> >> > >>>>>>> change and evolve independently. (In most cases any
> >> change is
> >> > >>>>>>> CRUSH's mapping behavior is difficult to roll out
> >> because all
> >> > >>>>>>> parties participating in the cluster have to support
> >> any new behavior
> >> > >>>>>>> before it is enabled or used.)
> >> > >>>>>>>
> >> > >>>>>>>> I have a question also. Is there any significant
> >> difference between
> >> > >>>>>>>> the device selection algorithm description in the
> >> paper and its final
> >> > >>>>>>>> implementation?
> >> > >>>>>>>
> >> > >>>>>>> The main difference is the "retry_bucket" behavior was
> >> found to be a bad
> >> > >>>>>>> idea; any collision or failed()/overload() case
> >> triggers the
> >> > >>>>>>> retry_descent.
> >> > >>>>>>>
> >> > >>>>>>> There are other changes, of course, but I don't think
> >> they'll impact any
> >> > >>>>>>> solution we come with here (or at least any solution
> >> can be suitably
> >> > >>>>>>> adapted)!
> >> > >>>>>>>
> >> > >>>>>>> sage
> >> > >>>>>> --
> >> > >>>>>> To unsubscribe from this list: send the line
> >> "unsubscribe ceph-devel" in
> >> > >>>>>> the body of a message to majordomo@vger.kernel.org
> >> > >>>>>> More majordomo info at
> >> http://vger.kernel.org/majordomo-info.html
> >> > >>>>>>
> >> > >>>>>
> >> > >>>>
> >> > >>>
> >> > >>> --
> >> > >>> Loïc Dachary, Artisan Logiciel Libre
> >> > >> --
> >> > >> To unsubscribe from this list: send the line "unsubscribe
> >> ceph-devel" in
> >> > >> the body of a message to majordomo@vger.kernel.org
> >> > >> More majordomo info at
> >> http://vger.kernel.org/majordomo-info.html
> >> > >>
> >> > >
> >> > > --
> >> > > Loïc Dachary, Artisan Logiciel Libre
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe
> >> ceph-devel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at
> >> http://vger.kernel.org/majordomo-info.html
> >> >
> >> >
> >>
> >>
> >>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-27 9:27 ` Adam Kupczyk
2017-03-27 10:29 ` Loic Dachary
2017-03-27 10:37 ` Pedro López-Adeva
@ 2017-03-27 13:39 ` Sage Weil
2017-03-28 6:52 ` Adam Kupczyk
2 siblings, 1 reply; 70+ messages in thread
From: Sage Weil @ 2017-03-27 13:39 UTC (permalink / raw)
To: Adam Kupczyk; +Cc: Loic Dachary, Ceph Development
[-- Attachment #1: Type: TEXT/PLAIN, Size: 26927 bytes --]
On Mon, 27 Mar 2017, Adam Kupczyk wrote:
> Hi,
>
> My understanding is that optimal tweaked weights will depend on:
> 1) pool_id, because of rjenkins(pool_id) in crush
> 2) number of placement groups and replication factor, as it determines
> amount of samples
>
> Therefore tweaked weights should rather be property of instantialized pool,
> not crush placement definition.
>
> If tweaked weights are to be part of crush definition, than for each
> created pool we need to have separate list of weights.
> Is it possible to provide clients with different weights depending on on
> which pool they want to operate?
As Loic suggested, you can create as many derivative hierarchies in the
crush map as you like, potentially one per pool. Or you could treat the
sum total of all pgs as the interesting set, balance those, and get some
OSDs doing a bit more of one pool than another. The new post-CRUSH OSD
remap capability can always clean this up (and turn a "good" crush
distribution into a perfect distribution).
I guess the question is: when we add the explicit adjusted weight matrix
to crush should we have multiple sets of weights (perhaps one for each
pool), or simply have a single global set. It might make sense to allow N
sets of adjusted weights so that the crush users can choose a particular
set of them for different pools (or whatever it is they're calculating the
mapping for)..
sage
>
> Best regards,
> Adam
>
> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
> > Hi,
> >
> > My understanding is that optimal tweaked weights will depend on:
> > 1) pool_id, because of rjenkins(pool_id) in crush
> > 2) number of placement groups and replication factor, as it determines
> > amount of samples
> >
> > Therefore tweaked weights should rather be property of instantialized pool,
> > not crush placement definition.
> >
> > If tweaked weights are to be part of crush definition, than for each created
> > pool we need to have separate list of weights.
> > Is it possible to provide clients with different weights depending on on
> > which pool they want to operate?
> >
> > Best regards,
> > Adam
> >
> >
> > On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@dachary.org> wrote:
> >>
> >>
> >>
> >> On 03/27/2017 04:33 AM, Sage Weil wrote:
> >> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
> >> >> Hello Sage, Loic, Pedro,
> >> >>
> >> >>
> >> >> I am certain that almost perfect mapping can be achieved by
> >> >> substituting weights from crush map with slightly modified weights.
> >> >> By perfect mapping I mean we get on each OSD number of PGs exactly
> >> >> proportional to weights specified in crush map.
> >> >>
> >> >> 1. Example
> >> >> Lets think of PGs of single object pool.
> >> >> We have OSDs with following weights:
> >> >> [10, 10, 10, 5, 5]
> >> >>
> >> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
> >> >> PGcopies :
> >> >> [150, 150, 150, 75, 75]
> >> >>
> >> >> However, because crush simulates random process we have:
> >> >> [143, 152, 158, 71, 76]
> >> >>
> >> >> We could have obtained perfect distribution had we used weights like
> >> >> this:
> >> >> [10.2, 9.9, 9.6, 5.2, 4.9]
> >> >>
> >> >>
> >> >> 2. Obtaining perfect mapping weights from OSD capacity weights
> >> >>
> >> >> When we apply crush for the first time, distribution of PGs comes as
> >> >> random.
> >> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
> >> >>
> >> >> But CRUSH is not random proces at all, it behaves in numerically stable
> >> >> way.
> >> >> Specifically, if we increase weight on one node, we will get more PGs
> >> >> on
> >> >> this node and less on every other node:
> >> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
> >> >>
> >> >> Now, finding ideal weights can be done by any numerical minimization
> >> >> method,
> >> >> for example NLMS.
> >> >>
> >> >>
> >> >> 3. The proposal
> >> >> For each pool, from initial weights given in crush map perfect weights
> >> >> will
> >> >> be derived.
> >> >> This weights will be used to calculate PG distribution. This of course
> >> >> will
> >> >> be close to perfect.
> >> >>
> >> >> 3a: Downside when OSD is out
> >> >> When an OSD is out, missing PG copies will be replicated elsewhere.
> >> >> Because now weights deviate from OSD capacity, some OSDs will
> >> >> statistically
> >> >> get more copies then they should.
> >> >> This unevenness in distribution is proportional to scale of deviation
> >> >> of
> >> >> calculated weights to capacity weights.
> >> >>
> >> >> 3b: Upside
> >> >> This all can be achieved without changes to crush.
> >> >
> >> > Yes!
> >> >
> >> > And no. You're totally right--we should use an offline optimization to
> >> > tweak the crush input weights to get a better balance. It won't be
> >> > robust
> >> > to changes to the cluster, but we can incrementally optimize after that
> >> > happens to converge on something better.
> >> >
> >> > The problem with doing this with current versions of Ceph is that we
> >> > lose
> >> > the original "input" or "target" weights (i.e., the actual size of
> >> > the OSD) that we want to converge on. This is one reason why we haven't
> >> > done something like this before.
> >> >
> >> > In luminous we *could* work around this by storing those canonical
> >> > weights outside of crush using something (probably?) ugly and
> >> > maintain backward compatibility with older clients using existing
> >> > CRUSH behavior.
> >>
> >> These canonical weights could be stored in crush by creating dedicated
> >> buckets. For instance the root-canonical bucket could be created to store
> >> the canonical weights of the root bucket. The sysadmin needs to be aware of
> >> the difference and know to add a new device in the host01-canonical bucket
> >> instead of the host01 bucket. And to run an offline tool to keep the two
> >> buckets in sync and compute the weight to use for placement derived from the
> >> weights representing the device capacity.
> >>
> >> It is a little bit ugly ;-)
> >>
> >> > OR, (and this is my preferred route), if the multi-pick anomaly approach
> >> > that Pedro is working on works out, we'll want to extend the CRUSH map
> >> > to
> >> > include a set of derivative weights used for actual placement
> >> > calculations
> >> > instead of the canonical target weights, and we can do what you're
> >> > proposing *and* solve the multipick problem with one change in the crush
> >> > map and algorithm. (Actually choosing those derivative weights will
> >> > be an offline process that can both improve the balance for the inputs
> >> > we
> >> > care about *and* adjust them based on the position to fix the skew issue
> >> > for replicas.) This doesn't help pre-luminous clients, but I think the
> >> > end solution will be simpler and more elegant...
> >> >
> >> > What do you think?
> >> >
> >> > sage
> >> >
> >> >
> >> >> 4. Extra
> >> >> Some time ago I made such change to perfectly balance Thomson-Reuters
> >> >> cluster.
> >> >> It succeeded.
> >> >> A solution was not accepted, because modification of OSD weights were
> >> >> higher
> >> >> then 50%, which was caused by fact that different placement rules
> >> >> operated
> >> >> on different sets of OSDs, and those sets were not disjointed.
> >> >
> >> >
> >> >>
> >> >> Best regards,
> >> >> Adam
> >> >>
> >> >>
> >> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
> >> >> Hi Pedro, Loic,
> >> >>
> >> >> For what it's worth, my intuition here (which has had a mixed
> >> >> record as
> >> >> far as CRUSH goes) is that this is the most promising path
> >> >> forward.
> >> >>
> >> >> Thinking ahead a few steps, and confirming that I'm following
> >> >> the
> >> >> discussion so far, if you're able to do get black (or white) box
> >> >> gradient
> >> >> descent to work, then this will give us a set of weights for
> >> >> each item in
> >> >> the tree for each selection round, derived from the tree
> >> >> structure and
> >> >> original (target) weights. That would basically give us a map
> >> >> of item id
> >> >> (bucket id or leave item id) to weight for each round. i.e.,
> >> >>
> >> >> map<int, map<int, float>> weight_by_position; // position ->
> >> >> item -> weight
> >> >>
> >> >> where the 0 round would (I think?) match the target weights, and
> >> >> each
> >> >> round after that would skew low-weighted items lower to some
> >> >> degree.
> >> >> Right?
> >> >>
> >> >> The next question I have is: does this generalize from the
> >> >> single-bucket
> >> >> case to the hierarchy? I.e., if I have a "tree" (single bucket)
> >> >> like
> >> >>
> >> >> 3.1
> >> >> |_____________
> >> >> | \ \ \
> >> >> 1.0 1.0 1.0 .1
> >> >>
> >> >> it clearly works, but when we have a multi-level tree like
> >> >>
> >> >>
> >> >> 8.4
> >> >> |____________________________________
> >> >> | \ \
> >> >> 3.1 3.1 2.2
> >> >> |_____________ |_____________ |_____________
> >> >> | \ \ \ | \ \ \ | \ \ \
> >> >> 1.0 1.0 1.0 .1 1.0 1.0 1.0 .1 1.0 1.0 .1 .1
> >> >>
> >> >> and the second round weights skew the small .1 leaves lower, can
> >> >> we
> >> >> continue to build the summed-weight hierarchy, such that the
> >> >> adjusted
> >> >> weights at the higher level are appropriately adjusted to give
> >> >> us the
> >> >> right probabilities of descending into those trees? I'm not
> >> >> sure if that
> >> >> logically follows from the above or if my intuition is
> >> >> oversimplifying
> >> >> things.
> >> >>
> >> >> If this *is* how we think this will shake out, then I'm
> >> >> wondering if we
> >> >> should go ahead and build this weigh matrix into CRUSH sooner
> >> >> rather
> >> >> than later (i.e., for luminous). As with the explicit
> >> >> remappings, the
> >> >> hard part is all done offline, and the adjustments to the CRUSH
> >> >> mapping
> >> >> calculation itself (storing and making use of the adjusted
> >> >> weights for
> >> >> each round of placement) are relatively straightforward. And
> >> >> the sooner
> >> >> this is incorporated into a release the sooner real users will
> >> >> be able to
> >> >> roll out code to all clients and start making us of it.
> >> >>
> >> >> Thanks again for looking at this problem! I'm excited that we
> >> >> may be
> >> >> closing in on a real solution!
> >> >>
> >> >> sage
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
> >> >>
> >> >> > There are lot of gradient-free methods. I will try first to
> >> >> run the
> >> >> > ones available using just scipy
> >> >> >
> >> >>
> >> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> >> >> > Some of them don't require the gradient and some of them can
> >> >> estimate
> >> >> > it. The reason to go without the gradient is to run the CRUSH
> >> >> > algorithm as a black box. In that case this would be the
> >> >> pseudo-code:
> >> >> >
> >> >> > - BEGIN CODE -
> >> >> > def build_target(desired_freqs):
> >> >> > def target(weights):
> >> >> > # run a simulation of CRUSH for a number of objects
> >> >> > sim_freqs = run_crush(weights)
> >> >> > # Kullback-Leibler divergence between desired
> >> >> frequencies and
> >> >> > current ones
> >> >> > return loss(sim_freqs, desired_freqs)
> >> >> > return target
> >> >> >
> >> >> > weights = scipy.optimize.minimize(build_target(desired_freqs))
> >> >> > - END CODE -
> >> >> >
> >> >> > The tricky thing here is that this procedure can be slow if
> >> >> the
> >> >> > simulation (run_crush) needs to place a lot of objects to get
> >> >> accurate
> >> >> > simulated frequencies. This is true specially if the minimize
> >> >> method
> >> >> > attempts to approximate the gradient using finite differences
> >> >> since it
> >> >> > will evaluate the target function a number of times
> >> >> proportional to
> >> >> > the number of weights). Apart from the ones in scipy I would
> >> >> try also
> >> >> > optimization methods that try to perform as few evaluations as
> >> >> > possible like for example HyperOpt
> >> >> > (http://hyperopt.github.io/hyperopt/), which by the way takes
> >> >> into
> >> >> > account that the target function can be noisy.
> >> >> >
> >> >> > This black box approximation is simple to implement and makes
> >> >> the
> >> >> > computer do all the work instead of us.
> >> >> > I think that this black box approximation is worthy to try
> >> >> even if
> >> >> > it's not the final one because if this approximation works
> >> >> then we
> >> >> > know that a more elaborate one that computes the gradient of
> >> >> the CRUSH
> >> >> > algorithm will work for sure.
> >> >> >
> >> >> > I can try this black box approximation this weekend not on the
> >> >> real
> >> >> > CRUSH algorithm but with the simple implementation I did in
> >> >> python. If
> >> >> > it works it's just a matter of substituting one simulation
> >> >> with
> >> >> > another and see what happens.
> >> >> >
> >> >> > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
> >> >> > > Hi Pedro,
> >> >> > >
> >> >> > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
> >> >> > >> Hi Loic,
> >> >> > >>
> >> >> > >>>From what I see everything seems OK.
> >> >> > >
> >> >> > > Cool. I'll keep going in this direction then !
> >> >> > >
> >> >> > >> The interesting thing would be to
> >> >> > >> test on some complex mapping. The reason is that
> >> >> "CrushPolicyFamily"
> >> >> > >> is right now modeling just a single straw bucket not the
> >> >> full CRUSH
> >> >> > >> algorithm.
> >> >> > >
> >> >> > > A number of use cases use a single straw bucket, maybe the
> >> >> majority of them. Even though it does not reflect the full range
> >> >> of what crush can offer, it could be useful. To be more
> >> >> specific, a crush map that states "place objects so that there
> >> >> is at most one replica per host" or "one replica per rack" is
> >> >> common. Such a crushmap can be reduced to a single straw bucket
> >> >> that contains all the hosts and by using the CrushPolicyFamily,
> >> >> we can change the weights of each host to fix the probabilities.
> >> >> The hosts themselves contain disks with varying weights but I
> >> >> think we can ignore that because crush will only recurse to
> >> >> place one object within a given host.
> >> >> > >
> >> >> > >> That's the work that remains to be done. The only way that
> >> >> > >> would avoid reimplementing the CRUSH algorithm and
> >> >> computing the
> >> >> > >> gradient would be treating CRUSH as a black box and
> >> >> eliminating the
> >> >> > >> necessity of computing the gradient either by using a
> >> >> gradient-free
> >> >> > >> optimization method or making an estimation of the
> >> >> gradient.
> >> >> > >
> >> >> > > By gradient-free optimization you mean simulated annealing
> >> >> or Monte Carlo ?
> >> >> > >
> >> >> > > Cheers
> >> >> > >
> >> >> > >>
> >> >> > >>
> >> >> > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
> >> >> > >>> Hi,
> >> >> > >>>
> >> >> > >>> I modified the crush library to accept two weights (one
> >> >> for the first disk, the other for the remaining disks)[1]. This
> >> >> really is a hack for experimentation purposes only ;-) I was
> >> >> able to run a variation of your code[2] and got the following
> >> >> results which are encouraging. Do you think what I did is
> >> >> sensible ? Or is there a problem I don't see ?
> >> >> > >>>
> >> >> > >>> Thanks !
> >> >> > >>>
> >> >> > >>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8
> >> >> 6]
> >> >> > >>>
> >> >>
> >> >> ------------------------------------------------------------------------
> >> >> > >>> Before: All replicas on each hard drive
> >> >> > >>> Expected vs actual use (20000 samples)
> >> >> > >>> disk 0: 1.39e-01 1.12e-01
> >> >> > >>> disk 1: 1.11e-01 1.10e-01
> >> >> > >>> disk 2: 8.33e-02 1.13e-01
> >> >> > >>> disk 3: 1.39e-01 1.11e-01
> >> >> > >>> disk 4: 1.11e-01 1.11e-01
> >> >> > >>> disk 5: 8.33e-02 1.11e-01
> >> >> > >>> disk 6: 1.39e-01 1.12e-01
> >> >> > >>> disk 7: 1.11e-01 1.12e-01
> >> >> > >>> disk 8: 8.33e-02 1.10e-01
> >> >> > >>> it= 1 jac norm=1.59e-01 loss=5.27e-03
> >> >> > >>> it= 2 jac norm=1.55e-01 loss=5.03e-03
> >> >> > >>> ...
> >> >> > >>> it= 212 jac norm=1.02e-03 loss=2.41e-07
> >> >> > >>> it= 213 jac norm=1.00e-03 loss=2.31e-07
> >> >> > >>> Converged to desired accuracy :)
> >> >> > >>> After: All replicas on each hard drive
> >> >> > >>> Expected vs actual use (20000 samples)
> >> >> > >>> disk 0: 1.39e-01 1.42e-01
> >> >> > >>> disk 1: 1.11e-01 1.09e-01
> >> >> > >>> disk 2: 8.33e-02 8.37e-02
> >> >> > >>> disk 3: 1.39e-01 1.40e-01
> >> >> > >>> disk 4: 1.11e-01 1.13e-01
> >> >> > >>> disk 5: 8.33e-02 8.08e-02
> >> >> > >>> disk 6: 1.39e-01 1.38e-01
> >> >> > >>> disk 7: 1.11e-01 1.09e-01
> >> >> > >>> disk 8: 8.33e-02 8.48e-02
> >> >> > >>>
> >> >> > >>>
> >> >> > >>> Simulation: R=2 devices capacity [10 10 10 10 1]
> >> >> > >>>
> >> >>
> >> >> ------------------------------------------------------------------------
> >> >> > >>> Before: All replicas on each hard drive
> >> >> > >>> Expected vs actual use (20000 samples)
> >> >> > >>> disk 0: 2.44e-01 2.36e-01
> >> >> > >>> disk 1: 2.44e-01 2.38e-01
> >> >> > >>> disk 2: 2.44e-01 2.34e-01
> >> >> > >>> disk 3: 2.44e-01 2.38e-01
> >> >> > >>> disk 4: 2.44e-02 5.37e-02
> >> >> > >>> it= 1 jac norm=2.43e-01 loss=2.98e-03
> >> >> > >>> it= 2 jac norm=2.28e-01 loss=2.47e-03
> >> >> > >>> ...
> >> >> > >>> it= 37 jac norm=1.28e-03 loss=3.48e-08
> >> >> > >>> it= 38 jac norm=1.07e-03 loss=2.42e-08
> >> >> > >>> Converged to desired accuracy :)
> >> >> > >>> After: All replicas on each hard drive
> >> >> > >>> Expected vs actual use (20000 samples)
> >> >> > >>> disk 0: 2.44e-01 2.46e-01
> >> >> > >>> disk 1: 2.44e-01 2.44e-01
> >> >> > >>> disk 2: 2.44e-01 2.41e-01
> >> >> > >>> disk 3: 2.44e-01 2.45e-01
> >> >> > >>> disk 4: 2.44e-02 2.33e-02
> >> >> > >>>
> >> >> > >>>
> >> >> > >>> [1] crush
> >> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
> >> >> 56fee8
> >> >> > >>> [2] python-crush
> >> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
> >> >> bd25f8f2c4b68
> >> >> > >>>
> >> >> > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
> >> >> > >>>> Hi Pedro,
> >> >> > >>>>
> >> >> > >>>> It looks like trying to experiment with crush won't work
> >> >> as expected because crush does not distinguish the probability
> >> >> of selecting the first device from the probability of selecting
> >> >> the second or third device. Am I mistaken ?
> >> >> > >>>>
> >> >> > >>>> Cheers
> >> >> > >>>>
> >> >> > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
> >> >> > >>>>> Hi Pedro,
> >> >> > >>>>>
> >> >> > >>>>> I'm going to experiment with what you did at
> >> >> > >>>>>
> >> >> > >>>>>
> >> >> https://github.com/plafl/notebooks/blob/master/replication.ipynb
> >> >> > >>>>>
> >> >> > >>>>> and the latest python-crush published today. A
> >> >> comparison function was added that will help measure the data
> >> >> movement. I'm hoping we can release an offline tool based on
> >> >> your solution. Please let me know if I should wait before diving
> >> >> into this, in case you have unpublished drafts or new ideas.
> >> >> > >>>>>
> >> >> > >>>>> Cheers
> >> >> > >>>>>
> >> >> > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
> >> >> > >>>>>> Great, thanks for the clarifications.
> >> >> > >>>>>> I also think that the most natural way is to keep just
> >> >> a set of
> >> >> > >>>>>> weights in the CRUSH map and update them inside the
> >> >> algorithm.
> >> >> > >>>>>>
> >> >> > >>>>>> I keep working on it.
> >> >> > >>>>>>
> >> >> > >>>>>>
> >> >> > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
> >> >> <sage@newdream.net>:
> >> >> > >>>>>>> Hi Pedro,
> >> >> > >>>>>>>
> >> >> > >>>>>>> Thanks for taking a look at this! It's a frustrating
> >> >> problem and we
> >> >> > >>>>>>> haven't made much headway.
> >> >> > >>>>>>>
> >> >> > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
> >> >> > >>>>>>>> Hi,
> >> >> > >>>>>>>>
> >> >> > >>>>>>>> I will have a look. BTW, I have not progressed that
> >> >> much but I have
> >> >> > >>>>>>>> been thinking about it. In order to adapt the
> >> >> previous algorithm in
> >> >> > >>>>>>>> the python notebook I need to substitute the
> >> >> iteration over all
> >> >> > >>>>>>>> possible devices permutations to iteration over all
> >> >> the possible
> >> >> > >>>>>>>> selections that crush would make. That is the main
> >> >> thing I need to
> >> >> > >>>>>>>> work on.
> >> >> > >>>>>>>>
> >> >> > >>>>>>>> The other thing is of course that weights change for
> >> >> each replica.
> >> >> > >>>>>>>> That is, they cannot be really fixed in the crush
> >> >> map. So the
> >> >> > >>>>>>>> algorithm inside libcrush, not only the weights in
> >> >> the map, need to be
> >> >> > >>>>>>>> changed. The weights in the crush map should reflect
> >> >> then, maybe, the
> >> >> > >>>>>>>> desired usage frequencies. Or maybe each replica
> >> >> should have their own
> >> >> > >>>>>>>> crush map, but then the information about the
> >> >> previous selection
> >> >> > >>>>>>>> should be passed to the next replica placement run so
> >> >> it avoids
> >> >> > >>>>>>>> selecting the same one again.
> >> >> > >>>>>>>
> >> >> > >>>>>>> My suspicion is that the best solution here (whatever
> >> >> that means!)
> >> >> > >>>>>>> leaves the CRUSH weights intact with the desired
> >> >> distribution, and
> >> >> > >>>>>>> then generates a set of derivative weights--probably
> >> >> one set for each
> >> >> > >>>>>>> round/replica/rank.
> >> >> > >>>>>>>
> >> >> > >>>>>>> One nice property of this is that once the support is
> >> >> added to encode
> >> >> > >>>>>>> multiple sets of weights, the algorithm used to
> >> >> generate them is free to
> >> >> > >>>>>>> change and evolve independently. (In most cases any
> >> >> change is
> >> >> > >>>>>>> CRUSH's mapping behavior is difficult to roll out
> >> >> because all
> >> >> > >>>>>>> parties participating in the cluster have to support
> >> >> any new behavior
> >> >> > >>>>>>> before it is enabled or used.)
> >> >> > >>>>>>>
> >> >> > >>>>>>>> I have a question also. Is there any significant
> >> >> difference between
> >> >> > >>>>>>>> the device selection algorithm description in the
> >> >> paper and its final
> >> >> > >>>>>>>> implementation?
> >> >> > >>>>>>>
> >> >> > >>>>>>> The main difference is the "retry_bucket" behavior was
> >> >> found to be a bad
> >> >> > >>>>>>> idea; any collision or failed()/overload() case
> >> >> triggers the
> >> >> > >>>>>>> retry_descent.
> >> >> > >>>>>>>
> >> >> > >>>>>>> There are other changes, of course, but I don't think
> >> >> they'll impact any
> >> >> > >>>>>>> solution we come with here (or at least any solution
> >> >> can be suitably
> >> >> > >>>>>>> adapted)!
> >> >> > >>>>>>>
> >> >> > >>>>>>> sage
> >> >> > >>>>>> --
> >> >> > >>>>>> To unsubscribe from this list: send the line
> >> >> "unsubscribe ceph-devel" in
> >> >> > >>>>>> the body of a message to majordomo@vger.kernel.org
> >> >> > >>>>>> More majordomo info at
> >> >> http://vger.kernel.org/majordomo-info.html
> >> >> > >>>>>>
> >> >> > >>>>>
> >> >> > >>>>
> >> >> > >>>
> >> >> > >>> --
> >> >> > >>> Loïc Dachary, Artisan Logiciel Libre
> >> >> > >> --
> >> >> > >> To unsubscribe from this list: send the line "unsubscribe
> >> >> ceph-devel" in
> >> >> > >> the body of a message to majordomo@vger.kernel.org
> >> >> > >> More majordomo info at
> >> >> http://vger.kernel.org/majordomo-info.html
> >> >> > >>
> >> >> > >
> >> >> > > --
> >> >> > > Loïc Dachary, Artisan Logiciel Libre
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe
> >> >> ceph-devel" in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at
> >> >> http://vger.kernel.org/majordomo-info.html
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >>
> >> --
> >> Loïc Dachary, Artisan Logiciel Libre
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-27 13:39 ` Sage Weil
@ 2017-03-28 6:52 ` Adam Kupczyk
2017-03-28 9:49 ` Spandan Kumar Sahu
2017-03-28 13:35 ` Sage Weil
0 siblings, 2 replies; 70+ messages in thread
From: Adam Kupczyk @ 2017-03-28 6:52 UTC (permalink / raw)
To: Sage Weil, Pedro López-Adeva; +Cc: Loic Dachary, Ceph Development
"... or simply have a single global set"
No. Proof by example:
I once attempted to perfectly balance cluster X by modifying crush weights.
Pool A spanned over 352 OSDs (set A)
Pool B spanned over 176 OSDs (set B, half of A)
The result (simulated perfect balance) was that obtained weights had
- small variance for B (5%),
- small variance for A-B (5%).
- huge variance for A (800%)
This was of course because crush had to be strongly discouraged to
pick from B, when performing placement for A.
"...crush users can choose..."
For each pool there is only one vector of weights that will provide
perfect balance. (math note: actually multiple of them, but different
by scale)
I cannot at the moment imagine any other practical metrics other then
balancing. But maybe it is just failure of imagination.
On Mon, Mar 27, 2017 at 3:39 PM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 27 Mar 2017, Adam Kupczyk wrote:
>> Hi,
>>
>> My understanding is that optimal tweaked weights will depend on:
>> 1) pool_id, because of rjenkins(pool_id) in crush
>> 2) number of placement groups and replication factor, as it determines
>> amount of samples
>>
>> Therefore tweaked weights should rather be property of instantialized pool,
>> not crush placement definition.
>>
>> If tweaked weights are to be part of crush definition, than for each
>> created pool we need to have separate list of weights.
>> Is it possible to provide clients with different weights depending on on
>> which pool they want to operate?
>
> As Loic suggested, you can create as many derivative hierarchies in the
> crush map as you like, potentially one per pool. Or you could treat the
> sum total of all pgs as the interesting set, balance those, and get some
> OSDs doing a bit more of one pool than another. The new post-CRUSH OSD
> remap capability can always clean this up (and turn a "good" crush
> distribution into a perfect distribution).
>
> I guess the question is: when we add the explicit adjusted weight matrix
> to crush should we have multiple sets of weights (perhaps one for each
> pool), or simply have a single global set. It might make sense to allow N
> sets of adjusted weights so that the crush users can choose a particular
> set of them for different pools (or whatever it is they're calculating the
> mapping for)..
>
> sage
>
>
>>
>> Best regards,
>> Adam
>>
>> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
>> > Hi,
>> >
>> > My understanding is that optimal tweaked weights will depend on:
>> > 1) pool_id, because of rjenkins(pool_id) in crush
>> > 2) number of placement groups and replication factor, as it determines
>> > amount of samples
>> >
>> > Therefore tweaked weights should rather be property of instantialized pool,
>> > not crush placement definition.
>> >
>> > If tweaked weights are to be part of crush definition, than for each created
>> > pool we need to have separate list of weights.
>> > Is it possible to provide clients with different weights depending on on
>> > which pool they want to operate?
>> >
>> > Best regards,
>> > Adam
>> >
>> >
>> > On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@dachary.org> wrote:
>> >>
>> >>
>> >>
>> >> On 03/27/2017 04:33 AM, Sage Weil wrote:
>> >> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
>> >> >> Hello Sage, Loic, Pedro,
>> >> >>
>> >> >>
>> >> >> I am certain that almost perfect mapping can be achieved by
>> >> >> substituting weights from crush map with slightly modified weights.
>> >> >> By perfect mapping I mean we get on each OSD number of PGs exactly
>> >> >> proportional to weights specified in crush map.
>> >> >>
>> >> >> 1. Example
>> >> >> Lets think of PGs of single object pool.
>> >> >> We have OSDs with following weights:
>> >> >> [10, 10, 10, 5, 5]
>> >> >>
>> >> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
>> >> >> PGcopies :
>> >> >> [150, 150, 150, 75, 75]
>> >> >>
>> >> >> However, because crush simulates random process we have:
>> >> >> [143, 152, 158, 71, 76]
>> >> >>
>> >> >> We could have obtained perfect distribution had we used weights like
>> >> >> this:
>> >> >> [10.2, 9.9, 9.6, 5.2, 4.9]
>> >> >>
>> >> >>
>> >> >> 2. Obtaining perfect mapping weights from OSD capacity weights
>> >> >>
>> >> >> When we apply crush for the first time, distribution of PGs comes as
>> >> >> random.
>> >> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
>> >> >>
>> >> >> But CRUSH is not random proces at all, it behaves in numerically stable
>> >> >> way.
>> >> >> Specifically, if we increase weight on one node, we will get more PGs
>> >> >> on
>> >> >> this node and less on every other node:
>> >> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
>> >> >>
>> >> >> Now, finding ideal weights can be done by any numerical minimization
>> >> >> method,
>> >> >> for example NLMS.
>> >> >>
>> >> >>
>> >> >> 3. The proposal
>> >> >> For each pool, from initial weights given in crush map perfect weights
>> >> >> will
>> >> >> be derived.
>> >> >> This weights will be used to calculate PG distribution. This of course
>> >> >> will
>> >> >> be close to perfect.
>> >> >>
>> >> >> 3a: Downside when OSD is out
>> >> >> When an OSD is out, missing PG copies will be replicated elsewhere.
>> >> >> Because now weights deviate from OSD capacity, some OSDs will
>> >> >> statistically
>> >> >> get more copies then they should.
>> >> >> This unevenness in distribution is proportional to scale of deviation
>> >> >> of
>> >> >> calculated weights to capacity weights.
>> >> >>
>> >> >> 3b: Upside
>> >> >> This all can be achieved without changes to crush.
>> >> >
>> >> > Yes!
>> >> >
>> >> > And no. You're totally right--we should use an offline optimization to
>> >> > tweak the crush input weights to get a better balance. It won't be
>> >> > robust
>> >> > to changes to the cluster, but we can incrementally optimize after that
>> >> > happens to converge on something better.
>> >> >
>> >> > The problem with doing this with current versions of Ceph is that we
>> >> > lose
>> >> > the original "input" or "target" weights (i.e., the actual size of
>> >> > the OSD) that we want to converge on. This is one reason why we haven't
>> >> > done something like this before.
>> >> >
>> >> > In luminous we *could* work around this by storing those canonical
>> >> > weights outside of crush using something (probably?) ugly and
>> >> > maintain backward compatibility with older clients using existing
>> >> > CRUSH behavior.
>> >>
>> >> These canonical weights could be stored in crush by creating dedicated
>> >> buckets. For instance the root-canonical bucket could be created to store
>> >> the canonical weights of the root bucket. The sysadmin needs to be aware of
>> >> the difference and know to add a new device in the host01-canonical bucket
>> >> instead of the host01 bucket. And to run an offline tool to keep the two
>> >> buckets in sync and compute the weight to use for placement derived from the
>> >> weights representing the device capacity.
>> >>
>> >> It is a little bit ugly ;-)
>> >>
>> >> > OR, (and this is my preferred route), if the multi-pick anomaly approach
>> >> > that Pedro is working on works out, we'll want to extend the CRUSH map
>> >> > to
>> >> > include a set of derivative weights used for actual placement
>> >> > calculations
>> >> > instead of the canonical target weights, and we can do what you're
>> >> > proposing *and* solve the multipick problem with one change in the crush
>> >> > map and algorithm. (Actually choosing those derivative weights will
>> >> > be an offline process that can both improve the balance for the inputs
>> >> > we
>> >> > care about *and* adjust them based on the position to fix the skew issue
>> >> > for replicas.) This doesn't help pre-luminous clients, but I think the
>> >> > end solution will be simpler and more elegant...
>> >> >
>> >> > What do you think?
>> >> >
>> >> > sage
>> >> >
>> >> >
>> >> >> 4. Extra
>> >> >> Some time ago I made such change to perfectly balance Thomson-Reuters
>> >> >> cluster.
>> >> >> It succeeded.
>> >> >> A solution was not accepted, because modification of OSD weights were
>> >> >> higher
>> >> >> then 50%, which was caused by fact that different placement rules
>> >> >> operated
>> >> >> on different sets of OSDs, and those sets were not disjointed.
>> >> >
>> >> >
>> >> >>
>> >> >> Best regards,
>> >> >> Adam
>> >> >>
>> >> >>
>> >> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
>> >> >> Hi Pedro, Loic,
>> >> >>
>> >> >> For what it's worth, my intuition here (which has had a mixed
>> >> >> record as
>> >> >> far as CRUSH goes) is that this is the most promising path
>> >> >> forward.
>> >> >>
>> >> >> Thinking ahead a few steps, and confirming that I'm following
>> >> >> the
>> >> >> discussion so far, if you're able to do get black (or white) box
>> >> >> gradient
>> >> >> descent to work, then this will give us a set of weights for
>> >> >> each item in
>> >> >> the tree for each selection round, derived from the tree
>> >> >> structure and
>> >> >> original (target) weights. That would basically give us a map
>> >> >> of item id
>> >> >> (bucket id or leave item id) to weight for each round. i.e.,
>> >> >>
>> >> >> map<int, map<int, float>> weight_by_position; // position ->
>> >> >> item -> weight
>> >> >>
>> >> >> where the 0 round would (I think?) match the target weights, and
>> >> >> each
>> >> >> round after that would skew low-weighted items lower to some
>> >> >> degree.
>> >> >> Right?
>> >> >>
>> >> >> The next question I have is: does this generalize from the
>> >> >> single-bucket
>> >> >> case to the hierarchy? I.e., if I have a "tree" (single bucket)
>> >> >> like
>> >> >>
>> >> >> 3.1
>> >> >> |_____________
>> >> >> | \ \ \
>> >> >> 1.0 1.0 1.0 .1
>> >> >>
>> >> >> it clearly works, but when we have a multi-level tree like
>> >> >>
>> >> >>
>> >> >> 8.4
>> >> >> |____________________________________
>> >> >> | \ \
>> >> >> 3.1 3.1 2.2
>> >> >> |_____________ |_____________ |_____________
>> >> >> | \ \ \ | \ \ \ | \ \ \
>> >> >> 1.0 1.0 1.0 .1 1.0 1.0 1.0 .1 1.0 1.0 .1 .1
>> >> >>
>> >> >> and the second round weights skew the small .1 leaves lower, can
>> >> >> we
>> >> >> continue to build the summed-weight hierarchy, such that the
>> >> >> adjusted
>> >> >> weights at the higher level are appropriately adjusted to give
>> >> >> us the
>> >> >> right probabilities of descending into those trees? I'm not
>> >> >> sure if that
>> >> >> logically follows from the above or if my intuition is
>> >> >> oversimplifying
>> >> >> things.
>> >> >>
>> >> >> If this *is* how we think this will shake out, then I'm
>> >> >> wondering if we
>> >> >> should go ahead and build this weigh matrix into CRUSH sooner
>> >> >> rather
>> >> >> than later (i.e., for luminous). As with the explicit
>> >> >> remappings, the
>> >> >> hard part is all done offline, and the adjustments to the CRUSH
>> >> >> mapping
>> >> >> calculation itself (storing and making use of the adjusted
>> >> >> weights for
>> >> >> each round of placement) are relatively straightforward. And
>> >> >> the sooner
>> >> >> this is incorporated into a release the sooner real users will
>> >> >> be able to
>> >> >> roll out code to all clients and start making us of it.
>> >> >>
>> >> >> Thanks again for looking at this problem! I'm excited that we
>> >> >> may be
>> >> >> closing in on a real solution!
>> >> >>
>> >> >> sage
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
>> >> >>
>> >> >> > There are lot of gradient-free methods. I will try first to
>> >> >> run the
>> >> >> > ones available using just scipy
>> >> >> >
>> >> >>
>> >> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>> >> >> > Some of them don't require the gradient and some of them can
>> >> >> estimate
>> >> >> > it. The reason to go without the gradient is to run the CRUSH
>> >> >> > algorithm as a black box. In that case this would be the
>> >> >> pseudo-code:
>> >> >> >
>> >> >> > - BEGIN CODE -
>> >> >> > def build_target(desired_freqs):
>> >> >> > def target(weights):
>> >> >> > # run a simulation of CRUSH for a number of objects
>> >> >> > sim_freqs = run_crush(weights)
>> >> >> > # Kullback-Leibler divergence between desired
>> >> >> frequencies and
>> >> >> > current ones
>> >> >> > return loss(sim_freqs, desired_freqs)
>> >> >> > return target
>> >> >> >
>> >> >> > weights = scipy.optimize.minimize(build_target(desired_freqs))
>> >> >> > - END CODE -
>> >> >> >
>> >> >> > The tricky thing here is that this procedure can be slow if
>> >> >> the
>> >> >> > simulation (run_crush) needs to place a lot of objects to get
>> >> >> accurate
>> >> >> > simulated frequencies. This is true specially if the minimize
>> >> >> method
>> >> >> > attempts to approximate the gradient using finite differences
>> >> >> since it
>> >> >> > will evaluate the target function a number of times
>> >> >> proportional to
>> >> >> > the number of weights). Apart from the ones in scipy I would
>> >> >> try also
>> >> >> > optimization methods that try to perform as few evaluations as
>> >> >> > possible like for example HyperOpt
>> >> >> > (http://hyperopt.github.io/hyperopt/), which by the way takes
>> >> >> into
>> >> >> > account that the target function can be noisy.
>> >> >> >
>> >> >> > This black box approximation is simple to implement and makes
>> >> >> the
>> >> >> > computer do all the work instead of us.
>> >> >> > I think that this black box approximation is worthy to try
>> >> >> even if
>> >> >> > it's not the final one because if this approximation works
>> >> >> then we
>> >> >> > know that a more elaborate one that computes the gradient of
>> >> >> the CRUSH
>> >> >> > algorithm will work for sure.
>> >> >> >
>> >> >> > I can try this black box approximation this weekend not on the
>> >> >> real
>> >> >> > CRUSH algorithm but with the simple implementation I did in
>> >> >> python. If
>> >> >> > it works it's just a matter of substituting one simulation
>> >> >> with
>> >> >> > another and see what happens.
>> >> >> >
>> >> >> > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> >> >> > > Hi Pedro,
>> >> >> > >
>> >> >> > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>> >> >> > >> Hi Loic,
>> >> >> > >>
>> >> >> > >>>From what I see everything seems OK.
>> >> >> > >
>> >> >> > > Cool. I'll keep going in this direction then !
>> >> >> > >
>> >> >> > >> The interesting thing would be to
>> >> >> > >> test on some complex mapping. The reason is that
>> >> >> "CrushPolicyFamily"
>> >> >> > >> is right now modeling just a single straw bucket not the
>> >> >> full CRUSH
>> >> >> > >> algorithm.
>> >> >> > >
>> >> >> > > A number of use cases use a single straw bucket, maybe the
>> >> >> majority of them. Even though it does not reflect the full range
>> >> >> of what crush can offer, it could be useful. To be more
>> >> >> specific, a crush map that states "place objects so that there
>> >> >> is at most one replica per host" or "one replica per rack" is
>> >> >> common. Such a crushmap can be reduced to a single straw bucket
>> >> >> that contains all the hosts and by using the CrushPolicyFamily,
>> >> >> we can change the weights of each host to fix the probabilities.
>> >> >> The hosts themselves contain disks with varying weights but I
>> >> >> think we can ignore that because crush will only recurse to
>> >> >> place one object within a given host.
>> >> >> > >
>> >> >> > >> That's the work that remains to be done. The only way that
>> >> >> > >> would avoid reimplementing the CRUSH algorithm and
>> >> >> computing the
>> >> >> > >> gradient would be treating CRUSH as a black box and
>> >> >> eliminating the
>> >> >> > >> necessity of computing the gradient either by using a
>> >> >> gradient-free
>> >> >> > >> optimization method or making an estimation of the
>> >> >> gradient.
>> >> >> > >
>> >> >> > > By gradient-free optimization you mean simulated annealing
>> >> >> or Monte Carlo ?
>> >> >> > >
>> >> >> > > Cheers
>> >> >> > >
>> >> >> > >>
>> >> >> > >>
>> >> >> > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> >> >> > >>> Hi,
>> >> >> > >>>
>> >> >> > >>> I modified the crush library to accept two weights (one
>> >> >> for the first disk, the other for the remaining disks)[1]. This
>> >> >> really is a hack for experimentation purposes only ;-) I was
>> >> >> able to run a variation of your code[2] and got the following
>> >> >> results which are encouraging. Do you think what I did is
>> >> >> sensible ? Or is there a problem I don't see ?
>> >> >> > >>>
>> >> >> > >>> Thanks !
>> >> >> > >>>
>> >> >> > >>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8
>> >> >> 6]
>> >> >> > >>>
>> >> >>
>> >> >> ------------------------------------------------------------------------
>> >> >> > >>> Before: All replicas on each hard drive
>> >> >> > >>> Expected vs actual use (20000 samples)
>> >> >> > >>> disk 0: 1.39e-01 1.12e-01
>> >> >> > >>> disk 1: 1.11e-01 1.10e-01
>> >> >> > >>> disk 2: 8.33e-02 1.13e-01
>> >> >> > >>> disk 3: 1.39e-01 1.11e-01
>> >> >> > >>> disk 4: 1.11e-01 1.11e-01
>> >> >> > >>> disk 5: 8.33e-02 1.11e-01
>> >> >> > >>> disk 6: 1.39e-01 1.12e-01
>> >> >> > >>> disk 7: 1.11e-01 1.12e-01
>> >> >> > >>> disk 8: 8.33e-02 1.10e-01
>> >> >> > >>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>> >> >> > >>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>> >> >> > >>> ...
>> >> >> > >>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>> >> >> > >>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>> >> >> > >>> Converged to desired accuracy :)
>> >> >> > >>> After: All replicas on each hard drive
>> >> >> > >>> Expected vs actual use (20000 samples)
>> >> >> > >>> disk 0: 1.39e-01 1.42e-01
>> >> >> > >>> disk 1: 1.11e-01 1.09e-01
>> >> >> > >>> disk 2: 8.33e-02 8.37e-02
>> >> >> > >>> disk 3: 1.39e-01 1.40e-01
>> >> >> > >>> disk 4: 1.11e-01 1.13e-01
>> >> >> > >>> disk 5: 8.33e-02 8.08e-02
>> >> >> > >>> disk 6: 1.39e-01 1.38e-01
>> >> >> > >>> disk 7: 1.11e-01 1.09e-01
>> >> >> > >>> disk 8: 8.33e-02 8.48e-02
>> >> >> > >>>
>> >> >> > >>>
>> >> >> > >>> Simulation: R=2 devices capacity [10 10 10 10 1]
>> >> >> > >>>
>> >> >>
>> >> >> ------------------------------------------------------------------------
>> >> >> > >>> Before: All replicas on each hard drive
>> >> >> > >>> Expected vs actual use (20000 samples)
>> >> >> > >>> disk 0: 2.44e-01 2.36e-01
>> >> >> > >>> disk 1: 2.44e-01 2.38e-01
>> >> >> > >>> disk 2: 2.44e-01 2.34e-01
>> >> >> > >>> disk 3: 2.44e-01 2.38e-01
>> >> >> > >>> disk 4: 2.44e-02 5.37e-02
>> >> >> > >>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>> >> >> > >>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>> >> >> > >>> ...
>> >> >> > >>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>> >> >> > >>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>> >> >> > >>> Converged to desired accuracy :)
>> >> >> > >>> After: All replicas on each hard drive
>> >> >> > >>> Expected vs actual use (20000 samples)
>> >> >> > >>> disk 0: 2.44e-01 2.46e-01
>> >> >> > >>> disk 1: 2.44e-01 2.44e-01
>> >> >> > >>> disk 2: 2.44e-01 2.41e-01
>> >> >> > >>> disk 3: 2.44e-01 2.45e-01
>> >> >> > >>> disk 4: 2.44e-02 2.33e-02
>> >> >> > >>>
>> >> >> > >>>
>> >> >> > >>> [1] crush
>> >> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
>> >> >> 56fee8
>> >> >> > >>> [2] python-crush
>> >> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
>> >> >> bd25f8f2c4b68
>> >> >> > >>>
>> >> >> > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>> >> >> > >>>> Hi Pedro,
>> >> >> > >>>>
>> >> >> > >>>> It looks like trying to experiment with crush won't work
>> >> >> as expected because crush does not distinguish the probability
>> >> >> of selecting the first device from the probability of selecting
>> >> >> the second or third device. Am I mistaken ?
>> >> >> > >>>>
>> >> >> > >>>> Cheers
>> >> >> > >>>>
>> >> >> > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>> >> >> > >>>>> Hi Pedro,
>> >> >> > >>>>>
>> >> >> > >>>>> I'm going to experiment with what you did at
>> >> >> > >>>>>
>> >> >> > >>>>>
>> >> >> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>> >> >> > >>>>>
>> >> >> > >>>>> and the latest python-crush published today. A
>> >> >> comparison function was added that will help measure the data
>> >> >> movement. I'm hoping we can release an offline tool based on
>> >> >> your solution. Please let me know if I should wait before diving
>> >> >> into this, in case you have unpublished drafts or new ideas.
>> >> >> > >>>>>
>> >> >> > >>>>> Cheers
>> >> >> > >>>>>
>> >> >> > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>> >> >> > >>>>>> Great, thanks for the clarifications.
>> >> >> > >>>>>> I also think that the most natural way is to keep just
>> >> >> a set of
>> >> >> > >>>>>> weights in the CRUSH map and update them inside the
>> >> >> algorithm.
>> >> >> > >>>>>>
>> >> >> > >>>>>> I keep working on it.
>> >> >> > >>>>>>
>> >> >> > >>>>>>
>> >> >> > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
>> >> >> <sage@newdream.net>:
>> >> >> > >>>>>>> Hi Pedro,
>> >> >> > >>>>>>>
>> >> >> > >>>>>>> Thanks for taking a look at this! It's a frustrating
>> >> >> problem and we
>> >> >> > >>>>>>> haven't made much headway.
>> >> >> > >>>>>>>
>> >> >> > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>> >> >> > >>>>>>>> Hi,
>> >> >> > >>>>>>>>
>> >> >> > >>>>>>>> I will have a look. BTW, I have not progressed that
>> >> >> much but I have
>> >> >> > >>>>>>>> been thinking about it. In order to adapt the
>> >> >> previous algorithm in
>> >> >> > >>>>>>>> the python notebook I need to substitute the
>> >> >> iteration over all
>> >> >> > >>>>>>>> possible devices permutations to iteration over all
>> >> >> the possible
>> >> >> > >>>>>>>> selections that crush would make. That is the main
>> >> >> thing I need to
>> >> >> > >>>>>>>> work on.
>> >> >> > >>>>>>>>
>> >> >> > >>>>>>>> The other thing is of course that weights change for
>> >> >> each replica.
>> >> >> > >>>>>>>> That is, they cannot be really fixed in the crush
>> >> >> map. So the
>> >> >> > >>>>>>>> algorithm inside libcrush, not only the weights in
>> >> >> the map, need to be
>> >> >> > >>>>>>>> changed. The weights in the crush map should reflect
>> >> >> then, maybe, the
>> >> >> > >>>>>>>> desired usage frequencies. Or maybe each replica
>> >> >> should have their own
>> >> >> > >>>>>>>> crush map, but then the information about the
>> >> >> previous selection
>> >> >> > >>>>>>>> should be passed to the next replica placement run so
>> >> >> it avoids
>> >> >> > >>>>>>>> selecting the same one again.
>> >> >> > >>>>>>>
>> >> >> > >>>>>>> My suspicion is that the best solution here (whatever
>> >> >> that means!)
>> >> >> > >>>>>>> leaves the CRUSH weights intact with the desired
>> >> >> distribution, and
>> >> >> > >>>>>>> then generates a set of derivative weights--probably
>> >> >> one set for each
>> >> >> > >>>>>>> round/replica/rank.
>> >> >> > >>>>>>>
>> >> >> > >>>>>>> One nice property of this is that once the support is
>> >> >> added to encode
>> >> >> > >>>>>>> multiple sets of weights, the algorithm used to
>> >> >> generate them is free to
>> >> >> > >>>>>>> change and evolve independently. (In most cases any
>> >> >> change is
>> >> >> > >>>>>>> CRUSH's mapping behavior is difficult to roll out
>> >> >> because all
>> >> >> > >>>>>>> parties participating in the cluster have to support
>> >> >> any new behavior
>> >> >> > >>>>>>> before it is enabled or used.)
>> >> >> > >>>>>>>
>> >> >> > >>>>>>>> I have a question also. Is there any significant
>> >> >> difference between
>> >> >> > >>>>>>>> the device selection algorithm description in the
>> >> >> paper and its final
>> >> >> > >>>>>>>> implementation?
>> >> >> > >>>>>>>
>> >> >> > >>>>>>> The main difference is the "retry_bucket" behavior was
>> >> >> found to be a bad
>> >> >> > >>>>>>> idea; any collision or failed()/overload() case
>> >> >> triggers the
>> >> >> > >>>>>>> retry_descent.
>> >> >> > >>>>>>>
>> >> >> > >>>>>>> There are other changes, of course, but I don't think
>> >> >> they'll impact any
>> >> >> > >>>>>>> solution we come with here (or at least any solution
>> >> >> can be suitably
>> >> >> > >>>>>>> adapted)!
>> >> >> > >>>>>>>
>> >> >> > >>>>>>> sage
>> >> >> > >>>>>> --
>> >> >> > >>>>>> To unsubscribe from this list: send the line
>> >> >> "unsubscribe ceph-devel" in
>> >> >> > >>>>>> the body of a message to majordomo@vger.kernel.org
>> >> >> > >>>>>> More majordomo info at
>> >> >> http://vger.kernel.org/majordomo-info.html
>> >> >> > >>>>>>
>> >> >> > >>>>>
>> >> >> > >>>>
>> >> >> > >>>
>> >> >> > >>> --
>> >> >> > >>> Loïc Dachary, Artisan Logiciel Libre
>> >> >> > >> --
>> >> >> > >> To unsubscribe from this list: send the line "unsubscribe
>> >> >> ceph-devel" in
>> >> >> > >> the body of a message to majordomo@vger.kernel.org
>> >> >> > >> More majordomo info at
>> >> >> http://vger.kernel.org/majordomo-info.html
>> >> >> > >>
>> >> >> > >
>> >> >> > > --
>> >> >> > > Loïc Dachary, Artisan Logiciel Libre
>> >> >> > --
>> >> >> > To unsubscribe from this list: send the line "unsubscribe
>> >> >> ceph-devel" in
>> >> >> > the body of a message to majordomo@vger.kernel.org
>> >> >> > More majordomo info at
>> >> >> http://vger.kernel.org/majordomo-info.html
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >>
>> >> --
>> >> Loïc Dachary, Artisan Logiciel Libre
>> >
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-28 6:52 ` Adam Kupczyk
@ 2017-03-28 9:49 ` Spandan Kumar Sahu
2017-03-28 13:35 ` Sage Weil
1 sibling, 0 replies; 70+ messages in thread
From: Spandan Kumar Sahu @ 2017-03-28 9:49 UTC (permalink / raw)
To: Adam Kupczyk
Cc: Sage Weil, Pedro López-Adeva, Loic Dachary, Ceph Development
Hi
I have a bit different algorithm of reweighing for multi-pick. It was
a bit long, so I decided to not to put on the mail thread, and instead
I have uploaded on a GitHub repo [1].
I have explained with examples, and the reasons I think it can solve
the problem. I would really appreciate if someone can go through it
and suggest if this is viable or not.
[1] : https://github.com/SpandanKumarSahu/Ceph_Proposal
On Tue, Mar 28, 2017 at 12:22 PM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
> "... or simply have a single global set"
>
> No. Proof by example:
>
> I once attempted to perfectly balance cluster X by modifying crush weights.
> Pool A spanned over 352 OSDs (set A)
> Pool B spanned over 176 OSDs (set B, half of A)
> The result (simulated perfect balance) was that obtained weights had
> - small variance for B (5%),
> - small variance for A-B (5%).
> - huge variance for A (800%)
> This was of course because crush had to be strongly discouraged to
> pick from B, when performing placement for A.
>
> "...crush users can choose..."
> For each pool there is only one vector of weights that will provide
> perfect balance. (math note: actually multiple of them, but different
> by scale)
> I cannot at the moment imagine any other practical metrics other then
> balancing. But maybe it is just failure of imagination.
>
> On Mon, Mar 27, 2017 at 3:39 PM, Sage Weil <sage@newdream.net> wrote:
>> On Mon, 27 Mar 2017, Adam Kupczyk wrote:
>>> Hi,
>>>
>>> My understanding is that optimal tweaked weights will depend on:
>>> 1) pool_id, because of rjenkins(pool_id) in crush
>>> 2) number of placement groups and replication factor, as it determines
>>> amount of samples
>>>
>>> Therefore tweaked weights should rather be property of instantialized pool,
>>> not crush placement definition.
>>>
>>> If tweaked weights are to be part of crush definition, than for each
>>> created pool we need to have separate list of weights.
>>> Is it possible to provide clients with different weights depending on on
>>> which pool they want to operate?
>>
>> As Loic suggested, you can create as many derivative hierarchies in the
>> crush map as you like, potentially one per pool. Or you could treat the
>> sum total of all pgs as the interesting set, balance those, and get some
>> OSDs doing a bit more of one pool than another. The new post-CRUSH OSD
>> remap capability can always clean this up (and turn a "good" crush
>> distribution into a perfect distribution).
>>
>> I guess the question is: when we add the explicit adjusted weight matrix
>> to crush should we have multiple sets of weights (perhaps one for each
>> pool), or simply have a single global set. It might make sense to allow N
>> sets of adjusted weights so that the crush users can choose a particular
>> set of them for different pools (or whatever it is they're calculating the
>> mapping for)..
>>
>> sage
>>
>>
>>>
>>> Best regards,
>>> Adam
>>>
>>> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
>>> > Hi,
>>> >
>>> > My understanding is that optimal tweaked weights will depend on:
>>> > 1) pool_id, because of rjenkins(pool_id) in crush
>>> > 2) number of placement groups and replication factor, as it determines
>>> > amount of samples
>>> >
>>> > Therefore tweaked weights should rather be property of instantialized pool,
>>> > not crush placement definition.
>>> >
>>> > If tweaked weights are to be part of crush definition, than for each created
>>> > pool we need to have separate list of weights.
>>> > Is it possible to provide clients with different weights depending on on
>>> > which pool they want to operate?
>>> >
>>> > Best regards,
>>> > Adam
>>> >
>>> >
>>> > On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@dachary.org> wrote:
>>> >>
>>> >>
>>> >>
>>> >> On 03/27/2017 04:33 AM, Sage Weil wrote:
>>> >> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
>>> >> >> Hello Sage, Loic, Pedro,
>>> >> >>
>>> >> >>
>>> >> >> I am certain that almost perfect mapping can be achieved by
>>> >> >> substituting weights from crush map with slightly modified weights.
>>> >> >> By perfect mapping I mean we get on each OSD number of PGs exactly
>>> >> >> proportional to weights specified in crush map.
>>> >> >>
>>> >> >> 1. Example
>>> >> >> Lets think of PGs of single object pool.
>>> >> >> We have OSDs with following weights:
>>> >> >> [10, 10, 10, 5, 5]
>>> >> >>
>>> >> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
>>> >> >> PGcopies :
>>> >> >> [150, 150, 150, 75, 75]
>>> >> >>
>>> >> >> However, because crush simulates random process we have:
>>> >> >> [143, 152, 158, 71, 76]
>>> >> >>
>>> >> >> We could have obtained perfect distribution had we used weights like
>>> >> >> this:
>>> >> >> [10.2, 9.9, 9.6, 5.2, 4.9]
>>> >> >>
>>> >> >>
>>> >> >> 2. Obtaining perfect mapping weights from OSD capacity weights
>>> >> >>
>>> >> >> When we apply crush for the first time, distribution of PGs comes as
>>> >> >> random.
>>> >> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
>>> >> >>
>>> >> >> But CRUSH is not random proces at all, it behaves in numerically stable
>>> >> >> way.
>>> >> >> Specifically, if we increase weight on one node, we will get more PGs
>>> >> >> on
>>> >> >> this node and less on every other node:
>>> >> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
>>> >> >>
>>> >> >> Now, finding ideal weights can be done by any numerical minimization
>>> >> >> method,
>>> >> >> for example NLMS.
>>> >> >>
>>> >> >>
>>> >> >> 3. The proposal
>>> >> >> For each pool, from initial weights given in crush map perfect weights
>>> >> >> will
>>> >> >> be derived.
>>> >> >> This weights will be used to calculate PG distribution. This of course
>>> >> >> will
>>> >> >> be close to perfect.
>>> >> >>
>>> >> >> 3a: Downside when OSD is out
>>> >> >> When an OSD is out, missing PG copies will be replicated elsewhere.
>>> >> >> Because now weights deviate from OSD capacity, some OSDs will
>>> >> >> statistically
>>> >> >> get more copies then they should.
>>> >> >> This unevenness in distribution is proportional to scale of deviation
>>> >> >> of
>>> >> >> calculated weights to capacity weights.
>>> >> >>
>>> >> >> 3b: Upside
>>> >> >> This all can be achieved without changes to crush.
>>> >> >
>>> >> > Yes!
>>> >> >
>>> >> > And no. You're totally right--we should use an offline optimization to
>>> >> > tweak the crush input weights to get a better balance. It won't be
>>> >> > robust
>>> >> > to changes to the cluster, but we can incrementally optimize after that
>>> >> > happens to converge on something better.
>>> >> >
>>> >> > The problem with doing this with current versions of Ceph is that we
>>> >> > lose
>>> >> > the original "input" or "target" weights (i.e., the actual size of
>>> >> > the OSD) that we want to converge on. This is one reason why we haven't
>>> >> > done something like this before.
>>> >> >
>>> >> > In luminous we *could* work around this by storing those canonical
>>> >> > weights outside of crush using something (probably?) ugly and
>>> >> > maintain backward compatibility with older clients using existing
>>> >> > CRUSH behavior.
>>> >>
>>> >> These canonical weights could be stored in crush by creating dedicated
>>> >> buckets. For instance the root-canonical bucket could be created to store
>>> >> the canonical weights of the root bucket. The sysadmin needs to be aware of
>>> >> the difference and know to add a new device in the host01-canonical bucket
>>> >> instead of the host01 bucket. And to run an offline tool to keep the two
>>> >> buckets in sync and compute the weight to use for placement derived from the
>>> >> weights representing the device capacity.
>>> >>
>>> >> It is a little bit ugly ;-)
>>> >>
>>> >> > OR, (and this is my preferred route), if the multi-pick anomaly approach
>>> >> > that Pedro is working on works out, we'll want to extend the CRUSH map
>>> >> > to
>>> >> > include a set of derivative weights used for actual placement
>>> >> > calculations
>>> >> > instead of the canonical target weights, and we can do what you're
>>> >> > proposing *and* solve the multipick problem with one change in the crush
>>> >> > map and algorithm. (Actually choosing those derivative weights will
>>> >> > be an offline process that can both improve the balance for the inputs
>>> >> > we
>>> >> > care about *and* adjust them based on the position to fix the skew issue
>>> >> > for replicas.) This doesn't help pre-luminous clients, but I think the
>>> >> > end solution will be simpler and more elegant...
>>> >> >
>>> >> > What do you think?
>>> >> >
>>> >> > sage
>>> >> >
>>> >> >
>>> >> >> 4. Extra
>>> >> >> Some time ago I made such change to perfectly balance Thomson-Reuters
>>> >> >> cluster.
>>> >> >> It succeeded.
>>> >> >> A solution was not accepted, because modification of OSD weights were
>>> >> >> higher
>>> >> >> then 50%, which was caused by fact that different placement rules
>>> >> >> operated
>>> >> >> on different sets of OSDs, and those sets were not disjointed.
>>> >> >
>>> >> >
>>> >> >>
>>> >> >> Best regards,
>>> >> >> Adam
>>> >> >>
>>> >> >>
>>> >> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
>>> >> >> Hi Pedro, Loic,
>>> >> >>
>>> >> >> For what it's worth, my intuition here (which has had a mixed
>>> >> >> record as
>>> >> >> far as CRUSH goes) is that this is the most promising path
>>> >> >> forward.
>>> >> >>
>>> >> >> Thinking ahead a few steps, and confirming that I'm following
>>> >> >> the
>>> >> >> discussion so far, if you're able to do get black (or white) box
>>> >> >> gradient
>>> >> >> descent to work, then this will give us a set of weights for
>>> >> >> each item in
>>> >> >> the tree for each selection round, derived from the tree
>>> >> >> structure and
>>> >> >> original (target) weights. That would basically give us a map
>>> >> >> of item id
>>> >> >> (bucket id or leave item id) to weight for each round. i.e.,
>>> >> >>
>>> >> >> map<int, map<int, float>> weight_by_position; // position ->
>>> >> >> item -> weight
>>> >> >>
>>> >> >> where the 0 round would (I think?) match the target weights, and
>>> >> >> each
>>> >> >> round after that would skew low-weighted items lower to some
>>> >> >> degree.
>>> >> >> Right?
>>> >> >>
>>> >> >> The next question I have is: does this generalize from the
>>> >> >> single-bucket
>>> >> >> case to the hierarchy? I.e., if I have a "tree" (single bucket)
>>> >> >> like
>>> >> >>
>>> >> >> 3.1
>>> >> >> |_____________
>>> >> >> | \ \ \
>>> >> >> 1.0 1.0 1.0 .1
>>> >> >>
>>> >> >> it clearly works, but when we have a multi-level tree like
>>> >> >>
>>> >> >>
>>> >> >> 8.4
>>> >> >> |____________________________________
>>> >> >> | \ \
>>> >> >> 3.1 3.1 2.2
>>> >> >> |_____________ |_____________ |_____________
>>> >> >> | \ \ \ | \ \ \ | \ \ \
>>> >> >> 1.0 1.0 1.0 .1 1.0 1.0 1.0 .1 1.0 1.0 .1 .1
>>> >> >>
>>> >> >> and the second round weights skew the small .1 leaves lower, can
>>> >> >> we
>>> >> >> continue to build the summed-weight hierarchy, such that the
>>> >> >> adjusted
>>> >> >> weights at the higher level are appropriately adjusted to give
>>> >> >> us the
>>> >> >> right probabilities of descending into those trees? I'm not
>>> >> >> sure if that
>>> >> >> logically follows from the above or if my intuition is
>>> >> >> oversimplifying
>>> >> >> things.
>>> >> >>
>>> >> >> If this *is* how we think this will shake out, then I'm
>>> >> >> wondering if we
>>> >> >> should go ahead and build this weigh matrix into CRUSH sooner
>>> >> >> rather
>>> >> >> than later (i.e., for luminous). As with the explicit
>>> >> >> remappings, the
>>> >> >> hard part is all done offline, and the adjustments to the CRUSH
>>> >> >> mapping
>>> >> >> calculation itself (storing and making use of the adjusted
>>> >> >> weights for
>>> >> >> each round of placement) are relatively straightforward. And
>>> >> >> the sooner
>>> >> >> this is incorporated into a release the sooner real users will
>>> >> >> be able to
>>> >> >> roll out code to all clients and start making us of it.
>>> >> >>
>>> >> >> Thanks again for looking at this problem! I'm excited that we
>>> >> >> may be
>>> >> >> closing in on a real solution!
>>> >> >>
>>> >> >> sage
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
>>> >> >>
>>> >> >> > There are lot of gradient-free methods. I will try first to
>>> >> >> run the
>>> >> >> > ones available using just scipy
>>> >> >> >
>>> >> >>
>>> >> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>> >> >> > Some of them don't require the gradient and some of them can
>>> >> >> estimate
>>> >> >> > it. The reason to go without the gradient is to run the CRUSH
>>> >> >> > algorithm as a black box. In that case this would be the
>>> >> >> pseudo-code:
>>> >> >> >
>>> >> >> > - BEGIN CODE -
>>> >> >> > def build_target(desired_freqs):
>>> >> >> > def target(weights):
>>> >> >> > # run a simulation of CRUSH for a number of objects
>>> >> >> > sim_freqs = run_crush(weights)
>>> >> >> > # Kullback-Leibler divergence between desired
>>> >> >> frequencies and
>>> >> >> > current ones
>>> >> >> > return loss(sim_freqs, desired_freqs)
>>> >> >> > return target
>>> >> >> >
>>> >> >> > weights = scipy.optimize.minimize(build_target(desired_freqs))
>>> >> >> > - END CODE -
>>> >> >> >
>>> >> >> > The tricky thing here is that this procedure can be slow if
>>> >> >> the
>>> >> >> > simulation (run_crush) needs to place a lot of objects to get
>>> >> >> accurate
>>> >> >> > simulated frequencies. This is true specially if the minimize
>>> >> >> method
>>> >> >> > attempts to approximate the gradient using finite differences
>>> >> >> since it
>>> >> >> > will evaluate the target function a number of times
>>> >> >> proportional to
>>> >> >> > the number of weights). Apart from the ones in scipy I would
>>> >> >> try also
>>> >> >> > optimization methods that try to perform as few evaluations as
>>> >> >> > possible like for example HyperOpt
>>> >> >> > (http://hyperopt.github.io/hyperopt/), which by the way takes
>>> >> >> into
>>> >> >> > account that the target function can be noisy.
>>> >> >> >
>>> >> >> > This black box approximation is simple to implement and makes
>>> >> >> the
>>> >> >> > computer do all the work instead of us.
>>> >> >> > I think that this black box approximation is worthy to try
>>> >> >> even if
>>> >> >> > it's not the final one because if this approximation works
>>> >> >> then we
>>> >> >> > know that a more elaborate one that computes the gradient of
>>> >> >> the CRUSH
>>> >> >> > algorithm will work for sure.
>>> >> >> >
>>> >> >> > I can try this black box approximation this weekend not on the
>>> >> >> real
>>> >> >> > CRUSH algorithm but with the simple implementation I did in
>>> >> >> python. If
>>> >> >> > it works it's just a matter of substituting one simulation
>>> >> >> with
>>> >> >> > another and see what happens.
>>> >> >> >
>>> >> >> > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> >> >> > > Hi Pedro,
>>> >> >> > >
>>> >> >> > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>> >> >> > >> Hi Loic,
>>> >> >> > >>
>>> >> >> > >>>From what I see everything seems OK.
>>> >> >> > >
>>> >> >> > > Cool. I'll keep going in this direction then !
>>> >> >> > >
>>> >> >> > >> The interesting thing would be to
>>> >> >> > >> test on some complex mapping. The reason is that
>>> >> >> "CrushPolicyFamily"
>>> >> >> > >> is right now modeling just a single straw bucket not the
>>> >> >> full CRUSH
>>> >> >> > >> algorithm.
>>> >> >> > >
>>> >> >> > > A number of use cases use a single straw bucket, maybe the
>>> >> >> majority of them. Even though it does not reflect the full range
>>> >> >> of what crush can offer, it could be useful. To be more
>>> >> >> specific, a crush map that states "place objects so that there
>>> >> >> is at most one replica per host" or "one replica per rack" is
>>> >> >> common. Such a crushmap can be reduced to a single straw bucket
>>> >> >> that contains all the hosts and by using the CrushPolicyFamily,
>>> >> >> we can change the weights of each host to fix the probabilities.
>>> >> >> The hosts themselves contain disks with varying weights but I
>>> >> >> think we can ignore that because crush will only recurse to
>>> >> >> place one object within a given host.
>>> >> >> > >
>>> >> >> > >> That's the work that remains to be done. The only way that
>>> >> >> > >> would avoid reimplementing the CRUSH algorithm and
>>> >> >> computing the
>>> >> >> > >> gradient would be treating CRUSH as a black box and
>>> >> >> eliminating the
>>> >> >> > >> necessity of computing the gradient either by using a
>>> >> >> gradient-free
>>> >> >> > >> optimization method or making an estimation of the
>>> >> >> gradient.
>>> >> >> > >
>>> >> >> > > By gradient-free optimization you mean simulated annealing
>>> >> >> or Monte Carlo ?
>>> >> >> > >
>>> >> >> > > Cheers
>>> >> >> > >
>>> >> >> > >>
>>> >> >> > >>
>>> >> >> > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> >> >> > >>> Hi,
>>> >> >> > >>>
>>> >> >> > >>> I modified the crush library to accept two weights (one
>>> >> >> for the first disk, the other for the remaining disks)[1]. This
>>> >> >> really is a hack for experimentation purposes only ;-) I was
>>> >> >> able to run a variation of your code[2] and got the following
>>> >> >> results which are encouraging. Do you think what I did is
>>> >> >> sensible ? Or is there a problem I don't see ?
>>> >> >> > >>>
>>> >> >> > >>> Thanks !
>>> >> >> > >>>
>>> >> >> > >>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8
>>> >> >> 6]
>>> >> >> > >>>
>>> >> >>
>>> >> >> ------------------------------------------------------------------------
>>> >> >> > >>> Before: All replicas on each hard drive
>>> >> >> > >>> Expected vs actual use (20000 samples)
>>> >> >> > >>> disk 0: 1.39e-01 1.12e-01
>>> >> >> > >>> disk 1: 1.11e-01 1.10e-01
>>> >> >> > >>> disk 2: 8.33e-02 1.13e-01
>>> >> >> > >>> disk 3: 1.39e-01 1.11e-01
>>> >> >> > >>> disk 4: 1.11e-01 1.11e-01
>>> >> >> > >>> disk 5: 8.33e-02 1.11e-01
>>> >> >> > >>> disk 6: 1.39e-01 1.12e-01
>>> >> >> > >>> disk 7: 1.11e-01 1.12e-01
>>> >> >> > >>> disk 8: 8.33e-02 1.10e-01
>>> >> >> > >>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>>> >> >> > >>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>>> >> >> > >>> ...
>>> >> >> > >>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>>> >> >> > >>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>>> >> >> > >>> Converged to desired accuracy :)
>>> >> >> > >>> After: All replicas on each hard drive
>>> >> >> > >>> Expected vs actual use (20000 samples)
>>> >> >> > >>> disk 0: 1.39e-01 1.42e-01
>>> >> >> > >>> disk 1: 1.11e-01 1.09e-01
>>> >> >> > >>> disk 2: 8.33e-02 8.37e-02
>>> >> >> > >>> disk 3: 1.39e-01 1.40e-01
>>> >> >> > >>> disk 4: 1.11e-01 1.13e-01
>>> >> >> > >>> disk 5: 8.33e-02 8.08e-02
>>> >> >> > >>> disk 6: 1.39e-01 1.38e-01
>>> >> >> > >>> disk 7: 1.11e-01 1.09e-01
>>> >> >> > >>> disk 8: 8.33e-02 8.48e-02
>>> >> >> > >>>
>>> >> >> > >>>
>>> >> >> > >>> Simulation: R=2 devices capacity [10 10 10 10 1]
>>> >> >> > >>>
>>> >> >>
>>> >> >> ------------------------------------------------------------------------
>>> >> >> > >>> Before: All replicas on each hard drive
>>> >> >> > >>> Expected vs actual use (20000 samples)
>>> >> >> > >>> disk 0: 2.44e-01 2.36e-01
>>> >> >> > >>> disk 1: 2.44e-01 2.38e-01
>>> >> >> > >>> disk 2: 2.44e-01 2.34e-01
>>> >> >> > >>> disk 3: 2.44e-01 2.38e-01
>>> >> >> > >>> disk 4: 2.44e-02 5.37e-02
>>> >> >> > >>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>>> >> >> > >>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>>> >> >> > >>> ...
>>> >> >> > >>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>>> >> >> > >>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>>> >> >> > >>> Converged to desired accuracy :)
>>> >> >> > >>> After: All replicas on each hard drive
>>> >> >> > >>> Expected vs actual use (20000 samples)
>>> >> >> > >>> disk 0: 2.44e-01 2.46e-01
>>> >> >> > >>> disk 1: 2.44e-01 2.44e-01
>>> >> >> > >>> disk 2: 2.44e-01 2.41e-01
>>> >> >> > >>> disk 3: 2.44e-01 2.45e-01
>>> >> >> > >>> disk 4: 2.44e-02 2.33e-02
>>> >> >> > >>>
>>> >> >> > >>>
>>> >> >> > >>> [1] crush
>>> >> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
>>> >> >> 56fee8
>>> >> >> > >>> [2] python-crush
>>> >> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
>>> >> >> bd25f8f2c4b68
>>> >> >> > >>>
>>> >> >> > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>> >> >> > >>>> Hi Pedro,
>>> >> >> > >>>>
>>> >> >> > >>>> It looks like trying to experiment with crush won't work
>>> >> >> as expected because crush does not distinguish the probability
>>> >> >> of selecting the first device from the probability of selecting
>>> >> >> the second or third device. Am I mistaken ?
>>> >> >> > >>>>
>>> >> >> > >>>> Cheers
>>> >> >> > >>>>
>>> >> >> > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>> >> >> > >>>>> Hi Pedro,
>>> >> >> > >>>>>
>>> >> >> > >>>>> I'm going to experiment with what you did at
>>> >> >> > >>>>>
>>> >> >> > >>>>>
>>> >> >> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>> >> >> > >>>>>
>>> >> >> > >>>>> and the latest python-crush published today. A
>>> >> >> comparison function was added that will help measure the data
>>> >> >> movement. I'm hoping we can release an offline tool based on
>>> >> >> your solution. Please let me know if I should wait before diving
>>> >> >> into this, in case you have unpublished drafts or new ideas.
>>> >> >> > >>>>>
>>> >> >> > >>>>> Cheers
>>> >> >> > >>>>>
>>> >> >> > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>> >> >> > >>>>>> Great, thanks for the clarifications.
>>> >> >> > >>>>>> I also think that the most natural way is to keep just
>>> >> >> a set of
>>> >> >> > >>>>>> weights in the CRUSH map and update them inside the
>>> >> >> algorithm.
>>> >> >> > >>>>>>
>>> >> >> > >>>>>> I keep working on it.
>>> >> >> > >>>>>>
>>> >> >> > >>>>>>
>>> >> >> > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
>>> >> >> <sage@newdream.net>:
>>> >> >> > >>>>>>> Hi Pedro,
>>> >> >> > >>>>>>>
>>> >> >> > >>>>>>> Thanks for taking a look at this! It's a frustrating
>>> >> >> problem and we
>>> >> >> > >>>>>>> haven't made much headway.
>>> >> >> > >>>>>>>
>>> >> >> > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>> >> >> > >>>>>>>> Hi,
>>> >> >> > >>>>>>>>
>>> >> >> > >>>>>>>> I will have a look. BTW, I have not progressed that
>>> >> >> much but I have
>>> >> >> > >>>>>>>> been thinking about it. In order to adapt the
>>> >> >> previous algorithm in
>>> >> >> > >>>>>>>> the python notebook I need to substitute the
>>> >> >> iteration over all
>>> >> >> > >>>>>>>> possible devices permutations to iteration over all
>>> >> >> the possible
>>> >> >> > >>>>>>>> selections that crush would make. That is the main
>>> >> >> thing I need to
>>> >> >> > >>>>>>>> work on.
>>> >> >> > >>>>>>>>
>>> >> >> > >>>>>>>> The other thing is of course that weights change for
>>> >> >> each replica.
>>> >> >> > >>>>>>>> That is, they cannot be really fixed in the crush
>>> >> >> map. So the
>>> >> >> > >>>>>>>> algorithm inside libcrush, not only the weights in
>>> >> >> the map, need to be
>>> >> >> > >>>>>>>> changed. The weights in the crush map should reflect
>>> >> >> then, maybe, the
>>> >> >> > >>>>>>>> desired usage frequencies. Or maybe each replica
>>> >> >> should have their own
>>> >> >> > >>>>>>>> crush map, but then the information about the
>>> >> >> previous selection
>>> >> >> > >>>>>>>> should be passed to the next replica placement run so
>>> >> >> it avoids
>>> >> >> > >>>>>>>> selecting the same one again.
>>> >> >> > >>>>>>>
>>> >> >> > >>>>>>> My suspicion is that the best solution here (whatever
>>> >> >> that means!)
>>> >> >> > >>>>>>> leaves the CRUSH weights intact with the desired
>>> >> >> distribution, and
>>> >> >> > >>>>>>> then generates a set of derivative weights--probably
>>> >> >> one set for each
>>> >> >> > >>>>>>> round/replica/rank.
>>> >> >> > >>>>>>>
>>> >> >> > >>>>>>> One nice property of this is that once the support is
>>> >> >> added to encode
>>> >> >> > >>>>>>> multiple sets of weights, the algorithm used to
>>> >> >> generate them is free to
>>> >> >> > >>>>>>> change and evolve independently. (In most cases any
>>> >> >> change is
>>> >> >> > >>>>>>> CRUSH's mapping behavior is difficult to roll out
>>> >> >> because all
>>> >> >> > >>>>>>> parties participating in the cluster have to support
>>> >> >> any new behavior
>>> >> >> > >>>>>>> before it is enabled or used.)
>>> >> >> > >>>>>>>
>>> >> >> > >>>>>>>> I have a question also. Is there any significant
>>> >> >> difference between
>>> >> >> > >>>>>>>> the device selection algorithm description in the
>>> >> >> paper and its final
>>> >> >> > >>>>>>>> implementation?
>>> >> >> > >>>>>>>
>>> >> >> > >>>>>>> The main difference is the "retry_bucket" behavior was
>>> >> >> found to be a bad
>>> >> >> > >>>>>>> idea; any collision or failed()/overload() case
>>> >> >> triggers the
>>> >> >> > >>>>>>> retry_descent.
>>> >> >> > >>>>>>>
>>> >> >> > >>>>>>> There are other changes, of course, but I don't think
>>> >> >> they'll impact any
>>> >> >> > >>>>>>> solution we come with here (or at least any solution
>>> >> >> can be suitably
>>> >> >> > >>>>>>> adapted)!
>>> >> >> > >>>>>>>
>>> >> >> > >>>>>>> sage
>>> >> >> > >>>>>> --
>>> >> >> > >>>>>> To unsubscribe from this list: send the line
>>> >> >> "unsubscribe ceph-devel" in
>>> >> >> > >>>>>> the body of a message to majordomo@vger.kernel.org
>>> >> >> > >>>>>> More majordomo info at
>>> >> >> http://vger.kernel.org/majordomo-info.html
>>> >> >> > >>>>>>
>>> >> >> > >>>>>
>>> >> >> > >>>>
>>> >> >> > >>>
>>> >> >> > >>> --
>>> >> >> > >>> Loïc Dachary, Artisan Logiciel Libre
>>> >> >> > >> --
>>> >> >> > >> To unsubscribe from this list: send the line "unsubscribe
>>> >> >> ceph-devel" in
>>> >> >> > >> the body of a message to majordomo@vger.kernel.org
>>> >> >> > >> More majordomo info at
>>> >> >> http://vger.kernel.org/majordomo-info.html
>>> >> >> > >>
>>> >> >> > >
>>> >> >> > > --
>>> >> >> > > Loïc Dachary, Artisan Logiciel Libre
>>> >> >> > --
>>> >> >> > To unsubscribe from this list: send the line "unsubscribe
>>> >> >> ceph-devel" in
>>> >> >> > the body of a message to majordomo@vger.kernel.org
>>> >> >> > More majordomo info at
>>> >> >> http://vger.kernel.org/majordomo-info.html
>>> >> >> >
>>> >> >> >
>>> >> >>
>>> >> >>
>>> >> >>
>>> >>
>>> >> --
>>> >> Loïc Dachary, Artisan Logiciel Libre
>>> >
>>> >
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Spandan Kumar Sahu
IIT Kharagpur
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-28 6:52 ` Adam Kupczyk
2017-03-28 9:49 ` Spandan Kumar Sahu
@ 2017-03-28 13:35 ` Sage Weil
1 sibling, 0 replies; 70+ messages in thread
From: Sage Weil @ 2017-03-28 13:35 UTC (permalink / raw)
To: Adam Kupczyk; +Cc: Pedro López-Adeva, Loic Dachary, Ceph Development
[-- Attachment #1: Type: TEXT/PLAIN, Size: 31263 bytes --]
On Tue, 28 Mar 2017, Adam Kupczyk wrote:
> "... or simply have a single global set"
>
> No. Proof by example:
>
> I once attempted to perfectly balance cluster X by modifying crush weights.
> Pool A spanned over 352 OSDs (set A)
> Pool B spanned over 176 OSDs (set B, half of A)
> The result (simulated perfect balance) was that obtained weights had
> - small variance for B (5%),
> - small variance for A-B (5%).
> - huge variance for A (800%)
> This was of course because crush had to be strongly discouraged to
> pick from B, when performing placement for A.
FWIW in this situation I think we should aim to have the B OSDs more
utilized than the A-B OSDs by exactly as much data is in pool B divieded
by 176. We should not try to make the A and A-B sets have equal
utilization because the rules do not suggest that we should. Does that
make sense? I.e., if we treat each pool's placement in isolation by
*only* considering the PGs from pool A, then we should aim for
perfect balance across A, and when we look only at B PGs we should see
perfect balance across B, and the result will be that A-B will have more
PGs.
> "...crush users can choose..."
> For each pool there is only one vector of weights that will provide
> perfect balance. (math note: actually multiple of them, but different
> by scale)
Yeah, although again CRUSH doesn't need to be perfect here, just better;
the new OSDMap remap can always fix up the loose ends to take the final
step to perfect.
> I cannot at the moment imagine any other practical metrics other then
> balancing. But maybe it is just failure of imagination.
I can see us looking at other dimensions (e.g., trying to maximize the
number of replica sets that span disk models) where there is no
correlation to the hierarchy, but I'm not sure that fiddling with weights
will really get us anywhere.
Also, the new device class hierarchies Loic just added could be expressed
as alternative sets of bucket weights instead of the shadow hierarchy.
Once the admin makes the leap to luminous compatibility as the baseline
the map could compile to that instead of generating the hidden buckets it
does now.
sage
>
> On Mon, Mar 27, 2017 at 3:39 PM, Sage Weil <sage@newdream.net> wrote:
> > On Mon, 27 Mar 2017, Adam Kupczyk wrote:
> >> Hi,
> >>
> >> My understanding is that optimal tweaked weights will depend on:
> >> 1) pool_id, because of rjenkins(pool_id) in crush
> >> 2) number of placement groups and replication factor, as it determines
> >> amount of samples
> >>
> >> Therefore tweaked weights should rather be property of instantialized pool,
> >> not crush placement definition.
> >>
> >> If tweaked weights are to be part of crush definition, than for each
> >> created pool we need to have separate list of weights.
> >> Is it possible to provide clients with different weights depending on on
> >> which pool they want to operate?
> >
> > As Loic suggested, you can create as many derivative hierarchies in the
> > crush map as you like, potentially one per pool. Or you could treat the
> > sum total of all pgs as the interesting set, balance those, and get some
> > OSDs doing a bit more of one pool than another. The new post-CRUSH OSD
> > remap capability can always clean this up (and turn a "good" crush
> > distribution into a perfect distribution).
> >
> > I guess the question is: when we add the explicit adjusted weight matrix
> > to crush should we have multiple sets of weights (perhaps one for each
> > pool), or simply have a single global set. It might make sense to allow N
> > sets of adjusted weights so that the crush users can choose a particular
> > set of them for different pools (or whatever it is they're calculating the
> > mapping for)..
> >
> > sage
> >
> >
> >>
> >> Best regards,
> >> Adam
> >>
> >> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
> >> > Hi,
> >> >
> >> > My understanding is that optimal tweaked weights will depend on:
> >> > 1) pool_id, because of rjenkins(pool_id) in crush
> >> > 2) number of placement groups and replication factor, as it determines
> >> > amount of samples
> >> >
> >> > Therefore tweaked weights should rather be property of instantialized pool,
> >> > not crush placement definition.
> >> >
> >> > If tweaked weights are to be part of crush definition, than for each created
> >> > pool we need to have separate list of weights.
> >> > Is it possible to provide clients with different weights depending on on
> >> > which pool they want to operate?
> >> >
> >> > Best regards,
> >> > Adam
> >> >
> >> >
> >> > On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@dachary.org> wrote:
> >> >>
> >> >>
> >> >>
> >> >> On 03/27/2017 04:33 AM, Sage Weil wrote:
> >> >> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
> >> >> >> Hello Sage, Loic, Pedro,
> >> >> >>
> >> >> >>
> >> >> >> I am certain that almost perfect mapping can be achieved by
> >> >> >> substituting weights from crush map with slightly modified weights.
> >> >> >> By perfect mapping I mean we get on each OSD number of PGs exactly
> >> >> >> proportional to weights specified in crush map.
> >> >> >>
> >> >> >> 1. Example
> >> >> >> Lets think of PGs of single object pool.
> >> >> >> We have OSDs with following weights:
> >> >> >> [10, 10, 10, 5, 5]
> >> >> >>
> >> >> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
> >> >> >> PGcopies :
> >> >> >> [150, 150, 150, 75, 75]
> >> >> >>
> >> >> >> However, because crush simulates random process we have:
> >> >> >> [143, 152, 158, 71, 76]
> >> >> >>
> >> >> >> We could have obtained perfect distribution had we used weights like
> >> >> >> this:
> >> >> >> [10.2, 9.9, 9.6, 5.2, 4.9]
> >> >> >>
> >> >> >>
> >> >> >> 2. Obtaining perfect mapping weights from OSD capacity weights
> >> >> >>
> >> >> >> When we apply crush for the first time, distribution of PGs comes as
> >> >> >> random.
> >> >> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
> >> >> >>
> >> >> >> But CRUSH is not random proces at all, it behaves in numerically stable
> >> >> >> way.
> >> >> >> Specifically, if we increase weight on one node, we will get more PGs
> >> >> >> on
> >> >> >> this node and less on every other node:
> >> >> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
> >> >> >>
> >> >> >> Now, finding ideal weights can be done by any numerical minimization
> >> >> >> method,
> >> >> >> for example NLMS.
> >> >> >>
> >> >> >>
> >> >> >> 3. The proposal
> >> >> >> For each pool, from initial weights given in crush map perfect weights
> >> >> >> will
> >> >> >> be derived.
> >> >> >> This weights will be used to calculate PG distribution. This of course
> >> >> >> will
> >> >> >> be close to perfect.
> >> >> >>
> >> >> >> 3a: Downside when OSD is out
> >> >> >> When an OSD is out, missing PG copies will be replicated elsewhere.
> >> >> >> Because now weights deviate from OSD capacity, some OSDs will
> >> >> >> statistically
> >> >> >> get more copies then they should.
> >> >> >> This unevenness in distribution is proportional to scale of deviation
> >> >> >> of
> >> >> >> calculated weights to capacity weights.
> >> >> >>
> >> >> >> 3b: Upside
> >> >> >> This all can be achieved without changes to crush.
> >> >> >
> >> >> > Yes!
> >> >> >
> >> >> > And no. You're totally right--we should use an offline optimization to
> >> >> > tweak the crush input weights to get a better balance. It won't be
> >> >> > robust
> >> >> > to changes to the cluster, but we can incrementally optimize after that
> >> >> > happens to converge on something better.
> >> >> >
> >> >> > The problem with doing this with current versions of Ceph is that we
> >> >> > lose
> >> >> > the original "input" or "target" weights (i.e., the actual size of
> >> >> > the OSD) that we want to converge on. This is one reason why we haven't
> >> >> > done something like this before.
> >> >> >
> >> >> > In luminous we *could* work around this by storing those canonical
> >> >> > weights outside of crush using something (probably?) ugly and
> >> >> > maintain backward compatibility with older clients using existing
> >> >> > CRUSH behavior.
> >> >>
> >> >> These canonical weights could be stored in crush by creating dedicated
> >> >> buckets. For instance the root-canonical bucket could be created to store
> >> >> the canonical weights of the root bucket. The sysadmin needs to be aware of
> >> >> the difference and know to add a new device in the host01-canonical bucket
> >> >> instead of the host01 bucket. And to run an offline tool to keep the two
> >> >> buckets in sync and compute the weight to use for placement derived from the
> >> >> weights representing the device capacity.
> >> >>
> >> >> It is a little bit ugly ;-)
> >> >>
> >> >> > OR, (and this is my preferred route), if the multi-pick anomaly approach
> >> >> > that Pedro is working on works out, we'll want to extend the CRUSH map
> >> >> > to
> >> >> > include a set of derivative weights used for actual placement
> >> >> > calculations
> >> >> > instead of the canonical target weights, and we can do what you're
> >> >> > proposing *and* solve the multipick problem with one change in the crush
> >> >> > map and algorithm. (Actually choosing those derivative weights will
> >> >> > be an offline process that can both improve the balance for the inputs
> >> >> > we
> >> >> > care about *and* adjust them based on the position to fix the skew issue
> >> >> > for replicas.) This doesn't help pre-luminous clients, but I think the
> >> >> > end solution will be simpler and more elegant...
> >> >> >
> >> >> > What do you think?
> >> >> >
> >> >> > sage
> >> >> >
> >> >> >
> >> >> >> 4. Extra
> >> >> >> Some time ago I made such change to perfectly balance Thomson-Reuters
> >> >> >> cluster.
> >> >> >> It succeeded.
> >> >> >> A solution was not accepted, because modification of OSD weights were
> >> >> >> higher
> >> >> >> then 50%, which was caused by fact that different placement rules
> >> >> >> operated
> >> >> >> on different sets of OSDs, and those sets were not disjointed.
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> Best regards,
> >> >> >> Adam
> >> >> >>
> >> >> >>
> >> >> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
> >> >> >> Hi Pedro, Loic,
> >> >> >>
> >> >> >> For what it's worth, my intuition here (which has had a mixed
> >> >> >> record as
> >> >> >> far as CRUSH goes) is that this is the most promising path
> >> >> >> forward.
> >> >> >>
> >> >> >> Thinking ahead a few steps, and confirming that I'm following
> >> >> >> the
> >> >> >> discussion so far, if you're able to do get black (or white) box
> >> >> >> gradient
> >> >> >> descent to work, then this will give us a set of weights for
> >> >> >> each item in
> >> >> >> the tree for each selection round, derived from the tree
> >> >> >> structure and
> >> >> >> original (target) weights. That would basically give us a map
> >> >> >> of item id
> >> >> >> (bucket id or leave item id) to weight for each round. i.e.,
> >> >> >>
> >> >> >> map<int, map<int, float>> weight_by_position; // position ->
> >> >> >> item -> weight
> >> >> >>
> >> >> >> where the 0 round would (I think?) match the target weights, and
> >> >> >> each
> >> >> >> round after that would skew low-weighted items lower to some
> >> >> >> degree.
> >> >> >> Right?
> >> >> >>
> >> >> >> The next question I have is: does this generalize from the
> >> >> >> single-bucket
> >> >> >> case to the hierarchy? I.e., if I have a "tree" (single bucket)
> >> >> >> like
> >> >> >>
> >> >> >> 3.1
> >> >> >> |_____________
> >> >> >> | \ \ \
> >> >> >> 1.0 1.0 1.0 .1
> >> >> >>
> >> >> >> it clearly works, but when we have a multi-level tree like
> >> >> >>
> >> >> >>
> >> >> >> 8.4
> >> >> >> |____________________________________
> >> >> >> | \ \
> >> >> >> 3.1 3.1 2.2
> >> >> >> |_____________ |_____________ |_____________
> >> >> >> | \ \ \ | \ \ \ | \ \ \
> >> >> >> 1.0 1.0 1.0 .1 1.0 1.0 1.0 .1 1.0 1.0 .1 .1
> >> >> >>
> >> >> >> and the second round weights skew the small .1 leaves lower, can
> >> >> >> we
> >> >> >> continue to build the summed-weight hierarchy, such that the
> >> >> >> adjusted
> >> >> >> weights at the higher level are appropriately adjusted to give
> >> >> >> us the
> >> >> >> right probabilities of descending into those trees? I'm not
> >> >> >> sure if that
> >> >> >> logically follows from the above or if my intuition is
> >> >> >> oversimplifying
> >> >> >> things.
> >> >> >>
> >> >> >> If this *is* how we think this will shake out, then I'm
> >> >> >> wondering if we
> >> >> >> should go ahead and build this weigh matrix into CRUSH sooner
> >> >> >> rather
> >> >> >> than later (i.e., for luminous). As with the explicit
> >> >> >> remappings, the
> >> >> >> hard part is all done offline, and the adjustments to the CRUSH
> >> >> >> mapping
> >> >> >> calculation itself (storing and making use of the adjusted
> >> >> >> weights for
> >> >> >> each round of placement) are relatively straightforward. And
> >> >> >> the sooner
> >> >> >> this is incorporated into a release the sooner real users will
> >> >> >> be able to
> >> >> >> roll out code to all clients and start making us of it.
> >> >> >>
> >> >> >> Thanks again for looking at this problem! I'm excited that we
> >> >> >> may be
> >> >> >> closing in on a real solution!
> >> >> >>
> >> >> >> sage
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
> >> >> >>
> >> >> >> > There are lot of gradient-free methods. I will try first to
> >> >> >> run the
> >> >> >> > ones available using just scipy
> >> >> >> >
> >> >> >>
> >> >> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> >> >> >> > Some of them don't require the gradient and some of them can
> >> >> >> estimate
> >> >> >> > it. The reason to go without the gradient is to run the CRUSH
> >> >> >> > algorithm as a black box. In that case this would be the
> >> >> >> pseudo-code:
> >> >> >> >
> >> >> >> > - BEGIN CODE -
> >> >> >> > def build_target(desired_freqs):
> >> >> >> > def target(weights):
> >> >> >> > # run a simulation of CRUSH for a number of objects
> >> >> >> > sim_freqs = run_crush(weights)
> >> >> >> > # Kullback-Leibler divergence between desired
> >> >> >> frequencies and
> >> >> >> > current ones
> >> >> >> > return loss(sim_freqs, desired_freqs)
> >> >> >> > return target
> >> >> >> >
> >> >> >> > weights = scipy.optimize.minimize(build_target(desired_freqs))
> >> >> >> > - END CODE -
> >> >> >> >
> >> >> >> > The tricky thing here is that this procedure can be slow if
> >> >> >> the
> >> >> >> > simulation (run_crush) needs to place a lot of objects to get
> >> >> >> accurate
> >> >> >> > simulated frequencies. This is true specially if the minimize
> >> >> >> method
> >> >> >> > attempts to approximate the gradient using finite differences
> >> >> >> since it
> >> >> >> > will evaluate the target function a number of times
> >> >> >> proportional to
> >> >> >> > the number of weights). Apart from the ones in scipy I would
> >> >> >> try also
> >> >> >> > optimization methods that try to perform as few evaluations as
> >> >> >> > possible like for example HyperOpt
> >> >> >> > (http://hyperopt.github.io/hyperopt/), which by the way takes
> >> >> >> into
> >> >> >> > account that the target function can be noisy.
> >> >> >> >
> >> >> >> > This black box approximation is simple to implement and makes
> >> >> >> the
> >> >> >> > computer do all the work instead of us.
> >> >> >> > I think that this black box approximation is worthy to try
> >> >> >> even if
> >> >> >> > it's not the final one because if this approximation works
> >> >> >> then we
> >> >> >> > know that a more elaborate one that computes the gradient of
> >> >> >> the CRUSH
> >> >> >> > algorithm will work for sure.
> >> >> >> >
> >> >> >> > I can try this black box approximation this weekend not on the
> >> >> >> real
> >> >> >> > CRUSH algorithm but with the simple implementation I did in
> >> >> >> python. If
> >> >> >> > it works it's just a matter of substituting one simulation
> >> >> >> with
> >> >> >> > another and see what happens.
> >> >> >> >
> >> >> >> > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
> >> >> >> > > Hi Pedro,
> >> >> >> > >
> >> >> >> > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
> >> >> >> > >> Hi Loic,
> >> >> >> > >>
> >> >> >> > >>>From what I see everything seems OK.
> >> >> >> > >
> >> >> >> > > Cool. I'll keep going in this direction then !
> >> >> >> > >
> >> >> >> > >> The interesting thing would be to
> >> >> >> > >> test on some complex mapping. The reason is that
> >> >> >> "CrushPolicyFamily"
> >> >> >> > >> is right now modeling just a single straw bucket not the
> >> >> >> full CRUSH
> >> >> >> > >> algorithm.
> >> >> >> > >
> >> >> >> > > A number of use cases use a single straw bucket, maybe the
> >> >> >> majority of them. Even though it does not reflect the full range
> >> >> >> of what crush can offer, it could be useful. To be more
> >> >> >> specific, a crush map that states "place objects so that there
> >> >> >> is at most one replica per host" or "one replica per rack" is
> >> >> >> common. Such a crushmap can be reduced to a single straw bucket
> >> >> >> that contains all the hosts and by using the CrushPolicyFamily,
> >> >> >> we can change the weights of each host to fix the probabilities.
> >> >> >> The hosts themselves contain disks with varying weights but I
> >> >> >> think we can ignore that because crush will only recurse to
> >> >> >> place one object within a given host.
> >> >> >> > >
> >> >> >> > >> That's the work that remains to be done. The only way that
> >> >> >> > >> would avoid reimplementing the CRUSH algorithm and
> >> >> >> computing the
> >> >> >> > >> gradient would be treating CRUSH as a black box and
> >> >> >> eliminating the
> >> >> >> > >> necessity of computing the gradient either by using a
> >> >> >> gradient-free
> >> >> >> > >> optimization method or making an estimation of the
> >> >> >> gradient.
> >> >> >> > >
> >> >> >> > > By gradient-free optimization you mean simulated annealing
> >> >> >> or Monte Carlo ?
> >> >> >> > >
> >> >> >> > > Cheers
> >> >> >> > >
> >> >> >> > >>
> >> >> >> > >>
> >> >> >> > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
> >> >> >> > >>> Hi,
> >> >> >> > >>>
> >> >> >> > >>> I modified the crush library to accept two weights (one
> >> >> >> for the first disk, the other for the remaining disks)[1]. This
> >> >> >> really is a hack for experimentation purposes only ;-) I was
> >> >> >> able to run a variation of your code[2] and got the following
> >> >> >> results which are encouraging. Do you think what I did is
> >> >> >> sensible ? Or is there a problem I don't see ?
> >> >> >> > >>>
> >> >> >> > >>> Thanks !
> >> >> >> > >>>
> >> >> >> > >>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8
> >> >> >> 6]
> >> >> >> > >>>
> >> >> >>
> >> >> >> ------------------------------------------------------------------------
> >> >> >> > >>> Before: All replicas on each hard drive
> >> >> >> > >>> Expected vs actual use (20000 samples)
> >> >> >> > >>> disk 0: 1.39e-01 1.12e-01
> >> >> >> > >>> disk 1: 1.11e-01 1.10e-01
> >> >> >> > >>> disk 2: 8.33e-02 1.13e-01
> >> >> >> > >>> disk 3: 1.39e-01 1.11e-01
> >> >> >> > >>> disk 4: 1.11e-01 1.11e-01
> >> >> >> > >>> disk 5: 8.33e-02 1.11e-01
> >> >> >> > >>> disk 6: 1.39e-01 1.12e-01
> >> >> >> > >>> disk 7: 1.11e-01 1.12e-01
> >> >> >> > >>> disk 8: 8.33e-02 1.10e-01
> >> >> >> > >>> it= 1 jac norm=1.59e-01 loss=5.27e-03
> >> >> >> > >>> it= 2 jac norm=1.55e-01 loss=5.03e-03
> >> >> >> > >>> ...
> >> >> >> > >>> it= 212 jac norm=1.02e-03 loss=2.41e-07
> >> >> >> > >>> it= 213 jac norm=1.00e-03 loss=2.31e-07
> >> >> >> > >>> Converged to desired accuracy :)
> >> >> >> > >>> After: All replicas on each hard drive
> >> >> >> > >>> Expected vs actual use (20000 samples)
> >> >> >> > >>> disk 0: 1.39e-01 1.42e-01
> >> >> >> > >>> disk 1: 1.11e-01 1.09e-01
> >> >> >> > >>> disk 2: 8.33e-02 8.37e-02
> >> >> >> > >>> disk 3: 1.39e-01 1.40e-01
> >> >> >> > >>> disk 4: 1.11e-01 1.13e-01
> >> >> >> > >>> disk 5: 8.33e-02 8.08e-02
> >> >> >> > >>> disk 6: 1.39e-01 1.38e-01
> >> >> >> > >>> disk 7: 1.11e-01 1.09e-01
> >> >> >> > >>> disk 8: 8.33e-02 8.48e-02
> >> >> >> > >>>
> >> >> >> > >>>
> >> >> >> > >>> Simulation: R=2 devices capacity [10 10 10 10 1]
> >> >> >> > >>>
> >> >> >>
> >> >> >> ------------------------------------------------------------------------
> >> >> >> > >>> Before: All replicas on each hard drive
> >> >> >> > >>> Expected vs actual use (20000 samples)
> >> >> >> > >>> disk 0: 2.44e-01 2.36e-01
> >> >> >> > >>> disk 1: 2.44e-01 2.38e-01
> >> >> >> > >>> disk 2: 2.44e-01 2.34e-01
> >> >> >> > >>> disk 3: 2.44e-01 2.38e-01
> >> >> >> > >>> disk 4: 2.44e-02 5.37e-02
> >> >> >> > >>> it= 1 jac norm=2.43e-01 loss=2.98e-03
> >> >> >> > >>> it= 2 jac norm=2.28e-01 loss=2.47e-03
> >> >> >> > >>> ...
> >> >> >> > >>> it= 37 jac norm=1.28e-03 loss=3.48e-08
> >> >> >> > >>> it= 38 jac norm=1.07e-03 loss=2.42e-08
> >> >> >> > >>> Converged to desired accuracy :)
> >> >> >> > >>> After: All replicas on each hard drive
> >> >> >> > >>> Expected vs actual use (20000 samples)
> >> >> >> > >>> disk 0: 2.44e-01 2.46e-01
> >> >> >> > >>> disk 1: 2.44e-01 2.44e-01
> >> >> >> > >>> disk 2: 2.44e-01 2.41e-01
> >> >> >> > >>> disk 3: 2.44e-01 2.45e-01
> >> >> >> > >>> disk 4: 2.44e-02 2.33e-02
> >> >> >> > >>>
> >> >> >> > >>>
> >> >> >> > >>> [1] crush
> >> >> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
> >> >> >> 56fee8
> >> >> >> > >>> [2] python-crush
> >> >> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
> >> >> >> bd25f8f2c4b68
> >> >> >> > >>>
> >> >> >> > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
> >> >> >> > >>>> Hi Pedro,
> >> >> >> > >>>>
> >> >> >> > >>>> It looks like trying to experiment with crush won't work
> >> >> >> as expected because crush does not distinguish the probability
> >> >> >> of selecting the first device from the probability of selecting
> >> >> >> the second or third device. Am I mistaken ?
> >> >> >> > >>>>
> >> >> >> > >>>> Cheers
> >> >> >> > >>>>
> >> >> >> > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
> >> >> >> > >>>>> Hi Pedro,
> >> >> >> > >>>>>
> >> >> >> > >>>>> I'm going to experiment with what you did at
> >> >> >> > >>>>>
> >> >> >> > >>>>>
> >> >> >> https://github.com/plafl/notebooks/blob/master/replication.ipynb
> >> >> >> > >>>>>
> >> >> >> > >>>>> and the latest python-crush published today. A
> >> >> >> comparison function was added that will help measure the data
> >> >> >> movement. I'm hoping we can release an offline tool based on
> >> >> >> your solution. Please let me know if I should wait before diving
> >> >> >> into this, in case you have unpublished drafts or new ideas.
> >> >> >> > >>>>>
> >> >> >> > >>>>> Cheers
> >> >> >> > >>>>>
> >> >> >> > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
> >> >> >> > >>>>>> Great, thanks for the clarifications.
> >> >> >> > >>>>>> I also think that the most natural way is to keep just
> >> >> >> a set of
> >> >> >> > >>>>>> weights in the CRUSH map and update them inside the
> >> >> >> algorithm.
> >> >> >> > >>>>>>
> >> >> >> > >>>>>> I keep working on it.
> >> >> >> > >>>>>>
> >> >> >> > >>>>>>
> >> >> >> > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
> >> >> >> <sage@newdream.net>:
> >> >> >> > >>>>>>> Hi Pedro,
> >> >> >> > >>>>>>>
> >> >> >> > >>>>>>> Thanks for taking a look at this! It's a frustrating
> >> >> >> problem and we
> >> >> >> > >>>>>>> haven't made much headway.
> >> >> >> > >>>>>>>
> >> >> >> > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
> >> >> >> > >>>>>>>> Hi,
> >> >> >> > >>>>>>>>
> >> >> >> > >>>>>>>> I will have a look. BTW, I have not progressed that
> >> >> >> much but I have
> >> >> >> > >>>>>>>> been thinking about it. In order to adapt the
> >> >> >> previous algorithm in
> >> >> >> > >>>>>>>> the python notebook I need to substitute the
> >> >> >> iteration over all
> >> >> >> > >>>>>>>> possible devices permutations to iteration over all
> >> >> >> the possible
> >> >> >> > >>>>>>>> selections that crush would make. That is the main
> >> >> >> thing I need to
> >> >> >> > >>>>>>>> work on.
> >> >> >> > >>>>>>>>
> >> >> >> > >>>>>>>> The other thing is of course that weights change for
> >> >> >> each replica.
> >> >> >> > >>>>>>>> That is, they cannot be really fixed in the crush
> >> >> >> map. So the
> >> >> >> > >>>>>>>> algorithm inside libcrush, not only the weights in
> >> >> >> the map, need to be
> >> >> >> > >>>>>>>> changed. The weights in the crush map should reflect
> >> >> >> then, maybe, the
> >> >> >> > >>>>>>>> desired usage frequencies. Or maybe each replica
> >> >> >> should have their own
> >> >> >> > >>>>>>>> crush map, but then the information about the
> >> >> >> previous selection
> >> >> >> > >>>>>>>> should be passed to the next replica placement run so
> >> >> >> it avoids
> >> >> >> > >>>>>>>> selecting the same one again.
> >> >> >> > >>>>>>>
> >> >> >> > >>>>>>> My suspicion is that the best solution here (whatever
> >> >> >> that means!)
> >> >> >> > >>>>>>> leaves the CRUSH weights intact with the desired
> >> >> >> distribution, and
> >> >> >> > >>>>>>> then generates a set of derivative weights--probably
> >> >> >> one set for each
> >> >> >> > >>>>>>> round/replica/rank.
> >> >> >> > >>>>>>>
> >> >> >> > >>>>>>> One nice property of this is that once the support is
> >> >> >> added to encode
> >> >> >> > >>>>>>> multiple sets of weights, the algorithm used to
> >> >> >> generate them is free to
> >> >> >> > >>>>>>> change and evolve independently. (In most cases any
> >> >> >> change is
> >> >> >> > >>>>>>> CRUSH's mapping behavior is difficult to roll out
> >> >> >> because all
> >> >> >> > >>>>>>> parties participating in the cluster have to support
> >> >> >> any new behavior
> >> >> >> > >>>>>>> before it is enabled or used.)
> >> >> >> > >>>>>>>
> >> >> >> > >>>>>>>> I have a question also. Is there any significant
> >> >> >> difference between
> >> >> >> > >>>>>>>> the device selection algorithm description in the
> >> >> >> paper and its final
> >> >> >> > >>>>>>>> implementation?
> >> >> >> > >>>>>>>
> >> >> >> > >>>>>>> The main difference is the "retry_bucket" behavior was
> >> >> >> found to be a bad
> >> >> >> > >>>>>>> idea; any collision or failed()/overload() case
> >> >> >> triggers the
> >> >> >> > >>>>>>> retry_descent.
> >> >> >> > >>>>>>>
> >> >> >> > >>>>>>> There are other changes, of course, but I don't think
> >> >> >> they'll impact any
> >> >> >> > >>>>>>> solution we come with here (or at least any solution
> >> >> >> can be suitably
> >> >> >> > >>>>>>> adapted)!
> >> >> >> > >>>>>>>
> >> >> >> > >>>>>>> sage
> >> >> >> > >>>>>> --
> >> >> >> > >>>>>> To unsubscribe from this list: send the line
> >> >> >> "unsubscribe ceph-devel" in
> >> >> >> > >>>>>> the body of a message to majordomo@vger.kernel.org
> >> >> >> > >>>>>> More majordomo info at
> >> >> >> http://vger.kernel.org/majordomo-info.html
> >> >> >> > >>>>>>
> >> >> >> > >>>>>
> >> >> >> > >>>>
> >> >> >> > >>>
> >> >> >> > >>> --
> >> >> >> > >>> Loïc Dachary, Artisan Logiciel Libre
> >> >> >> > >> --
> >> >> >> > >> To unsubscribe from this list: send the line "unsubscribe
> >> >> >> ceph-devel" in
> >> >> >> > >> the body of a message to majordomo@vger.kernel.org
> >> >> >> > >> More majordomo info at
> >> >> >> http://vger.kernel.org/majordomo-info.html
> >> >> >> > >>
> >> >> >> > >
> >> >> >> > > --
> >> >> >> > > Loïc Dachary, Artisan Logiciel Libre
> >> >> >> > --
> >> >> >> > To unsubscribe from this list: send the line "unsubscribe
> >> >> >> ceph-devel" in
> >> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> >> > More majordomo info at
> >> >> >> http://vger.kernel.org/majordomo-info.html
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >>
> >> >> --
> >> >> Loïc Dachary, Artisan Logiciel Libre
> >> >
> >> >
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-23 15:32 ` Pedro López-Adeva
2017-03-23 16:18 ` Loic Dachary
2017-03-25 18:42 ` Sage Weil
@ 2017-04-11 15:22 ` Loic Dachary
2017-04-22 16:51 ` Loic Dachary
3 siblings, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-04-11 15:22 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: ceph-devel
Hi Pedro,
A short update to let you know the changes to crush allowing multiple weights per item is well under way[1]. It should be merged next week and will make it possible to effectively use your optimization. A new version of Ceph is going to be published in the next few weeks and will also contain these modifications.
Cheers
[1] http://libcrush.org/main/libcrush/commit/49b6043d6b85197a49e70cbbcfe411d92983f501
On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
> There are lot of gradient-free methods. I will try first to run the
> ones available using just scipy
> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> Some of them don't require the gradient and some of them can estimate
> it. The reason to go without the gradient is to run the CRUSH
> algorithm as a black box. In that case this would be the pseudo-code:
>
> - BEGIN CODE -
> def build_target(desired_freqs):
> def target(weights):
> # run a simulation of CRUSH for a number of objects
> sim_freqs = run_crush(weights)
> # Kullback-Leibler divergence between desired frequencies and
> current ones
> return loss(sim_freqs, desired_freqs)
> return target
>
> weights = scipy.optimize.minimize(build_target(desired_freqs))
> - END CODE -
>
> The tricky thing here is that this procedure can be slow if the
> simulation (run_crush) needs to place a lot of objects to get accurate
> simulated frequencies. This is true specially if the minimize method
> attempts to approximate the gradient using finite differences since it
> will evaluate the target function a number of times proportional to
> the number of weights). Apart from the ones in scipy I would try also
> optimization methods that try to perform as few evaluations as
> possible like for example HyperOpt
> (http://hyperopt.github.io/hyperopt/), which by the way takes into
> account that the target function can be noisy.
>
> This black box approximation is simple to implement and makes the
> computer do all the work instead of us.
> I think that this black box approximation is worthy to try even if
> it's not the final one because if this approximation works then we
> know that a more elaborate one that computes the gradient of the CRUSH
> algorithm will work for sure.
>
> I can try this black box approximation this weekend not on the real
> CRUSH algorithm but with the simple implementation I did in python. If
> it works it's just a matter of substituting one simulation with
> another and see what happens.
>
> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>> Hi Loic,
>>>
>>> >From what I see everything seems OK.
>>
>> Cool. I'll keep going in this direction then !
>>
>>> The interesting thing would be to
>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>> is right now modeling just a single straw bucket not the full CRUSH
>>> algorithm.
>>
>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>
>>> That's the work that remains to be done. The only way that
>>> would avoid reimplementing the CRUSH algorithm and computing the
>>> gradient would be treating CRUSH as a black box and eliminating the
>>> necessity of computing the gradient either by using a gradient-free
>>> optimization method or making an estimation of the gradient.
>>
>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>
>> Cheers
>>
>>>
>>>
>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>> Hi,
>>>>
>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>
>>>> Thanks !
>>>>
>>>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
>>>> ------------------------------------------------------------------------
>>>> Before: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>> disk 0: 1.39e-01 1.12e-01
>>>> disk 1: 1.11e-01 1.10e-01
>>>> disk 2: 8.33e-02 1.13e-01
>>>> disk 3: 1.39e-01 1.11e-01
>>>> disk 4: 1.11e-01 1.11e-01
>>>> disk 5: 8.33e-02 1.11e-01
>>>> disk 6: 1.39e-01 1.12e-01
>>>> disk 7: 1.11e-01 1.12e-01
>>>> disk 8: 8.33e-02 1.10e-01
>>>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>>>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>>>> ...
>>>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>>>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>>>> Converged to desired accuracy :)
>>>> After: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>> disk 0: 1.39e-01 1.42e-01
>>>> disk 1: 1.11e-01 1.09e-01
>>>> disk 2: 8.33e-02 8.37e-02
>>>> disk 3: 1.39e-01 1.40e-01
>>>> disk 4: 1.11e-01 1.13e-01
>>>> disk 5: 8.33e-02 8.08e-02
>>>> disk 6: 1.39e-01 1.38e-01
>>>> disk 7: 1.11e-01 1.09e-01
>>>> disk 8: 8.33e-02 8.48e-02
>>>>
>>>>
>>>> Simulation: R=2 devices capacity [10 10 10 10 1]
>>>> ------------------------------------------------------------------------
>>>> Before: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>> disk 0: 2.44e-01 2.36e-01
>>>> disk 1: 2.44e-01 2.38e-01
>>>> disk 2: 2.44e-01 2.34e-01
>>>> disk 3: 2.44e-01 2.38e-01
>>>> disk 4: 2.44e-02 5.37e-02
>>>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>>>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>>>> ...
>>>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>>>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>>>> Converged to desired accuracy :)
>>>> After: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>> disk 0: 2.44e-01 2.46e-01
>>>> disk 1: 2.44e-01 2.44e-01
>>>> disk 2: 2.44e-01 2.41e-01
>>>> disk 3: 2.44e-01 2.45e-01
>>>> disk 4: 2.44e-02 2.33e-02
>>>>
>>>>
>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>
>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>> Hi Pedro,
>>>>>
>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> I'm going to experiment with what you did at
>>>>>>
>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>
>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>> Great, thanks for the clarifications.
>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>
>>>>>>> I keep working on it.
>>>>>>>
>>>>>>>
>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> Thanks for taking a look at this! It's a frustrating problem and we
>>>>>>>> haven't made much headway.
>>>>>>>>
>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>> work on.
>>>>>>>>>
>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>> selecting the same one again.
>>>>>>>>
>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>> round/replica/rank.
>>>>>>>>
>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>> change and evolve independently. (In most cases any change is
>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>> before it is enabled or used.)
>>>>>>>>
>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>> implementation?
>>>>>>>>
>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>> retry_descent.
>>>>>>>>
>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>> adapted)!
>>>>>>>>
>>>>>>>> sage
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-03-23 15:32 ` Pedro López-Adeva
` (2 preceding siblings ...)
2017-04-11 15:22 ` Loic Dachary
@ 2017-04-22 16:51 ` Loic Dachary
2017-04-25 15:04 ` Pedro López-Adeva
3 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-04-22 16:51 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: ceph-devel
Hi Pedro,
I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
Before optimization the situation is:
~expected~ ~objects~ ~delta~ ~delta%~
~name~
dc1 1024 1024 0 0.000000
host0 256 294 38 14.843750
device0 128 153 25 19.531250
device1 128 141 13 10.156250
host1 256 301 45 17.578125
device2 128 157 29 22.656250
device3 128 144 16 12.500000
host2 512 429 -83 -16.210938
device4 128 96 -32 -25.000000
device5 128 117 -11 -8.593750
device6 256 216 -40 -15.625000
and after optimization we have the following:
~expected~ ~objects~ ~delta~ ~delta%~
~name~
dc1 1024 1024 0 0.000000
host0 256 259 3 1.171875
device0 128 129 1 0.781250
device1 128 130 2 1.562500
host1 256 258 2 0.781250
device2 128 129 1 0.781250
device3 128 129 1 0.781250
host2 512 507 -5 -0.976562
device4 128 126 -2 -1.562500
device5 128 127 -1 -0.781250
device6 256 254 -2 -0.781250
Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
Cheers
[1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
[2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
[3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
[4] https://github.com/ceph/ceph/pull/14486
On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
> There are lot of gradient-free methods. I will try first to run the
> ones available using just scipy
> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> Some of them don't require the gradient and some of them can estimate
> it. The reason to go without the gradient is to run the CRUSH
> algorithm as a black box. In that case this would be the pseudo-code:
>
> - BEGIN CODE -
> def build_target(desired_freqs):
> def target(weights):
> # run a simulation of CRUSH for a number of objects
> sim_freqs = run_crush(weights)
> # Kullback-Leibler divergence between desired frequencies and
> current ones
> return loss(sim_freqs, desired_freqs)
> return target
>
> weights = scipy.optimize.minimize(build_target(desired_freqs))
> - END CODE -
>
> The tricky thing here is that this procedure can be slow if the
> simulation (run_crush) needs to place a lot of objects to get accurate
> simulated frequencies. This is true specially if the minimize method
> attempts to approximate the gradient using finite differences since it
> will evaluate the target function a number of times proportional to
> the number of weights). Apart from the ones in scipy I would try also
> optimization methods that try to perform as few evaluations as
> possible like for example HyperOpt
> (http://hyperopt.github.io/hyperopt/), which by the way takes into
> account that the target function can be noisy.
>
> This black box approximation is simple to implement and makes the
> computer do all the work instead of us.
> I think that this black box approximation is worthy to try even if
> it's not the final one because if this approximation works then we
> know that a more elaborate one that computes the gradient of the CRUSH
> algorithm will work for sure.
>
> I can try this black box approximation this weekend not on the real
> CRUSH algorithm but with the simple implementation I did in python. If
> it works it's just a matter of substituting one simulation with
> another and see what happens.
>
> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>> Hi Loic,
>>>
>>> >From what I see everything seems OK.
>>
>> Cool. I'll keep going in this direction then !
>>
>>> The interesting thing would be to
>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>> is right now modeling just a single straw bucket not the full CRUSH
>>> algorithm.
>>
>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>
>>> That's the work that remains to be done. The only way that
>>> would avoid reimplementing the CRUSH algorithm and computing the
>>> gradient would be treating CRUSH as a black box and eliminating the
>>> necessity of computing the gradient either by using a gradient-free
>>> optimization method or making an estimation of the gradient.
>>
>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>
>> Cheers
>>
>>>
>>>
>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>> Hi,
>>>>
>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>
>>>> Thanks !
>>>>
>>>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
>>>> ------------------------------------------------------------------------
>>>> Before: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>> disk 0: 1.39e-01 1.12e-01
>>>> disk 1: 1.11e-01 1.10e-01
>>>> disk 2: 8.33e-02 1.13e-01
>>>> disk 3: 1.39e-01 1.11e-01
>>>> disk 4: 1.11e-01 1.11e-01
>>>> disk 5: 8.33e-02 1.11e-01
>>>> disk 6: 1.39e-01 1.12e-01
>>>> disk 7: 1.11e-01 1.12e-01
>>>> disk 8: 8.33e-02 1.10e-01
>>>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>>>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>>>> ...
>>>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>>>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>>>> Converged to desired accuracy :)
>>>> After: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>> disk 0: 1.39e-01 1.42e-01
>>>> disk 1: 1.11e-01 1.09e-01
>>>> disk 2: 8.33e-02 8.37e-02
>>>> disk 3: 1.39e-01 1.40e-01
>>>> disk 4: 1.11e-01 1.13e-01
>>>> disk 5: 8.33e-02 8.08e-02
>>>> disk 6: 1.39e-01 1.38e-01
>>>> disk 7: 1.11e-01 1.09e-01
>>>> disk 8: 8.33e-02 8.48e-02
>>>>
>>>>
>>>> Simulation: R=2 devices capacity [10 10 10 10 1]
>>>> ------------------------------------------------------------------------
>>>> Before: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>> disk 0: 2.44e-01 2.36e-01
>>>> disk 1: 2.44e-01 2.38e-01
>>>> disk 2: 2.44e-01 2.34e-01
>>>> disk 3: 2.44e-01 2.38e-01
>>>> disk 4: 2.44e-02 5.37e-02
>>>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>>>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>>>> ...
>>>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>>>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>>>> Converged to desired accuracy :)
>>>> After: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>> disk 0: 2.44e-01 2.46e-01
>>>> disk 1: 2.44e-01 2.44e-01
>>>> disk 2: 2.44e-01 2.41e-01
>>>> disk 3: 2.44e-01 2.45e-01
>>>> disk 4: 2.44e-02 2.33e-02
>>>>
>>>>
>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>
>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>> Hi Pedro,
>>>>>
>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> I'm going to experiment with what you did at
>>>>>>
>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>
>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>> Great, thanks for the clarifications.
>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>
>>>>>>> I keep working on it.
>>>>>>>
>>>>>>>
>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> Thanks for taking a look at this! It's a frustrating problem and we
>>>>>>>> haven't made much headway.
>>>>>>>>
>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>> work on.
>>>>>>>>>
>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>> selecting the same one again.
>>>>>>>>
>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>> round/replica/rank.
>>>>>>>>
>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>> change and evolve independently. (In most cases any change is
>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>> before it is enabled or used.)
>>>>>>>>
>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>> implementation?
>>>>>>>>
>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>> retry_descent.
>>>>>>>>
>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>> adapted)!
>>>>>>>>
>>>>>>>> sage
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-04-22 16:51 ` Loic Dachary
@ 2017-04-25 15:04 ` Pedro López-Adeva
2017-04-25 17:46 ` Loic Dachary
2017-04-26 21:08 ` Loic Dachary
0 siblings, 2 replies; 70+ messages in thread
From: Pedro López-Adeva @ 2017-04-25 15:04 UTC (permalink / raw)
To: Loic Dachary; +Cc: Ceph Development
Hi Loic,
Well, the results are better certainly! Some comments:
- I'm glad Nelder-Mead worked. It's not the one I would have chosen
because but I'm not an expert in optimization either. I wonder how it
will scale with more weights[1]. My attempt at using scipy's optimize
didn't work because you are optimizing an stochastic function and this
can make scipy's to decide that no further steps are possible. The
field that studies this kind of problems is stochastic optimization
[2]
- I used KL divergence for the loss function. My first attempt was
using as you standard deviation (more commonly known as L2 loss) with
gradient descent, but it didn't work very well.
- Sum of differences sounds like a bad idea, +100 and -100 errors will
cancel out. Worse still -100 and -100 will be better than 0 and 0.
Maybe you were talking about the absolute value of the differences?
- Well, now that CRUSH can use multiple weight the problem that
remains I think is seeing if the optimization problem is: a) reliable
and b) fast enough
Cheers,
Pedro.
[1] http://www.benfrederickson.com/numerical-optimization/
[2] https://en.wikipedia.org/wiki/Stochastic_optimization
2017-04-22 18:51 GMT+02:00 Loic Dachary <loic@dachary.org>:
> Hi Pedro,
>
> I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
>
> I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
>
> Before optimization the situation is:
>
> ~expected~ ~objects~ ~delta~ ~delta%~
> ~name~
> dc1 1024 1024 0 0.000000
> host0 256 294 38 14.843750
> device0 128 153 25 19.531250
> device1 128 141 13 10.156250
> host1 256 301 45 17.578125
> device2 128 157 29 22.656250
> device3 128 144 16 12.500000
> host2 512 429 -83 -16.210938
> device4 128 96 -32 -25.000000
> device5 128 117 -11 -8.593750
> device6 256 216 -40 -15.625000
>
> and after optimization we have the following:
>
> ~expected~ ~objects~ ~delta~ ~delta%~
> ~name~
> dc1 1024 1024 0 0.000000
> host0 256 259 3 1.171875
> device0 128 129 1 0.781250
> device1 128 130 2 1.562500
> host1 256 258 2 0.781250
> device2 128 129 1 0.781250
> device3 128 129 1 0.781250
> host2 512 507 -5 -0.976562
> device4 128 126 -2 -1.562500
> device5 128 127 -1 -0.781250
> device6 256 254 -2 -0.781250
>
> Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
>
> Cheers
>
> [1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
> [2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
> [3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
> [4] https://github.com/ceph/ceph/pull/14486
>
> On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
>> There are lot of gradient-free methods. I will try first to run the
>> ones available using just scipy
>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>> Some of them don't require the gradient and some of them can estimate
>> it. The reason to go without the gradient is to run the CRUSH
>> algorithm as a black box. In that case this would be the pseudo-code:
>>
>> - BEGIN CODE -
>> def build_target(desired_freqs):
>> def target(weights):
>> # run a simulation of CRUSH for a number of objects
>> sim_freqs = run_crush(weights)
>> # Kullback-Leibler divergence between desired frequencies and
>> current ones
>> return loss(sim_freqs, desired_freqs)
>> return target
>>
>> weights = scipy.optimize.minimize(build_target(desired_freqs))
>> - END CODE -
>>
>> The tricky thing here is that this procedure can be slow if the
>> simulation (run_crush) needs to place a lot of objects to get accurate
>> simulated frequencies. This is true specially if the minimize method
>> attempts to approximate the gradient using finite differences since it
>> will evaluate the target function a number of times proportional to
>> the number of weights). Apart from the ones in scipy I would try also
>> optimization methods that try to perform as few evaluations as
>> possible like for example HyperOpt
>> (http://hyperopt.github.io/hyperopt/), which by the way takes into
>> account that the target function can be noisy.
>>
>> This black box approximation is simple to implement and makes the
>> computer do all the work instead of us.
>> I think that this black box approximation is worthy to try even if
>> it's not the final one because if this approximation works then we
>> know that a more elaborate one that computes the gradient of the CRUSH
>> algorithm will work for sure.
>>
>> I can try this black box approximation this weekend not on the real
>> CRUSH algorithm but with the simple implementation I did in python. If
>> it works it's just a matter of substituting one simulation with
>> another and see what happens.
>>
>> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> Hi Pedro,
>>>
>>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>> Hi Loic,
>>>>
>>>> >From what I see everything seems OK.
>>>
>>> Cool. I'll keep going in this direction then !
>>>
>>>> The interesting thing would be to
>>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>>> is right now modeling just a single straw bucket not the full CRUSH
>>>> algorithm.
>>>
>>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>>
>>>> That's the work that remains to be done. The only way that
>>>> would avoid reimplementing the CRUSH algorithm and computing the
>>>> gradient would be treating CRUSH as a black box and eliminating the
>>>> necessity of computing the gradient either by using a gradient-free
>>>> optimization method or making an estimation of the gradient.
>>>
>>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>>
>>> Cheers
>>>
>>>>
>>>>
>>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>> Hi,
>>>>>
>>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>>
>>>>> Thanks !
>>>>>
>>>>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
>>>>> ------------------------------------------------------------------------
>>>>> Before: All replicas on each hard drive
>>>>> Expected vs actual use (20000 samples)
>>>>> disk 0: 1.39e-01 1.12e-01
>>>>> disk 1: 1.11e-01 1.10e-01
>>>>> disk 2: 8.33e-02 1.13e-01
>>>>> disk 3: 1.39e-01 1.11e-01
>>>>> disk 4: 1.11e-01 1.11e-01
>>>>> disk 5: 8.33e-02 1.11e-01
>>>>> disk 6: 1.39e-01 1.12e-01
>>>>> disk 7: 1.11e-01 1.12e-01
>>>>> disk 8: 8.33e-02 1.10e-01
>>>>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>>>>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>>>>> ...
>>>>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>>>>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>>>>> Converged to desired accuracy :)
>>>>> After: All replicas on each hard drive
>>>>> Expected vs actual use (20000 samples)
>>>>> disk 0: 1.39e-01 1.42e-01
>>>>> disk 1: 1.11e-01 1.09e-01
>>>>> disk 2: 8.33e-02 8.37e-02
>>>>> disk 3: 1.39e-01 1.40e-01
>>>>> disk 4: 1.11e-01 1.13e-01
>>>>> disk 5: 8.33e-02 8.08e-02
>>>>> disk 6: 1.39e-01 1.38e-01
>>>>> disk 7: 1.11e-01 1.09e-01
>>>>> disk 8: 8.33e-02 8.48e-02
>>>>>
>>>>>
>>>>> Simulation: R=2 devices capacity [10 10 10 10 1]
>>>>> ------------------------------------------------------------------------
>>>>> Before: All replicas on each hard drive
>>>>> Expected vs actual use (20000 samples)
>>>>> disk 0: 2.44e-01 2.36e-01
>>>>> disk 1: 2.44e-01 2.38e-01
>>>>> disk 2: 2.44e-01 2.34e-01
>>>>> disk 3: 2.44e-01 2.38e-01
>>>>> disk 4: 2.44e-02 5.37e-02
>>>>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>>>>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>>>>> ...
>>>>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>>>>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>>>>> Converged to desired accuracy :)
>>>>> After: All replicas on each hard drive
>>>>> Expected vs actual use (20000 samples)
>>>>> disk 0: 2.44e-01 2.46e-01
>>>>> disk 1: 2.44e-01 2.44e-01
>>>>> disk 2: 2.44e-01 2.41e-01
>>>>> disk 3: 2.44e-01 2.45e-01
>>>>> disk 4: 2.44e-02 2.33e-02
>>>>>
>>>>>
>>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>>
>>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>>> Hi Pedro,
>>>>>>>
>>>>>>> I'm going to experiment with what you did at
>>>>>>>
>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>
>>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>>> Great, thanks for the clarifications.
>>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>>
>>>>>>>> I keep working on it.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>>> Hi Pedro,
>>>>>>>>>
>>>>>>>>> Thanks for taking a look at this! It's a frustrating problem and we
>>>>>>>>> haven't made much headway.
>>>>>>>>>
>>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>>> work on.
>>>>>>>>>>
>>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>>> selecting the same one again.
>>>>>>>>>
>>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>>> round/replica/rank.
>>>>>>>>>
>>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>>> change and evolve independently. (In most cases any change is
>>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>>> before it is enabled or used.)
>>>>>>>>>
>>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>>> implementation?
>>>>>>>>>
>>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>>> retry_descent.
>>>>>>>>>
>>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>>> adapted)!
>>>>>>>>>
>>>>>>>>> sage
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-04-25 15:04 ` Pedro López-Adeva
@ 2017-04-25 17:46 ` Loic Dachary
2017-04-26 21:08 ` Loic Dachary
1 sibling, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-04-25 17:46 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: Ceph Development
Hi Pedro,
On 04/25/2017 05:04 PM, Pedro López-Adeva wrote:
> Hi Loic,
>
> Well, the results are better certainly! Some comments:
>
> - I'm glad Nelder-Mead worked. It's not the one I would have chosen
> because but I'm not an expert in optimization either. I wonder how it
> will scale with more weights[1]. My attempt at using scipy's optimize
> didn't work because you are optimizing an stochastic function and this
> can make scipy's to decide that no further steps are possible.
Understood (I think). Do you have an opinion on which one of the following would be a better fit ?
minimize(method=’Powell’)
minimize(method=’CG’)
minimize(method=’BFGS’)
minimize(method=’Newton-CG’)
minimize(method=’L-BFGS-B’)
minimize(method=’TNC’)
minimize(method=’COBYLA’)
minimize(method=’SLSQP’)
minimize(method=’dogleg’)
minimize(method=’trust-ncg’)
> The
> field that studies this kind of problems is stochastic optimization
> [2]
Unless I'm mistaken there are no tools related to that kind of problem in scipy, right ? I'll keep using scipy anyway because, as you wrote in your previous mail, it will be helpful to know if it works or not. Even if it takes so much time that it's not practical to use, it will tell us if computing the gradient of the CRUSH algorithm is a lost cause or not :-)
> - I used KL divergence for the loss function. My first attempt was
> using as you standard deviation (more commonly known as L2 loss) with
> gradient descent, but it didn't work very well.
>
> - Sum of differences sounds like a bad idea, +100 and -100 errors will
> cancel out. Worse still -100 and -100 will be better than 0 and 0.
> Maybe you were talking about the absolute value of the differences?
I was not thinking straigth to be honest.
> - Well, now that CRUSH can use multiple weight the problem that
> remains I think is seeing if the optimization problem is: a) reliable
> and b) fast enough
Yep. I'll implement something and let you know how it goes.
Cheers
>
> Cheers,
> Pedro.
>
> [1] http://www.benfrederickson.com/numerical-optimization/
> [2] https://en.wikipedia.org/wiki/Stochastic_optimization
>
> 2017-04-22 18:51 GMT+02:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
>>
>> I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
>>
>> Before optimization the situation is:
>>
>> ~expected~ ~objects~ ~delta~ ~delta%~
>> ~name~
>> dc1 1024 1024 0 0.000000
>> host0 256 294 38 14.843750
>> device0 128 153 25 19.531250
>> device1 128 141 13 10.156250
>> host1 256 301 45 17.578125
>> device2 128 157 29 22.656250
>> device3 128 144 16 12.500000
>> host2 512 429 -83 -16.210938
>> device4 128 96 -32 -25.000000
>> device5 128 117 -11 -8.593750
>> device6 256 216 -40 -15.625000
>>
>> and after optimization we have the following:
>>
>> ~expected~ ~objects~ ~delta~ ~delta%~
>> ~name~
>> dc1 1024 1024 0 0.000000
>> host0 256 259 3 1.171875
>> device0 128 129 1 0.781250
>> device1 128 130 2 1.562500
>> host1 256 258 2 0.781250
>> device2 128 129 1 0.781250
>> device3 128 129 1 0.781250
>> host2 512 507 -5 -0.976562
>> device4 128 126 -2 -1.562500
>> device5 128 127 -1 -0.781250
>> device6 256 254 -2 -0.781250
>>
>> Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
>>
>> Cheers
>>
>> [1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
>> [2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
>> [3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
>> [4] https://github.com/ceph/ceph/pull/14486
>>
>> On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
>>> There are lot of gradient-free methods. I will try first to run the
>>> ones available using just scipy
>>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>> Some of them don't require the gradient and some of them can estimate
>>> it. The reason to go without the gradient is to run the CRUSH
>>> algorithm as a black box. In that case this would be the pseudo-code:
>>>
>>> - BEGIN CODE -
>>> def build_target(desired_freqs):
>>> def target(weights):
>>> # run a simulation of CRUSH for a number of objects
>>> sim_freqs = run_crush(weights)
>>> # Kullback-Leibler divergence between desired frequencies and
>>> current ones
>>> return loss(sim_freqs, desired_freqs)
>>> return target
>>>
>>> weights = scipy.optimize.minimize(build_target(desired_freqs))
>>> - END CODE -
>>>
>>> The tricky thing here is that this procedure can be slow if the
>>> simulation (run_crush) needs to place a lot of objects to get accurate
>>> simulated frequencies. This is true specially if the minimize method
>>> attempts to approximate the gradient using finite differences since it
>>> will evaluate the target function a number of times proportional to
>>> the number of weights). Apart from the ones in scipy I would try also
>>> optimization methods that try to perform as few evaluations as
>>> possible like for example HyperOpt
>>> (http://hyperopt.github.io/hyperopt/), which by the way takes into
>>> account that the target function can be noisy.
>>>
>>> This black box approximation is simple to implement and makes the
>>> computer do all the work instead of us.
>>> I think that this black box approximation is worthy to try even if
>>> it's not the final one because if this approximation works then we
>>> know that a more elaborate one that computes the gradient of the CRUSH
>>> algorithm will work for sure.
>>>
>>> I can try this black box approximation this weekend not on the real
>>> CRUSH algorithm but with the simple implementation I did in python. If
>>> it works it's just a matter of substituting one simulation with
>>> another and see what happens.
>>>
>>> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>> Hi Pedro,
>>>>
>>>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>>> Hi Loic,
>>>>>
>>>>> >From what I see everything seems OK.
>>>>
>>>> Cool. I'll keep going in this direction then !
>>>>
>>>>> The interesting thing would be to
>>>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>>>> is right now modeling just a single straw bucket not the full CRUSH
>>>>> algorithm.
>>>>
>>>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>>>
>>>>> That's the work that remains to be done. The only way that
>>>>> would avoid reimplementing the CRUSH algorithm and computing the
>>>>> gradient would be treating CRUSH as a black box and eliminating the
>>>>> necessity of computing the gradient either by using a gradient-free
>>>>> optimization method or making an estimation of the gradient.
>>>>
>>>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>>>
>>>> Cheers
>>>>
>>>>>
>>>>>
>>>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>> Hi,
>>>>>>
>>>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>>>
>>>>>> Thanks !
>>>>>>
>>>>>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
>>>>>> ------------------------------------------------------------------------
>>>>>> Before: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>> disk 0: 1.39e-01 1.12e-01
>>>>>> disk 1: 1.11e-01 1.10e-01
>>>>>> disk 2: 8.33e-02 1.13e-01
>>>>>> disk 3: 1.39e-01 1.11e-01
>>>>>> disk 4: 1.11e-01 1.11e-01
>>>>>> disk 5: 8.33e-02 1.11e-01
>>>>>> disk 6: 1.39e-01 1.12e-01
>>>>>> disk 7: 1.11e-01 1.12e-01
>>>>>> disk 8: 8.33e-02 1.10e-01
>>>>>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>>>>>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>>>>>> ...
>>>>>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>>>>>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>>>>>> Converged to desired accuracy :)
>>>>>> After: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>> disk 0: 1.39e-01 1.42e-01
>>>>>> disk 1: 1.11e-01 1.09e-01
>>>>>> disk 2: 8.33e-02 8.37e-02
>>>>>> disk 3: 1.39e-01 1.40e-01
>>>>>> disk 4: 1.11e-01 1.13e-01
>>>>>> disk 5: 8.33e-02 8.08e-02
>>>>>> disk 6: 1.39e-01 1.38e-01
>>>>>> disk 7: 1.11e-01 1.09e-01
>>>>>> disk 8: 8.33e-02 8.48e-02
>>>>>>
>>>>>>
>>>>>> Simulation: R=2 devices capacity [10 10 10 10 1]
>>>>>> ------------------------------------------------------------------------
>>>>>> Before: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>> disk 0: 2.44e-01 2.36e-01
>>>>>> disk 1: 2.44e-01 2.38e-01
>>>>>> disk 2: 2.44e-01 2.34e-01
>>>>>> disk 3: 2.44e-01 2.38e-01
>>>>>> disk 4: 2.44e-02 5.37e-02
>>>>>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>>>>>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>>>>>> ...
>>>>>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>>>>>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>>>>>> Converged to desired accuracy :)
>>>>>> After: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>> disk 0: 2.44e-01 2.46e-01
>>>>>> disk 1: 2.44e-01 2.44e-01
>>>>>> disk 2: 2.44e-01 2.41e-01
>>>>>> disk 3: 2.44e-01 2.45e-01
>>>>>> disk 4: 2.44e-02 2.33e-02
>>>>>>
>>>>>>
>>>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>>>
>>>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>>> Hi Pedro,
>>>>>>>
>>>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> I'm going to experiment with what you did at
>>>>>>>>
>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>
>>>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>>>> Great, thanks for the clarifications.
>>>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>>>
>>>>>>>>> I keep working on it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>>>> Hi Pedro,
>>>>>>>>>>
>>>>>>>>>> Thanks for taking a look at this! It's a frustrating problem and we
>>>>>>>>>> haven't made much headway.
>>>>>>>>>>
>>>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>>>> work on.
>>>>>>>>>>>
>>>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>>>> selecting the same one again.
>>>>>>>>>>
>>>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>>>> round/replica/rank.
>>>>>>>>>>
>>>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>>>> change and evolve independently. (In most cases any change is
>>>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>>>> before it is enabled or used.)
>>>>>>>>>>
>>>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>>>> implementation?
>>>>>>>>>>
>>>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>>>> retry_descent.
>>>>>>>>>>
>>>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>>>> adapted)!
>>>>>>>>>>
>>>>>>>>>> sage
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-04-25 15:04 ` Pedro López-Adeva
2017-04-25 17:46 ` Loic Dachary
@ 2017-04-26 21:08 ` Loic Dachary
2017-04-26 22:25 ` Loic Dachary
1 sibling, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-04-26 21:08 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: Ceph Development
On 04/25/2017 05:04 PM, Pedro López-Adeva wrote:
> Hi Loic,
>
> Well, the results are better certainly! Some comments:
>
> - I'm glad Nelder-Mead worked. It's not the one I would have chosen
> because but I'm not an expert in optimization either. I wonder how it
> will scale with more weights[1]. My attempt at using scipy's optimize
> didn't work because you are optimizing an stochastic function and this
> can make scipy's to decide that no further steps are possible. The
> field that studies this kind of problems is stochastic optimization
> [2]
You were right, it does not always work. Note that this is *not* about the conditional probability bias. This is about the uneven distribution due to the low number of values in the distribution. I think this case should be treated separately, with a different method. In Ceph clusters, large and small, the number of PGs per host is unlikely to be large enough to get enough samples. It is not an isolated problem, it's what happens most of the time.
Even in a case as simple as 12 devices starting with:
~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
host1 2560.000000 2580 20.000000 0.781250 24
device12 106.666667 101 -5.666667 -5.312500 1
device13 213.333333 221 7.666667 3.593750 2
device14 320.000000 317 -3.000000 -0.937500 3
device15 106.666667 101 -5.666667 -5.312500 1
device16 213.333333 217 3.666667 1.718750 2
device17 320.000000 342 22.000000 6.875000 3
device18 106.666667 102 -4.666667 -4.375000 1
device19 213.333333 243 29.666667 13.906250 2
device20 320.000000 313 -7.000000 -2.187500 3
device21 106.666667 94 -12.666667 -11.875000 1
device22 213.333333 208 -5.333333 -2.500000 2
device23 320.000000 321 1.000000 0.312500 3
res = minimize(crush, weights, method='nelder-mead',
options={'xtol': 1e-8, 'disp': True})
device weights [ 1. 3. 3. 2. 3. 2. 2. 1. 3. 1. 1. 2.]
device kl 0.00117274995028
...
device kl 0.00016530695476
Optimization terminated successfully.
Current function value: 0.000165
Iterations: 117
Function evaluations: 470
we still get a 5% difference on device 21:
~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
host1 2560.000000 2559 -1.000000 -0.039062 23.805183
device12 106.666667 103 -3.666667 -3.437500 1.016999
device13 213.333333 214 0.666667 0.312500 1.949328
device14 320.000000 325 5.000000 1.562500 3.008688
device15 106.666667 106 -0.666667 -0.625000 1.012565
device16 213.333333 214 0.666667 0.312500 1.976344
device17 320.000000 320 0.000000 0.000000 2.845135
device18 106.666667 102 -4.666667 -4.375000 1.039181
device19 213.333333 214 0.666667 0.312500 1.820435
device20 320.000000 324 4.000000 1.250000 3.062573
device21 106.666667 101 -5.666667 -5.312500 1.071341
device22 213.333333 212 -1.333333 -0.625000 2.039190
device23 320.000000 324 4.000000 1.250000 3.016468
> - I used KL divergence for the loss function. My first attempt was
> using as you standard deviation (more commonly known as L2 loss) with
> gradient descent, but it didn't work very well.
>
> - Sum of differences sounds like a bad idea, +100 and -100 errors will
> cancel out. Worse still -100 and -100 will be better than 0 and 0.
> Maybe you were talking about the absolute value of the differences?
>
> - Well, now that CRUSH can use multiple weight the problem that
> remains I think is seeing if the optimization problem is: a) reliable
> and b) fast enough
>
> Cheers,
> Pedro.
>
> [1] http://www.benfrederickson.com/numerical-optimization/
> [2] https://en.wikipedia.org/wiki/Stochastic_optimization
>
> 2017-04-22 18:51 GMT+02:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
>>
>> I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
>>
>> Before optimization the situation is:
>>
>> ~expected~ ~objects~ ~delta~ ~delta%~
>> ~name~
>> dc1 1024 1024 0 0.000000
>> host0 256 294 38 14.843750
>> device0 128 153 25 19.531250
>> device1 128 141 13 10.156250
>> host1 256 301 45 17.578125
>> device2 128 157 29 22.656250
>> device3 128 144 16 12.500000
>> host2 512 429 -83 -16.210938
>> device4 128 96 -32 -25.000000
>> device5 128 117 -11 -8.593750
>> device6 256 216 -40 -15.625000
>>
>> and after optimization we have the following:
>>
>> ~expected~ ~objects~ ~delta~ ~delta%~
>> ~name~
>> dc1 1024 1024 0 0.000000
>> host0 256 259 3 1.171875
>> device0 128 129 1 0.781250
>> device1 128 130 2 1.562500
>> host1 256 258 2 0.781250
>> device2 128 129 1 0.781250
>> device3 128 129 1 0.781250
>> host2 512 507 -5 -0.976562
>> device4 128 126 -2 -1.562500
>> device5 128 127 -1 -0.781250
>> device6 256 254 -2 -0.781250
>>
>> Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
>>
>> Cheers
>>
>> [1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
>> [2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
>> [3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
>> [4] https://github.com/ceph/ceph/pull/14486
>>
>> On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
>>> There are lot of gradient-free methods. I will try first to run the
>>> ones available using just scipy
>>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>> Some of them don't require the gradient and some of them can estimate
>>> it. The reason to go without the gradient is to run the CRUSH
>>> algorithm as a black box. In that case this would be the pseudo-code:
>>>
>>> - BEGIN CODE -
>>> def build_target(desired_freqs):
>>> def target(weights):
>>> # run a simulation of CRUSH for a number of objects
>>> sim_freqs = run_crush(weights)
>>> # Kullback-Leibler divergence between desired frequencies and
>>> current ones
>>> return loss(sim_freqs, desired_freqs)
>>> return target
>>>
>>> weights = scipy.optimize.minimize(build_target(desired_freqs))
>>> - END CODE -
>>>
>>> The tricky thing here is that this procedure can be slow if the
>>> simulation (run_crush) needs to place a lot of objects to get accurate
>>> simulated frequencies. This is true specially if the minimize method
>>> attempts to approximate the gradient using finite differences since it
>>> will evaluate the target function a number of times proportional to
>>> the number of weights). Apart from the ones in scipy I would try also
>>> optimization methods that try to perform as few evaluations as
>>> possible like for example HyperOpt
>>> (http://hyperopt.github.io/hyperopt/), which by the way takes into
>>> account that the target function can be noisy.
>>>
>>> This black box approximation is simple to implement and makes the
>>> computer do all the work instead of us.
>>> I think that this black box approximation is worthy to try even if
>>> it's not the final one because if this approximation works then we
>>> know that a more elaborate one that computes the gradient of the CRUSH
>>> algorithm will work for sure.
>>>
>>> I can try this black box approximation this weekend not on the real
>>> CRUSH algorithm but with the simple implementation I did in python. If
>>> it works it's just a matter of substituting one simulation with
>>> another and see what happens.
>>>
>>> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>> Hi Pedro,
>>>>
>>>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>>> Hi Loic,
>>>>>
>>>>> >From what I see everything seems OK.
>>>>
>>>> Cool. I'll keep going in this direction then !
>>>>
>>>>> The interesting thing would be to
>>>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>>>> is right now modeling just a single straw bucket not the full CRUSH
>>>>> algorithm.
>>>>
>>>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>>>
>>>>> That's the work that remains to be done. The only way that
>>>>> would avoid reimplementing the CRUSH algorithm and computing the
>>>>> gradient would be treating CRUSH as a black box and eliminating the
>>>>> necessity of computing the gradient either by using a gradient-free
>>>>> optimization method or making an estimation of the gradient.
>>>>
>>>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>>>
>>>> Cheers
>>>>
>>>>>
>>>>>
>>>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>> Hi,
>>>>>>
>>>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>>>
>>>>>> Thanks !
>>>>>>
>>>>>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
>>>>>> ------------------------------------------------------------------------
>>>>>> Before: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>> disk 0: 1.39e-01 1.12e-01
>>>>>> disk 1: 1.11e-01 1.10e-01
>>>>>> disk 2: 8.33e-02 1.13e-01
>>>>>> disk 3: 1.39e-01 1.11e-01
>>>>>> disk 4: 1.11e-01 1.11e-01
>>>>>> disk 5: 8.33e-02 1.11e-01
>>>>>> disk 6: 1.39e-01 1.12e-01
>>>>>> disk 7: 1.11e-01 1.12e-01
>>>>>> disk 8: 8.33e-02 1.10e-01
>>>>>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>>>>>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>>>>>> ...
>>>>>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>>>>>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>>>>>> Converged to desired accuracy :)
>>>>>> After: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>> disk 0: 1.39e-01 1.42e-01
>>>>>> disk 1: 1.11e-01 1.09e-01
>>>>>> disk 2: 8.33e-02 8.37e-02
>>>>>> disk 3: 1.39e-01 1.40e-01
>>>>>> disk 4: 1.11e-01 1.13e-01
>>>>>> disk 5: 8.33e-02 8.08e-02
>>>>>> disk 6: 1.39e-01 1.38e-01
>>>>>> disk 7: 1.11e-01 1.09e-01
>>>>>> disk 8: 8.33e-02 8.48e-02
>>>>>>
>>>>>>
>>>>>> Simulation: R=2 devices capacity [10 10 10 10 1]
>>>>>> ------------------------------------------------------------------------
>>>>>> Before: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>> disk 0: 2.44e-01 2.36e-01
>>>>>> disk 1: 2.44e-01 2.38e-01
>>>>>> disk 2: 2.44e-01 2.34e-01
>>>>>> disk 3: 2.44e-01 2.38e-01
>>>>>> disk 4: 2.44e-02 5.37e-02
>>>>>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>>>>>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>>>>>> ...
>>>>>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>>>>>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>>>>>> Converged to desired accuracy :)
>>>>>> After: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>> disk 0: 2.44e-01 2.46e-01
>>>>>> disk 1: 2.44e-01 2.44e-01
>>>>>> disk 2: 2.44e-01 2.41e-01
>>>>>> disk 3: 2.44e-01 2.45e-01
>>>>>> disk 4: 2.44e-02 2.33e-02
>>>>>>
>>>>>>
>>>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>>>
>>>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>>> Hi Pedro,
>>>>>>>
>>>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> I'm going to experiment with what you did at
>>>>>>>>
>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>
>>>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>>>> Great, thanks for the clarifications.
>>>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>>>
>>>>>>>>> I keep working on it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>>>> Hi Pedro,
>>>>>>>>>>
>>>>>>>>>> Thanks for taking a look at this! It's a frustrating problem and we
>>>>>>>>>> haven't made much headway.
>>>>>>>>>>
>>>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>>>> work on.
>>>>>>>>>>>
>>>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>>>> selecting the same one again.
>>>>>>>>>>
>>>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>>>> round/replica/rank.
>>>>>>>>>>
>>>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>>>> change and evolve independently. (In most cases any change is
>>>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>>>> before it is enabled or used.)
>>>>>>>>>>
>>>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>>>> implementation?
>>>>>>>>>>
>>>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>>>> retry_descent.
>>>>>>>>>>
>>>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>>>> adapted)!
>>>>>>>>>>
>>>>>>>>>> sage
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-04-26 21:08 ` Loic Dachary
@ 2017-04-26 22:25 ` Loic Dachary
2017-04-27 6:12 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-04-26 22:25 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: Ceph Development
It seems to work when the distribution has enough samples. I tried with 40 hosts and a distribution with 100,000 samples.
We go from kl =~ 1e-4 (with as much as 10% difference) to kl =~ 1e-7 (with no more than 0.5% difference). I will do some more experiements and try to think of patterns where this would not work.
~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
dc1 102400 102400 0 0.000000 1008
host0 2438 2390 -48 -1.968827 24
host1 2438 2370 -68 -2.789171 24
host2 2438 2493 55 2.255947 24
host3 2438 2396 -42 -1.722724 24
host4 2438 2497 59 2.420016 24
host5 2438 2520 82 3.363413 24
host6 2438 2500 62 2.543068 24
host7 2438 2380 -58 -2.378999 24
host8 2438 2488 50 2.050861 24
host9 2438 2435 -3 -0.123052 24
host10 2438 2440 2 0.082034 24
host11 2438 2472 34 1.394586 24
host12 2438 2346 -92 -3.773585 24
host13 2438 2411 -27 -1.107465 24
host14 2438 2513 75 3.076292 24
host15 2438 2421 -17 -0.697293 24
host16 2438 2469 31 1.271534 24
host17 2438 2419 -19 -0.779327 24
host18 2438 2424 -14 -0.574241 24
host19 2438 2451 13 0.533224 24
host20 2438 2486 48 1.968827 24
host21 2438 2439 1 0.041017 24
host22 2438 2482 44 1.804758 24
host23 2438 2415 -23 -0.943396 24
host24 2438 2389 -49 -2.009844 24
host25 2438 2265 -173 -7.095980 24
host26 2438 2374 -64 -2.625103 24
host27 2438 2529 91 3.732568 24
host28 2438 2495 57 2.337982 24
host29 2438 2433 -5 -0.205086 24
host30 2438 2485 47 1.927810 24
host31 2438 2377 -61 -2.502051 24
host32 2438 2441 3 0.123052 24
host33 2438 2421 -17 -0.697293 24
host34 2438 2359 -79 -3.240361 24
host35 2438 2509 71 2.912223 24
host36 2438 2425 -13 -0.533224 24
host37 2438 2419 -19 -0.779327 24
host38 2438 2403 -35 -1.435603 24
host39 2438 2458 20 0.820345 24
host40 2438 2458 20 0.820345 24
host41 2438 2503 65 2.666120 24
~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
dc1 102400 102400 0 0.000000 1008
host0 2438 2438 0 0.000000 24.559919
host1 2438 2438 0 0.000000 24.641221
host2 2438 2440 2 0.082034 23.486113
host3 2438 2437 -1 -0.041017 24.525875
host4 2438 2436 -2 -0.082034 23.644304
host5 2438 2440 2 0.082034 23.245287
host6 2438 2442 4 0.164069 23.617162
host7 2438 2439 1 0.041017 24.746174
host8 2438 2436 -2 -0.082034 23.584667
host9 2438 2439 1 0.041017 24.140637
host10 2438 2438 0 0.000000 24.060084
host11 2438 2441 3 0.123052 23.730349
host12 2438 2437 -1 -0.041017 24.948602
host13 2438 2437 -1 -0.041017 24.280851
host14 2438 2436 -2 -0.082034 23.402216
host15 2438 2436 -2 -0.082034 24.272037
host16 2438 2437 -1 -0.041017 23.747867
host17 2438 2436 -2 -0.082034 24.266271
host18 2438 2438 0 0.000000 24.158545
host19 2438 2440 2 0.082034 23.934788
host20 2438 2438 0 0.000000 23.630851
host21 2438 2435 -3 -0.123052 24.001950
host22 2438 2440 2 0.082034 23.623120
host23 2438 2437 -1 -0.041017 24.343138
host24 2438 2438 0 0.000000 24.595820
host25 2438 2439 1 0.041017 25.547510
host26 2438 2437 -1 -0.041017 24.753111
host27 2438 2437 -1 -0.041017 23.288606
host28 2438 2437 -1 -0.041017 23.425059
host29 2438 2438 0 0.000000 24.115941
host30 2438 2441 3 0.123052 23.560539
host31 2438 2438 0 0.000000 24.459911
host32 2438 2440 2 0.082034 24.096746
host33 2438 2437 -1 -0.041017 24.241316
host34 2438 2438 0 0.000000 24.715044
host35 2438 2436 -2 -0.082034 23.424601
host36 2438 2436 -2 -0.082034 24.123606
host37 2438 2439 1 0.041017 24.368997
host38 2438 2440 2 0.082034 24.331532
host39 2438 2439 1 0.041017 23.803561
host40 2438 2437 -1 -0.041017 23.861094
host41 2438 2442 4 0.164069 23.468473
On 04/26/2017 11:08 PM, Loic Dachary wrote:
>
>
> On 04/25/2017 05:04 PM, Pedro López-Adeva wrote:
>> Hi Loic,
>>
>> Well, the results are better certainly! Some comments:
>>
>> - I'm glad Nelder-Mead worked. It's not the one I would have chosen
>> because but I'm not an expert in optimization either. I wonder how it
>> will scale with more weights[1]. My attempt at using scipy's optimize
>> didn't work because you are optimizing an stochastic function and this
>> can make scipy's to decide that no further steps are possible. The
>> field that studies this kind of problems is stochastic optimization
>> [2]
>
> You were right, it does not always work. Note that this is *not* about the conditional probability bias. This is about the uneven distribution due to the low number of values in the distribution. I think this case should be treated separately, with a different method. In Ceph clusters, large and small, the number of PGs per host is unlikely to be large enough to get enough samples. It is not an isolated problem, it's what happens most of the time.
>
> Even in a case as simple as 12 devices starting with:
>
> ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
> host1 2560.000000 2580 20.000000 0.781250 24
> device12 106.666667 101 -5.666667 -5.312500 1
> device13 213.333333 221 7.666667 3.593750 2
> device14 320.000000 317 -3.000000 -0.937500 3
> device15 106.666667 101 -5.666667 -5.312500 1
> device16 213.333333 217 3.666667 1.718750 2
> device17 320.000000 342 22.000000 6.875000 3
> device18 106.666667 102 -4.666667 -4.375000 1
> device19 213.333333 243 29.666667 13.906250 2
> device20 320.000000 313 -7.000000 -2.187500 3
> device21 106.666667 94 -12.666667 -11.875000 1
> device22 213.333333 208 -5.333333 -2.500000 2
> device23 320.000000 321 1.000000 0.312500 3
>
> res = minimize(crush, weights, method='nelder-mead',
> options={'xtol': 1e-8, 'disp': True})
>
> device weights [ 1. 3. 3. 2. 3. 2. 2. 1. 3. 1. 1. 2.]
> device kl 0.00117274995028
> ...
> device kl 0.00016530695476
> Optimization terminated successfully.
> Current function value: 0.000165
> Iterations: 117
> Function evaluations: 470
>
> we still get a 5% difference on device 21:
>
> ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
> host1 2560.000000 2559 -1.000000 -0.039062 23.805183
> device12 106.666667 103 -3.666667 -3.437500 1.016999
> device13 213.333333 214 0.666667 0.312500 1.949328
> device14 320.000000 325 5.000000 1.562500 3.008688
> device15 106.666667 106 -0.666667 -0.625000 1.012565
> device16 213.333333 214 0.666667 0.312500 1.976344
> device17 320.000000 320 0.000000 0.000000 2.845135
> device18 106.666667 102 -4.666667 -4.375000 1.039181
> device19 213.333333 214 0.666667 0.312500 1.820435
> device20 320.000000 324 4.000000 1.250000 3.062573
> device21 106.666667 101 -5.666667 -5.312500 1.071341
> device22 213.333333 212 -1.333333 -0.625000 2.039190
> device23 320.000000 324 4.000000 1.250000 3.016468
>
>
>> - I used KL divergence for the loss function. My first attempt was
>> using as you standard deviation (more commonly known as L2 loss) with
>> gradient descent, but it didn't work very well.
>>
>> - Sum of differences sounds like a bad idea, +100 and -100 errors will
>> cancel out. Worse still -100 and -100 will be better than 0 and 0.
>> Maybe you were talking about the absolute value of the differences?
>>
>> - Well, now that CRUSH can use multiple weight the problem that
>> remains I think is seeing if the optimization problem is: a) reliable
>> and b) fast enough
>>
>> Cheers,
>> Pedro.
>>
>> [1] http://www.benfrederickson.com/numerical-optimization/
>> [2] https://en.wikipedia.org/wiki/Stochastic_optimization
>>
>> 2017-04-22 18:51 GMT+02:00 Loic Dachary <loic@dachary.org>:
>>> Hi Pedro,
>>>
>>> I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
>>>
>>> I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
>>>
>>> Before optimization the situation is:
>>>
>>> ~expected~ ~objects~ ~delta~ ~delta%~
>>> ~name~
>>> dc1 1024 1024 0 0.000000
>>> host0 256 294 38 14.843750
>>> device0 128 153 25 19.531250
>>> device1 128 141 13 10.156250
>>> host1 256 301 45 17.578125
>>> device2 128 157 29 22.656250
>>> device3 128 144 16 12.500000
>>> host2 512 429 -83 -16.210938
>>> device4 128 96 -32 -25.000000
>>> device5 128 117 -11 -8.593750
>>> device6 256 216 -40 -15.625000
>>>
>>> and after optimization we have the following:
>>>
>>> ~expected~ ~objects~ ~delta~ ~delta%~
>>> ~name~
>>> dc1 1024 1024 0 0.000000
>>> host0 256 259 3 1.171875
>>> device0 128 129 1 0.781250
>>> device1 128 130 2 1.562500
>>> host1 256 258 2 0.781250
>>> device2 128 129 1 0.781250
>>> device3 128 129 1 0.781250
>>> host2 512 507 -5 -0.976562
>>> device4 128 126 -2 -1.562500
>>> device5 128 127 -1 -0.781250
>>> device6 256 254 -2 -0.781250
>>>
>>> Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
>>>
>>> Cheers
>>>
>>> [1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
>>> [2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
>>> [3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
>>> [4] https://github.com/ceph/ceph/pull/14486
>>>
>>> On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
>>>> There are lot of gradient-free methods. I will try first to run the
>>>> ones available using just scipy
>>>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>>> Some of them don't require the gradient and some of them can estimate
>>>> it. The reason to go without the gradient is to run the CRUSH
>>>> algorithm as a black box. In that case this would be the pseudo-code:
>>>>
>>>> - BEGIN CODE -
>>>> def build_target(desired_freqs):
>>>> def target(weights):
>>>> # run a simulation of CRUSH for a number of objects
>>>> sim_freqs = run_crush(weights)
>>>> # Kullback-Leibler divergence between desired frequencies and
>>>> current ones
>>>> return loss(sim_freqs, desired_freqs)
>>>> return target
>>>>
>>>> weights = scipy.optimize.minimize(build_target(desired_freqs))
>>>> - END CODE -
>>>>
>>>> The tricky thing here is that this procedure can be slow if the
>>>> simulation (run_crush) needs to place a lot of objects to get accurate
>>>> simulated frequencies. This is true specially if the minimize method
>>>> attempts to approximate the gradient using finite differences since it
>>>> will evaluate the target function a number of times proportional to
>>>> the number of weights). Apart from the ones in scipy I would try also
>>>> optimization methods that try to perform as few evaluations as
>>>> possible like for example HyperOpt
>>>> (http://hyperopt.github.io/hyperopt/), which by the way takes into
>>>> account that the target function can be noisy.
>>>>
>>>> This black box approximation is simple to implement and makes the
>>>> computer do all the work instead of us.
>>>> I think that this black box approximation is worthy to try even if
>>>> it's not the final one because if this approximation works then we
>>>> know that a more elaborate one that computes the gradient of the CRUSH
>>>> algorithm will work for sure.
>>>>
>>>> I can try this black box approximation this weekend not on the real
>>>> CRUSH algorithm but with the simple implementation I did in python. If
>>>> it works it's just a matter of substituting one simulation with
>>>> another and see what happens.
>>>>
>>>> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>> Hi Pedro,
>>>>>
>>>>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>>>> Hi Loic,
>>>>>>
>>>>>> >From what I see everything seems OK.
>>>>>
>>>>> Cool. I'll keep going in this direction then !
>>>>>
>>>>>> The interesting thing would be to
>>>>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>>>>> is right now modeling just a single straw bucket not the full CRUSH
>>>>>> algorithm.
>>>>>
>>>>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>>>>
>>>>>> That's the work that remains to be done. The only way that
>>>>>> would avoid reimplementing the CRUSH algorithm and computing the
>>>>>> gradient would be treating CRUSH as a black box and eliminating the
>>>>>> necessity of computing the gradient either by using a gradient-free
>>>>>> optimization method or making an estimation of the gradient.
>>>>>
>>>>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>>>>
>>>>> Cheers
>>>>>
>>>>>>
>>>>>>
>>>>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>>>>
>>>>>>> Thanks !
>>>>>>>
>>>>>>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
>>>>>>> ------------------------------------------------------------------------
>>>>>>> Before: All replicas on each hard drive
>>>>>>> Expected vs actual use (20000 samples)
>>>>>>> disk 0: 1.39e-01 1.12e-01
>>>>>>> disk 1: 1.11e-01 1.10e-01
>>>>>>> disk 2: 8.33e-02 1.13e-01
>>>>>>> disk 3: 1.39e-01 1.11e-01
>>>>>>> disk 4: 1.11e-01 1.11e-01
>>>>>>> disk 5: 8.33e-02 1.11e-01
>>>>>>> disk 6: 1.39e-01 1.12e-01
>>>>>>> disk 7: 1.11e-01 1.12e-01
>>>>>>> disk 8: 8.33e-02 1.10e-01
>>>>>>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>>>>>>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>>>>>>> ...
>>>>>>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>>>>>>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>>>>>>> Converged to desired accuracy :)
>>>>>>> After: All replicas on each hard drive
>>>>>>> Expected vs actual use (20000 samples)
>>>>>>> disk 0: 1.39e-01 1.42e-01
>>>>>>> disk 1: 1.11e-01 1.09e-01
>>>>>>> disk 2: 8.33e-02 8.37e-02
>>>>>>> disk 3: 1.39e-01 1.40e-01
>>>>>>> disk 4: 1.11e-01 1.13e-01
>>>>>>> disk 5: 8.33e-02 8.08e-02
>>>>>>> disk 6: 1.39e-01 1.38e-01
>>>>>>> disk 7: 1.11e-01 1.09e-01
>>>>>>> disk 8: 8.33e-02 8.48e-02
>>>>>>>
>>>>>>>
>>>>>>> Simulation: R=2 devices capacity [10 10 10 10 1]
>>>>>>> ------------------------------------------------------------------------
>>>>>>> Before: All replicas on each hard drive
>>>>>>> Expected vs actual use (20000 samples)
>>>>>>> disk 0: 2.44e-01 2.36e-01
>>>>>>> disk 1: 2.44e-01 2.38e-01
>>>>>>> disk 2: 2.44e-01 2.34e-01
>>>>>>> disk 3: 2.44e-01 2.38e-01
>>>>>>> disk 4: 2.44e-02 5.37e-02
>>>>>>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>>>>>>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>>>>>>> ...
>>>>>>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>>>>>>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>>>>>>> Converged to desired accuracy :)
>>>>>>> After: All replicas on each hard drive
>>>>>>> Expected vs actual use (20000 samples)
>>>>>>> disk 0: 2.44e-01 2.46e-01
>>>>>>> disk 1: 2.44e-01 2.44e-01
>>>>>>> disk 2: 2.44e-01 2.41e-01
>>>>>>> disk 3: 2.44e-01 2.45e-01
>>>>>>> disk 4: 2.44e-02 2.33e-02
>>>>>>>
>>>>>>>
>>>>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>>>>
>>>>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>>>>> Hi Pedro,
>>>>>>>>>
>>>>>>>>> I'm going to experiment with what you did at
>>>>>>>>>
>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>
>>>>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>>>>> Great, thanks for the clarifications.
>>>>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>>>>
>>>>>>>>>> I keep working on it.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>>>>> Hi Pedro,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for taking a look at this! It's a frustrating problem and we
>>>>>>>>>>> haven't made much headway.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>>>>> work on.
>>>>>>>>>>>>
>>>>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>>>>> selecting the same one again.
>>>>>>>>>>>
>>>>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>>>>> round/replica/rank.
>>>>>>>>>>>
>>>>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>>>>> change and evolve independently. (In most cases any change is
>>>>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>>>>> before it is enabled or used.)
>>>>>>>>>>>
>>>>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>>>>> implementation?
>>>>>>>>>>>
>>>>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>>>>> retry_descent.
>>>>>>>>>>>
>>>>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>>>>> adapted)!
>>>>>>>>>>>
>>>>>>>>>>> sage
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-04-26 22:25 ` Loic Dachary
@ 2017-04-27 6:12 ` Loic Dachary
2017-04-27 16:47 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-04-27 6:12 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: Ceph Development
With 63 hosts instead of 41 we get the same results: from kl 1.9169485575e-04 to kl 3.0384231953e-07 with a maximum difference going from ~8% to ~0.5%. What's is interesting (at least to me ;-) is that the weights don't change that much, they all stay in the range ]23,25].
Note that all this optimization is done by changing a single weight per host. It is worth trying again with two different weights (which is what you did in https://github.com/plafl/notebooks/blob/master/replication.ipynb). The weight for the first draw is immutable as it is (i.e. 24) and the weight for the second draw is allowed to change.
Before optimization
host0 2400 2345 -55 -2.291667 24
host1 2400 2434 34 1.416667 24
host2 2400 2387 -13 -0.541667 24
host3 2400 2351 -49 -2.041667 24
host4 2400 2423 23 0.958333 24
host5 2400 2456 56 2.333333 24
host6 2400 2450 50 2.083333 24
host7 2400 2307 -93 -3.875000 24
host8 2400 2434 34 1.416667 24
host9 2400 2358 -42 -1.750000 24
host10 2400 2452 52 2.166667 24
host11 2400 2398 -2 -0.083333 24
host12 2400 2359 -41 -1.708333 24
host13 2400 2403 3 0.125000 24
host14 2400 2484 84 3.500000 24
host15 2400 2348 -52 -2.166667 24
host16 2400 2489 89 3.708333 24
host17 2400 2412 12 0.500000 24
host18 2400 2416 16 0.666667 24
host19 2400 2453 53 2.208333 24
host20 2400 2475 75 3.125000 24
host21 2400 2413 13 0.541667 24
host22 2400 2450 50 2.083333 24
host23 2400 2348 -52 -2.166667 24
host24 2400 2355 -45 -1.875000 24
host25 2400 2348 -52 -2.166667 24
host26 2400 2373 -27 -1.125000 24
host27 2400 2470 70 2.916667 24
host28 2400 2449 49 2.041667 24
host29 2400 2420 20 0.833333 24
host30 2400 2406 6 0.250000 24
host31 2400 2376 -24 -1.000000 24
host32 2400 2371 -29 -1.208333 24
host33 2400 2395 -5 -0.208333 24
host34 2400 2351 -49 -2.041667 24
host35 2400 2453 53 2.208333 24
host36 2400 2421 21 0.875000 24
host37 2400 2393 -7 -0.291667 24
host38 2400 2394 -6 -0.250000 24
host39 2400 2322 -78 -3.250000 24
host40 2400 2409 9 0.375000 24
host41 2400 2486 86 3.583333 24
host42 2400 2466 66 2.750000 24
host43 2400 2409 9 0.375000 24
host44 2400 2276 -124 -5.166667 24
host45 2400 2379 -21 -0.875000 24
host46 2400 2394 -6 -0.250000 24
host47 2400 2401 1 0.041667 24
host48 2400 2446 46 1.916667 24
host49 2400 2349 -51 -2.125000 24
host50 2400 2413 13 0.541667 24
host51 2400 2333 -67 -2.791667 24
host52 2400 2387 -13 -0.541667 24
host53 2400 2407 7 0.291667 24
host54 2400 2377 -23 -0.958333 24
host55 2400 2441 41 1.708333 24
host56 2400 2420 20 0.833333 24
host57 2400 2388 -12 -0.500000 24
host58 2400 2460 60 2.500000 24
host59 2400 2394 -6 -0.250000 24
host60 2400 2316 -84 -3.500000 24
host61 2400 2373 -27 -1.125000 24
host62 2400 2362 -38 -1.583333 24
host63 2400 2372 -28 -1.166667 24
After optimization
host0 2400 2403 3 0.125000 24.575153
host1 2400 2401 1 0.041667 23.723316
host2 2400 2402 2 0.083333 24.168746
host3 2400 2399 -1 -0.041667 24.520240
host4 2400 2399 -1 -0.041667 23.911445
host5 2400 2400 0 0.000000 23.606956
host6 2400 2401 1 0.041667 23.714102
host7 2400 2400 0 0.000000 25.008463
host8 2400 2399 -1 -0.041667 23.557143
host9 2400 2399 -1 -0.041667 24.431548
host10 2400 2400 0 0.000000 23.494153
host11 2400 2401 1 0.041667 23.976621
host12 2400 2400 0 0.000000 24.512622
host13 2400 2397 -3 -0.125000 24.010814
host14 2400 2398 -2 -0.083333 23.229791
host15 2400 2402 2 0.083333 24.510854
host16 2400 2401 1 0.041667 23.188161
host17 2400 2397 -3 -0.125000 23.931915
host18 2400 2400 0 0.000000 23.886135
host19 2400 2398 -2 -0.083333 23.442129
host20 2400 2401 1 0.041667 23.393092
host21 2400 2398 -2 -0.083333 23.940452
host22 2400 2401 1 0.041667 23.643843
host23 2400 2403 3 0.125000 24.592113
host24 2400 2402 2 0.083333 24.561842
host25 2400 2401 1 0.041667 24.598754
host26 2400 2398 -2 -0.083333 24.350951
host27 2400 2399 -1 -0.041667 23.336478
host28 2400 2401 1 0.041667 23.549652
host29 2400 2401 1 0.041667 23.840408
host30 2400 2400 0 0.000000 23.932423
host31 2400 2397 -3 -0.125000 24.295621
host32 2400 2402 2 0.083333 24.298228
host33 2400 2403 3 0.125000 24.068700
host34 2400 2399 -1 -0.041667 24.395416
host35 2400 2398 -2 -0.083333 23.522074
host36 2400 2395 -5 -0.208333 23.746354
host37 2400 2402 2 0.083333 24.120875
host38 2400 2401 1 0.041667 24.034644
host39 2400 2400 0 0.000000 24.665110
host40 2400 2400 0 0.000000 23.856618
host41 2400 2400 0 0.000000 23.265386
host42 2400 2398 -2 -0.083333 23.334984
host43 2400 2400 0 0.000000 23.950316
host44 2400 2404 4 0.166667 25.276133
host45 2400 2399 -1 -0.041667 24.272922
host46 2400 2399 -1 -0.041667 24.013644
host47 2400 2402 2 0.083333 24.113955
host48 2400 2404 4 0.166667 23.582616
host49 2400 2400 0 0.000000 24.531067
host50 2400 2400 0 0.000000 23.784893
host51 2400 2401 1 0.041667 24.793213
host52 2400 2400 0 0.000000 24.170809
host53 2400 2400 0 0.000000 23.783899
host54 2400 2399 -1 -0.041667 24.365295
host55 2400 2398 -2 -0.083333 23.645767
host56 2400 2401 1 0.041667 23.858433
host57 2400 2399 -1 -0.041667 24.159351
host58 2400 2396 -4 -0.166667 23.430493
host59 2400 2402 2 0.083333 24.107154
host60 2400 2403 3 0.125000 24.784382
host61 2400 2397 -3 -0.125000 24.292784
host62 2400 2399 -1 -0.041667 24.404311
host63 2400 2400 0 0.000000 24.219422
On 04/27/2017 12:25 AM, Loic Dachary wrote:
> It seems to work when the distribution has enough samples. I tried with 40 hosts and a distribution with 100,000 samples.
>
> We go from kl =~ 1e-4 (with as much as 10% difference) to kl =~ 1e-7 (with no more than 0.5% difference). I will do some more experiements and try to think of patterns where this would not work.
>
> ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
> dc1 102400 102400 0 0.000000 1008
> host0 2438 2390 -48 -1.968827 24
> host1 2438 2370 -68 -2.789171 24
> host2 2438 2493 55 2.255947 24
> host3 2438 2396 -42 -1.722724 24
> host4 2438 2497 59 2.420016 24
> host5 2438 2520 82 3.363413 24
> host6 2438 2500 62 2.543068 24
> host7 2438 2380 -58 -2.378999 24
> host8 2438 2488 50 2.050861 24
> host9 2438 2435 -3 -0.123052 24
> host10 2438 2440 2 0.082034 24
> host11 2438 2472 34 1.394586 24
> host12 2438 2346 -92 -3.773585 24
> host13 2438 2411 -27 -1.107465 24
> host14 2438 2513 75 3.076292 24
> host15 2438 2421 -17 -0.697293 24
> host16 2438 2469 31 1.271534 24
> host17 2438 2419 -19 -0.779327 24
> host18 2438 2424 -14 -0.574241 24
> host19 2438 2451 13 0.533224 24
> host20 2438 2486 48 1.968827 24
> host21 2438 2439 1 0.041017 24
> host22 2438 2482 44 1.804758 24
> host23 2438 2415 -23 -0.943396 24
> host24 2438 2389 -49 -2.009844 24
> host25 2438 2265 -173 -7.095980 24
> host26 2438 2374 -64 -2.625103 24
> host27 2438 2529 91 3.732568 24
> host28 2438 2495 57 2.337982 24
> host29 2438 2433 -5 -0.205086 24
> host30 2438 2485 47 1.927810 24
> host31 2438 2377 -61 -2.502051 24
> host32 2438 2441 3 0.123052 24
> host33 2438 2421 -17 -0.697293 24
> host34 2438 2359 -79 -3.240361 24
> host35 2438 2509 71 2.912223 24
> host36 2438 2425 -13 -0.533224 24
> host37 2438 2419 -19 -0.779327 24
> host38 2438 2403 -35 -1.435603 24
> host39 2438 2458 20 0.820345 24
> host40 2438 2458 20 0.820345 24
> host41 2438 2503 65 2.666120 24
>
> ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
> dc1 102400 102400 0 0.000000 1008
> host0 2438 2438 0 0.000000 24.559919
> host1 2438 2438 0 0.000000 24.641221
> host2 2438 2440 2 0.082034 23.486113
> host3 2438 2437 -1 -0.041017 24.525875
> host4 2438 2436 -2 -0.082034 23.644304
> host5 2438 2440 2 0.082034 23.245287
> host6 2438 2442 4 0.164069 23.617162
> host7 2438 2439 1 0.041017 24.746174
> host8 2438 2436 -2 -0.082034 23.584667
> host9 2438 2439 1 0.041017 24.140637
> host10 2438 2438 0 0.000000 24.060084
> host11 2438 2441 3 0.123052 23.730349
> host12 2438 2437 -1 -0.041017 24.948602
> host13 2438 2437 -1 -0.041017 24.280851
> host14 2438 2436 -2 -0.082034 23.402216
> host15 2438 2436 -2 -0.082034 24.272037
> host16 2438 2437 -1 -0.041017 23.747867
> host17 2438 2436 -2 -0.082034 24.266271
> host18 2438 2438 0 0.000000 24.158545
> host19 2438 2440 2 0.082034 23.934788
> host20 2438 2438 0 0.000000 23.630851
> host21 2438 2435 -3 -0.123052 24.001950
> host22 2438 2440 2 0.082034 23.623120
> host23 2438 2437 -1 -0.041017 24.343138
> host24 2438 2438 0 0.000000 24.595820
> host25 2438 2439 1 0.041017 25.547510
> host26 2438 2437 -1 -0.041017 24.753111
> host27 2438 2437 -1 -0.041017 23.288606
> host28 2438 2437 -1 -0.041017 23.425059
> host29 2438 2438 0 0.000000 24.115941
> host30 2438 2441 3 0.123052 23.560539
> host31 2438 2438 0 0.000000 24.459911
> host32 2438 2440 2 0.082034 24.096746
> host33 2438 2437 -1 -0.041017 24.241316
> host34 2438 2438 0 0.000000 24.715044
> host35 2438 2436 -2 -0.082034 23.424601
> host36 2438 2436 -2 -0.082034 24.123606
> host37 2438 2439 1 0.041017 24.368997
> host38 2438 2440 2 0.082034 24.331532
> host39 2438 2439 1 0.041017 23.803561
> host40 2438 2437 -1 -0.041017 23.861094
> host41 2438 2442 4 0.164069 23.468473
>
>
> On 04/26/2017 11:08 PM, Loic Dachary wrote:
>>
>>
>> On 04/25/2017 05:04 PM, Pedro López-Adeva wrote:
>>> Hi Loic,
>>>
>>> Well, the results are better certainly! Some comments:
>>>
>>> - I'm glad Nelder-Mead worked. It's not the one I would have chosen
>>> because but I'm not an expert in optimization either. I wonder how it
>>> will scale with more weights[1]. My attempt at using scipy's optimize
>>> didn't work because you are optimizing an stochastic function and this
>>> can make scipy's to decide that no further steps are possible. The
>>> field that studies this kind of problems is stochastic optimization
>>> [2]
>>
>> You were right, it does not always work. Note that this is *not* about the conditional probability bias. This is about the uneven distribution due to the low number of values in the distribution. I think this case should be treated separately, with a different method. In Ceph clusters, large and small, the number of PGs per host is unlikely to be large enough to get enough samples. It is not an isolated problem, it's what happens most of the time.
>>
>> Even in a case as simple as 12 devices starting with:
>>
>> ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
>> host1 2560.000000 2580 20.000000 0.781250 24
>> device12 106.666667 101 -5.666667 -5.312500 1
>> device13 213.333333 221 7.666667 3.593750 2
>> device14 320.000000 317 -3.000000 -0.937500 3
>> device15 106.666667 101 -5.666667 -5.312500 1
>> device16 213.333333 217 3.666667 1.718750 2
>> device17 320.000000 342 22.000000 6.875000 3
>> device18 106.666667 102 -4.666667 -4.375000 1
>> device19 213.333333 243 29.666667 13.906250 2
>> device20 320.000000 313 -7.000000 -2.187500 3
>> device21 106.666667 94 -12.666667 -11.875000 1
>> device22 213.333333 208 -5.333333 -2.500000 2
>> device23 320.000000 321 1.000000 0.312500 3
>>
>> res = minimize(crush, weights, method='nelder-mead',
>> options={'xtol': 1e-8, 'disp': True})
>>
>> device weights [ 1. 3. 3. 2. 3. 2. 2. 1. 3. 1. 1. 2.]
>> device kl 0.00117274995028
>> ...
>> device kl 0.00016530695476
>> Optimization terminated successfully.
>> Current function value: 0.000165
>> Iterations: 117
>> Function evaluations: 470
>>
>> we still get a 5% difference on device 21:
>>
>> ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
>> host1 2560.000000 2559 -1.000000 -0.039062 23.805183
>> device12 106.666667 103 -3.666667 -3.437500 1.016999
>> device13 213.333333 214 0.666667 0.312500 1.949328
>> device14 320.000000 325 5.000000 1.562500 3.008688
>> device15 106.666667 106 -0.666667 -0.625000 1.012565
>> device16 213.333333 214 0.666667 0.312500 1.976344
>> device17 320.000000 320 0.000000 0.000000 2.845135
>> device18 106.666667 102 -4.666667 -4.375000 1.039181
>> device19 213.333333 214 0.666667 0.312500 1.820435
>> device20 320.000000 324 4.000000 1.250000 3.062573
>> device21 106.666667 101 -5.666667 -5.312500 1.071341
>> device22 213.333333 212 -1.333333 -0.625000 2.039190
>> device23 320.000000 324 4.000000 1.250000 3.016468
>>
>>
>>> - I used KL divergence for the loss function. My first attempt was
>>> using as you standard deviation (more commonly known as L2 loss) with
>>> gradient descent, but it didn't work very well.
>>>
>>> - Sum of differences sounds like a bad idea, +100 and -100 errors will
>>> cancel out. Worse still -100 and -100 will be better than 0 and 0.
>>> Maybe you were talking about the absolute value of the differences?
>>>
>>> - Well, now that CRUSH can use multiple weight the problem that
>>> remains I think is seeing if the optimization problem is: a) reliable
>>> and b) fast enough
>>>
>>> Cheers,
>>> Pedro.
>>>
>>> [1] http://www.benfrederickson.com/numerical-optimization/
>>> [2] https://en.wikipedia.org/wiki/Stochastic_optimization
>>>
>>> 2017-04-22 18:51 GMT+02:00 Loic Dachary <loic@dachary.org>:
>>>> Hi Pedro,
>>>>
>>>> I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
>>>>
>>>> I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
>>>>
>>>> Before optimization the situation is:
>>>>
>>>> ~expected~ ~objects~ ~delta~ ~delta%~
>>>> ~name~
>>>> dc1 1024 1024 0 0.000000
>>>> host0 256 294 38 14.843750
>>>> device0 128 153 25 19.531250
>>>> device1 128 141 13 10.156250
>>>> host1 256 301 45 17.578125
>>>> device2 128 157 29 22.656250
>>>> device3 128 144 16 12.500000
>>>> host2 512 429 -83 -16.210938
>>>> device4 128 96 -32 -25.000000
>>>> device5 128 117 -11 -8.593750
>>>> device6 256 216 -40 -15.625000
>>>>
>>>> and after optimization we have the following:
>>>>
>>>> ~expected~ ~objects~ ~delta~ ~delta%~
>>>> ~name~
>>>> dc1 1024 1024 0 0.000000
>>>> host0 256 259 3 1.171875
>>>> device0 128 129 1 0.781250
>>>> device1 128 130 2 1.562500
>>>> host1 256 258 2 0.781250
>>>> device2 128 129 1 0.781250
>>>> device3 128 129 1 0.781250
>>>> host2 512 507 -5 -0.976562
>>>> device4 128 126 -2 -1.562500
>>>> device5 128 127 -1 -0.781250
>>>> device6 256 254 -2 -0.781250
>>>>
>>>> Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
>>>>
>>>> Cheers
>>>>
>>>> [1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
>>>> [2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
>>>> [3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
>>>> [4] https://github.com/ceph/ceph/pull/14486
>>>>
>>>> On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
>>>>> There are lot of gradient-free methods. I will try first to run the
>>>>> ones available using just scipy
>>>>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>>>> Some of them don't require the gradient and some of them can estimate
>>>>> it. The reason to go without the gradient is to run the CRUSH
>>>>> algorithm as a black box. In that case this would be the pseudo-code:
>>>>>
>>>>> - BEGIN CODE -
>>>>> def build_target(desired_freqs):
>>>>> def target(weights):
>>>>> # run a simulation of CRUSH for a number of objects
>>>>> sim_freqs = run_crush(weights)
>>>>> # Kullback-Leibler divergence between desired frequencies and
>>>>> current ones
>>>>> return loss(sim_freqs, desired_freqs)
>>>>> return target
>>>>>
>>>>> weights = scipy.optimize.minimize(build_target(desired_freqs))
>>>>> - END CODE -
>>>>>
>>>>> The tricky thing here is that this procedure can be slow if the
>>>>> simulation (run_crush) needs to place a lot of objects to get accurate
>>>>> simulated frequencies. This is true specially if the minimize method
>>>>> attempts to approximate the gradient using finite differences since it
>>>>> will evaluate the target function a number of times proportional to
>>>>> the number of weights). Apart from the ones in scipy I would try also
>>>>> optimization methods that try to perform as few evaluations as
>>>>> possible like for example HyperOpt
>>>>> (http://hyperopt.github.io/hyperopt/), which by the way takes into
>>>>> account that the target function can be noisy.
>>>>>
>>>>> This black box approximation is simple to implement and makes the
>>>>> computer do all the work instead of us.
>>>>> I think that this black box approximation is worthy to try even if
>>>>> it's not the final one because if this approximation works then we
>>>>> know that a more elaborate one that computes the gradient of the CRUSH
>>>>> algorithm will work for sure.
>>>>>
>>>>> I can try this black box approximation this weekend not on the real
>>>>> CRUSH algorithm but with the simple implementation I did in python. If
>>>>> it works it's just a matter of substituting one simulation with
>>>>> another and see what happens.
>>>>>
>>>>> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>>>>> Hi Loic,
>>>>>>>
>>>>>>> >From what I see everything seems OK.
>>>>>>
>>>>>> Cool. I'll keep going in this direction then !
>>>>>>
>>>>>>> The interesting thing would be to
>>>>>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>>>>>> is right now modeling just a single straw bucket not the full CRUSH
>>>>>>> algorithm.
>>>>>>
>>>>>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>>>>>
>>>>>>> That's the work that remains to be done. The only way that
>>>>>>> would avoid reimplementing the CRUSH algorithm and computing the
>>>>>>> gradient would be treating CRUSH as a black box and eliminating the
>>>>>>> necessity of computing the gradient either by using a gradient-free
>>>>>>> optimization method or making an estimation of the gradient.
>>>>>>
>>>>>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>>>>>
>>>>>>>> Thanks !
>>>>>>>>
>>>>>>>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> Before: All replicas on each hard drive
>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>> disk 0: 1.39e-01 1.12e-01
>>>>>>>> disk 1: 1.11e-01 1.10e-01
>>>>>>>> disk 2: 8.33e-02 1.13e-01
>>>>>>>> disk 3: 1.39e-01 1.11e-01
>>>>>>>> disk 4: 1.11e-01 1.11e-01
>>>>>>>> disk 5: 8.33e-02 1.11e-01
>>>>>>>> disk 6: 1.39e-01 1.12e-01
>>>>>>>> disk 7: 1.11e-01 1.12e-01
>>>>>>>> disk 8: 8.33e-02 1.10e-01
>>>>>>>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>>>>>>>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>>>>>>>> ...
>>>>>>>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>>>>>>>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>>>>>>>> Converged to desired accuracy :)
>>>>>>>> After: All replicas on each hard drive
>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>> disk 0: 1.39e-01 1.42e-01
>>>>>>>> disk 1: 1.11e-01 1.09e-01
>>>>>>>> disk 2: 8.33e-02 8.37e-02
>>>>>>>> disk 3: 1.39e-01 1.40e-01
>>>>>>>> disk 4: 1.11e-01 1.13e-01
>>>>>>>> disk 5: 8.33e-02 8.08e-02
>>>>>>>> disk 6: 1.39e-01 1.38e-01
>>>>>>>> disk 7: 1.11e-01 1.09e-01
>>>>>>>> disk 8: 8.33e-02 8.48e-02
>>>>>>>>
>>>>>>>>
>>>>>>>> Simulation: R=2 devices capacity [10 10 10 10 1]
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> Before: All replicas on each hard drive
>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>> disk 0: 2.44e-01 2.36e-01
>>>>>>>> disk 1: 2.44e-01 2.38e-01
>>>>>>>> disk 2: 2.44e-01 2.34e-01
>>>>>>>> disk 3: 2.44e-01 2.38e-01
>>>>>>>> disk 4: 2.44e-02 5.37e-02
>>>>>>>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>>>>>>>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>>>>>>>> ...
>>>>>>>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>>>>>>>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>>>>>>>> Converged to desired accuracy :)
>>>>>>>> After: All replicas on each hard drive
>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>> disk 0: 2.44e-01 2.46e-01
>>>>>>>> disk 1: 2.44e-01 2.44e-01
>>>>>>>> disk 2: 2.44e-01 2.41e-01
>>>>>>>> disk 3: 2.44e-01 2.45e-01
>>>>>>>> disk 4: 2.44e-02 2.33e-02
>>>>>>>>
>>>>>>>>
>>>>>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>>>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>>>>>
>>>>>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>>>>> Hi Pedro,
>>>>>>>>>
>>>>>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>>>>>> Hi Pedro,
>>>>>>>>>>
>>>>>>>>>> I'm going to experiment with what you did at
>>>>>>>>>>
>>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>>
>>>>>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>>>>>> Great, thanks for the clarifications.
>>>>>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>>>>>
>>>>>>>>>>> I keep working on it.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>>>>>> Hi Pedro,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for taking a look at this! It's a frustrating problem and we
>>>>>>>>>>>> haven't made much headway.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>>>>>> work on.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>>>>>> selecting the same one again.
>>>>>>>>>>>>
>>>>>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>>>>>> round/replica/rank.
>>>>>>>>>>>>
>>>>>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>>>>>> change and evolve independently. (In most cases any change is
>>>>>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>>>>>> before it is enabled or used.)
>>>>>>>>>>>>
>>>>>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>>>>>> implementation?
>>>>>>>>>>>>
>>>>>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>>>>>> retry_descent.
>>>>>>>>>>>>
>>>>>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>>>>>> adapted)!
>>>>>>>>>>>>
>>>>>>>>>>>> sage
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-04-27 6:12 ` Loic Dachary
@ 2017-04-27 16:47 ` Loic Dachary
2017-04-27 22:14 ` Loic Dachary
0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-04-27 16:47 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: Ceph Development
Hi Pedro,
After I suspected uniform weights could be a border case, I tried with varying weights and did not get good results. Nedler-Mead also tried (and why not) negative values for the weights which is invalid for CRUSH. And since there is no way to specify the value bounds for Nedler-Mead, that makes it a bad candidate for the job.
Next in line seems to be L-BFGS-B [1] which
a) projects a gradient and is likely to run faster
b) allows a min value to be defined for each value so we won't have negative values
I'll go in this direction unless you tell me "Noooooo this is a baaaaad idea" ;-)
Cheers
[1] https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
On 04/27/2017 08:12 AM, Loic Dachary wrote:
> With 63 hosts instead of 41 we get the same results: from kl 1.9169485575e-04 to kl 3.0384231953e-07 with a maximum difference going from ~8% to ~0.5%. What's is interesting (at least to me ;-) is that the weights don't change that much, they all stay in the range ]23,25].
>
> Note that all this optimization is done by changing a single weight per host. It is worth trying again with two different weights (which is what you did in https://github.com/plafl/notebooks/blob/master/replication.ipynb). The weight for the first draw is immutable as it is (i.e. 24) and the weight for the second draw is allowed to change.
>
> Before optimization
>
> host0 2400 2345 -55 -2.291667 24
> host1 2400 2434 34 1.416667 24
> host2 2400 2387 -13 -0.541667 24
> host3 2400 2351 -49 -2.041667 24
> host4 2400 2423 23 0.958333 24
> host5 2400 2456 56 2.333333 24
> host6 2400 2450 50 2.083333 24
> host7 2400 2307 -93 -3.875000 24
> host8 2400 2434 34 1.416667 24
> host9 2400 2358 -42 -1.750000 24
> host10 2400 2452 52 2.166667 24
> host11 2400 2398 -2 -0.083333 24
> host12 2400 2359 -41 -1.708333 24
> host13 2400 2403 3 0.125000 24
> host14 2400 2484 84 3.500000 24
> host15 2400 2348 -52 -2.166667 24
> host16 2400 2489 89 3.708333 24
> host17 2400 2412 12 0.500000 24
> host18 2400 2416 16 0.666667 24
> host19 2400 2453 53 2.208333 24
> host20 2400 2475 75 3.125000 24
> host21 2400 2413 13 0.541667 24
> host22 2400 2450 50 2.083333 24
> host23 2400 2348 -52 -2.166667 24
> host24 2400 2355 -45 -1.875000 24
> host25 2400 2348 -52 -2.166667 24
> host26 2400 2373 -27 -1.125000 24
> host27 2400 2470 70 2.916667 24
> host28 2400 2449 49 2.041667 24
> host29 2400 2420 20 0.833333 24
> host30 2400 2406 6 0.250000 24
> host31 2400 2376 -24 -1.000000 24
> host32 2400 2371 -29 -1.208333 24
> host33 2400 2395 -5 -0.208333 24
> host34 2400 2351 -49 -2.041667 24
> host35 2400 2453 53 2.208333 24
> host36 2400 2421 21 0.875000 24
> host37 2400 2393 -7 -0.291667 24
> host38 2400 2394 -6 -0.250000 24
> host39 2400 2322 -78 -3.250000 24
> host40 2400 2409 9 0.375000 24
> host41 2400 2486 86 3.583333 24
> host42 2400 2466 66 2.750000 24
> host43 2400 2409 9 0.375000 24
> host44 2400 2276 -124 -5.166667 24
> host45 2400 2379 -21 -0.875000 24
> host46 2400 2394 -6 -0.250000 24
> host47 2400 2401 1 0.041667 24
> host48 2400 2446 46 1.916667 24
> host49 2400 2349 -51 -2.125000 24
> host50 2400 2413 13 0.541667 24
> host51 2400 2333 -67 -2.791667 24
> host52 2400 2387 -13 -0.541667 24
> host53 2400 2407 7 0.291667 24
> host54 2400 2377 -23 -0.958333 24
> host55 2400 2441 41 1.708333 24
> host56 2400 2420 20 0.833333 24
> host57 2400 2388 -12 -0.500000 24
> host58 2400 2460 60 2.500000 24
> host59 2400 2394 -6 -0.250000 24
> host60 2400 2316 -84 -3.500000 24
> host61 2400 2373 -27 -1.125000 24
> host62 2400 2362 -38 -1.583333 24
> host63 2400 2372 -28 -1.166667 24
>
> After optimization
>
> host0 2400 2403 3 0.125000 24.575153
> host1 2400 2401 1 0.041667 23.723316
> host2 2400 2402 2 0.083333 24.168746
> host3 2400 2399 -1 -0.041667 24.520240
> host4 2400 2399 -1 -0.041667 23.911445
> host5 2400 2400 0 0.000000 23.606956
> host6 2400 2401 1 0.041667 23.714102
> host7 2400 2400 0 0.000000 25.008463
> host8 2400 2399 -1 -0.041667 23.557143
> host9 2400 2399 -1 -0.041667 24.431548
> host10 2400 2400 0 0.000000 23.494153
> host11 2400 2401 1 0.041667 23.976621
> host12 2400 2400 0 0.000000 24.512622
> host13 2400 2397 -3 -0.125000 24.010814
> host14 2400 2398 -2 -0.083333 23.229791
> host15 2400 2402 2 0.083333 24.510854
> host16 2400 2401 1 0.041667 23.188161
> host17 2400 2397 -3 -0.125000 23.931915
> host18 2400 2400 0 0.000000 23.886135
> host19 2400 2398 -2 -0.083333 23.442129
> host20 2400 2401 1 0.041667 23.393092
> host21 2400 2398 -2 -0.083333 23.940452
> host22 2400 2401 1 0.041667 23.643843
> host23 2400 2403 3 0.125000 24.592113
> host24 2400 2402 2 0.083333 24.561842
> host25 2400 2401 1 0.041667 24.598754
> host26 2400 2398 -2 -0.083333 24.350951
> host27 2400 2399 -1 -0.041667 23.336478
> host28 2400 2401 1 0.041667 23.549652
> host29 2400 2401 1 0.041667 23.840408
> host30 2400 2400 0 0.000000 23.932423
> host31 2400 2397 -3 -0.125000 24.295621
> host32 2400 2402 2 0.083333 24.298228
> host33 2400 2403 3 0.125000 24.068700
> host34 2400 2399 -1 -0.041667 24.395416
> host35 2400 2398 -2 -0.083333 23.522074
> host36 2400 2395 -5 -0.208333 23.746354
> host37 2400 2402 2 0.083333 24.120875
> host38 2400 2401 1 0.041667 24.034644
> host39 2400 2400 0 0.000000 24.665110
> host40 2400 2400 0 0.000000 23.856618
> host41 2400 2400 0 0.000000 23.265386
> host42 2400 2398 -2 -0.083333 23.334984
> host43 2400 2400 0 0.000000 23.950316
> host44 2400 2404 4 0.166667 25.276133
> host45 2400 2399 -1 -0.041667 24.272922
> host46 2400 2399 -1 -0.041667 24.013644
> host47 2400 2402 2 0.083333 24.113955
> host48 2400 2404 4 0.166667 23.582616
> host49 2400 2400 0 0.000000 24.531067
> host50 2400 2400 0 0.000000 23.784893
> host51 2400 2401 1 0.041667 24.793213
> host52 2400 2400 0 0.000000 24.170809
> host53 2400 2400 0 0.000000 23.783899
> host54 2400 2399 -1 -0.041667 24.365295
> host55 2400 2398 -2 -0.083333 23.645767
> host56 2400 2401 1 0.041667 23.858433
> host57 2400 2399 -1 -0.041667 24.159351
> host58 2400 2396 -4 -0.166667 23.430493
> host59 2400 2402 2 0.083333 24.107154
> host60 2400 2403 3 0.125000 24.784382
> host61 2400 2397 -3 -0.125000 24.292784
> host62 2400 2399 -1 -0.041667 24.404311
> host63 2400 2400 0 0.000000 24.219422
>
>
> On 04/27/2017 12:25 AM, Loic Dachary wrote:
>> It seems to work when the distribution has enough samples. I tried with 40 hosts and a distribution with 100,000 samples.
>>
>> We go from kl =~ 1e-4 (with as much as 10% difference) to kl =~ 1e-7 (with no more than 0.5% difference). I will do some more experiements and try to think of patterns where this would not work.
>>
>> ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
>> dc1 102400 102400 0 0.000000 1008
>> host0 2438 2390 -48 -1.968827 24
>> host1 2438 2370 -68 -2.789171 24
>> host2 2438 2493 55 2.255947 24
>> host3 2438 2396 -42 -1.722724 24
>> host4 2438 2497 59 2.420016 24
>> host5 2438 2520 82 3.363413 24
>> host6 2438 2500 62 2.543068 24
>> host7 2438 2380 -58 -2.378999 24
>> host8 2438 2488 50 2.050861 24
>> host9 2438 2435 -3 -0.123052 24
>> host10 2438 2440 2 0.082034 24
>> host11 2438 2472 34 1.394586 24
>> host12 2438 2346 -92 -3.773585 24
>> host13 2438 2411 -27 -1.107465 24
>> host14 2438 2513 75 3.076292 24
>> host15 2438 2421 -17 -0.697293 24
>> host16 2438 2469 31 1.271534 24
>> host17 2438 2419 -19 -0.779327 24
>> host18 2438 2424 -14 -0.574241 24
>> host19 2438 2451 13 0.533224 24
>> host20 2438 2486 48 1.968827 24
>> host21 2438 2439 1 0.041017 24
>> host22 2438 2482 44 1.804758 24
>> host23 2438 2415 -23 -0.943396 24
>> host24 2438 2389 -49 -2.009844 24
>> host25 2438 2265 -173 -7.095980 24
>> host26 2438 2374 -64 -2.625103 24
>> host27 2438 2529 91 3.732568 24
>> host28 2438 2495 57 2.337982 24
>> host29 2438 2433 -5 -0.205086 24
>> host30 2438 2485 47 1.927810 24
>> host31 2438 2377 -61 -2.502051 24
>> host32 2438 2441 3 0.123052 24
>> host33 2438 2421 -17 -0.697293 24
>> host34 2438 2359 -79 -3.240361 24
>> host35 2438 2509 71 2.912223 24
>> host36 2438 2425 -13 -0.533224 24
>> host37 2438 2419 -19 -0.779327 24
>> host38 2438 2403 -35 -1.435603 24
>> host39 2438 2458 20 0.820345 24
>> host40 2438 2458 20 0.820345 24
>> host41 2438 2503 65 2.666120 24
>>
>> ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
>> dc1 102400 102400 0 0.000000 1008
>> host0 2438 2438 0 0.000000 24.559919
>> host1 2438 2438 0 0.000000 24.641221
>> host2 2438 2440 2 0.082034 23.486113
>> host3 2438 2437 -1 -0.041017 24.525875
>> host4 2438 2436 -2 -0.082034 23.644304
>> host5 2438 2440 2 0.082034 23.245287
>> host6 2438 2442 4 0.164069 23.617162
>> host7 2438 2439 1 0.041017 24.746174
>> host8 2438 2436 -2 -0.082034 23.584667
>> host9 2438 2439 1 0.041017 24.140637
>> host10 2438 2438 0 0.000000 24.060084
>> host11 2438 2441 3 0.123052 23.730349
>> host12 2438 2437 -1 -0.041017 24.948602
>> host13 2438 2437 -1 -0.041017 24.280851
>> host14 2438 2436 -2 -0.082034 23.402216
>> host15 2438 2436 -2 -0.082034 24.272037
>> host16 2438 2437 -1 -0.041017 23.747867
>> host17 2438 2436 -2 -0.082034 24.266271
>> host18 2438 2438 0 0.000000 24.158545
>> host19 2438 2440 2 0.082034 23.934788
>> host20 2438 2438 0 0.000000 23.630851
>> host21 2438 2435 -3 -0.123052 24.001950
>> host22 2438 2440 2 0.082034 23.623120
>> host23 2438 2437 -1 -0.041017 24.343138
>> host24 2438 2438 0 0.000000 24.595820
>> host25 2438 2439 1 0.041017 25.547510
>> host26 2438 2437 -1 -0.041017 24.753111
>> host27 2438 2437 -1 -0.041017 23.288606
>> host28 2438 2437 -1 -0.041017 23.425059
>> host29 2438 2438 0 0.000000 24.115941
>> host30 2438 2441 3 0.123052 23.560539
>> host31 2438 2438 0 0.000000 24.459911
>> host32 2438 2440 2 0.082034 24.096746
>> host33 2438 2437 -1 -0.041017 24.241316
>> host34 2438 2438 0 0.000000 24.715044
>> host35 2438 2436 -2 -0.082034 23.424601
>> host36 2438 2436 -2 -0.082034 24.123606
>> host37 2438 2439 1 0.041017 24.368997
>> host38 2438 2440 2 0.082034 24.331532
>> host39 2438 2439 1 0.041017 23.803561
>> host40 2438 2437 -1 -0.041017 23.861094
>> host41 2438 2442 4 0.164069 23.468473
>>
>>
>> On 04/26/2017 11:08 PM, Loic Dachary wrote:
>>>
>>>
>>> On 04/25/2017 05:04 PM, Pedro López-Adeva wrote:
>>>> Hi Loic,
>>>>
>>>> Well, the results are better certainly! Some comments:
>>>>
>>>> - I'm glad Nelder-Mead worked. It's not the one I would have chosen
>>>> because but I'm not an expert in optimization either. I wonder how it
>>>> will scale with more weights[1]. My attempt at using scipy's optimize
>>>> didn't work because you are optimizing an stochastic function and this
>>>> can make scipy's to decide that no further steps are possible. The
>>>> field that studies this kind of problems is stochastic optimization
>>>> [2]
>>>
>>> You were right, it does not always work. Note that this is *not* about the conditional probability bias. This is about the uneven distribution due to the low number of values in the distribution. I think this case should be treated separately, with a different method. In Ceph clusters, large and small, the number of PGs per host is unlikely to be large enough to get enough samples. It is not an isolated problem, it's what happens most of the time.
>>>
>>> Even in a case as simple as 12 devices starting with:
>>>
>>> ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
>>> host1 2560.000000 2580 20.000000 0.781250 24
>>> device12 106.666667 101 -5.666667 -5.312500 1
>>> device13 213.333333 221 7.666667 3.593750 2
>>> device14 320.000000 317 -3.000000 -0.937500 3
>>> device15 106.666667 101 -5.666667 -5.312500 1
>>> device16 213.333333 217 3.666667 1.718750 2
>>> device17 320.000000 342 22.000000 6.875000 3
>>> device18 106.666667 102 -4.666667 -4.375000 1
>>> device19 213.333333 243 29.666667 13.906250 2
>>> device20 320.000000 313 -7.000000 -2.187500 3
>>> device21 106.666667 94 -12.666667 -11.875000 1
>>> device22 213.333333 208 -5.333333 -2.500000 2
>>> device23 320.000000 321 1.000000 0.312500 3
>>>
>>> res = minimize(crush, weights, method='nelder-mead',
>>> options={'xtol': 1e-8, 'disp': True})
>>>
>>> device weights [ 1. 3. 3. 2. 3. 2. 2. 1. 3. 1. 1. 2.]
>>> device kl 0.00117274995028
>>> ...
>>> device kl 0.00016530695476
>>> Optimization terminated successfully.
>>> Current function value: 0.000165
>>> Iterations: 117
>>> Function evaluations: 470
>>>
>>> we still get a 5% difference on device 21:
>>>
>>> ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
>>> host1 2560.000000 2559 -1.000000 -0.039062 23.805183
>>> device12 106.666667 103 -3.666667 -3.437500 1.016999
>>> device13 213.333333 214 0.666667 0.312500 1.949328
>>> device14 320.000000 325 5.000000 1.562500 3.008688
>>> device15 106.666667 106 -0.666667 -0.625000 1.012565
>>> device16 213.333333 214 0.666667 0.312500 1.976344
>>> device17 320.000000 320 0.000000 0.000000 2.845135
>>> device18 106.666667 102 -4.666667 -4.375000 1.039181
>>> device19 213.333333 214 0.666667 0.312500 1.820435
>>> device20 320.000000 324 4.000000 1.250000 3.062573
>>> device21 106.666667 101 -5.666667 -5.312500 1.071341
>>> device22 213.333333 212 -1.333333 -0.625000 2.039190
>>> device23 320.000000 324 4.000000 1.250000 3.016468
>>>
>>>
>>>> - I used KL divergence for the loss function. My first attempt was
>>>> using as you standard deviation (more commonly known as L2 loss) with
>>>> gradient descent, but it didn't work very well.
>>>>
>>>> - Sum of differences sounds like a bad idea, +100 and -100 errors will
>>>> cancel out. Worse still -100 and -100 will be better than 0 and 0.
>>>> Maybe you were talking about the absolute value of the differences?
>>>>
>>>> - Well, now that CRUSH can use multiple weight the problem that
>>>> remains I think is seeing if the optimization problem is: a) reliable
>>>> and b) fast enough
>>>>
>>>> Cheers,
>>>> Pedro.
>>>>
>>>> [1] http://www.benfrederickson.com/numerical-optimization/
>>>> [2] https://en.wikipedia.org/wiki/Stochastic_optimization
>>>>
>>>> 2017-04-22 18:51 GMT+02:00 Loic Dachary <loic@dachary.org>:
>>>>> Hi Pedro,
>>>>>
>>>>> I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
>>>>>
>>>>> I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
>>>>>
>>>>> Before optimization the situation is:
>>>>>
>>>>> ~expected~ ~objects~ ~delta~ ~delta%~
>>>>> ~name~
>>>>> dc1 1024 1024 0 0.000000
>>>>> host0 256 294 38 14.843750
>>>>> device0 128 153 25 19.531250
>>>>> device1 128 141 13 10.156250
>>>>> host1 256 301 45 17.578125
>>>>> device2 128 157 29 22.656250
>>>>> device3 128 144 16 12.500000
>>>>> host2 512 429 -83 -16.210938
>>>>> device4 128 96 -32 -25.000000
>>>>> device5 128 117 -11 -8.593750
>>>>> device6 256 216 -40 -15.625000
>>>>>
>>>>> and after optimization we have the following:
>>>>>
>>>>> ~expected~ ~objects~ ~delta~ ~delta%~
>>>>> ~name~
>>>>> dc1 1024 1024 0 0.000000
>>>>> host0 256 259 3 1.171875
>>>>> device0 128 129 1 0.781250
>>>>> device1 128 130 2 1.562500
>>>>> host1 256 258 2 0.781250
>>>>> device2 128 129 1 0.781250
>>>>> device3 128 129 1 0.781250
>>>>> host2 512 507 -5 -0.976562
>>>>> device4 128 126 -2 -1.562500
>>>>> device5 128 127 -1 -0.781250
>>>>> device6 256 254 -2 -0.781250
>>>>>
>>>>> Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
>>>>>
>>>>> Cheers
>>>>>
>>>>> [1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
>>>>> [2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
>>>>> [3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
>>>>> [4] https://github.com/ceph/ceph/pull/14486
>>>>>
>>>>> On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
>>>>>> There are lot of gradient-free methods. I will try first to run the
>>>>>> ones available using just scipy
>>>>>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>>>>> Some of them don't require the gradient and some of them can estimate
>>>>>> it. The reason to go without the gradient is to run the CRUSH
>>>>>> algorithm as a black box. In that case this would be the pseudo-code:
>>>>>>
>>>>>> - BEGIN CODE -
>>>>>> def build_target(desired_freqs):
>>>>>> def target(weights):
>>>>>> # run a simulation of CRUSH for a number of objects
>>>>>> sim_freqs = run_crush(weights)
>>>>>> # Kullback-Leibler divergence between desired frequencies and
>>>>>> current ones
>>>>>> return loss(sim_freqs, desired_freqs)
>>>>>> return target
>>>>>>
>>>>>> weights = scipy.optimize.minimize(build_target(desired_freqs))
>>>>>> - END CODE -
>>>>>>
>>>>>> The tricky thing here is that this procedure can be slow if the
>>>>>> simulation (run_crush) needs to place a lot of objects to get accurate
>>>>>> simulated frequencies. This is true specially if the minimize method
>>>>>> attempts to approximate the gradient using finite differences since it
>>>>>> will evaluate the target function a number of times proportional to
>>>>>> the number of weights). Apart from the ones in scipy I would try also
>>>>>> optimization methods that try to perform as few evaluations as
>>>>>> possible like for example HyperOpt
>>>>>> (http://hyperopt.github.io/hyperopt/), which by the way takes into
>>>>>> account that the target function can be noisy.
>>>>>>
>>>>>> This black box approximation is simple to implement and makes the
>>>>>> computer do all the work instead of us.
>>>>>> I think that this black box approximation is worthy to try even if
>>>>>> it's not the final one because if this approximation works then we
>>>>>> know that a more elaborate one that computes the gradient of the CRUSH
>>>>>> algorithm will work for sure.
>>>>>>
>>>>>> I can try this black box approximation this weekend not on the real
>>>>>> CRUSH algorithm but with the simple implementation I did in python. If
>>>>>> it works it's just a matter of substituting one simulation with
>>>>>> another and see what happens.
>>>>>>
>>>>>> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>> Hi Pedro,
>>>>>>>
>>>>>>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>>>>>> Hi Loic,
>>>>>>>>
>>>>>>>> >From what I see everything seems OK.
>>>>>>>
>>>>>>> Cool. I'll keep going in this direction then !
>>>>>>>
>>>>>>>> The interesting thing would be to
>>>>>>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>>>>>>> is right now modeling just a single straw bucket not the full CRUSH
>>>>>>>> algorithm.
>>>>>>>
>>>>>>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>>>>>>
>>>>>>>> That's the work that remains to be done. The only way that
>>>>>>>> would avoid reimplementing the CRUSH algorithm and computing the
>>>>>>>> gradient would be treating CRUSH as a black box and eliminating the
>>>>>>>> necessity of computing the gradient either by using a gradient-free
>>>>>>>> optimization method or making an estimation of the gradient.
>>>>>>>
>>>>>>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>>>>>>
>>>>>>>>> Thanks !
>>>>>>>>>
>>>>>>>>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>> Before: All replicas on each hard drive
>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>> disk 0: 1.39e-01 1.12e-01
>>>>>>>>> disk 1: 1.11e-01 1.10e-01
>>>>>>>>> disk 2: 8.33e-02 1.13e-01
>>>>>>>>> disk 3: 1.39e-01 1.11e-01
>>>>>>>>> disk 4: 1.11e-01 1.11e-01
>>>>>>>>> disk 5: 8.33e-02 1.11e-01
>>>>>>>>> disk 6: 1.39e-01 1.12e-01
>>>>>>>>> disk 7: 1.11e-01 1.12e-01
>>>>>>>>> disk 8: 8.33e-02 1.10e-01
>>>>>>>>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>>>>>>>>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>>>>>>>>> ...
>>>>>>>>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>>>>>>>>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>>>>>>>>> Converged to desired accuracy :)
>>>>>>>>> After: All replicas on each hard drive
>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>> disk 0: 1.39e-01 1.42e-01
>>>>>>>>> disk 1: 1.11e-01 1.09e-01
>>>>>>>>> disk 2: 8.33e-02 8.37e-02
>>>>>>>>> disk 3: 1.39e-01 1.40e-01
>>>>>>>>> disk 4: 1.11e-01 1.13e-01
>>>>>>>>> disk 5: 8.33e-02 8.08e-02
>>>>>>>>> disk 6: 1.39e-01 1.38e-01
>>>>>>>>> disk 7: 1.11e-01 1.09e-01
>>>>>>>>> disk 8: 8.33e-02 8.48e-02
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Simulation: R=2 devices capacity [10 10 10 10 1]
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>> Before: All replicas on each hard drive
>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>> disk 0: 2.44e-01 2.36e-01
>>>>>>>>> disk 1: 2.44e-01 2.38e-01
>>>>>>>>> disk 2: 2.44e-01 2.34e-01
>>>>>>>>> disk 3: 2.44e-01 2.38e-01
>>>>>>>>> disk 4: 2.44e-02 5.37e-02
>>>>>>>>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>>>>>>>>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>>>>>>>>> ...
>>>>>>>>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>>>>>>>>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>>>>>>>>> Converged to desired accuracy :)
>>>>>>>>> After: All replicas on each hard drive
>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>> disk 0: 2.44e-01 2.46e-01
>>>>>>>>> disk 1: 2.44e-01 2.44e-01
>>>>>>>>> disk 2: 2.44e-01 2.41e-01
>>>>>>>>> disk 3: 2.44e-01 2.45e-01
>>>>>>>>> disk 4: 2.44e-02 2.33e-02
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>>>>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>>>>>>
>>>>>>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>>>>>> Hi Pedro,
>>>>>>>>>>
>>>>>>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>>>>>>> Hi Pedro,
>>>>>>>>>>>
>>>>>>>>>>> I'm going to experiment with what you did at
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>>>
>>>>>>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>>
>>>>>>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>>>>>>> Great, thanks for the clarifications.
>>>>>>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>>>>>>
>>>>>>>>>>>> I keep working on it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>>>>>>> Hi Pedro,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for taking a look at this! It's a frustrating problem and we
>>>>>>>>>>>>> haven't made much headway.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>>>>>>> work on.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>>>>>>> selecting the same one again.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>>>>>>> round/replica/rank.
>>>>>>>>>>>>>
>>>>>>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>>>>>>> change and evolve independently. (In most cases any change is
>>>>>>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>>>>>>> before it is enabled or used.)
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>>>>>>> implementation?
>>>>>>>>>>>>>
>>>>>>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>>>>>>> retry_descent.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>>>>>>> adapted)!
>>>>>>>>>>>>>
>>>>>>>>>>>>> sage
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
* Re: crush multipick anomaly
2017-04-27 16:47 ` Loic Dachary
@ 2017-04-27 22:14 ` Loic Dachary
0 siblings, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-04-27 22:14 UTC (permalink / raw)
To: Pedro López-Adeva; +Cc: Ceph Development
TL;DR: either I'm doing something wrong or optimize L-BFGS-B does not converge to anything useful.
Trying L-BFGS-B wasn't that difficult. Only eps gave me trouble but I think I chose something sensible. However ... it does not converge to anything useful. The code itself is at http://libcrush.org/dachary/python-crush/blob/b19af6d0da0ac4f8c6d9fb1c8828775539df7feb/tests/test_analyze.py#L235 and a summary of the output is shown below.
I think I'm stuck now, unfortunately. Any idea on how to move forward ?
Cheers
bounds = [(0.1, None), (0.1, None), (0.1, None), (0.1, None), (0.1, None), (0.1, None), (0.1, None), (0.1, None), (0.1, None), (0.1, None)]
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 10 M = 10
At X0 0 variables are exactly at the bounds
host weights [ 1. 1. 1. 1. 1. 5. 1. 1. 1. 1.]
host kl 0.395525661546
...
host weights [ 7.06935073 0.59036832 0.58504545 0.57290196 0.55298047 0.1
0.54095906 0.60123172 0.54841584 0.68277045]
host kl 0.0511888801117
At iterate 12 f= 5.12013D-02 |proj g|= 1.00098D-03
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
10 12 62 13 1 1 1.001D-03 5.120D-02
F = 5.12013167923604934E-002
CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
Warning: more than 10 function and gradient
evaluations in the last line search. Termination
may possibly be caused by a bad search direction.
Cauchy time 0.000E+00 seconds.
Subspace minimization time 0.000E+00 seconds.
Line search time 0.000E+00 seconds.
Total User time 0.000E+00 seconds.
On 04/27/2017 06:47 PM, Loic Dachary wrote:
> Hi Pedro,
>
> After I suspected uniform weights could be a border case, I tried with varying weights and did not get good results. Nedler-Mead also tried (and why not) negative values for the weights which is invalid for CRUSH. And since there is no way to specify the value bounds for Nedler-Mead, that makes it a bad candidate for the job.
>
> Next in line seems to be L-BFGS-B [1] which
>
> a) projects a gradient and is likely to run faster
> b) allows a min value to be defined for each value so we won't have negative values
>
> I'll go in this direction unless you tell me "Noooooo this is a baaaaad idea" ;-)
>
> Cheers
>
> [1] https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>
> On 04/27/2017 08:12 AM, Loic Dachary wrote:
>> With 63 hosts instead of 41 we get the same results: from kl 1.9169485575e-04 to kl 3.0384231953e-07 with a maximum difference going from ~8% to ~0.5%. What's is interesting (at least to me ;-) is that the weights don't change that much, they all stay in the range ]23,25].
>>
>> Note that all this optimization is done by changing a single weight per host. It is worth trying again with two different weights (which is what you did in https://github.com/plafl/notebooks/blob/master/replication.ipynb). The weight for the first draw is immutable as it is (i.e. 24) and the weight for the second draw is allowed to change.
>>
>> Before optimization
>>
>> host0 2400 2345 -55 -2.291667 24
>> host1 2400 2434 34 1.416667 24
>> host2 2400 2387 -13 -0.541667 24
>> host3 2400 2351 -49 -2.041667 24
>> host4 2400 2423 23 0.958333 24
>> host5 2400 2456 56 2.333333 24
>> host6 2400 2450 50 2.083333 24
>> host7 2400 2307 -93 -3.875000 24
>> host8 2400 2434 34 1.416667 24
>> host9 2400 2358 -42 -1.750000 24
>> host10 2400 2452 52 2.166667 24
>> host11 2400 2398 -2 -0.083333 24
>> host12 2400 2359 -41 -1.708333 24
>> host13 2400 2403 3 0.125000 24
>> host14 2400 2484 84 3.500000 24
>> host15 2400 2348 -52 -2.166667 24
>> host16 2400 2489 89 3.708333 24
>> host17 2400 2412 12 0.500000 24
>> host18 2400 2416 16 0.666667 24
>> host19 2400 2453 53 2.208333 24
>> host20 2400 2475 75 3.125000 24
>> host21 2400 2413 13 0.541667 24
>> host22 2400 2450 50 2.083333 24
>> host23 2400 2348 -52 -2.166667 24
>> host24 2400 2355 -45 -1.875000 24
>> host25 2400 2348 -52 -2.166667 24
>> host26 2400 2373 -27 -1.125000 24
>> host27 2400 2470 70 2.916667 24
>> host28 2400 2449 49 2.041667 24
>> host29 2400 2420 20 0.833333 24
>> host30 2400 2406 6 0.250000 24
>> host31 2400 2376 -24 -1.000000 24
>> host32 2400 2371 -29 -1.208333 24
>> host33 2400 2395 -5 -0.208333 24
>> host34 2400 2351 -49 -2.041667 24
>> host35 2400 2453 53 2.208333 24
>> host36 2400 2421 21 0.875000 24
>> host37 2400 2393 -7 -0.291667 24
>> host38 2400 2394 -6 -0.250000 24
>> host39 2400 2322 -78 -3.250000 24
>> host40 2400 2409 9 0.375000 24
>> host41 2400 2486 86 3.583333 24
>> host42 2400 2466 66 2.750000 24
>> host43 2400 2409 9 0.375000 24
>> host44 2400 2276 -124 -5.166667 24
>> host45 2400 2379 -21 -0.875000 24
>> host46 2400 2394 -6 -0.250000 24
>> host47 2400 2401 1 0.041667 24
>> host48 2400 2446 46 1.916667 24
>> host49 2400 2349 -51 -2.125000 24
>> host50 2400 2413 13 0.541667 24
>> host51 2400 2333 -67 -2.791667 24
>> host52 2400 2387 -13 -0.541667 24
>> host53 2400 2407 7 0.291667 24
>> host54 2400 2377 -23 -0.958333 24
>> host55 2400 2441 41 1.708333 24
>> host56 2400 2420 20 0.833333 24
>> host57 2400 2388 -12 -0.500000 24
>> host58 2400 2460 60 2.500000 24
>> host59 2400 2394 -6 -0.250000 24
>> host60 2400 2316 -84 -3.500000 24
>> host61 2400 2373 -27 -1.125000 24
>> host62 2400 2362 -38 -1.583333 24
>> host63 2400 2372 -28 -1.166667 24
>>
>> After optimization
>>
>> host0 2400 2403 3 0.125000 24.575153
>> host1 2400 2401 1 0.041667 23.723316
>> host2 2400 2402 2 0.083333 24.168746
>> host3 2400 2399 -1 -0.041667 24.520240
>> host4 2400 2399 -1 -0.041667 23.911445
>> host5 2400 2400 0 0.000000 23.606956
>> host6 2400 2401 1 0.041667 23.714102
>> host7 2400 2400 0 0.000000 25.008463
>> host8 2400 2399 -1 -0.041667 23.557143
>> host9 2400 2399 -1 -0.041667 24.431548
>> host10 2400 2400 0 0.000000 23.494153
>> host11 2400 2401 1 0.041667 23.976621
>> host12 2400 2400 0 0.000000 24.512622
>> host13 2400 2397 -3 -0.125000 24.010814
>> host14 2400 2398 -2 -0.083333 23.229791
>> host15 2400 2402 2 0.083333 24.510854
>> host16 2400 2401 1 0.041667 23.188161
>> host17 2400 2397 -3 -0.125000 23.931915
>> host18 2400 2400 0 0.000000 23.886135
>> host19 2400 2398 -2 -0.083333 23.442129
>> host20 2400 2401 1 0.041667 23.393092
>> host21 2400 2398 -2 -0.083333 23.940452
>> host22 2400 2401 1 0.041667 23.643843
>> host23 2400 2403 3 0.125000 24.592113
>> host24 2400 2402 2 0.083333 24.561842
>> host25 2400 2401 1 0.041667 24.598754
>> host26 2400 2398 -2 -0.083333 24.350951
>> host27 2400 2399 -1 -0.041667 23.336478
>> host28 2400 2401 1 0.041667 23.549652
>> host29 2400 2401 1 0.041667 23.840408
>> host30 2400 2400 0 0.000000 23.932423
>> host31 2400 2397 -3 -0.125000 24.295621
>> host32 2400 2402 2 0.083333 24.298228
>> host33 2400 2403 3 0.125000 24.068700
>> host34 2400 2399 -1 -0.041667 24.395416
>> host35 2400 2398 -2 -0.083333 23.522074
>> host36 2400 2395 -5 -0.208333 23.746354
>> host37 2400 2402 2 0.083333 24.120875
>> host38 2400 2401 1 0.041667 24.034644
>> host39 2400 2400 0 0.000000 24.665110
>> host40 2400 2400 0 0.000000 23.856618
>> host41 2400 2400 0 0.000000 23.265386
>> host42 2400 2398 -2 -0.083333 23.334984
>> host43 2400 2400 0 0.000000 23.950316
>> host44 2400 2404 4 0.166667 25.276133
>> host45 2400 2399 -1 -0.041667 24.272922
>> host46 2400 2399 -1 -0.041667 24.013644
>> host47 2400 2402 2 0.083333 24.113955
>> host48 2400 2404 4 0.166667 23.582616
>> host49 2400 2400 0 0.000000 24.531067
>> host50 2400 2400 0 0.000000 23.784893
>> host51 2400 2401 1 0.041667 24.793213
>> host52 2400 2400 0 0.000000 24.170809
>> host53 2400 2400 0 0.000000 23.783899
>> host54 2400 2399 -1 -0.041667 24.365295
>> host55 2400 2398 -2 -0.083333 23.645767
>> host56 2400 2401 1 0.041667 23.858433
>> host57 2400 2399 -1 -0.041667 24.159351
>> host58 2400 2396 -4 -0.166667 23.430493
>> host59 2400 2402 2 0.083333 24.107154
>> host60 2400 2403 3 0.125000 24.784382
>> host61 2400 2397 -3 -0.125000 24.292784
>> host62 2400 2399 -1 -0.041667 24.404311
>> host63 2400 2400 0 0.000000 24.219422
>>
>>
>> On 04/27/2017 12:25 AM, Loic Dachary wrote:
>>> It seems to work when the distribution has enough samples. I tried with 40 hosts and a distribution with 100,000 samples.
>>>
>>> We go from kl =~ 1e-4 (with as much as 10% difference) to kl =~ 1e-7 (with no more than 0.5% difference). I will do some more experiements and try to think of patterns where this would not work.
>>>
>>> ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
>>> dc1 102400 102400 0 0.000000 1008
>>> host0 2438 2390 -48 -1.968827 24
>>> host1 2438 2370 -68 -2.789171 24
>>> host2 2438 2493 55 2.255947 24
>>> host3 2438 2396 -42 -1.722724 24
>>> host4 2438 2497 59 2.420016 24
>>> host5 2438 2520 82 3.363413 24
>>> host6 2438 2500 62 2.543068 24
>>> host7 2438 2380 -58 -2.378999 24
>>> host8 2438 2488 50 2.050861 24
>>> host9 2438 2435 -3 -0.123052 24
>>> host10 2438 2440 2 0.082034 24
>>> host11 2438 2472 34 1.394586 24
>>> host12 2438 2346 -92 -3.773585 24
>>> host13 2438 2411 -27 -1.107465 24
>>> host14 2438 2513 75 3.076292 24
>>> host15 2438 2421 -17 -0.697293 24
>>> host16 2438 2469 31 1.271534 24
>>> host17 2438 2419 -19 -0.779327 24
>>> host18 2438 2424 -14 -0.574241 24
>>> host19 2438 2451 13 0.533224 24
>>> host20 2438 2486 48 1.968827 24
>>> host21 2438 2439 1 0.041017 24
>>> host22 2438 2482 44 1.804758 24
>>> host23 2438 2415 -23 -0.943396 24
>>> host24 2438 2389 -49 -2.009844 24
>>> host25 2438 2265 -173 -7.095980 24
>>> host26 2438 2374 -64 -2.625103 24
>>> host27 2438 2529 91 3.732568 24
>>> host28 2438 2495 57 2.337982 24
>>> host29 2438 2433 -5 -0.205086 24
>>> host30 2438 2485 47 1.927810 24
>>> host31 2438 2377 -61 -2.502051 24
>>> host32 2438 2441 3 0.123052 24
>>> host33 2438 2421 -17 -0.697293 24
>>> host34 2438 2359 -79 -3.240361 24
>>> host35 2438 2509 71 2.912223 24
>>> host36 2438 2425 -13 -0.533224 24
>>> host37 2438 2419 -19 -0.779327 24
>>> host38 2438 2403 -35 -1.435603 24
>>> host39 2438 2458 20 0.820345 24
>>> host40 2438 2458 20 0.820345 24
>>> host41 2438 2503 65 2.666120 24
>>>
>>> ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
>>> dc1 102400 102400 0 0.000000 1008
>>> host0 2438 2438 0 0.000000 24.559919
>>> host1 2438 2438 0 0.000000 24.641221
>>> host2 2438 2440 2 0.082034 23.486113
>>> host3 2438 2437 -1 -0.041017 24.525875
>>> host4 2438 2436 -2 -0.082034 23.644304
>>> host5 2438 2440 2 0.082034 23.245287
>>> host6 2438 2442 4 0.164069 23.617162
>>> host7 2438 2439 1 0.041017 24.746174
>>> host8 2438 2436 -2 -0.082034 23.584667
>>> host9 2438 2439 1 0.041017 24.140637
>>> host10 2438 2438 0 0.000000 24.060084
>>> host11 2438 2441 3 0.123052 23.730349
>>> host12 2438 2437 -1 -0.041017 24.948602
>>> host13 2438 2437 -1 -0.041017 24.280851
>>> host14 2438 2436 -2 -0.082034 23.402216
>>> host15 2438 2436 -2 -0.082034 24.272037
>>> host16 2438 2437 -1 -0.041017 23.747867
>>> host17 2438 2436 -2 -0.082034 24.266271
>>> host18 2438 2438 0 0.000000 24.158545
>>> host19 2438 2440 2 0.082034 23.934788
>>> host20 2438 2438 0 0.000000 23.630851
>>> host21 2438 2435 -3 -0.123052 24.001950
>>> host22 2438 2440 2 0.082034 23.623120
>>> host23 2438 2437 -1 -0.041017 24.343138
>>> host24 2438 2438 0 0.000000 24.595820
>>> host25 2438 2439 1 0.041017 25.547510
>>> host26 2438 2437 -1 -0.041017 24.753111
>>> host27 2438 2437 -1 -0.041017 23.288606
>>> host28 2438 2437 -1 -0.041017 23.425059
>>> host29 2438 2438 0 0.000000 24.115941
>>> host30 2438 2441 3 0.123052 23.560539
>>> host31 2438 2438 0 0.000000 24.459911
>>> host32 2438 2440 2 0.082034 24.096746
>>> host33 2438 2437 -1 -0.041017 24.241316
>>> host34 2438 2438 0 0.000000 24.715044
>>> host35 2438 2436 -2 -0.082034 23.424601
>>> host36 2438 2436 -2 -0.082034 24.123606
>>> host37 2438 2439 1 0.041017 24.368997
>>> host38 2438 2440 2 0.082034 24.331532
>>> host39 2438 2439 1 0.041017 23.803561
>>> host40 2438 2437 -1 -0.041017 23.861094
>>> host41 2438 2442 4 0.164069 23.468473
>>>
>>>
>>> On 04/26/2017 11:08 PM, Loic Dachary wrote:
>>>>
>>>>
>>>> On 04/25/2017 05:04 PM, Pedro López-Adeva wrote:
>>>>> Hi Loic,
>>>>>
>>>>> Well, the results are better certainly! Some comments:
>>>>>
>>>>> - I'm glad Nelder-Mead worked. It's not the one I would have chosen
>>>>> because but I'm not an expert in optimization either. I wonder how it
>>>>> will scale with more weights[1]. My attempt at using scipy's optimize
>>>>> didn't work because you are optimizing an stochastic function and this
>>>>> can make scipy's to decide that no further steps are possible. The
>>>>> field that studies this kind of problems is stochastic optimization
>>>>> [2]
>>>>
>>>> You were right, it does not always work. Note that this is *not* about the conditional probability bias. This is about the uneven distribution due to the low number of values in the distribution. I think this case should be treated separately, with a different method. In Ceph clusters, large and small, the number of PGs per host is unlikely to be large enough to get enough samples. It is not an isolated problem, it's what happens most of the time.
>>>>
>>>> Even in a case as simple as 12 devices starting with:
>>>>
>>>> ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
>>>> host1 2560.000000 2580 20.000000 0.781250 24
>>>> device12 106.666667 101 -5.666667 -5.312500 1
>>>> device13 213.333333 221 7.666667 3.593750 2
>>>> device14 320.000000 317 -3.000000 -0.937500 3
>>>> device15 106.666667 101 -5.666667 -5.312500 1
>>>> device16 213.333333 217 3.666667 1.718750 2
>>>> device17 320.000000 342 22.000000 6.875000 3
>>>> device18 106.666667 102 -4.666667 -4.375000 1
>>>> device19 213.333333 243 29.666667 13.906250 2
>>>> device20 320.000000 313 -7.000000 -2.187500 3
>>>> device21 106.666667 94 -12.666667 -11.875000 1
>>>> device22 213.333333 208 -5.333333 -2.500000 2
>>>> device23 320.000000 321 1.000000 0.312500 3
>>>>
>>>> res = minimize(crush, weights, method='nelder-mead',
>>>> options={'xtol': 1e-8, 'disp': True})
>>>>
>>>> device weights [ 1. 3. 3. 2. 3. 2. 2. 1. 3. 1. 1. 2.]
>>>> device kl 0.00117274995028
>>>> ...
>>>> device kl 0.00016530695476
>>>> Optimization terminated successfully.
>>>> Current function value: 0.000165
>>>> Iterations: 117
>>>> Function evaluations: 470
>>>>
>>>> we still get a 5% difference on device 21:
>>>>
>>>> ~expected~ ~actual~ ~delta~ ~delta%~ ~weight~
>>>> host1 2560.000000 2559 -1.000000 -0.039062 23.805183
>>>> device12 106.666667 103 -3.666667 -3.437500 1.016999
>>>> device13 213.333333 214 0.666667 0.312500 1.949328
>>>> device14 320.000000 325 5.000000 1.562500 3.008688
>>>> device15 106.666667 106 -0.666667 -0.625000 1.012565
>>>> device16 213.333333 214 0.666667 0.312500 1.976344
>>>> device17 320.000000 320 0.000000 0.000000 2.845135
>>>> device18 106.666667 102 -4.666667 -4.375000 1.039181
>>>> device19 213.333333 214 0.666667 0.312500 1.820435
>>>> device20 320.000000 324 4.000000 1.250000 3.062573
>>>> device21 106.666667 101 -5.666667 -5.312500 1.071341
>>>> device22 213.333333 212 -1.333333 -0.625000 2.039190
>>>> device23 320.000000 324 4.000000 1.250000 3.016468
>>>>
>>>>
>>>>> - I used KL divergence for the loss function. My first attempt was
>>>>> using as you standard deviation (more commonly known as L2 loss) with
>>>>> gradient descent, but it didn't work very well.
>>>>>
>>>>> - Sum of differences sounds like a bad idea, +100 and -100 errors will
>>>>> cancel out. Worse still -100 and -100 will be better than 0 and 0.
>>>>> Maybe you were talking about the absolute value of the differences?
>>>>>
>>>>> - Well, now that CRUSH can use multiple weight the problem that
>>>>> remains I think is seeing if the optimization problem is: a) reliable
>>>>> and b) fast enough
>>>>>
>>>>> Cheers,
>>>>> Pedro.
>>>>>
>>>>> [1] http://www.benfrederickson.com/numerical-optimization/
>>>>> [2] https://en.wikipedia.org/wiki/Stochastic_optimization
>>>>>
>>>>> 2017-04-22 18:51 GMT+02:00 Loic Dachary <loic@dachary.org>:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
>>>>>>
>>>>>> I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
>>>>>>
>>>>>> Before optimization the situation is:
>>>>>>
>>>>>> ~expected~ ~objects~ ~delta~ ~delta%~
>>>>>> ~name~
>>>>>> dc1 1024 1024 0 0.000000
>>>>>> host0 256 294 38 14.843750
>>>>>> device0 128 153 25 19.531250
>>>>>> device1 128 141 13 10.156250
>>>>>> host1 256 301 45 17.578125
>>>>>> device2 128 157 29 22.656250
>>>>>> device3 128 144 16 12.500000
>>>>>> host2 512 429 -83 -16.210938
>>>>>> device4 128 96 -32 -25.000000
>>>>>> device5 128 117 -11 -8.593750
>>>>>> device6 256 216 -40 -15.625000
>>>>>>
>>>>>> and after optimization we have the following:
>>>>>>
>>>>>> ~expected~ ~objects~ ~delta~ ~delta%~
>>>>>> ~name~
>>>>>> dc1 1024 1024 0 0.000000
>>>>>> host0 256 259 3 1.171875
>>>>>> device0 128 129 1 0.781250
>>>>>> device1 128 130 2 1.562500
>>>>>> host1 256 258 2 0.781250
>>>>>> device2 128 129 1 0.781250
>>>>>> device3 128 129 1 0.781250
>>>>>> host2 512 507 -5 -0.976562
>>>>>> device4 128 126 -2 -1.562500
>>>>>> device5 128 127 -1 -0.781250
>>>>>> device6 256 254 -2 -0.781250
>>>>>>
>>>>>> Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> [1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
>>>>>> [2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
>>>>>> [3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
>>>>>> [4] https://github.com/ceph/ceph/pull/14486
>>>>>>
>>>>>> On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
>>>>>>> There are lot of gradient-free methods. I will try first to run the
>>>>>>> ones available using just scipy
>>>>>>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>>>>>> Some of them don't require the gradient and some of them can estimate
>>>>>>> it. The reason to go without the gradient is to run the CRUSH
>>>>>>> algorithm as a black box. In that case this would be the pseudo-code:
>>>>>>>
>>>>>>> - BEGIN CODE -
>>>>>>> def build_target(desired_freqs):
>>>>>>> def target(weights):
>>>>>>> # run a simulation of CRUSH for a number of objects
>>>>>>> sim_freqs = run_crush(weights)
>>>>>>> # Kullback-Leibler divergence between desired frequencies and
>>>>>>> current ones
>>>>>>> return loss(sim_freqs, desired_freqs)
>>>>>>> return target
>>>>>>>
>>>>>>> weights = scipy.optimize.minimize(build_target(desired_freqs))
>>>>>>> - END CODE -
>>>>>>>
>>>>>>> The tricky thing here is that this procedure can be slow if the
>>>>>>> simulation (run_crush) needs to place a lot of objects to get accurate
>>>>>>> simulated frequencies. This is true specially if the minimize method
>>>>>>> attempts to approximate the gradient using finite differences since it
>>>>>>> will evaluate the target function a number of times proportional to
>>>>>>> the number of weights). Apart from the ones in scipy I would try also
>>>>>>> optimization methods that try to perform as few evaluations as
>>>>>>> possible like for example HyperOpt
>>>>>>> (http://hyperopt.github.io/hyperopt/), which by the way takes into
>>>>>>> account that the target function can be noisy.
>>>>>>>
>>>>>>> This black box approximation is simple to implement and makes the
>>>>>>> computer do all the work instead of us.
>>>>>>> I think that this black box approximation is worthy to try even if
>>>>>>> it's not the final one because if this approximation works then we
>>>>>>> know that a more elaborate one that computes the gradient of the CRUSH
>>>>>>> algorithm will work for sure.
>>>>>>>
>>>>>>> I can try this black box approximation this weekend not on the real
>>>>>>> CRUSH algorithm but with the simple implementation I did in python. If
>>>>>>> it works it's just a matter of substituting one simulation with
>>>>>>> another and see what happens.
>>>>>>>
>>>>>>> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>>>>>>> Hi Loic,
>>>>>>>>>
>>>>>>>>> >From what I see everything seems OK.
>>>>>>>>
>>>>>>>> Cool. I'll keep going in this direction then !
>>>>>>>>
>>>>>>>>> The interesting thing would be to
>>>>>>>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>>>>>>>> is right now modeling just a single straw bucket not the full CRUSH
>>>>>>>>> algorithm.
>>>>>>>>
>>>>>>>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>>>>>>>
>>>>>>>>> That's the work that remains to be done. The only way that
>>>>>>>>> would avoid reimplementing the CRUSH algorithm and computing the
>>>>>>>>> gradient would be treating CRUSH as a black box and eliminating the
>>>>>>>>> necessity of computing the gradient either by using a gradient-free
>>>>>>>>> optimization method or making an estimation of the gradient.
>>>>>>>>
>>>>>>>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>>>>>>>
>>>>>>>>>> Thanks !
>>>>>>>>>>
>>>>>>>>>> Simulation: R=2 devices capacity [10 8 6 10 8 6 10 8 6]
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>> Before: All replicas on each hard drive
>>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>>> disk 0: 1.39e-01 1.12e-01
>>>>>>>>>> disk 1: 1.11e-01 1.10e-01
>>>>>>>>>> disk 2: 8.33e-02 1.13e-01
>>>>>>>>>> disk 3: 1.39e-01 1.11e-01
>>>>>>>>>> disk 4: 1.11e-01 1.11e-01
>>>>>>>>>> disk 5: 8.33e-02 1.11e-01
>>>>>>>>>> disk 6: 1.39e-01 1.12e-01
>>>>>>>>>> disk 7: 1.11e-01 1.12e-01
>>>>>>>>>> disk 8: 8.33e-02 1.10e-01
>>>>>>>>>> it= 1 jac norm=1.59e-01 loss=5.27e-03
>>>>>>>>>> it= 2 jac norm=1.55e-01 loss=5.03e-03
>>>>>>>>>> ...
>>>>>>>>>> it= 212 jac norm=1.02e-03 loss=2.41e-07
>>>>>>>>>> it= 213 jac norm=1.00e-03 loss=2.31e-07
>>>>>>>>>> Converged to desired accuracy :)
>>>>>>>>>> After: All replicas on each hard drive
>>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>>> disk 0: 1.39e-01 1.42e-01
>>>>>>>>>> disk 1: 1.11e-01 1.09e-01
>>>>>>>>>> disk 2: 8.33e-02 8.37e-02
>>>>>>>>>> disk 3: 1.39e-01 1.40e-01
>>>>>>>>>> disk 4: 1.11e-01 1.13e-01
>>>>>>>>>> disk 5: 8.33e-02 8.08e-02
>>>>>>>>>> disk 6: 1.39e-01 1.38e-01
>>>>>>>>>> disk 7: 1.11e-01 1.09e-01
>>>>>>>>>> disk 8: 8.33e-02 8.48e-02
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Simulation: R=2 devices capacity [10 10 10 10 1]
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>> Before: All replicas on each hard drive
>>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>>> disk 0: 2.44e-01 2.36e-01
>>>>>>>>>> disk 1: 2.44e-01 2.38e-01
>>>>>>>>>> disk 2: 2.44e-01 2.34e-01
>>>>>>>>>> disk 3: 2.44e-01 2.38e-01
>>>>>>>>>> disk 4: 2.44e-02 5.37e-02
>>>>>>>>>> it= 1 jac norm=2.43e-01 loss=2.98e-03
>>>>>>>>>> it= 2 jac norm=2.28e-01 loss=2.47e-03
>>>>>>>>>> ...
>>>>>>>>>> it= 37 jac norm=1.28e-03 loss=3.48e-08
>>>>>>>>>> it= 38 jac norm=1.07e-03 loss=2.42e-08
>>>>>>>>>> Converged to desired accuracy :)
>>>>>>>>>> After: All replicas on each hard drive
>>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>>> disk 0: 2.44e-01 2.46e-01
>>>>>>>>>> disk 1: 2.44e-01 2.44e-01
>>>>>>>>>> disk 2: 2.44e-01 2.41e-01
>>>>>>>>>> disk 3: 2.44e-01 2.45e-01
>>>>>>>>>> disk 4: 2.44e-02 2.33e-02
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>>>>>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>>>>>>>
>>>>>>>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>>>>>>> Hi Pedro,
>>>>>>>>>>>
>>>>>>>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>>
>>>>>>>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>>>>>>>> Hi Pedro,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm going to experiment with what you did at
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>>>>
>>>>>>>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>>
>>>>>>>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>>>>>>>> Great, thanks for the clarifications.
>>>>>>>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I keep working on it.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>>>>>>>> Hi Pedro,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for taking a look at this! It's a frustrating problem and we
>>>>>>>>>>>>>> haven't made much headway.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>>>>>>>> work on.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>>>>>>>> selecting the same one again.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>>>>>>>> round/replica/rank.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>>>>>>>> change and evolve independently. (In most cases any change is
>>>>>>>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>>>>>>>> before it is enabled or used.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>>>>>>>> implementation?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>>>>>>>> retry_descent.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>>>>>>>> adapted)!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sage
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>
>>
>
--
Loïc Dachary, Artisan Logiciel Libre
^ permalink raw reply [flat|nested] 70+ messages in thread
end of thread, other threads:[~2017-04-27 22:14 UTC | newest]
Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-26 3:05 crush multipick anomaly Sage Weil
2017-01-26 11:13 ` Loic Dachary
2017-01-26 11:51 ` kefu chai
2017-02-03 14:37 ` Loic Dachary
2017-02-03 14:47 ` Sage Weil
2017-02-03 15:08 ` Loic Dachary
2017-02-03 18:54 ` Loic Dachary
2017-02-06 3:08 ` Jaze Lee
2017-02-06 8:18 ` Loic Dachary
2017-02-06 14:11 ` Jaze Lee
2017-02-06 17:07 ` Loic Dachary
2017-02-03 15:26 ` Dan van der Ster
2017-02-03 17:37 ` Dan van der Ster
2017-02-06 8:31 ` Loic Dachary
2017-02-06 9:13 ` Dan van der Ster
2017-02-06 16:53 ` Dan van der Ster
2017-02-13 10:36 ` Loic Dachary
2017-02-13 14:21 ` Sage Weil
2017-02-13 18:50 ` Loic Dachary
2017-02-13 19:16 ` Sage Weil
2017-02-13 20:18 ` Loic Dachary
2017-02-13 21:01 ` Loic Dachary
2017-02-13 21:15 ` Sage Weil
2017-02-13 21:19 ` Gregory Farnum
2017-02-13 21:26 ` Sage Weil
2017-02-13 21:43 ` Loic Dachary
2017-02-16 22:04 ` Pedro López-Adeva
2017-02-22 7:52 ` Loic Dachary
2017-02-22 11:26 ` Pedro López-Adeva
2017-02-22 11:38 ` Loic Dachary
2017-02-22 11:46 ` Pedro López-Adeva
2017-02-25 0:38 ` Loic Dachary
2017-02-25 8:41 ` Pedro López-Adeva
2017-02-25 9:02 ` Loic Dachary
2017-03-02 9:43 ` Loic Dachary
2017-03-02 9:58 ` Pedro López-Adeva
2017-03-02 10:31 ` Loic Dachary
2017-03-07 23:06 ` Sage Weil
2017-03-09 8:47 ` Pedro López-Adeva
2017-03-18 9:21 ` Loic Dachary
2017-03-19 22:31 ` Loic Dachary
2017-03-20 10:49 ` Loic Dachary
2017-03-23 11:49 ` Pedro López-Adeva
2017-03-23 14:13 ` Loic Dachary
2017-03-23 15:32 ` Pedro López-Adeva
2017-03-23 16:18 ` Loic Dachary
2017-03-25 18:42 ` Sage Weil
[not found] ` <CAHMeWhHV=5u=QFggXFNMn2MzGLgQJ6nMnae+ZgK=MB5yYr1p9g@mail.gmail.com>
2017-03-27 2:33 ` Sage Weil
2017-03-27 6:45 ` Loic Dachary
[not found] ` <CAHMeWhGuJnu2664VTxomQ-wJewBEPjRT_VGWH+g-v5k3ka6X5Q@mail.gmail.com>
2017-03-27 9:27 ` Adam Kupczyk
2017-03-27 10:29 ` Loic Dachary
2017-03-27 10:37 ` Pedro López-Adeva
2017-03-27 13:39 ` Sage Weil
2017-03-28 6:52 ` Adam Kupczyk
2017-03-28 9:49 ` Spandan Kumar Sahu
2017-03-28 13:35 ` Sage Weil
2017-03-27 13:24 ` Sage Weil
2017-04-11 15:22 ` Loic Dachary
2017-04-22 16:51 ` Loic Dachary
2017-04-25 15:04 ` Pedro López-Adeva
2017-04-25 17:46 ` Loic Dachary
2017-04-26 21:08 ` Loic Dachary
2017-04-26 22:25 ` Loic Dachary
2017-04-27 6:12 ` Loic Dachary
2017-04-27 16:47 ` Loic Dachary
2017-04-27 22:14 ` Loic Dachary
2017-02-13 14:53 ` Gregory Farnum
2017-02-20 8:47 ` Loic Dachary
2017-02-20 17:32 ` Gregory Farnum
2017-02-20 19:31 ` Loic Dachary
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.