All of lore.kernel.org
 help / color / mirror / Atom feed
* crush multipick anomaly
@ 2017-01-26  3:05 Sage Weil
  2017-01-26 11:13 ` Loic Dachary
  2017-02-13 10:36 ` Loic Dachary
  0 siblings, 2 replies; 70+ messages in thread
From: Sage Weil @ 2017-01-26  3:05 UTC (permalink / raw)
  To: ceph-devel

This is a longstanding bug,

	http://tracker.ceph.com/issues/15653

that causes low-weighted devices to get more data than they should. Loic's 
recent activity resurrected discussion on the original PR

	https://github.com/ceph/ceph/pull/10218

but since it's closed and almost nobody will see it I'm moving the 
discussion here.

The main news is that I have a simple adjustment for the weights that 
works (almost perfectly) for the 2nd round of placements.  The solution is 
pretty simple, although as with most probabilities it tends to make my 
brain hurt.

The idea is that, on the second round, the original weight for the small 
OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
P(pick small | first pick not small).  Since P(a|b) (the probability of a 
given b) is P(a && b) / P(b),

 P(pick small | first pick not small)
 = P(pick small && first pick not small) / P(first pick not small)

The last term is easy to calculate,

 P(first pick not small) = (total_weight - small_weight) / total_weight

and the && term is the distribution we're trying to produce.  For exmaple, 
if small has 1/10 the weight, then we should see 1/10th of the PGs have 
their second replica be the small OSD.  So

 P(pick small && first pick not small) = small_weight / total_weight

Putting those together,

 P(pick small | first pick not small)
 = P(pick small && first pick not small) / P(first pick not small)
 = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
 = small_weight / (total_weight - small_weight)

This is, on the second round, we should adjust the weights by the above so 
that we get the right distribution of second choices.  It turns out it 
works to adjust *all* weights like this to get hte conditional probability 
that they weren't already chosen.

I have a branch that hacks this into straw2 and it appears to work 
properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
current code, you get

$ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
rule 0 (data), x = 0..40000000, numrep = 2..2
rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
  device 0:             19765965        [9899364,9866601]
  device 1:             19768033        [9899444,9868589]
  device 2:             19769938        [9901770,9868168]
  device 3:             19766918        [9898851,9868067]
  device 6:             929148  [400572,528576]

which is very close for the first replica (primary), but way off for the 
second.  With my hacky change,

rule 0 (data), x = 0..40000000, numrep = 2..2
rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
  device 0:             19797315        [9899364,9897951]
  device 1:             19799199        [9899444,9899755]
  device 2:             19801016        [9901770,9899246]
  device 3:             19797906        [9898851,9899055]
  device 6:             804566  [400572,403994]

which is quite close, but still skewing slightly high (by a big less than 
1%).

Next steps:

1- generalize this for >2 replicas
2- figure out why it skews high
3- make this work for multi-level hierarchical descent

sage


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-01-26  3:05 crush multipick anomaly Sage Weil
@ 2017-01-26 11:13 ` Loic Dachary
  2017-01-26 11:51   ` kefu chai
  2017-02-03 14:37   ` Loic Dachary
  2017-02-13 10:36 ` Loic Dachary
  1 sibling, 2 replies; 70+ messages in thread
From: Loic Dachary @ 2017-01-26 11:13 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

Hi Sage,

Still trying to understand what you did :-) I have one question below.

On 01/26/2017 04:05 AM, Sage Weil wrote:
> This is a longstanding bug,
> 
> 	http://tracker.ceph.com/issues/15653
> 
> that causes low-weighted devices to get more data than they should. Loic's 
> recent activity resurrected discussion on the original PR
> 
> 	https://github.com/ceph/ceph/pull/10218
> 
> but since it's closed and almost nobody will see it I'm moving the 
> discussion here.
> 
> The main news is that I have a simple adjustment for the weights that 
> works (almost perfectly) for the 2nd round of placements.  The solution is 
> pretty simple, although as with most probabilities it tends to make my 
> brain hurt.
> 
> The idea is that, on the second round, the original weight for the small 
> OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
> P(pick small | first pick not small).  Since P(a|b) (the probability of a 
> given b) is P(a && b) / P(b),

From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition

> 
>  P(pick small | first pick not small)
>  = P(pick small && first pick not small) / P(first pick not small)
> 
> The last term is easy to calculate,
> 
>  P(first pick not small) = (total_weight - small_weight) / total_weight
> 
> and the && term is the distribution we're trying to produce.  

https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?

> For exmaple, 
> if small has 1/10 the weight, then we should see 1/10th of the PGs have 
> their second replica be the small OSD.  So
> 
>  P(pick small && first pick not small) = small_weight / total_weight
> 
> Putting those together,
> 
>  P(pick small | first pick not small)
>  = P(pick small && first pick not small) / P(first pick not small)
>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>  = small_weight / (total_weight - small_weight)
> 
> This is, on the second round, we should adjust the weights by the above so 
> that we get the right distribution of second choices.  It turns out it 
> works to adjust *all* weights like this to get hte conditional probability 
> that they weren't already chosen.
> 
> I have a branch that hacks this into straw2 and it appears to work 

This is https://github.com/liewegas/ceph/commit/wip-crush-multipick

> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
> current code, you get
> 
> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
> rule 0 (data), x = 0..40000000, numrep = 2..2
> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>   device 0:             19765965        [9899364,9866601]
>   device 1:             19768033        [9899444,9868589]
>   device 2:             19769938        [9901770,9868168]
>   device 3:             19766918        [9898851,9868067]
>   device 6:             929148  [400572,528576]
> 
> which is very close for the first replica (primary), but way off for the 
> second.  With my hacky change,
> 
> rule 0 (data), x = 0..40000000, numrep = 2..2
> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>   device 0:             19797315        [9899364,9897951]
>   device 1:             19799199        [9899444,9899755]
>   device 2:             19801016        [9901770,9899246]
>   device 3:             19797906        [9898851,9899055]
>   device 6:             804566  [400572,403994]
> 
> which is quite close, but still skewing slightly high (by a big less than 
> 1%).
> 
> Next steps:
> 
> 1- generalize this for >2 replicas
> 2- figure out why it skews high
> 3- make this work for multi-level hierarchical descent
> 
> sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-01-26 11:13 ` Loic Dachary
@ 2017-01-26 11:51   ` kefu chai
  2017-02-03 14:37   ` Loic Dachary
  1 sibling, 0 replies; 70+ messages in thread
From: kefu chai @ 2017-01-26 11:51 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Sage Weil, ceph-devel

On Thu, Jan 26, 2017 at 7:13 PM, Loic Dachary <loic@dachary.org> wrote:
> Hi Sage,
>
> Still trying to understand what you did :-) I have one question below.
>
> On 01/26/2017 04:05 AM, Sage Weil wrote:
>> This is a longstanding bug,
>>
>>       http://tracker.ceph.com/issues/15653
>>
>> that causes low-weighted devices to get more data than they should. Loic's
>> recent activity resurrected discussion on the original PR
>>
>>       https://github.com/ceph/ceph/pull/10218
>>
>> but since it's closed and almost nobody will see it I'm moving the
>> discussion here.
>>
>> The main news is that I have a simple adjustment for the weights that
>> works (almost perfectly) for the 2nd round of placements.  The solution is
>> pretty simple, although as with most probabilities it tends to make my
>> brain hurt.
>>
>> The idea is that, on the second round, the original weight for the small
>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>> given b) is P(a && b) / P(b),
>
> From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>
>>
>>  P(pick small | first pick not small)
>>  = P(pick small && first pick not small) / P(first pick not small)
>>
>> The last term is easy to calculate,
>>
>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>
>> and the && term is the distribution we're trying to produce.
>
> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?

a joint events of A and B means these two events occur together. Here,
A and B are two random variables. so, the notion of P(A && B) or
P(A∩B) stands for the probability of the event A and event B happens
together. in our case, in our case, "a" denotes the event of "first
small", and "b" denotes "first pick not small". so P(a∩b) is the
probability of the first pick is not small **and** the resulting pick
is not small. maybe you can also reference
https://en.wikipedia.org/wiki/Joint_probability_distribution

>
>> For exmaple,
>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>> their second replica be the small OSD.  So
>>
>>  P(pick small && first pick not small) = small_weight / total_weight
>>
>> Putting those together,
>>
>>  P(pick small | first pick not small)
>>  = P(pick small && first pick not small) / P(first pick not small)
>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>  = small_weight / (total_weight - small_weight)
>>
>> This is, on the second round, we should adjust the weights by the above so
>> that we get the right distribution of second choices.  It turns out it
>> works to adjust *all* weights like this to get hte conditional probability
>> that they weren't already chosen.
>>
>> I have a branch that hacks this into straw2 and it appears to work
>
> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>
>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>> current code, you get
>>
>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>> rule 0 (data), x = 0..40000000, numrep = 2..2
>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>   device 0:             19765965        [9899364,9866601]
>>   device 1:             19768033        [9899444,9868589]
>>   device 2:             19769938        [9901770,9868168]
>>   device 3:             19766918        [9898851,9868067]
>>   device 6:             929148  [400572,528576]
>>
>> which is very close for the first replica (primary), but way off for the
>> second.  With my hacky change,
>>
>> rule 0 (data), x = 0..40000000, numrep = 2..2
>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>   device 0:             19797315        [9899364,9897951]
>>   device 1:             19799199        [9899444,9899755]
>>   device 2:             19801016        [9901770,9899246]
>>   device 3:             19797906        [9898851,9899055]
>>   device 6:             804566  [400572,403994]
>>
>> which is quite close, but still skewing slightly high (by a big less than
>> 1%).
>>
>> Next steps:
>>
>> 1- generalize this for >2 replicas
>> 2- figure out why it skews high
>> 3- make this work for multi-level hierarchical descent
>>
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Regards
Kefu Chai

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-01-26 11:13 ` Loic Dachary
  2017-01-26 11:51   ` kefu chai
@ 2017-02-03 14:37   ` Loic Dachary
  2017-02-03 14:47     ` Sage Weil
  1 sibling, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-03 14:37 UTC (permalink / raw)
  To: Sage Weil, ceph-devel



On 01/26/2017 12:13 PM, Loic Dachary wrote:
> Hi Sage,
> 
> Still trying to understand what you did :-) I have one question below.
> 
> On 01/26/2017 04:05 AM, Sage Weil wrote:
>> This is a longstanding bug,
>>
>> 	http://tracker.ceph.com/issues/15653
>>
>> that causes low-weighted devices to get more data than they should. Loic's 
>> recent activity resurrected discussion on the original PR
>>
>> 	https://github.com/ceph/ceph/pull/10218
>>
>> but since it's closed and almost nobody will see it I'm moving the 
>> discussion here.
>>
>> The main news is that I have a simple adjustment for the weights that 
>> works (almost perfectly) for the 2nd round of placements.  The solution is 
>> pretty simple, although as with most probabilities it tends to make my 
>> brain hurt.
>>
>> The idea is that, on the second round, the original weight for the small 
>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
>> P(pick small | first pick not small).  Since P(a|b) (the probability of a 
>> given b) is P(a && b) / P(b),
> 
>>From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
> 
>>
>>  P(pick small | first pick not small)
>>  = P(pick small && first pick not small) / P(first pick not small)
>>
>> The last term is easy to calculate,
>>
>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>
>> and the && term is the distribution we're trying to produce.  
> 
> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
> 
>> For exmaple, 
>> if small has 1/10 the weight, then we should see 1/10th of the PGs have 
>> their second replica be the small OSD.  So
>>
>>  P(pick small && first pick not small) = small_weight / total_weight
>>
>> Putting those together,
>>
>>  P(pick small | first pick not small)
>>  = P(pick small && first pick not small) / P(first pick not small)
>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>  = small_weight / (total_weight - small_weight)
>>
>> This is, on the second round, we should adjust the weights by the above so 
>> that we get the right distribution of second choices.  It turns out it 
>> works to adjust *all* weights like this to get hte conditional probability 
>> that they weren't already chosen.
>>
>> I have a branch that hacks this into straw2 and it appears to work 
> 
> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick

In

https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316

double neww = oldw / (bucketw - oldw) * bucketw;

I don't get why we need  "* bucketw" at the end ?

> 
>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
>> current code, you get
>>
>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>> rule 0 (data), x = 0..40000000, numrep = 2..2
>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>   device 0:             19765965        [9899364,9866601]
>>   device 1:             19768033        [9899444,9868589]
>>   device 2:             19769938        [9901770,9868168]
>>   device 3:             19766918        [9898851,9868067]
>>   device 6:             929148  [400572,528576]
>>
>> which is very close for the first replica (primary), but way off for the 
>> second.  With my hacky change,
>>
>> rule 0 (data), x = 0..40000000, numrep = 2..2
>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>   device 0:             19797315        [9899364,9897951]
>>   device 1:             19799199        [9899444,9899755]
>>   device 2:             19801016        [9901770,9899246]
>>   device 3:             19797906        [9898851,9899055]
>>   device 6:             804566  [400572,403994]
>>
>> which is quite close, but still skewing slightly high (by a big less than 
>> 1%).
>>
>> Next steps:
>>
>> 1- generalize this for >2 replicas
>> 2- figure out why it skews high
>> 3- make this work for multi-level hierarchical descent
>>
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-03 14:37   ` Loic Dachary
@ 2017-02-03 14:47     ` Sage Weil
  2017-02-03 15:08       ` Loic Dachary
  2017-02-03 15:26       ` Dan van der Ster
  0 siblings, 2 replies; 70+ messages in thread
From: Sage Weil @ 2017-02-03 14:47 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5588 bytes --]

On Fri, 3 Feb 2017, Loic Dachary wrote:
> On 01/26/2017 12:13 PM, Loic Dachary wrote:
> > Hi Sage,
> > 
> > Still trying to understand what you did :-) I have one question below.
> > 
> > On 01/26/2017 04:05 AM, Sage Weil wrote:
> >> This is a longstanding bug,
> >>
> >> 	http://tracker.ceph.com/issues/15653
> >>
> >> that causes low-weighted devices to get more data than they should. Loic's 
> >> recent activity resurrected discussion on the original PR
> >>
> >> 	https://github.com/ceph/ceph/pull/10218
> >>
> >> but since it's closed and almost nobody will see it I'm moving the 
> >> discussion here.
> >>
> >> The main news is that I have a simple adjustment for the weights that 
> >> works (almost perfectly) for the 2nd round of placements.  The solution is 
> >> pretty simple, although as with most probabilities it tends to make my 
> >> brain hurt.
> >>
> >> The idea is that, on the second round, the original weight for the small 
> >> OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
> >> P(pick small | first pick not small).  Since P(a|b) (the probability of a 
> >> given b) is P(a && b) / P(b),
> > 
> >>From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
> > 
> >>
> >>  P(pick small | first pick not small)
> >>  = P(pick small && first pick not small) / P(first pick not small)
> >>
> >> The last term is easy to calculate,
> >>
> >>  P(first pick not small) = (total_weight - small_weight) / total_weight
> >>
> >> and the && term is the distribution we're trying to produce.  
> > 
> > https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
> > 
> >> For exmaple, 
> >> if small has 1/10 the weight, then we should see 1/10th of the PGs have 
> >> their second replica be the small OSD.  So
> >>
> >>  P(pick small && first pick not small) = small_weight / total_weight
> >>
> >> Putting those together,
> >>
> >>  P(pick small | first pick not small)
> >>  = P(pick small && first pick not small) / P(first pick not small)
> >>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
> >>  = small_weight / (total_weight - small_weight)
> >>
> >> This is, on the second round, we should adjust the weights by the above so 
> >> that we get the right distribution of second choices.  It turns out it 
> >> works to adjust *all* weights like this to get hte conditional probability 
> >> that they weren't already chosen.
> >>
> >> I have a branch that hacks this into straw2 and it appears to work 
> > 
> > This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
> 
> In
> 
> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
> 
> double neww = oldw / (bucketw - oldw) * bucketw;
> 
> I don't get why we need  "* bucketw" at the end ?

It's just to keep the values within a reasonable range so that we don't 
lose precision by dropping down into small integers.

I futzed around with this some more last week trying to get the third 
replica to work and ended up doubting that this piece is correct.  The 
ratio between the big and small OSDs in my [99 99 99 99 4] example varies 
slightly from what I would expect from first principles and what I get out 
of this derivation by about 1%.. which would explain the bias I as seeing.

I'm hoping we can find someone with a strong stats/probability background 
and loads of free time who can tackle this...

sage


> 
> > 
> >> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
> >> current code, you get
> >>
> >> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
> >> rule 0 (data), x = 0..40000000, numrep = 2..2
> >> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
> >>   device 0:             19765965        [9899364,9866601]
> >>   device 1:             19768033        [9899444,9868589]
> >>   device 2:             19769938        [9901770,9868168]
> >>   device 3:             19766918        [9898851,9868067]
> >>   device 6:             929148  [400572,528576]
> >>
> >> which is very close for the first replica (primary), but way off for the 
> >> second.  With my hacky change,
> >>
> >> rule 0 (data), x = 0..40000000, numrep = 2..2
> >> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
> >>   device 0:             19797315        [9899364,9897951]
> >>   device 1:             19799199        [9899444,9899755]
> >>   device 2:             19801016        [9901770,9899246]
> >>   device 3:             19797906        [9898851,9899055]
> >>   device 6:             804566  [400572,403994]
> >>
> >> which is quite close, but still skewing slightly high (by a big less than 
> >> 1%).
> >>
> >> Next steps:
> >>
> >> 1- generalize this for >2 replicas
> >> 2- figure out why it skews high
> >> 3- make this work for multi-level hierarchical descent
> >>
> >> sage
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> > 
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-03 14:47     ` Sage Weil
@ 2017-02-03 15:08       ` Loic Dachary
  2017-02-03 18:54         ` Loic Dachary
  2017-02-03 15:26       ` Dan van der Ster
  1 sibling, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-03 15:08 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel



On 02/03/2017 03:47 PM, Sage Weil wrote:
> On Fri, 3 Feb 2017, Loic Dachary wrote:
>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>> Hi Sage,
>>>
>>> Still trying to understand what you did :-) I have one question below.
>>>
>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>> This is a longstanding bug,
>>>>
>>>> 	http://tracker.ceph.com/issues/15653
>>>>
>>>> that causes low-weighted devices to get more data than they should. Loic's 
>>>> recent activity resurrected discussion on the original PR
>>>>
>>>> 	https://github.com/ceph/ceph/pull/10218
>>>>
>>>> but since it's closed and almost nobody will see it I'm moving the 
>>>> discussion here.
>>>>
>>>> The main news is that I have a simple adjustment for the weights that 
>>>> works (almost perfectly) for the 2nd round of placements.  The solution is 
>>>> pretty simple, although as with most probabilities it tends to make my 
>>>> brain hurt.
>>>>
>>>> The idea is that, on the second round, the original weight for the small 
>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a 
>>>> given b) is P(a && b) / P(b),
>>>
>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>
>>>>
>>>>  P(pick small | first pick not small)
>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>
>>>> The last term is easy to calculate,
>>>>
>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>
>>>> and the && term is the distribution we're trying to produce.  
>>>
>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>
>>>> For exmaple, 
>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have 
>>>> their second replica be the small OSD.  So
>>>>
>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>
>>>> Putting those together,
>>>>
>>>>  P(pick small | first pick not small)
>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>  = small_weight / (total_weight - small_weight)
>>>>
>>>> This is, on the second round, we should adjust the weights by the above so 
>>>> that we get the right distribution of second choices.  It turns out it 
>>>> works to adjust *all* weights like this to get hte conditional probability 
>>>> that they weren't already chosen.
>>>>
>>>> I have a branch that hacks this into straw2 and it appears to work 
>>>
>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>
>> In
>>
>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>
>> double neww = oldw / (bucketw - oldw) * bucketw;
>>
>> I don't get why we need  "* bucketw" at the end ?
> 
> It's just to keep the values within a reasonable range so that we don't 
> lose precision by dropping down into small integers.
> 
> I futzed around with this some more last week trying to get the third 
> replica to work and ended up doubting that this piece is correct.  The 
> ratio between the big and small OSDs in my [99 99 99 99 4] example varies 
> slightly from what I would expect from first principles and what I get out 
> of this derivation by about 1%.. which would explain the bias I as seeing.
> 
> I'm hoping we can find someone with a strong stats/probability background 
> and loads of free time who can tackle this...
> 

It would help to formulate the problem into a self contained puzzle to present a mathematician. I tried to do it last week but failed. I'll give it another shot and submit a draft, hoping something bad could be the start of something better ;-)
 
> sage
> 
> 
>>
>>>
>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
>>>> current code, you get
>>>>
>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>   device 0:             19765965        [9899364,9866601]
>>>>   device 1:             19768033        [9899444,9868589]
>>>>   device 2:             19769938        [9901770,9868168]
>>>>   device 3:             19766918        [9898851,9868067]
>>>>   device 6:             929148  [400572,528576]
>>>>
>>>> which is very close for the first replica (primary), but way off for the 
>>>> second.  With my hacky change,
>>>>
>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>   device 0:             19797315        [9899364,9897951]
>>>>   device 1:             19799199        [9899444,9899755]
>>>>   device 2:             19801016        [9901770,9899246]
>>>>   device 3:             19797906        [9898851,9899055]
>>>>   device 6:             804566  [400572,403994]
>>>>
>>>> which is quite close, but still skewing slightly high (by a big less than 
>>>> 1%).
>>>>
>>>> Next steps:
>>>>
>>>> 1- generalize this for >2 replicas
>>>> 2- figure out why it skews high
>>>> 3- make this work for multi-level hierarchical descent
>>>>
>>>> sage
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-03 14:47     ` Sage Weil
  2017-02-03 15:08       ` Loic Dachary
@ 2017-02-03 15:26       ` Dan van der Ster
  2017-02-03 17:37         ` Dan van der Ster
  1 sibling, 1 reply; 70+ messages in thread
From: Dan van der Ster @ 2017-02-03 15:26 UTC (permalink / raw)
  To: Sage Weil; +Cc: Loic Dachary, ceph-devel

On Fri, Feb 3, 2017 at 3:47 PM, Sage Weil <sweil@redhat.com> wrote:
> On Fri, 3 Feb 2017, Loic Dachary wrote:
>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>> > Hi Sage,
>> >
>> > Still trying to understand what you did :-) I have one question below.
>> >
>> > On 01/26/2017 04:05 AM, Sage Weil wrote:
>> >> This is a longstanding bug,
>> >>
>> >>    http://tracker.ceph.com/issues/15653
>> >>
>> >> that causes low-weighted devices to get more data than they should. Loic's
>> >> recent activity resurrected discussion on the original PR
>> >>
>> >>    https://github.com/ceph/ceph/pull/10218
>> >>
>> >> but since it's closed and almost nobody will see it I'm moving the
>> >> discussion here.
>> >>
>> >> The main news is that I have a simple adjustment for the weights that
>> >> works (almost perfectly) for the 2nd round of placements.  The solution is
>> >> pretty simple, although as with most probabilities it tends to make my
>> >> brain hurt.
>> >>
>> >> The idea is that, on the second round, the original weight for the small
>> >> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>> >> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>> >> given b) is P(a && b) / P(b),
>> >
>> >>From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>> >
>> >>
>> >>  P(pick small | first pick not small)
>> >>  = P(pick small && first pick not small) / P(first pick not small)
>> >>
>> >> The last term is easy to calculate,
>> >>
>> >>  P(first pick not small) = (total_weight - small_weight) / total_weight
>> >>
>> >> and the && term is the distribution we're trying to produce.
>> >
>> > https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>> >
>> >> For exmaple,
>> >> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>> >> their second replica be the small OSD.  So
>> >>
>> >>  P(pick small && first pick not small) = small_weight / total_weight
>> >>
>> >> Putting those together,
>> >>
>> >>  P(pick small | first pick not small)
>> >>  = P(pick small && first pick not small) / P(first pick not small)
>> >>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>> >>  = small_weight / (total_weight - small_weight)
>> >>
>> >> This is, on the second round, we should adjust the weights by the above so
>> >> that we get the right distribution of second choices.  It turns out it
>> >> works to adjust *all* weights like this to get hte conditional probability
>> >> that they weren't already chosen.
>> >>
>> >> I have a branch that hacks this into straw2 and it appears to work
>> >
>> > This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>
>> In
>>
>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>
>> double neww = oldw / (bucketw - oldw) * bucketw;
>>
>> I don't get why we need  "* bucketw" at the end ?
>
> It's just to keep the values within a reasonable range so that we don't
> lose precision by dropping down into small integers.
>
> I futzed around with this some more last week trying to get the third
> replica to work and ended up doubting that this piece is correct.  The
> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
> slightly from what I would expect from first principles and what I get out
> of this derivation by about 1%.. which would explain the bias I as seeing.
>
> I'm hoping we can find someone with a strong stats/probability background
> and loads of free time who can tackle this...
>

I'm *not* that person, but I gave it a go last weekend and realized a
few things:

1. We should add the additional constraint that for all PGs assigned
to an OSD, 1/N of them must be primary replicas, 1/N must be
secondary, 1/N must be tertiary, etc. for N replicas/stripes. E.g. for
a 3 replica pool, the "small" OSD should still have the property that
1/3rd are primaries, 1/3rd are secondary, 1/3rd are tertiary.

2. I believe this is a case of the balls-into-bins problem -- we have
colored balls and weighted bins. I didn't find a definition of the
problem where the goal is to allow users to specify weights which must
be respected after N rounds.

3. I wrote some quick python to simulate different reweighting
algorithms. The solution is definitely not obvious - I often thought
I'd solved it (e.g. for simple OSD weight sets like 3, 3, 3, 1) - but
changing the OSDs weights to e.g. 3, 3, 1, 1 completely broke things.
I can clean up and share that python if it's can help.

My gut feeling is that because CRUSH trees and rulesets can be
arbitrarily complex, the most pragmatic & reliable way to solve this
problem is to balance the PGs with a reweight-by-pg loop at crush
compilation time. This is what admins should do now -- we should just
automate it.

Cheers, Dan

P.S. -- maybe these guys can help: http://math.stackexchange.com/

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-03 15:26       ` Dan van der Ster
@ 2017-02-03 17:37         ` Dan van der Ster
  2017-02-06  8:31           ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Dan van der Ster @ 2017-02-03 17:37 UTC (permalink / raw)
  To: Sage Weil; +Cc: Loic Dachary, ceph-devel

Anyway, here's my simple simulation. It might be helpful for testing
ideas quickly: https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992

Below is the output using the P(pick small | first pick not small)
observation, using OSDs having weights 3, 3, 3, & 1 respectively. It
seems to *almost* work, but only when we have just one small OSD.

See the end of the script for other various ideas.

-- Dan

> python mpa.py
OSDs (id: weight): {0: 3, 1: 3, 2: 3, 3: 1}

Expected PGs per OSD:       {0: 90000, 1: 90000, 2: 90000, 3: 30000}

Simulating with existing CRUSH

Observed:                   {0: 85944, 1: 85810, 2: 85984, 3: 42262}
Observed for Nth replica:   [{0: 29936, 1: 30045, 2: 30061, 3: 9958},
{0: 29037, 1: 29073, 2: 29041, 3: 12849}, {0: 26971, 1: 26692, 2:
26882, 3: 19455}]

Now trying your new algorithm

Observed:                   {0: 89423, 1: 89443, 2: 89476, 3: 31658}
Observed for Nth replica:   [{0: 30103, 1: 30132, 2: 29805, 3: 9960},
{0: 29936, 1: 29964, 2: 29796, 3: 10304}, {0: 29384, 1: 29347, 2:
29875, 3: 11394}]


On Fri, Feb 3, 2017 at 4:26 PM, Dan van der Ster <dan@vanderster.com> wrote:
> On Fri, Feb 3, 2017 at 3:47 PM, Sage Weil <sweil@redhat.com> wrote:
>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>> > Hi Sage,
>>> >
>>> > Still trying to understand what you did :-) I have one question below.
>>> >
>>> > On 01/26/2017 04:05 AM, Sage Weil wrote:
>>> >> This is a longstanding bug,
>>> >>
>>> >>    http://tracker.ceph.com/issues/15653
>>> >>
>>> >> that causes low-weighted devices to get more data than they should. Loic's
>>> >> recent activity resurrected discussion on the original PR
>>> >>
>>> >>    https://github.com/ceph/ceph/pull/10218
>>> >>
>>> >> but since it's closed and almost nobody will see it I'm moving the
>>> >> discussion here.
>>> >>
>>> >> The main news is that I have a simple adjustment for the weights that
>>> >> works (almost perfectly) for the 2nd round of placements.  The solution is
>>> >> pretty simple, although as with most probabilities it tends to make my
>>> >> brain hurt.
>>> >>
>>> >> The idea is that, on the second round, the original weight for the small
>>> >> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>> >> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>> >> given b) is P(a && b) / P(b),
>>> >
>>> >>From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>> >
>>> >>
>>> >>  P(pick small | first pick not small)
>>> >>  = P(pick small && first pick not small) / P(first pick not small)
>>> >>
>>> >> The last term is easy to calculate,
>>> >>
>>> >>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>> >>
>>> >> and the && term is the distribution we're trying to produce.
>>> >
>>> > https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>> >
>>> >> For exmaple,
>>> >> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>> >> their second replica be the small OSD.  So
>>> >>
>>> >>  P(pick small && first pick not small) = small_weight / total_weight
>>> >>
>>> >> Putting those together,
>>> >>
>>> >>  P(pick small | first pick not small)
>>> >>  = P(pick small && first pick not small) / P(first pick not small)
>>> >>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>> >>  = small_weight / (total_weight - small_weight)
>>> >>
>>> >> This is, on the second round, we should adjust the weights by the above so
>>> >> that we get the right distribution of second choices.  It turns out it
>>> >> works to adjust *all* weights like this to get hte conditional probability
>>> >> that they weren't already chosen.
>>> >>
>>> >> I have a branch that hacks this into straw2 and it appears to work
>>> >
>>> > This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>
>>> In
>>>
>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>
>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>
>>> I don't get why we need  "* bucketw" at the end ?
>>
>> It's just to keep the values within a reasonable range so that we don't
>> lose precision by dropping down into small integers.
>>
>> I futzed around with this some more last week trying to get the third
>> replica to work and ended up doubting that this piece is correct.  The
>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>> slightly from what I would expect from first principles and what I get out
>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>
>> I'm hoping we can find someone with a strong stats/probability background
>> and loads of free time who can tackle this...
>>
>
> I'm *not* that person, but I gave it a go last weekend and realized a
> few things:
>
> 1. We should add the additional constraint that for all PGs assigned
> to an OSD, 1/N of them must be primary replicas, 1/N must be
> secondary, 1/N must be tertiary, etc. for N replicas/stripes. E.g. for
> a 3 replica pool, the "small" OSD should still have the property that
> 1/3rd are primaries, 1/3rd are secondary, 1/3rd are tertiary.
>
> 2. I believe this is a case of the balls-into-bins problem -- we have
> colored balls and weighted bins. I didn't find a definition of the
> problem where the goal is to allow users to specify weights which must
> be respected after N rounds.
>
> 3. I wrote some quick python to simulate different reweighting
> algorithms. The solution is definitely not obvious - I often thought
> I'd solved it (e.g. for simple OSD weight sets like 3, 3, 3, 1) - but
> changing the OSDs weights to e.g. 3, 3, 1, 1 completely broke things.
> I can clean up and share that python if it's can help.
>
> My gut feeling is that because CRUSH trees and rulesets can be
> arbitrarily complex, the most pragmatic & reliable way to solve this
> problem is to balance the PGs with a reweight-by-pg loop at crush
> compilation time. This is what admins should do now -- we should just
> automate it.
>
> Cheers, Dan
>
> P.S. -- maybe these guys can help: http://math.stackexchange.com/

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-03 15:08       ` Loic Dachary
@ 2017-02-03 18:54         ` Loic Dachary
  2017-02-06  3:08           ` Jaze Lee
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-03 18:54 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel



On 02/03/2017 04:08 PM, Loic Dachary wrote:
> 
> 
> On 02/03/2017 03:47 PM, Sage Weil wrote:
>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>> Hi Sage,
>>>>
>>>> Still trying to understand what you did :-) I have one question below.
>>>>
>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>> This is a longstanding bug,
>>>>>
>>>>> 	http://tracker.ceph.com/issues/15653
>>>>>
>>>>> that causes low-weighted devices to get more data than they should. Loic's 
>>>>> recent activity resurrected discussion on the original PR
>>>>>
>>>>> 	https://github.com/ceph/ceph/pull/10218
>>>>>
>>>>> but since it's closed and almost nobody will see it I'm moving the 
>>>>> discussion here.
>>>>>
>>>>> The main news is that I have a simple adjustment for the weights that 
>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is 
>>>>> pretty simple, although as with most probabilities it tends to make my 
>>>>> brain hurt.
>>>>>
>>>>> The idea is that, on the second round, the original weight for the small 
>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a 
>>>>> given b) is P(a && b) / P(b),
>>>>
>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>
>>>>>
>>>>>  P(pick small | first pick not small)
>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>
>>>>> The last term is easy to calculate,
>>>>>
>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>
>>>>> and the && term is the distribution we're trying to produce.  
>>>>
>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>
>>>>> For exmaple, 
>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have 
>>>>> their second replica be the small OSD.  So
>>>>>
>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>
>>>>> Putting those together,
>>>>>
>>>>>  P(pick small | first pick not small)
>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>  = small_weight / (total_weight - small_weight)
>>>>>
>>>>> This is, on the second round, we should adjust the weights by the above so 
>>>>> that we get the right distribution of second choices.  It turns out it 
>>>>> works to adjust *all* weights like this to get hte conditional probability 
>>>>> that they weren't already chosen.
>>>>>
>>>>> I have a branch that hacks this into straw2 and it appears to work 
>>>>
>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>
>>> In
>>>
>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>
>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>
>>> I don't get why we need  "* bucketw" at the end ?
>>
>> It's just to keep the values within a reasonable range so that we don't 
>> lose precision by dropping down into small integers.
>>
>> I futzed around with this some more last week trying to get the third 
>> replica to work and ended up doubting that this piece is correct.  The 
>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies 
>> slightly from what I would expect from first principles and what I get out 
>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>
>> I'm hoping we can find someone with a strong stats/probability background 
>> and loads of free time who can tackle this...
>>
> 
> It would help to formulate the problem into a self contained puzzle to present a mathematician. I tried to do it last week but failed. I'll give it another shot and submit a draft, hoping something bad could be the start of something better ;-)

Here is what I have. I realize this is not good but I'm hoping someone more knowledgeable will pity me and provide something sensible. Otherwise I'm happy to keep making a fool of myself :-) In the following a bin is the device, the ball is a replica and the color is the object id.

We have D bins and each bin can hold D(B) balls. All balls have the
same size. There is exactly X balls of the same color. Each ball must
be placed in a bin that does not already contain a ball of the same
color.

What distribution guarantees that, for all X, the bins are filled in
the same proportion ?

Details
=======

* One placement: all balls are the same color and we place each of them
  in a bin with a probability of:

    P(BIN) = BIN(B) / SUM(BINi(B) for i in [1..D])

  so that bins are equally filled regardless of their capacity.

* Two placements: for each ball there is exactly one other ball of the
  same color.  A ball is placed as in experience 1 and the chosen bin
  is set aside. The other ball of the same color is placed as in
  experience 1 with the remaining bins. The probability for a ball
  to be placed in a given BIN is:

    P(BIN) + P(all bins but BIN | BIN)

Examples
========

For instance we have 5 bins, a, b, c, d, e and they can hold:

a = 10 million balls
b = 10 million balls
c = 10 million balls
d = 10 million balls
e =  1 million balls

In the first experience with place each ball in

a with a probability of 10 / ( 10 + 10 + 10 + 10 + 1 ) = 10 / 41
same for b, c, d
e with a probability of 1 / 41

after 100,000 placements, the bins have

a = 243456
b = 243624
c = 244486
d = 243881
e = 24553

they are

a = 2.43 % full
b = 2.43 % full
c = 2.44 % full
d = 2.43 % full
e = 0.24 % full

In the second experience


>> sage
>>
>>
>>>
>>>>
>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
>>>>> current code, you get
>>>>>
>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>   device 6:             929148  [400572,528576]
>>>>>
>>>>> which is very close for the first replica (primary), but way off for the 
>>>>> second.  With my hacky change,
>>>>>
>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>   device 6:             804566  [400572,403994]
>>>>>
>>>>> which is quite close, but still skewing slightly high (by a big less than 
>>>>> 1%).
>>>>>
>>>>> Next steps:
>>>>>
>>>>> 1- generalize this for >2 replicas
>>>>> 2- figure out why it skews high
>>>>> 3- make this work for multi-level hierarchical descent
>>>>>
>>>>> sage
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>
>>> -- 
>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-03 18:54         ` Loic Dachary
@ 2017-02-06  3:08           ` Jaze Lee
  2017-02-06  8:18             ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Jaze Lee @ 2017-02-06  3:08 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Sage Weil, ceph-devel

It is more complicated than i have expected.....
I viewed http://tracker.ceph.com/issues/15653, and know that if the
replica number is
bigger than the host we choose, we may meet the problem.

That is
if we have
host: a b c d
host: e f  g h
host: i  j  k  l

we only choose one from each host for replica three, and the distribution
is as we expected?    Right ?


The problem described in http://tracker.ceph.com/issues/15653, may happen
when
1)
  host: a b c d e f g

and we choose all three replica from this host. But this is few happen
in production. Right?


May be i do not understand the problem correctly ?











2017-02-04 2:54 GMT+08:00 Loic Dachary <loic@dachary.org>:
>
>
> On 02/03/2017 04:08 PM, Loic Dachary wrote:
>>
>>
>> On 02/03/2017 03:47 PM, Sage Weil wrote:
>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>> Hi Sage,
>>>>>
>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>
>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>> This is a longstanding bug,
>>>>>>
>>>>>>   http://tracker.ceph.com/issues/15653
>>>>>>
>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>> recent activity resurrected discussion on the original PR
>>>>>>
>>>>>>   https://github.com/ceph/ceph/pull/10218
>>>>>>
>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>> discussion here.
>>>>>>
>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>> brain hurt.
>>>>>>
>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>> given b) is P(a && b) / P(b),
>>>>>
>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>
>>>>>>
>>>>>>  P(pick small | first pick not small)
>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>
>>>>>> The last term is easy to calculate,
>>>>>>
>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>
>>>>>> and the && term is the distribution we're trying to produce.
>>>>>
>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>
>>>>>> For exmaple,
>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>> their second replica be the small OSD.  So
>>>>>>
>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>
>>>>>> Putting those together,
>>>>>>
>>>>>>  P(pick small | first pick not small)
>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>
>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>> that they weren't already chosen.
>>>>>>
>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>
>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>
>>>> In
>>>>
>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>
>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>
>>>> I don't get why we need  "* bucketw" at the end ?
>>>
>>> It's just to keep the values within a reasonable range so that we don't
>>> lose precision by dropping down into small integers.
>>>
>>> I futzed around with this some more last week trying to get the third
>>> replica to work and ended up doubting that this piece is correct.  The
>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>> slightly from what I would expect from first principles and what I get out
>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>
>>> I'm hoping we can find someone with a strong stats/probability background
>>> and loads of free time who can tackle this...
>>>
>>
>> It would help to formulate the problem into a self contained puzzle to present a mathematician. I tried to do it last week but failed. I'll give it another shot and submit a draft, hoping something bad could be the start of something better ;-)
>
> Here is what I have. I realize this is not good but I'm hoping someone more knowledgeable will pity me and provide something sensible. Otherwise I'm happy to keep making a fool of myself :-) In the following a bin is the device, the ball is a replica and the color is the object id.
>
> We have D bins and each bin can hold D(B) balls. All balls have the
> same size. There is exactly X balls of the same color. Each ball must
> be placed in a bin that does not already contain a ball of the same
> color.
>
> What distribution guarantees that, for all X, the bins are filled in
> the same proportion ?
>
> Details
> =======
>
> * One placement: all balls are the same color and we place each of them
>   in a bin with a probability of:
>
>     P(BIN) = BIN(B) / SUM(BINi(B) for i in [1..D])
>
>   so that bins are equally filled regardless of their capacity.
>
> * Two placements: for each ball there is exactly one other ball of the
>   same color.  A ball is placed as in experience 1 and the chosen bin
>   is set aside. The other ball of the same color is placed as in
>   experience 1 with the remaining bins. The probability for a ball
>   to be placed in a given BIN is:
>
>     P(BIN) + P(all bins but BIN | BIN)
>
> Examples
> ========
>
> For instance we have 5 bins, a, b, c, d, e and they can hold:
>
> a = 10 million balls
> b = 10 million balls
> c = 10 million balls
> d = 10 million balls
> e =  1 million balls
>
> In the first experience with place each ball in
>
> a with a probability of 10 / ( 10 + 10 + 10 + 10 + 1 ) = 10 / 41
> same for b, c, d
> e with a probability of 1 / 41
>
> after 100,000 placements, the bins have
>
> a = 243456
> b = 243624
> c = 244486
> d = 243881
> e = 24553
>
> they are
>
> a = 2.43 % full
> b = 2.43 % full
> c = 2.44 % full
> d = 2.43 % full
> e = 0.24 % full
>
> In the second experience
>
>
>>> sage
>>>
>>>
>>>>
>>>>>
>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>>>> current code, you get
>>>>>>
>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>   device 6:             929148  [400572,528576]
>>>>>>
>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>> second.  With my hacky change,
>>>>>>
>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>   device 6:             804566  [400572,403994]
>>>>>>
>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>> 1%).
>>>>>>
>>>>>> Next steps:
>>>>>>
>>>>>> 1- generalize this for >2 replicas
>>>>>> 2- figure out why it skews high
>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>
>>>>>> sage
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
谦谦君子

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-06  3:08           ` Jaze Lee
@ 2017-02-06  8:18             ` Loic Dachary
  2017-02-06 14:11               ` Jaze Lee
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-06  8:18 UTC (permalink / raw)
  To: Jaze Lee; +Cc: ceph-devel

Hi,

On 02/06/2017 04:08 AM, Jaze Lee wrote:
> It is more complicated than i have expected.....
> I viewed http://tracker.ceph.com/issues/15653, and know that if the
> replica number is
> bigger than the host we choose, we may meet the problem.
> 
> That is
> if we have
> host: a b c d
> host: e f  g h
> host: i  j  k  l
> 
> we only choose one from each host for replica three, and the distribution
> is as we expected?    Right ?
> 
> 
> The problem described in http://tracker.ceph.com/issues/15653, may happen
> when
> 1)
>   host: a b c d e f g
> 
> and we choose all three replica from this host. But this is few happen
> in production. Right?
> 
> 
> May be i do not understand the problem correctly ?

The problem also happens with host: a b c d e f g when you try to get three replicas that are not on the same disk. You can experiment with Dan's script

https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992

Cheers


> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 2017-02-04 2:54 GMT+08:00 Loic Dachary <loic@dachary.org>:
>>
>>
>> On 02/03/2017 04:08 PM, Loic Dachary wrote:
>>>
>>>
>>> On 02/03/2017 03:47 PM, Sage Weil wrote:
>>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>>> Hi Sage,
>>>>>>
>>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>>
>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>> This is a longstanding bug,
>>>>>>>
>>>>>>>   http://tracker.ceph.com/issues/15653
>>>>>>>
>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>
>>>>>>>   https://github.com/ceph/ceph/pull/10218
>>>>>>>
>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>> discussion here.
>>>>>>>
>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>> brain hurt.
>>>>>>>
>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>>> given b) is P(a && b) / P(b),
>>>>>>
>>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>>
>>>>>>>
>>>>>>>  P(pick small | first pick not small)
>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>
>>>>>>> The last term is easy to calculate,
>>>>>>>
>>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>
>>>>>>> and the && term is the distribution we're trying to produce.
>>>>>>
>>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>>
>>>>>>> For exmaple,
>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>> their second replica be the small OSD.  So
>>>>>>>
>>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>
>>>>>>> Putting those together,
>>>>>>>
>>>>>>>  P(pick small | first pick not small)
>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>>
>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>> that they weren't already chosen.
>>>>>>>
>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>
>>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>>
>>>>> In
>>>>>
>>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>>
>>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>>
>>>>> I don't get why we need  "* bucketw" at the end ?
>>>>
>>>> It's just to keep the values within a reasonable range so that we don't
>>>> lose precision by dropping down into small integers.
>>>>
>>>> I futzed around with this some more last week trying to get the third
>>>> replica to work and ended up doubting that this piece is correct.  The
>>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>>> slightly from what I would expect from first principles and what I get out
>>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>>
>>>> I'm hoping we can find someone with a strong stats/probability background
>>>> and loads of free time who can tackle this...
>>>>
>>>
>>> It would help to formulate the problem into a self contained puzzle to present a mathematician. I tried to do it last week but failed. I'll give it another shot and submit a draft, hoping something bad could be the start of something better ;-)
>>
>> Here is what I have. I realize this is not good but I'm hoping someone more knowledgeable will pity me and provide something sensible. Otherwise I'm happy to keep making a fool of myself :-) In the following a bin is the device, the ball is a replica and the color is the object id.
>>
>> We have D bins and each bin can hold D(B) balls. All balls have the
>> same size. There is exactly X balls of the same color. Each ball must
>> be placed in a bin that does not already contain a ball of the same
>> color.
>>
>> What distribution guarantees that, for all X, the bins are filled in
>> the same proportion ?
>>
>> Details
>> =======
>>
>> * One placement: all balls are the same color and we place each of them
>>   in a bin with a probability of:
>>
>>     P(BIN) = BIN(B) / SUM(BINi(B) for i in [1..D])
>>
>>   so that bins are equally filled regardless of their capacity.
>>
>> * Two placements: for each ball there is exactly one other ball of the
>>   same color.  A ball is placed as in experience 1 and the chosen bin
>>   is set aside. The other ball of the same color is placed as in
>>   experience 1 with the remaining bins. The probability for a ball
>>   to be placed in a given BIN is:
>>
>>     P(BIN) + P(all bins but BIN | BIN)
>>
>> Examples
>> ========
>>
>> For instance we have 5 bins, a, b, c, d, e and they can hold:
>>
>> a = 10 million balls
>> b = 10 million balls
>> c = 10 million balls
>> d = 10 million balls
>> e =  1 million balls
>>
>> In the first experience with place each ball in
>>
>> a with a probability of 10 / ( 10 + 10 + 10 + 10 + 1 ) = 10 / 41
>> same for b, c, d
>> e with a probability of 1 / 41
>>
>> after 100,000 placements, the bins have
>>
>> a = 243456
>> b = 243624
>> c = 244486
>> d = 243881
>> e = 24553
>>
>> they are
>>
>> a = 2.43 % full
>> b = 2.43 % full
>> c = 2.44 % full
>> d = 2.43 % full
>> e = 0.24 % full
>>
>> In the second experience
>>
>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>>>>> current code, you get
>>>>>>>
>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>>   device 6:             929148  [400572,528576]
>>>>>>>
>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>> second.  With my hacky change,
>>>>>>>
>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>>   device 6:             804566  [400572,403994]
>>>>>>>
>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>> 1%).
>>>>>>>
>>>>>>> Next steps:
>>>>>>>
>>>>>>> 1- generalize this for >2 replicas
>>>>>>> 2- figure out why it skews high
>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>
>>>>>>> sage
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-03 17:37         ` Dan van der Ster
@ 2017-02-06  8:31           ` Loic Dachary
  2017-02-06  9:13             ` Dan van der Ster
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-06  8:31 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: ceph-devel, Szymon Datko, Tomasz Kuzemko

Hi Dan,

Your script turns out to be a nice self contained problem statement :-) Tomasz & Szymon discussed it today @ FOSDEM and I was enlightened by the way Szymon described how to calculate P(E|A) using a probability tree (see the picture at http://dachary.org/loic/crush-probability-schema.jpg).

Cheers

On 02/03/2017 06:37 PM, Dan van der Ster wrote:
> Anyway, here's my simple simulation. It might be helpful for testing
> ideas quickly: https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992
> 
> Below is the output using the P(pick small | first pick not small)
> observation, using OSDs having weights 3, 3, 3, & 1 respectively. It
> seems to *almost* work, but only when we have just one small OSD.
> 
> See the end of the script for other various ideas.
> 
> -- Dan
> 
>> python mpa.py
> OSDs (id: weight): {0: 3, 1: 3, 2: 3, 3: 1}
> 
> Expected PGs per OSD:       {0: 90000, 1: 90000, 2: 90000, 3: 30000}
> 
> Simulating with existing CRUSH
> 
> Observed:                   {0: 85944, 1: 85810, 2: 85984, 3: 42262}
> Observed for Nth replica:   [{0: 29936, 1: 30045, 2: 30061, 3: 9958},
> {0: 29037, 1: 29073, 2: 29041, 3: 12849}, {0: 26971, 1: 26692, 2:
> 26882, 3: 19455}]
> 
> Now trying your new algorithm
> 
> Observed:                   {0: 89423, 1: 89443, 2: 89476, 3: 31658}
> Observed for Nth replica:   [{0: 30103, 1: 30132, 2: 29805, 3: 9960},
> {0: 29936, 1: 29964, 2: 29796, 3: 10304}, {0: 29384, 1: 29347, 2:
> 29875, 3: 11394}]
> 
> 
> On Fri, Feb 3, 2017 at 4:26 PM, Dan van der Ster <dan@vanderster.com> wrote:
>> On Fri, Feb 3, 2017 at 3:47 PM, Sage Weil <sweil@redhat.com> wrote:
>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>> Hi Sage,
>>>>>
>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>
>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>> This is a longstanding bug,
>>>>>>
>>>>>>    http://tracker.ceph.com/issues/15653
>>>>>>
>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>> recent activity resurrected discussion on the original PR
>>>>>>
>>>>>>    https://github.com/ceph/ceph/pull/10218
>>>>>>
>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>> discussion here.
>>>>>>
>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>> brain hurt.
>>>>>>
>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>> given b) is P(a && b) / P(b),
>>>>>
>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>
>>>>>>
>>>>>>  P(pick small | first pick not small)
>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>
>>>>>> The last term is easy to calculate,
>>>>>>
>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>
>>>>>> and the && term is the distribution we're trying to produce.
>>>>>
>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>
>>>>>> For exmaple,
>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>> their second replica be the small OSD.  So
>>>>>>
>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>
>>>>>> Putting those together,
>>>>>>
>>>>>>  P(pick small | first pick not small)
>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>
>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>> that they weren't already chosen.
>>>>>>
>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>
>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>
>>>> In
>>>>
>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>
>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>
>>>> I don't get why we need  "* bucketw" at the end ?
>>>
>>> It's just to keep the values within a reasonable range so that we don't
>>> lose precision by dropping down into small integers.
>>>
>>> I futzed around with this some more last week trying to get the third
>>> replica to work and ended up doubting that this piece is correct.  The
>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>> slightly from what I would expect from first principles and what I get out
>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>
>>> I'm hoping we can find someone with a strong stats/probability background
>>> and loads of free time who can tackle this...
>>>
>>
>> I'm *not* that person, but I gave it a go last weekend and realized a
>> few things:
>>
>> 1. We should add the additional constraint that for all PGs assigned
>> to an OSD, 1/N of them must be primary replicas, 1/N must be
>> secondary, 1/N must be tertiary, etc. for N replicas/stripes. E.g. for
>> a 3 replica pool, the "small" OSD should still have the property that
>> 1/3rd are primaries, 1/3rd are secondary, 1/3rd are tertiary.
>>
>> 2. I believe this is a case of the balls-into-bins problem -- we have
>> colored balls and weighted bins. I didn't find a definition of the
>> problem where the goal is to allow users to specify weights which must
>> be respected after N rounds.
>>
>> 3. I wrote some quick python to simulate different reweighting
>> algorithms. The solution is definitely not obvious - I often thought
>> I'd solved it (e.g. for simple OSD weight sets like 3, 3, 3, 1) - but
>> changing the OSDs weights to e.g. 3, 3, 1, 1 completely broke things.
>> I can clean up and share that python if it's can help.
>>
>> My gut feeling is that because CRUSH trees and rulesets can be
>> arbitrarily complex, the most pragmatic & reliable way to solve this
>> problem is to balance the PGs with a reweight-by-pg loop at crush
>> compilation time. This is what admins should do now -- we should just
>> automate it.
>>
>> Cheers, Dan
>>
>> P.S. -- maybe these guys can help: http://math.stackexchange.com/
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-06  8:31           ` Loic Dachary
@ 2017-02-06  9:13             ` Dan van der Ster
  2017-02-06 16:53               ` Dan van der Ster
  0 siblings, 1 reply; 70+ messages in thread
From: Dan van der Ster @ 2017-02-06  9:13 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel, Szymon Datko, Tomasz Kuzemko

Hi Loic,

Here's my current understanding of the problem. (Below I work with the
example having four OSDs with weights 3, 3, 3, 1, respectively).

I'm elaborating on the observation that for every replication "round",
the PG ratios for each and every OSD must be equal to the "target" or
goal weight of that OSD. So, for an OSD that should get 10% of PGs,
that OSD gets 10% in round 1, 10% in round 2, etc... But we need to
multiply each of these ratios by the probability that this OSD is
still available in Round r.

Hence I believe we have this loop invariant:

   P(OSD.x still available in Round r) * (Weight of OSD.x in Round r)
/ (Total sum of all weights in Round r) == (Original "target" Weight
of OSD.x) / (Total sum of all target weights)

I simplify all these terms:
  P(OSD.x still available for Round r) = P_x_r
  Weight of OSD.x in Round r = W_x_r
  Total sum of all weights in Round r = T_r
  Original "target" Weight of OSD.x = W_x
  Total sum of all target weights = T

So rewriting the equation, we have:

  P_x_r * W_x_r / T_r == W_x / T

We then calculate the needed weight of OSD.x in Round r. W_x_r is what
we're trying to solve for!!

  W_x_r = W_x / T  *  T_r / P_x_r

The first term W_x / T is a constant and easy to compute. (For my
example small OSD, W_x / T = 0.1)

P_x_r is also -- I believe -- simple to compute. P_x_r gets smaller
for each round and is a function of what happened in the previous
round:

  Round 1: P_x_1 = 1.0
  Round 2: P_x_2 = P_x_1 * (1 - W_x_1 / T_1)
  Round 3: P_x_3 = P_x_2 * (1 - W_x_2 / T_2)
  ...

But T_r is a challenge -- T_r is the sum of W_x_r for all x in round
r. Hence, the problem is that we don't know T_r until *after* we
compute all W_x_r's for that round. I tried various ways to estimate
T_r but didn't make any progress.

Do you think this formulation is correct? Any clever ideas where to go next?

Cheers, Dan




On Mon, Feb 6, 2017 at 9:31 AM, Loic Dachary <loic@dachary.org> wrote:
> Hi Dan,
>
> Your script turns out to be a nice self contained problem statement :-) Tomasz & Szymon discussed it today @ FOSDEM and I was enlightened by the way Szymon described how to calculate P(E|A) using a probability tree (see the picture at http://dachary.org/loic/crush-probability-schema.jpg).
>
> Cheers
>
> On 02/03/2017 06:37 PM, Dan van der Ster wrote:
>> Anyway, here's my simple simulation. It might be helpful for testing
>> ideas quickly: https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992
>>
>> Below is the output using the P(pick small | first pick not small)
>> observation, using OSDs having weights 3, 3, 3, & 1 respectively. It
>> seems to *almost* work, but only when we have just one small OSD.
>>
>> See the end of the script for other various ideas.
>>
>> -- Dan
>>
>>> python mpa.py
>> OSDs (id: weight): {0: 3, 1: 3, 2: 3, 3: 1}
>>
>> Expected PGs per OSD:       {0: 90000, 1: 90000, 2: 90000, 3: 30000}
>>
>> Simulating with existing CRUSH
>>
>> Observed:                   {0: 85944, 1: 85810, 2: 85984, 3: 42262}
>> Observed for Nth replica:   [{0: 29936, 1: 30045, 2: 30061, 3: 9958},
>> {0: 29037, 1: 29073, 2: 29041, 3: 12849}, {0: 26971, 1: 26692, 2:
>> 26882, 3: 19455}]
>>
>> Now trying your new algorithm
>>
>> Observed:                   {0: 89423, 1: 89443, 2: 89476, 3: 31658}
>> Observed for Nth replica:   [{0: 30103, 1: 30132, 2: 29805, 3: 9960},
>> {0: 29936, 1: 29964, 2: 29796, 3: 10304}, {0: 29384, 1: 29347, 2:
>> 29875, 3: 11394}]
>>
>>
>> On Fri, Feb 3, 2017 at 4:26 PM, Dan van der Ster <dan@vanderster.com> wrote:
>>> On Fri, Feb 3, 2017 at 3:47 PM, Sage Weil <sweil@redhat.com> wrote:
>>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>>> Hi Sage,
>>>>>>
>>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>>
>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>> This is a longstanding bug,
>>>>>>>
>>>>>>>    http://tracker.ceph.com/issues/15653
>>>>>>>
>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>
>>>>>>>    https://github.com/ceph/ceph/pull/10218
>>>>>>>
>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>> discussion here.
>>>>>>>
>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>> brain hurt.
>>>>>>>
>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>>> given b) is P(a && b) / P(b),
>>>>>>
>>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>>
>>>>>>>
>>>>>>>  P(pick small | first pick not small)
>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>
>>>>>>> The last term is easy to calculate,
>>>>>>>
>>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>
>>>>>>> and the && term is the distribution we're trying to produce.
>>>>>>
>>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>>
>>>>>>> For exmaple,
>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>> their second replica be the small OSD.  So
>>>>>>>
>>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>
>>>>>>> Putting those together,
>>>>>>>
>>>>>>>  P(pick small | first pick not small)
>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>>
>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>> that they weren't already chosen.
>>>>>>>
>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>
>>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>>
>>>>> In
>>>>>
>>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>>
>>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>>
>>>>> I don't get why we need  "* bucketw" at the end ?
>>>>
>>>> It's just to keep the values within a reasonable range so that we don't
>>>> lose precision by dropping down into small integers.
>>>>
>>>> I futzed around with this some more last week trying to get the third
>>>> replica to work and ended up doubting that this piece is correct.  The
>>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>>> slightly from what I would expect from first principles and what I get out
>>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>>
>>>> I'm hoping we can find someone with a strong stats/probability background
>>>> and loads of free time who can tackle this...
>>>>
>>>
>>> I'm *not* that person, but I gave it a go last weekend and realized a
>>> few things:
>>>
>>> 1. We should add the additional constraint that for all PGs assigned
>>> to an OSD, 1/N of them must be primary replicas, 1/N must be
>>> secondary, 1/N must be tertiary, etc. for N replicas/stripes. E.g. for
>>> a 3 replica pool, the "small" OSD should still have the property that
>>> 1/3rd are primaries, 1/3rd are secondary, 1/3rd are tertiary.
>>>
>>> 2. I believe this is a case of the balls-into-bins problem -- we have
>>> colored balls and weighted bins. I didn't find a definition of the
>>> problem where the goal is to allow users to specify weights which must
>>> be respected after N rounds.
>>>
>>> 3. I wrote some quick python to simulate different reweighting
>>> algorithms. The solution is definitely not obvious - I often thought
>>> I'd solved it (e.g. for simple OSD weight sets like 3, 3, 3, 1) - but
>>> changing the OSDs weights to e.g. 3, 3, 1, 1 completely broke things.
>>> I can clean up and share that python if it's can help.
>>>
>>> My gut feeling is that because CRUSH trees and rulesets can be
>>> arbitrarily complex, the most pragmatic & reliable way to solve this
>>> problem is to balance the PGs with a reweight-by-pg loop at crush
>>> compilation time. This is what admins should do now -- we should just
>>> automate it.
>>>
>>> Cheers, Dan
>>>
>>> P.S. -- maybe these guys can help: http://math.stackexchange.com/
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-06  8:18             ` Loic Dachary
@ 2017-02-06 14:11               ` Jaze Lee
  2017-02-06 17:07                 ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Jaze Lee @ 2017-02-06 14:11 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

2017-02-06 16:18 GMT+08:00 Loic Dachary <loic@dachary.org>:
> Hi,
>
> On 02/06/2017 04:08 AM, Jaze Lee wrote:
>> It is more complicated than i have expected.....
>> I viewed http://tracker.ceph.com/issues/15653, and know that if the
>> replica number is
>> bigger than the host we choose, we may meet the problem.
>>
>> That is
>> if we have
>> host: a b c d
>> host: e f  g h
>> host: i  j  k  l
>>
>> we only choose one from each host for replica three, and the distribution
>> is as we expected?    Right ?
>>
>>
>> The problem described in http://tracker.ceph.com/issues/15653, may happen
>> when
>> 1)
>>   host: a b c d e f g
>>
>> and we choose all three replica from this host. But this is few happen
>> in production. Right?
>>
>>
>> May be i do not understand the problem correctly ?
>
> The problem also happens with host: a b c d e f g when you try to get three replicas that are not on the same disk. You can experiment with Dan's script

Yes, I mean why we choose three from one host ? In production the host
number is ALWAYS
more than replica number.....

root
   rack-0
      host A
      host B
   rack-1
      host C
      host D
   rack -2
       host E
       host F

when choose pg 1.1 for osd, it will always choose one from rack-0, one
from rack-1, one from rack-2.
any pg will cause one be choosed from rack-0, rack-1, rack-2.

The problem is happened when we want to choose more than one osd from
a bucket for a pg, right ?



>
> https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992
>
> Cheers
>
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2017-02-04 2:54 GMT+08:00 Loic Dachary <loic@dachary.org>:
>>>
>>>
>>> On 02/03/2017 04:08 PM, Loic Dachary wrote:
>>>>
>>>>
>>>> On 02/03/2017 03:47 PM, Sage Weil wrote:
>>>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>>>> Hi Sage,
>>>>>>>
>>>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>>>
>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>> This is a longstanding bug,
>>>>>>>>
>>>>>>>>   http://tracker.ceph.com/issues/15653
>>>>>>>>
>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>
>>>>>>>>   https://github.com/ceph/ceph/pull/10218
>>>>>>>>
>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>> discussion here.
>>>>>>>>
>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>> brain hurt.
>>>>>>>>
>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>
>>>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>>>
>>>>>>>>
>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>
>>>>>>>> The last term is easy to calculate,
>>>>>>>>
>>>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>
>>>>>>>> and the && term is the distribution we're trying to produce.
>>>>>>>
>>>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>>>
>>>>>>>> For exmaple,
>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>> their second replica be the small OSD.  So
>>>>>>>>
>>>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>
>>>>>>>> Putting those together,
>>>>>>>>
>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>>>
>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>> that they weren't already chosen.
>>>>>>>>
>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>
>>>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>>>
>>>>>> In
>>>>>>
>>>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>>>
>>>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>>>
>>>>>> I don't get why we need  "* bucketw" at the end ?
>>>>>
>>>>> It's just to keep the values within a reasonable range so that we don't
>>>>> lose precision by dropping down into small integers.
>>>>>
>>>>> I futzed around with this some more last week trying to get the third
>>>>> replica to work and ended up doubting that this piece is correct.  The
>>>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>>>> slightly from what I would expect from first principles and what I get out
>>>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>>>
>>>>> I'm hoping we can find someone with a strong stats/probability background
>>>>> and loads of free time who can tackle this...
>>>>>
>>>>
>>>> It would help to formulate the problem into a self contained puzzle to present a mathematician. I tried to do it last week but failed. I'll give it another shot and submit a draft, hoping something bad could be the start of something better ;-)
>>>
>>> Here is what I have. I realize this is not good but I'm hoping someone more knowledgeable will pity me and provide something sensible. Otherwise I'm happy to keep making a fool of myself :-) In the following a bin is the device, the ball is a replica and the color is the object id.
>>>
>>> We have D bins and each bin can hold D(B) balls. All balls have the
>>> same size. There is exactly X balls of the same color. Each ball must
>>> be placed in a bin that does not already contain a ball of the same
>>> color.
>>>
>>> What distribution guarantees that, for all X, the bins are filled in
>>> the same proportion ?
>>>
>>> Details
>>> =======
>>>
>>> * One placement: all balls are the same color and we place each of them
>>>   in a bin with a probability of:
>>>
>>>     P(BIN) = BIN(B) / SUM(BINi(B) for i in [1..D])
>>>
>>>   so that bins are equally filled regardless of their capacity.
>>>
>>> * Two placements: for each ball there is exactly one other ball of the
>>>   same color.  A ball is placed as in experience 1 and the chosen bin
>>>   is set aside. The other ball of the same color is placed as in
>>>   experience 1 with the remaining bins. The probability for a ball
>>>   to be placed in a given BIN is:
>>>
>>>     P(BIN) + P(all bins but BIN | BIN)
>>>
>>> Examples
>>> ========
>>>
>>> For instance we have 5 bins, a, b, c, d, e and they can hold:
>>>
>>> a = 10 million balls
>>> b = 10 million balls
>>> c = 10 million balls
>>> d = 10 million balls
>>> e =  1 million balls
>>>
>>> In the first experience with place each ball in
>>>
>>> a with a probability of 10 / ( 10 + 10 + 10 + 10 + 1 ) = 10 / 41
>>> same for b, c, d
>>> e with a probability of 1 / 41
>>>
>>> after 100,000 placements, the bins have
>>>
>>> a = 243456
>>> b = 243624
>>> c = 244486
>>> d = 243881
>>> e = 24553
>>>
>>> they are
>>>
>>> a = 2.43 % full
>>> b = 2.43 % full
>>> c = 2.44 % full
>>> d = 2.43 % full
>>> e = 0.24 % full
>>>
>>> In the second experience
>>>
>>>
>>>>> sage
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>>>>>> current code, you get
>>>>>>>>
>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>>>   device 6:             929148  [400572,528576]
>>>>>>>>
>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>> second.  With my hacky change,
>>>>>>>>
>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>>>   device 6:             804566  [400572,403994]
>>>>>>>>
>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>> 1%).
>>>>>>>>
>>>>>>>> Next steps:
>>>>>>>>
>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>> 2- figure out why it skews high
>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>
>>>>>>>> sage
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre



-- 
谦谦君子

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-06  9:13             ` Dan van der Ster
@ 2017-02-06 16:53               ` Dan van der Ster
  0 siblings, 0 replies; 70+ messages in thread
From: Dan van der Ster @ 2017-02-06 16:53 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel, Szymon Datko, Tomasz Kuzemko

On Mon, Feb 6, 2017 at 10:13 AM, Dan van der Ster <dan@vanderster.com> wrote:
> Hi Loic,
>
> Here's my current understanding of the problem. (Below I work with the
> example having four OSDs with weights 3, 3, 3, 1, respectively).
>
> I'm elaborating on the observation that for every replication "round",
> the PG ratios for each and every OSD must be equal to the "target" or
> goal weight of that OSD. So, for an OSD that should get 10% of PGs,
> that OSD gets 10% in round 1, 10% in round 2, etc... But we need to
> multiply each of these ratios by the probability that this OSD is
> still available in Round r.
>
> Hence I believe we have this loop invariant:
>
>    P(OSD.x still available in Round r) * (Weight of OSD.x in Round r)
> / (Total sum of all weights in Round r) == (Original "target" Weight
> of OSD.x) / (Total sum of all target weights)
>
> I simplify all these terms:
>   P(OSD.x still available for Round r) = P_x_r
>   Weight of OSD.x in Round r = W_x_r
>   Total sum of all weights in Round r = T_r
>   Original "target" Weight of OSD.x = W_x
>   Total sum of all target weights = T
>
> So rewriting the equation, we have:
>
>   P_x_r * W_x_r / T_r == W_x / T
>
> We then calculate the needed weight of OSD.x in Round r. W_x_r is what
> we're trying to solve for!!
>
>   W_x_r = W_x / T  *  T_r / P_x_r
>
> The first term W_x / T is a constant and easy to compute. (For my
> example small OSD, W_x / T = 0.1)
>
> P_x_r is also -- I believe -- simple to compute. P_x_r gets smaller
> for each round and is a function of what happened in the previous
> round:
>
>   Round 1: P_x_1 = 1.0
>   Round 2: P_x_2 = P_x_1 * (1 - W_x_1 / T_1)
>   Round 3: P_x_3 = P_x_2 * (1 - W_x_2 / T_2)
>   ...
>
> But T_r is a challenge -- T_r is the sum of W_x_r for all x in round
> r. Hence, the problem is that we don't know T_r until *after* we
> compute all W_x_r's for that round. I tried various ways to estimate
> T_r but didn't make any progress.
>
> Do you think this formulation is correct? Any clever ideas where to go next?
>

Something is wrong, because the system of equations that this gives is
unsolvable.

In round 2 for the 3,3,3,1 OSD set, assuming the first, weight 3,
OSD.0 was chosen in the first round, we have:

P_1_2 = (1-3/10) = 0.7
P_2_2 = (1-3/10) = 0.7
P_3_2 = (1-1/10) = 0.9

And we know:

W_1 / T = 3/10 = 0.3
W_2 / T = 3/10 = 0.3
W_3 / T = 1/10 = 0.1

So we can describe the whole round:

W_1_2 = W_1 / T * T_2 / P_1_2 = 0.3 * T_2 / 0.7 = 0.4286 T_2
W_2_2 = W_2 / T * T_2 / P_2_2 = 0.3 * T_2 / 0.7 = 0.4286 T_2
W_3_2 = W_3 / T * T_2 / P_3_2 = 0.1 * T_2 / 0.9 = 0.1111 T_2
W_1_2 + W_2_2 + W_3_2 = T_2

Putting this all into a solver gives 0.9683 * T_2 = T_2, which is nonsense.

-- Dan

> Cheers, Dan
>
>
>
>
> On Mon, Feb 6, 2017 at 9:31 AM, Loic Dachary <loic@dachary.org> wrote:
>> Hi Dan,
>>
>> Your script turns out to be a nice self contained problem statement :-) Tomasz & Szymon discussed it today @ FOSDEM and I was enlightened by the way Szymon described how to calculate P(E|A) using a probability tree (see the picture at http://dachary.org/loic/crush-probability-schema.jpg).
>>
>> Cheers
>>
>> On 02/03/2017 06:37 PM, Dan van der Ster wrote:
>>> Anyway, here's my simple simulation. It might be helpful for testing
>>> ideas quickly: https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992
>>>
>>> Below is the output using the P(pick small | first pick not small)
>>> observation, using OSDs having weights 3, 3, 3, & 1 respectively. It
>>> seems to *almost* work, but only when we have just one small OSD.
>>>
>>> See the end of the script for other various ideas.
>>>
>>> -- Dan
>>>
>>>> python mpa.py
>>> OSDs (id: weight): {0: 3, 1: 3, 2: 3, 3: 1}
>>>
>>> Expected PGs per OSD:       {0: 90000, 1: 90000, 2: 90000, 3: 30000}
>>>
>>> Simulating with existing CRUSH
>>>
>>> Observed:                   {0: 85944, 1: 85810, 2: 85984, 3: 42262}
>>> Observed for Nth replica:   [{0: 29936, 1: 30045, 2: 30061, 3: 9958},
>>> {0: 29037, 1: 29073, 2: 29041, 3: 12849}, {0: 26971, 1: 26692, 2:
>>> 26882, 3: 19455}]
>>>
>>> Now trying your new algorithm
>>>
>>> Observed:                   {0: 89423, 1: 89443, 2: 89476, 3: 31658}
>>> Observed for Nth replica:   [{0: 30103, 1: 30132, 2: 29805, 3: 9960},
>>> {0: 29936, 1: 29964, 2: 29796, 3: 10304}, {0: 29384, 1: 29347, 2:
>>> 29875, 3: 11394}]
>>>
>>>
>>> On Fri, Feb 3, 2017 at 4:26 PM, Dan van der Ster <dan@vanderster.com> wrote:
>>>> On Fri, Feb 3, 2017 at 3:47 PM, Sage Weil <sweil@redhat.com> wrote:
>>>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>>>> Hi Sage,
>>>>>>>
>>>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>>>
>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>> This is a longstanding bug,
>>>>>>>>
>>>>>>>>    http://tracker.ceph.com/issues/15653
>>>>>>>>
>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>
>>>>>>>>    https://github.com/ceph/ceph/pull/10218
>>>>>>>>
>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>> discussion here.
>>>>>>>>
>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>> brain hurt.
>>>>>>>>
>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>
>>>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>>>
>>>>>>>>
>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>
>>>>>>>> The last term is easy to calculate,
>>>>>>>>
>>>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>
>>>>>>>> and the && term is the distribution we're trying to produce.
>>>>>>>
>>>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>>>
>>>>>>>> For exmaple,
>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>> their second replica be the small OSD.  So
>>>>>>>>
>>>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>
>>>>>>>> Putting those together,
>>>>>>>>
>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>>>
>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>> that they weren't already chosen.
>>>>>>>>
>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>
>>>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>>>
>>>>>> In
>>>>>>
>>>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>>>
>>>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>>>
>>>>>> I don't get why we need  "* bucketw" at the end ?
>>>>>
>>>>> It's just to keep the values within a reasonable range so that we don't
>>>>> lose precision by dropping down into small integers.
>>>>>
>>>>> I futzed around with this some more last week trying to get the third
>>>>> replica to work and ended up doubting that this piece is correct.  The
>>>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>>>> slightly from what I would expect from first principles and what I get out
>>>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>>>
>>>>> I'm hoping we can find someone with a strong stats/probability background
>>>>> and loads of free time who can tackle this...
>>>>>
>>>>
>>>> I'm *not* that person, but I gave it a go last weekend and realized a
>>>> few things:
>>>>
>>>> 1. We should add the additional constraint that for all PGs assigned
>>>> to an OSD, 1/N of them must be primary replicas, 1/N must be
>>>> secondary, 1/N must be tertiary, etc. for N replicas/stripes. E.g. for
>>>> a 3 replica pool, the "small" OSD should still have the property that
>>>> 1/3rd are primaries, 1/3rd are secondary, 1/3rd are tertiary.
>>>>
>>>> 2. I believe this is a case of the balls-into-bins problem -- we have
>>>> colored balls and weighted bins. I didn't find a definition of the
>>>> problem where the goal is to allow users to specify weights which must
>>>> be respected after N rounds.
>>>>
>>>> 3. I wrote some quick python to simulate different reweighting
>>>> algorithms. The solution is definitely not obvious - I often thought
>>>> I'd solved it (e.g. for simple OSD weight sets like 3, 3, 3, 1) - but
>>>> changing the OSDs weights to e.g. 3, 3, 1, 1 completely broke things.
>>>> I can clean up and share that python if it's can help.
>>>>
>>>> My gut feeling is that because CRUSH trees and rulesets can be
>>>> arbitrarily complex, the most pragmatic & reliable way to solve this
>>>> problem is to balance the PGs with a reweight-by-pg loop at crush
>>>> compilation time. This is what admins should do now -- we should just
>>>> automate it.
>>>>
>>>> Cheers, Dan
>>>>
>>>> P.S. -- maybe these guys can help: http://math.stackexchange.com/
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-06 14:11               ` Jaze Lee
@ 2017-02-06 17:07                 ` Loic Dachary
  0 siblings, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-02-06 17:07 UTC (permalink / raw)
  To: Jaze Lee; +Cc: ceph-devel



On 02/06/2017 03:11 PM, Jaze Lee wrote:
> 2017-02-06 16:18 GMT+08:00 Loic Dachary <loic@dachary.org>:
>> Hi,
>>
>> On 02/06/2017 04:08 AM, Jaze Lee wrote:
>>> It is more complicated than i have expected.....
>>> I viewed http://tracker.ceph.com/issues/15653, and know that if the
>>> replica number is
>>> bigger than the host we choose, we may meet the problem.
>>>
>>> That is
>>> if we have
>>> host: a b c d
>>> host: e f  g h
>>> host: i  j  k  l
>>>
>>> we only choose one from each host for replica three, and the distribution
>>> is as we expected?    Right ?
>>>
>>>
>>> The problem described in http://tracker.ceph.com/issues/15653, may happen
>>> when
>>> 1)
>>>   host: a b c d e f g
>>>
>>> and we choose all three replica from this host. But this is few happen
>>> in production. Right?
>>>
>>>
>>> May be i do not understand the problem correctly ?
>>
>> The problem also happens with host: a b c d e f g when you try to get three replicas that are not on the same disk. You can experiment with Dan's script
> 
> Yes, I mean why we choose three from one host ? 

Because it should work in this specific case. And also because this is a problem that shows in every situation, not just this specific situation.

Cheers


> In production the host
> number is ALWAYS
> more than replica number.....
> 
> root
>    rack-0
>       host A
>       host B
>    rack-1
>       host C
>       host D
>    rack -2
>        host E
>        host F
> 
> when choose pg 1.1 for osd, it will always choose one from rack-0, one
> from rack-1, one from rack-2.
> any pg will cause one be choosed from rack-0, rack-1, rack-2.
> 
> The problem is happened when we want to choose more than one osd from
> a bucket for a pg, right ?
> 
> 
> 
>>
>> https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992
>>
>> Cheers
>>
>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2017-02-04 2:54 GMT+08:00 Loic Dachary <loic@dachary.org>:
>>>>
>>>>
>>>> On 02/03/2017 04:08 PM, Loic Dachary wrote:
>>>>>
>>>>>
>>>>> On 02/03/2017 03:47 PM, Sage Weil wrote:
>>>>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>>>>> Hi Sage,
>>>>>>>>
>>>>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>>>>
>>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>>> This is a longstanding bug,
>>>>>>>>>
>>>>>>>>>   http://tracker.ceph.com/issues/15653
>>>>>>>>>
>>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>>
>>>>>>>>>   https://github.com/ceph/ceph/pull/10218
>>>>>>>>>
>>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>>> discussion here.
>>>>>>>>>
>>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>>> brain hurt.
>>>>>>>>>
>>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>
>>>>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>>>>
>>>>>>>>>
>>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>
>>>>>>>>> The last term is easy to calculate,
>>>>>>>>>
>>>>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>>
>>>>>>>>> and the && term is the distribution we're trying to produce.
>>>>>>>>
>>>>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>>>>
>>>>>>>>> For exmaple,
>>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>>> their second replica be the small OSD.  So
>>>>>>>>>
>>>>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>>
>>>>>>>>> Putting those together,
>>>>>>>>>
>>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>>>>
>>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>>> that they weren't already chosen.
>>>>>>>>>
>>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>>
>>>>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>>>>
>>>>>>> In
>>>>>>>
>>>>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>>>>
>>>>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>>>>
>>>>>>> I don't get why we need  "* bucketw" at the end ?
>>>>>>
>>>>>> It's just to keep the values within a reasonable range so that we don't
>>>>>> lose precision by dropping down into small integers.
>>>>>>
>>>>>> I futzed around with this some more last week trying to get the third
>>>>>> replica to work and ended up doubting that this piece is correct.  The
>>>>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>>>>> slightly from what I would expect from first principles and what I get out
>>>>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>>>>
>>>>>> I'm hoping we can find someone with a strong stats/probability background
>>>>>> and loads of free time who can tackle this...
>>>>>>
>>>>>
>>>>> It would help to formulate the problem into a self contained puzzle to present a mathematician. I tried to do it last week but failed. I'll give it another shot and submit a draft, hoping something bad could be the start of something better ;-)
>>>>
>>>> Here is what I have. I realize this is not good but I'm hoping someone more knowledgeable will pity me and provide something sensible. Otherwise I'm happy to keep making a fool of myself :-) In the following a bin is the device, the ball is a replica and the color is the object id.
>>>>
>>>> We have D bins and each bin can hold D(B) balls. All balls have the
>>>> same size. There is exactly X balls of the same color. Each ball must
>>>> be placed in a bin that does not already contain a ball of the same
>>>> color.
>>>>
>>>> What distribution guarantees that, for all X, the bins are filled in
>>>> the same proportion ?
>>>>
>>>> Details
>>>> =======
>>>>
>>>> * One placement: all balls are the same color and we place each of them
>>>>   in a bin with a probability of:
>>>>
>>>>     P(BIN) = BIN(B) / SUM(BINi(B) for i in [1..D])
>>>>
>>>>   so that bins are equally filled regardless of their capacity.
>>>>
>>>> * Two placements: for each ball there is exactly one other ball of the
>>>>   same color.  A ball is placed as in experience 1 and the chosen bin
>>>>   is set aside. The other ball of the same color is placed as in
>>>>   experience 1 with the remaining bins. The probability for a ball
>>>>   to be placed in a given BIN is:
>>>>
>>>>     P(BIN) + P(all bins but BIN | BIN)
>>>>
>>>> Examples
>>>> ========
>>>>
>>>> For instance we have 5 bins, a, b, c, d, e and they can hold:
>>>>
>>>> a = 10 million balls
>>>> b = 10 million balls
>>>> c = 10 million balls
>>>> d = 10 million balls
>>>> e =  1 million balls
>>>>
>>>> In the first experience with place each ball in
>>>>
>>>> a with a probability of 10 / ( 10 + 10 + 10 + 10 + 1 ) = 10 / 41
>>>> same for b, c, d
>>>> e with a probability of 1 / 41
>>>>
>>>> after 100,000 placements, the bins have
>>>>
>>>> a = 243456
>>>> b = 243624
>>>> c = 244486
>>>> d = 243881
>>>> e = 24553
>>>>
>>>> they are
>>>>
>>>> a = 2.43 % full
>>>> b = 2.43 % full
>>>> c = 2.44 % full
>>>> d = 2.43 % full
>>>> e = 0.24 % full
>>>>
>>>> In the second experience
>>>>
>>>>
>>>>>> sage
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>>>>>>> current code, you get
>>>>>>>>>
>>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>>>>   device 6:             929148  [400572,528576]
>>>>>>>>>
>>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>>> second.  With my hacky change,
>>>>>>>>>
>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>>>>   device 6:             804566  [400572,403994]
>>>>>>>>>
>>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>>> 1%).
>>>>>>>>>
>>>>>>>>> Next steps:
>>>>>>>>>
>>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>>> 2- figure out why it skews high
>>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>>
>>>>>>>>> sage
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> 
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-01-26  3:05 crush multipick anomaly Sage Weil
  2017-01-26 11:13 ` Loic Dachary
@ 2017-02-13 10:36 ` Loic Dachary
  2017-02-13 14:21   ` Sage Weil
  2017-02-13 14:53   ` Gregory Farnum
  1 sibling, 2 replies; 70+ messages in thread
From: Loic Dachary @ 2017-02-13 10:36 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

Hi,

Dan van der Ster reached out to colleagues and friends and Pedro López-Adeva Fernández-Layos came up with a well written analysis of the problem and a tentative solution which he described at : https://github.com/plafl/notebooks/blob/master/replication.ipynb

Unless I'm reading the document incorrectly (very possible ;) it also means that the probability of each disk needs to take in account the weight of all disks. Which means that whenever a disk is added / removed or its weight is changed, this has an impact on the probability of all disks in the cluster and objects are likely to move everywhere. Am I mistaken ?

Cheers

On 01/26/2017 04:05 AM, Sage Weil wrote:
> This is a longstanding bug,
> 
> 	http://tracker.ceph.com/issues/15653
> 
> that causes low-weighted devices to get more data than they should. Loic's 
> recent activity resurrected discussion on the original PR
> 
> 	https://github.com/ceph/ceph/pull/10218
> 
> but since it's closed and almost nobody will see it I'm moving the 
> discussion here.
> 
> The main news is that I have a simple adjustment for the weights that 
> works (almost perfectly) for the 2nd round of placements.  The solution is 
> pretty simple, although as with most probabilities it tends to make my 
> brain hurt.
> 
> The idea is that, on the second round, the original weight for the small 
> OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
> P(pick small | first pick not small).  Since P(a|b) (the probability of a 
> given b) is P(a && b) / P(b),
> 
>  P(pick small | first pick not small)
>  = P(pick small && first pick not small) / P(first pick not small)
> 
> The last term is easy to calculate,
> 
>  P(first pick not small) = (total_weight - small_weight) / total_weight
> 
> and the && term is the distribution we're trying to produce.  For exmaple, 
> if small has 1/10 the weight, then we should see 1/10th of the PGs have 
> their second replica be the small OSD.  So
> 
>  P(pick small && first pick not small) = small_weight / total_weight
> 
> Putting those together,
> 
>  P(pick small | first pick not small)
>  = P(pick small && first pick not small) / P(first pick not small)
>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>  = small_weight / (total_weight - small_weight)
> 
> This is, on the second round, we should adjust the weights by the above so 
> that we get the right distribution of second choices.  It turns out it 
> works to adjust *all* weights like this to get hte conditional probability 
> that they weren't already chosen.
> 
> I have a branch that hacks this into straw2 and it appears to work 
> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
> current code, you get
> 
> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
> rule 0 (data), x = 0..40000000, numrep = 2..2
> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>   device 0:             19765965        [9899364,9866601]
>   device 1:             19768033        [9899444,9868589]
>   device 2:             19769938        [9901770,9868168]
>   device 3:             19766918        [9898851,9868067]
>   device 6:             929148  [400572,528576]
> 
> which is very close for the first replica (primary), but way off for the 
> second.  With my hacky change,
> 
> rule 0 (data), x = 0..40000000, numrep = 2..2
> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>   device 0:             19797315        [9899364,9897951]
>   device 1:             19799199        [9899444,9899755]
>   device 2:             19801016        [9901770,9899246]
>   device 3:             19797906        [9898851,9899055]
>   device 6:             804566  [400572,403994]
> 
> which is quite close, but still skewing slightly high (by a big less than 
> 1%).
> 
> Next steps:
> 
> 1- generalize this for >2 replicas
> 2- figure out why it skews high
> 3- make this work for multi-level hierarchical descent
> 
> sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-13 10:36 ` Loic Dachary
@ 2017-02-13 14:21   ` Sage Weil
  2017-02-13 18:50     ` Loic Dachary
  2017-02-16 22:04     ` Pedro López-Adeva
  2017-02-13 14:53   ` Gregory Farnum
  1 sibling, 2 replies; 70+ messages in thread
From: Sage Weil @ 2017-02-13 14:21 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5293 bytes --]

On Mon, 13 Feb 2017, Loic Dachary wrote:
> Hi,
> 
> Dan van der Ster reached out to colleagues and friends and Pedro 
> López-Adeva Fernández-Layos came up with a well written analysis of the 
> problem and a tentative solution which he described at : 
> https://github.com/plafl/notebooks/blob/master/replication.ipynb
> 
> Unless I'm reading the document incorrectly (very possible ;) it also 
> means that the probability of each disk needs to take in account the 
> weight of all disks. Which means that whenever a disk is added / removed 
> or its weight is changed, this has an impact on the probability of all 
> disks in the cluster and objects are likely to move everywhere. Am I 
> mistaken ?

Maybe (I haven't looked closely at the above yet).  But for comparison, in 
the normal straw2 case, adding or removing a disk also changes the 
probabilities for everything else (e.g., removing one out of 10 identical 
disks changes the probability from 1/10 to 1/9).  The key property that 
straw2 *is* able to handle is that as long as the relative probabilities 
between two unmodified disks does not change, then straw2 will avoid 
moving any objects between them (i.e., all data movement is to or from 
the disk that is reweighted).

sage


> 
> Cheers
> 
> On 01/26/2017 04:05 AM, Sage Weil wrote:
> > This is a longstanding bug,
> > 
> > 	http://tracker.ceph.com/issues/15653
> > 
> > that causes low-weighted devices to get more data than they should. Loic's 
> > recent activity resurrected discussion on the original PR
> > 
> > 	https://github.com/ceph/ceph/pull/10218
> > 
> > but since it's closed and almost nobody will see it I'm moving the 
> > discussion here.
> > 
> > The main news is that I have a simple adjustment for the weights that 
> > works (almost perfectly) for the 2nd round of placements.  The solution is 
> > pretty simple, although as with most probabilities it tends to make my 
> > brain hurt.
> > 
> > The idea is that, on the second round, the original weight for the small 
> > OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
> > P(pick small | first pick not small).  Since P(a|b) (the probability of a 
> > given b) is P(a && b) / P(b),
> > 
> >  P(pick small | first pick not small)
> >  = P(pick small && first pick not small) / P(first pick not small)
> > 
> > The last term is easy to calculate,
> > 
> >  P(first pick not small) = (total_weight - small_weight) / total_weight
> > 
> > and the && term is the distribution we're trying to produce.  For exmaple, 
> > if small has 1/10 the weight, then we should see 1/10th of the PGs have 
> > their second replica be the small OSD.  So
> > 
> >  P(pick small && first pick not small) = small_weight / total_weight
> > 
> > Putting those together,
> > 
> >  P(pick small | first pick not small)
> >  = P(pick small && first pick not small) / P(first pick not small)
> >  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
> >  = small_weight / (total_weight - small_weight)
> > 
> > This is, on the second round, we should adjust the weights by the above so 
> > that we get the right distribution of second choices.  It turns out it 
> > works to adjust *all* weights like this to get hte conditional probability 
> > that they weren't already chosen.
> > 
> > I have a branch that hacks this into straw2 and it appears to work 
> > properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
> > current code, you get
> > 
> > $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
> > rule 0 (data), x = 0..40000000, numrep = 2..2
> > rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
> >   device 0:             19765965        [9899364,9866601]
> >   device 1:             19768033        [9899444,9868589]
> >   device 2:             19769938        [9901770,9868168]
> >   device 3:             19766918        [9898851,9868067]
> >   device 6:             929148  [400572,528576]
> > 
> > which is very close for the first replica (primary), but way off for the 
> > second.  With my hacky change,
> > 
> > rule 0 (data), x = 0..40000000, numrep = 2..2
> > rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
> >   device 0:             19797315        [9899364,9897951]
> >   device 1:             19799199        [9899444,9899755]
> >   device 2:             19801016        [9901770,9899246]
> >   device 3:             19797906        [9898851,9899055]
> >   device 6:             804566  [400572,403994]
> > 
> > which is quite close, but still skewing slightly high (by a big less than 
> > 1%).
> > 
> > Next steps:
> > 
> > 1- generalize this for >2 replicas
> > 2- figure out why it skews high
> > 3- make this work for multi-level hierarchical descent
> > 
> > sage
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-13 10:36 ` Loic Dachary
  2017-02-13 14:21   ` Sage Weil
@ 2017-02-13 14:53   ` Gregory Farnum
  2017-02-20  8:47     ` Loic Dachary
  1 sibling, 1 reply; 70+ messages in thread
From: Gregory Farnum @ 2017-02-13 14:53 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Sage Weil, ceph-devel

On Mon, Feb 13, 2017 at 2:36 AM, Loic Dachary <loic@dachary.org> wrote:
> Hi,
>
> Dan van der Ster reached out to colleagues and friends and Pedro López-Adeva Fernández-Layos came up with a well written analysis of the problem and a tentative solution which he described at : https://github.com/plafl/notebooks/blob/master/replication.ipynb
>
> Unless I'm reading the document incorrectly (very possible ;) it also means that the probability of each disk needs to take in account the weight of all disks. Which means that whenever a disk is added / removed or its weight is changed, this has an impact on the probability of all disks in the cluster and objects are likely to move everywhere. Am I mistaken ?

Keep in mind that in the math presented, "all disks" for our purposes
really means "all items within a CRUSH bucket" (at least, best I can
tell). So if you reweight a disk, you have to recalculate weights
within its bucket and within each parent bucket, but each bucket has a
bounded size N so the calculation should remain feasible. I didn't
step through the more complicated math at the end but it made
intuitive sense as far as I went.
-Greg

>
> Cheers
>
> On 01/26/2017 04:05 AM, Sage Weil wrote:
>> This is a longstanding bug,
>>
>>       http://tracker.ceph.com/issues/15653
>>
>> that causes low-weighted devices to get more data than they should. Loic's
>> recent activity resurrected discussion on the original PR
>>
>>       https://github.com/ceph/ceph/pull/10218
>>
>> but since it's closed and almost nobody will see it I'm moving the
>> discussion here.
>>
>> The main news is that I have a simple adjustment for the weights that
>> works (almost perfectly) for the 2nd round of placements.  The solution is
>> pretty simple, although as with most probabilities it tends to make my
>> brain hurt.
>>
>> The idea is that, on the second round, the original weight for the small
>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>> given b) is P(a && b) / P(b),
>>
>>  P(pick small | first pick not small)
>>  = P(pick small && first pick not small) / P(first pick not small)
>>
>> The last term is easy to calculate,
>>
>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>
>> and the && term is the distribution we're trying to produce.  For exmaple,
>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>> their second replica be the small OSD.  So
>>
>>  P(pick small && first pick not small) = small_weight / total_weight
>>
>> Putting those together,
>>
>>  P(pick small | first pick not small)
>>  = P(pick small && first pick not small) / P(first pick not small)
>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>  = small_weight / (total_weight - small_weight)
>>
>> This is, on the second round, we should adjust the weights by the above so
>> that we get the right distribution of second choices.  It turns out it
>> works to adjust *all* weights like this to get hte conditional probability
>> that they weren't already chosen.
>>
>> I have a branch that hacks this into straw2 and it appears to work
>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>> current code, you get
>>
>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>> rule 0 (data), x = 0..40000000, numrep = 2..2
>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>   device 0:             19765965        [9899364,9866601]
>>   device 1:             19768033        [9899444,9868589]
>>   device 2:             19769938        [9901770,9868168]
>>   device 3:             19766918        [9898851,9868067]
>>   device 6:             929148  [400572,528576]
>>
>> which is very close for the first replica (primary), but way off for the
>> second.  With my hacky change,
>>
>> rule 0 (data), x = 0..40000000, numrep = 2..2
>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>   device 0:             19797315        [9899364,9897951]
>>   device 1:             19799199        [9899444,9899755]
>>   device 2:             19801016        [9901770,9899246]
>>   device 3:             19797906        [9898851,9899055]
>>   device 6:             804566  [400572,403994]
>>
>> which is quite close, but still skewing slightly high (by a big less than
>> 1%).
>>
>> Next steps:
>>
>> 1- generalize this for >2 replicas
>> 2- figure out why it skews high
>> 3- make this work for multi-level hierarchical descent
>>
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-13 14:21   ` Sage Weil
@ 2017-02-13 18:50     ` Loic Dachary
  2017-02-13 19:16       ` Sage Weil
  2017-02-16 22:04     ` Pedro López-Adeva
  1 sibling, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-13 18:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 7029 bytes --]

Hi Sage,

I wrote a little program to show where objects are moving when a new disk is added (disk 10 below) and it looks like this:

        00     01     02     03     04     05     06     07     08     09     10 
00:      0     14     17     14     19     23     13     22     21     20   1800 
01:     12      0     11     13     19     19     15     10     16     17   1841 
02:     17     27      0     17     15     15     13     19     18     11   1813 
03:     14     17     15      0     23     11     20     15     23     17   1792 
04:     14     18     16     25      0     27     13      8     15     16   1771 
05:     19     16     22     25     13      0      9     19     21     21   1813 
06:     18     15     21     17     10     18      0     10     18     11   1873 
07:     13     17     22     13     16     17     14      0     25     12   1719 
08:     23     20     16     17     19     18     11     12      0     18   1830 
09:     14     20     15     17     12     16     17     11     13      0   1828 
10:      0      0      0      0      0      0      0      0      0      0      0 

before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0 
after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080 


Each line shows how many objects moved from a given disk to the others after disk 10 was added. Most objects go to the new disk and around 1% go to each other disks. The before and after lines show how many objects are mapped to each disk. They all have the same weight and it's using replica 2 and straw2. Does that look right ?

Cheers

On 02/13/2017 03:21 PM, Sage Weil wrote:
> On Mon, 13 Feb 2017, Loic Dachary wrote:
>> Hi,
>>
>> Dan van der Ster reached out to colleagues and friends and Pedro 
>> López-Adeva Fernández-Layos came up with a well written analysis of the 
>> problem and a tentative solution which he described at : 
>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>
>> Unless I'm reading the document incorrectly (very possible ;) it also 
>> means that the probability of each disk needs to take in account the 
>> weight of all disks. Which means that whenever a disk is added / removed 
>> or its weight is changed, this has an impact on the probability of all 
>> disks in the cluster and objects are likely to move everywhere. Am I 
>> mistaken ?
> 
> Maybe (I haven't looked closely at the above yet).  But for comparison, in 
> the normal straw2 case, adding or removing a disk also changes the 
> probabilities for everything else (e.g., removing one out of 10 identical 
> disks changes the probability from 1/10 to 1/9).  The key property that 
> straw2 *is* able to handle is that as long as the relative probabilities 
> between two unmodified disks does not change, then straw2 will avoid 
> moving any objects between them (i.e., all data movement is to or from 
> the disk that is reweighted).
> 
> sage
> 
> 
>>
>> Cheers
>>
>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>> This is a longstanding bug,
>>>
>>> 	http://tracker.ceph.com/issues/15653
>>>
>>> that causes low-weighted devices to get more data than they should. Loic's 
>>> recent activity resurrected discussion on the original PR
>>>
>>> 	https://github.com/ceph/ceph/pull/10218
>>>
>>> but since it's closed and almost nobody will see it I'm moving the 
>>> discussion here.
>>>
>>> The main news is that I have a simple adjustment for the weights that 
>>> works (almost perfectly) for the 2nd round of placements.  The solution is 
>>> pretty simple, although as with most probabilities it tends to make my 
>>> brain hurt.
>>>
>>> The idea is that, on the second round, the original weight for the small 
>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a 
>>> given b) is P(a && b) / P(b),
>>>
>>>  P(pick small | first pick not small)
>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>
>>> The last term is easy to calculate,
>>>
>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>
>>> and the && term is the distribution we're trying to produce.  For exmaple, 
>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have 
>>> their second replica be the small OSD.  So
>>>
>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>
>>> Putting those together,
>>>
>>>  P(pick small | first pick not small)
>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>  = small_weight / (total_weight - small_weight)
>>>
>>> This is, on the second round, we should adjust the weights by the above so 
>>> that we get the right distribution of second choices.  It turns out it 
>>> works to adjust *all* weights like this to get hte conditional probability 
>>> that they weren't already chosen.
>>>
>>> I have a branch that hacks this into straw2 and it appears to work 
>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
>>> current code, you get
>>>
>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>   device 0:             19765965        [9899364,9866601]
>>>   device 1:             19768033        [9899444,9868589]
>>>   device 2:             19769938        [9901770,9868168]
>>>   device 3:             19766918        [9898851,9868067]
>>>   device 6:             929148  [400572,528576]
>>>
>>> which is very close for the first replica (primary), but way off for the 
>>> second.  With my hacky change,
>>>
>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>   device 0:             19797315        [9899364,9897951]
>>>   device 1:             19799199        [9899444,9899755]
>>>   device 2:             19801016        [9901770,9899246]
>>>   device 3:             19797906        [9898851,9899055]
>>>   device 6:             804566  [400572,403994]
>>>
>>> which is quite close, but still skewing slightly high (by a big less than 
>>> 1%).
>>>
>>> Next steps:
>>>
>>> 1- generalize this for >2 replicas
>>> 2- figure out why it skews high
>>> 3- make this work for multi-level hierarchical descent
>>>
>>> sage
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

-- 
Loïc Dachary, Artisan Logiciel Libre

[-- Attachment #2: compare.c --]
[-- Type: text/x-csrc, Size: 4795 bytes --]

#include "mapper.h"
#include "builder.h"
#include "crush.h"
#include "hash.h"
#include "stdio.h"

#define NUMBER_OF_OBJECTS 100000

void map_with_crush(int replication_count, int hosts_count, int object_map[][NUMBER_OF_OBJECTS]) {
  struct crush_map *m = crush_create();
  m->choose_local_tries = 0;
  m->choose_local_fallback_tries = 0;
  m->choose_total_tries = 50;
  m->chooseleaf_descend_once = 1;
  m->chooseleaf_vary_r = 1;
  m->chooseleaf_stable = 1;
  m->allowed_bucket_algs =
    (1 << CRUSH_BUCKET_UNIFORM) |
    (1 << CRUSH_BUCKET_LIST) |
    (1 << CRUSH_BUCKET_STRAW2);
  int root_type = 1;
  int host_type = 2;  
  int bucketno = 0;

  int hosts[hosts_count];
  int weights[hosts_count];
  int disk = 0;
  for(int host = 0; host < hosts_count; host++) {
    struct crush_bucket *b;

    b = crush_make_bucket(m, CRUSH_BUCKET_STRAW2, CRUSH_HASH_DEFAULT, host_type,
                          0, NULL, NULL);
    assert(b != NULL);
    assert(crush_bucket_add_item(m, b, disk, 0x10000) == 0);
    assert(crush_add_bucket(m, 0, b, &bucketno) == 0);
    hosts[host] = bucketno;
    weights[host] = 0x10000;
    disk++;
  }

  struct crush_bucket *root;
  int bucket_root;

  root = crush_make_bucket(m, CRUSH_BUCKET_STRAW2, CRUSH_HASH_DEFAULT, root_type,
                           hosts_count, hosts, weights);
  assert(root != NULL);
  assert(crush_add_bucket(m, 0, root, &bucket_root) == 0);
  assert(crush_reweight_bucket(m, root) == 0);
  
  struct crush_rule *r;
  int minsize = 1;
  int maxsize = 5;
  int number_of_steps = 3;
  r = crush_make_rule(number_of_steps, 0, 0, minsize, maxsize);
  assert(r != NULL);
  crush_rule_set_step(r, 0, CRUSH_RULE_TAKE, bucket_root, 0);
  crush_rule_set_step(r, 1, CRUSH_RULE_CHOOSELEAF_FIRSTN, replication_count, host_type);
  crush_rule_set_step(r, 2, CRUSH_RULE_EMIT, 0, 0);
  int ruleno = crush_add_rule(m, r, -1);
  assert(ruleno >= 0);

  crush_finalize(m);

  {
    int result[replication_count];
    __u32 weights[hosts_count];
    for(int i = 0; i < hosts_count; i++)
      weights[i] = 0x10000;
    int cwin_size = crush_work_size(m, replication_count);
    char cwin[cwin_size];
    crush_init_workspace(m, cwin);
    for(int x = 0; x < NUMBER_OF_OBJECTS; x++) {
      memset(result, '\0', sizeof(int) * replication_count);
      assert(crush_do_rule(m, ruleno, x, result, replication_count, weights, hosts_count, cwin) == 2);
      for(int i = 0; i < replication_count; i++) {
        object_map[i][x] = result[i];
      }
    }
  }
  crush_destroy(m);
}

int same_set(int object, int replication_count, int before[][NUMBER_OF_OBJECTS], int after[][NUMBER_OF_OBJECTS]) {
  for(int r = 0; r < replication_count; r++) {
    int found = 0;
    for(int s = 0; s < replication_count; s++)
      if(before[r][object] == after[s][object]) {
        found = 1;
        break;
      }
    if(!found)
      return 0;
  }
  return 1;
}

void with_crush(int replication_count, int hosts_count) {
  int before[replication_count][NUMBER_OF_OBJECTS];
  map_with_crush(replication_count, hosts_count, &before[0]);
  int after[replication_count][NUMBER_OF_OBJECTS];
  map_with_crush(replication_count, hosts_count+1, &after[0]);
  int movement[hosts_count + 1][hosts_count + 1];
  memset(movement, '\0', sizeof(movement));
  int count_before[hosts_count + 1];
  memset(count_before, '\0', sizeof(count_before));
  int count_after[hosts_count + 1];
  memset(count_after, '\0', sizeof(count_after));
  for(int object = 0; object < NUMBER_OF_OBJECTS; object++) {
    //    if(same_set(object, replication_count, &before[0], &after[0]))
    //      continue;
    for(int replica = 0; replica < replication_count; replica++) {
      count_before[before[replica][object]]++;
      count_after[after[replica][object]]++;
      if (before[replica][object] == after[replica][object])
        continue;
      movement[before[replica][object]][after[replica][object]]++;
    }
  }
  printf("    ");
  for(int host = 0; host < hosts_count + 1; host++)
    printf("    %02d ", host);
  printf("\n");
  for(int from = 0; from < hosts_count + 1; from++) {
    printf("%02d: ", from);
    for(int to = 0; to < hosts_count + 1; to++)
      printf("%6d ", movement[from][to]);
    printf("\n");
  }

  printf("before: ");
  for(int host = 0; host < hosts_count + 1; host++)
    printf("%6d ", count_before[host]);
  printf("\n");  
  printf("after:  ");
  for(int host = 0; host < hosts_count + 1; host++)
    printf("%6d ", count_after[host]);
  printf("\n");
}

int main(int argc, char* argv[]) {
  int replication_count = atoi(argv[1]);
  int hosts_count = atoi(argv[2]);
  with_crush(replication_count, hosts_count);
}

/*
 * Local Variables:
 * compile-command: "gcc -g -o compare compare.c $(pkg-config --cflags --libs libcrush) && ./compare 2 10"
 * End:
 */

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-13 18:50     ` Loic Dachary
@ 2017-02-13 19:16       ` Sage Weil
  2017-02-13 20:18         ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Sage Weil @ 2017-02-13 19:16 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 8071 bytes --]

On Mon, 13 Feb 2017, Loic Dachary wrote:
> Hi Sage,
> 
> I wrote a little program to show where objects are moving when a new disk is added (disk 10 below) and it looks like this:
> 
>         00     01     02     03     04     05     06     07     08     09     10 
> 00:      0     14     17     14     19     23     13     22     21     20   1800 
> 01:     12      0     11     13     19     19     15     10     16     17   1841 
> 02:     17     27      0     17     15     15     13     19     18     11   1813 
> 03:     14     17     15      0     23     11     20     15     23     17   1792 
> 04:     14     18     16     25      0     27     13      8     15     16   1771 
> 05:     19     16     22     25     13      0      9     19     21     21   1813 
> 06:     18     15     21     17     10     18      0     10     18     11   1873 
> 07:     13     17     22     13     16     17     14      0     25     12   1719 
> 08:     23     20     16     17     19     18     11     12      0     18   1830 
> 09:     14     20     15     17     12     16     17     11     13      0   1828 
> 10:      0      0      0      0      0      0      0      0      0      0      0 
> 
> before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0 
> after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080 
> 
> 
> Each line shows how many objects moved from a given disk to the others 
> after disk 10 was added. Most objects go to the new disk and around 1% 
> go to each other disks. The before and after lines show how many objects 
> are mapped to each disk. They all have the same weight and it's using 
> replica 2 and straw2. Does that look right ?

Hmm, that doesn't look right.  This is what the CRUSH.straw2_reweight unit 
test is there to validate: that data on moves to or from the device whose 
weight changed.

It also follows from the straw2 algorithm itself: each possible choice 
gets a 'straw' length derived only from that item's weight (and other 
fixed factors, like the item id and the bucket id), and we select the max 
across all items.  Two devices whose weights didn't change will have the 
same straw lengths, and the max between them will not change.  It's only 
possible that the changed item's straw length changed and wasn't max and 
now is (got longer) or was max and now isn't (got shorter).

sage


> 
> Cheers
> 
> On 02/13/2017 03:21 PM, Sage Weil wrote:
> > On Mon, 13 Feb 2017, Loic Dachary wrote:
> >> Hi,
> >>
> >> Dan van der Ster reached out to colleagues and friends and Pedro 
> >> López-Adeva Fernández-Layos came up with a well written analysis of the 
> >> problem and a tentative solution which he described at : 
> >> https://github.com/plafl/notebooks/blob/master/replication.ipynb
> >>
> >> Unless I'm reading the document incorrectly (very possible ;) it also 
> >> means that the probability of each disk needs to take in account the 
> >> weight of all disks. Which means that whenever a disk is added / removed 
> >> or its weight is changed, this has an impact on the probability of all 
> >> disks in the cluster and objects are likely to move everywhere. Am I 
> >> mistaken ?
> > 
> > Maybe (I haven't looked closely at the above yet).  But for comparison, in 
> > the normal straw2 case, adding or removing a disk also changes the 
> > probabilities for everything else (e.g., removing one out of 10 identical 
> > disks changes the probability from 1/10 to 1/9).  The key property that 
> > straw2 *is* able to handle is that as long as the relative probabilities 
> > between two unmodified disks does not change, then straw2 will avoid 
> > moving any objects between them (i.e., all data movement is to or from 
> > the disk that is reweighted).
> > 
> > sage
> > 
> > 
> >>
> >> Cheers
> >>
> >> On 01/26/2017 04:05 AM, Sage Weil wrote:
> >>> This is a longstanding bug,
> >>>
> >>> 	http://tracker.ceph.com/issues/15653
> >>>
> >>> that causes low-weighted devices to get more data than they should. Loic's 
> >>> recent activity resurrected discussion on the original PR
> >>>
> >>> 	https://github.com/ceph/ceph/pull/10218
> >>>
> >>> but since it's closed and almost nobody will see it I'm moving the 
> >>> discussion here.
> >>>
> >>> The main news is that I have a simple adjustment for the weights that 
> >>> works (almost perfectly) for the 2nd round of placements.  The solution is 
> >>> pretty simple, although as with most probabilities it tends to make my 
> >>> brain hurt.
> >>>
> >>> The idea is that, on the second round, the original weight for the small 
> >>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
> >>> P(pick small | first pick not small).  Since P(a|b) (the probability of a 
> >>> given b) is P(a && b) / P(b),
> >>>
> >>>  P(pick small | first pick not small)
> >>>  = P(pick small && first pick not small) / P(first pick not small)
> >>>
> >>> The last term is easy to calculate,
> >>>
> >>>  P(first pick not small) = (total_weight - small_weight) / total_weight
> >>>
> >>> and the && term is the distribution we're trying to produce.  For exmaple, 
> >>> if small has 1/10 the weight, then we should see 1/10th of the PGs have 
> >>> their second replica be the small OSD.  So
> >>>
> >>>  P(pick small && first pick not small) = small_weight / total_weight
> >>>
> >>> Putting those together,
> >>>
> >>>  P(pick small | first pick not small)
> >>>  = P(pick small && first pick not small) / P(first pick not small)
> >>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
> >>>  = small_weight / (total_weight - small_weight)
> >>>
> >>> This is, on the second round, we should adjust the weights by the above so 
> >>> that we get the right distribution of second choices.  It turns out it 
> >>> works to adjust *all* weights like this to get hte conditional probability 
> >>> that they weren't already chosen.
> >>>
> >>> I have a branch that hacks this into straw2 and it appears to work 
> >>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
> >>> current code, you get
> >>>
> >>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
> >>> rule 0 (data), x = 0..40000000, numrep = 2..2
> >>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
> >>>   device 0:             19765965        [9899364,9866601]
> >>>   device 1:             19768033        [9899444,9868589]
> >>>   device 2:             19769938        [9901770,9868168]
> >>>   device 3:             19766918        [9898851,9868067]
> >>>   device 6:             929148  [400572,528576]
> >>>
> >>> which is very close for the first replica (primary), but way off for the 
> >>> second.  With my hacky change,
> >>>
> >>> rule 0 (data), x = 0..40000000, numrep = 2..2
> >>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
> >>>   device 0:             19797315        [9899364,9897951]
> >>>   device 1:             19799199        [9899444,9899755]
> >>>   device 2:             19801016        [9901770,9899246]
> >>>   device 3:             19797906        [9898851,9899055]
> >>>   device 6:             804566  [400572,403994]
> >>>
> >>> which is quite close, but still skewing slightly high (by a big less than 
> >>> 1%).
> >>>
> >>> Next steps:
> >>>
> >>> 1- generalize this for >2 replicas
> >>> 2- figure out why it skews high
> >>> 3- make this work for multi-level hierarchical descent
> >>>
> >>> sage
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >> -- 
> >> Loïc Dachary, Artisan Logiciel Libre
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-13 19:16       ` Sage Weil
@ 2017-02-13 20:18         ` Loic Dachary
  2017-02-13 21:01           ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-13 20:18 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel



On 02/13/2017 08:16 PM, Sage Weil wrote:
> On Mon, 13 Feb 2017, Loic Dachary wrote:
>> Hi Sage,
>>
>> I wrote a little program to show where objects are moving when a new disk is added (disk 10 below) and it looks like this:
>>
>>         00     01     02     03     04     05     06     07     08     09     10 
>> 00:      0     14     17     14     19     23     13     22     21     20   1800 
>> 01:     12      0     11     13     19     19     15     10     16     17   1841 
>> 02:     17     27      0     17     15     15     13     19     18     11   1813 
>> 03:     14     17     15      0     23     11     20     15     23     17   1792 
>> 04:     14     18     16     25      0     27     13      8     15     16   1771 
>> 05:     19     16     22     25     13      0      9     19     21     21   1813 
>> 06:     18     15     21     17     10     18      0     10     18     11   1873 
>> 07:     13     17     22     13     16     17     14      0     25     12   1719 
>> 08:     23     20     16     17     19     18     11     12      0     18   1830 
>> 09:     14     20     15     17     12     16     17     11     13      0   1828 
>> 10:      0      0      0      0      0      0      0      0      0      0      0 
>>
>> before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0 
>> after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080 
>>
>>
>> Each line shows how many objects moved from a given disk to the others 
>> after disk 10 was added. Most objects go to the new disk and around 1% 
>> go to each other disks. The before and after lines show how many objects 
>> are mapped to each disk. They all have the same weight and it's using 
>> replica 2 and straw2. Does that look right ?
> 
> Hmm, that doesn't look right.  This is what the CRUSH.straw2_reweight unit 
> test is there to validate: that data on moves to or from the device whose 
> weight changed.

In the above, the bucket size changes: it has a new item. And the bucket size plays a role in bucket_straw2_choose because it loops on all items. In CRUSH.straw2_reweight only the weights change. I'm not entirely sure how that would explain the results I get though...

> It also follows from the straw2 algorithm itself: each possible choice 
> gets a 'straw' length derived only from that item's weight (and other 
> fixed factors, like the item id and the bucket id), and we select the max 
> across all items.  Two devices whose weights didn't change will have the 
> same straw lengths, and the max between them will not change.  It's only 
> possible that the changed item's straw length changed and wasn't max and 
> now is (got longer) or was max and now isn't (got shorter).

That's a crystal clear explanation, cool :-)

Cheers

> sage
> 
> 
>>
>> Cheers
>>
>> On 02/13/2017 03:21 PM, Sage Weil wrote:
>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>> Hi,
>>>>
>>>> Dan van der Ster reached out to colleagues and friends and Pedro 
>>>> López-Adeva Fernández-Layos came up with a well written analysis of the 
>>>> problem and a tentative solution which he described at : 
>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>
>>>> Unless I'm reading the document incorrectly (very possible ;) it also 
>>>> means that the probability of each disk needs to take in account the 
>>>> weight of all disks. Which means that whenever a disk is added / removed 
>>>> or its weight is changed, this has an impact on the probability of all 
>>>> disks in the cluster and objects are likely to move everywhere. Am I 
>>>> mistaken ?
>>>
>>> Maybe (I haven't looked closely at the above yet).  But for comparison, in 
>>> the normal straw2 case, adding or removing a disk also changes the 
>>> probabilities for everything else (e.g., removing one out of 10 identical 
>>> disks changes the probability from 1/10 to 1/9).  The key property that 
>>> straw2 *is* able to handle is that as long as the relative probabilities 
>>> between two unmodified disks does not change, then straw2 will avoid 
>>> moving any objects between them (i.e., all data movement is to or from 
>>> the disk that is reweighted).
>>>
>>> sage
>>>
>>>
>>>>
>>>> Cheers
>>>>
>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>> This is a longstanding bug,
>>>>>
>>>>> 	http://tracker.ceph.com/issues/15653
>>>>>
>>>>> that causes low-weighted devices to get more data than they should. Loic's 
>>>>> recent activity resurrected discussion on the original PR
>>>>>
>>>>> 	https://github.com/ceph/ceph/pull/10218
>>>>>
>>>>> but since it's closed and almost nobody will see it I'm moving the 
>>>>> discussion here.
>>>>>
>>>>> The main news is that I have a simple adjustment for the weights that 
>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is 
>>>>> pretty simple, although as with most probabilities it tends to make my 
>>>>> brain hurt.
>>>>>
>>>>> The idea is that, on the second round, the original weight for the small 
>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a 
>>>>> given b) is P(a && b) / P(b),
>>>>>
>>>>>  P(pick small | first pick not small)
>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>
>>>>> The last term is easy to calculate,
>>>>>
>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>
>>>>> and the && term is the distribution we're trying to produce.  For exmaple, 
>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have 
>>>>> their second replica be the small OSD.  So
>>>>>
>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>
>>>>> Putting those together,
>>>>>
>>>>>  P(pick small | first pick not small)
>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>  = small_weight / (total_weight - small_weight)
>>>>>
>>>>> This is, on the second round, we should adjust the weights by the above so 
>>>>> that we get the right distribution of second choices.  It turns out it 
>>>>> works to adjust *all* weights like this to get hte conditional probability 
>>>>> that they weren't already chosen.
>>>>>
>>>>> I have a branch that hacks this into straw2 and it appears to work 
>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
>>>>> current code, you get
>>>>>
>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>   device 6:             929148  [400572,528576]
>>>>>
>>>>> which is very close for the first replica (primary), but way off for the 
>>>>> second.  With my hacky change,
>>>>>
>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>   device 6:             804566  [400572,403994]
>>>>>
>>>>> which is quite close, but still skewing slightly high (by a big less than 
>>>>> 1%).
>>>>>
>>>>> Next steps:
>>>>>
>>>>> 1- generalize this for >2 replicas
>>>>> 2- figure out why it skews high
>>>>> 3- make this work for multi-level hierarchical descent
>>>>>
>>>>> sage
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> -- 
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-13 20:18         ` Loic Dachary
@ 2017-02-13 21:01           ` Loic Dachary
  2017-02-13 21:15             ` Sage Weil
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-13 21:01 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

I get the expected behavior for replica 1 (which is what CRUSH.straw2_reweight does). The movement between buckets observered below is for replica 2.

        00     01     02     03     04     05     06     07     08     09     10
00:      0      0      0      0      0      0      0      0      0      0    927
01:      0      0      0      0      0      0      0      0      0      0    904
02:      0      0      0      0      0      0      0      0      0      0    928
03:      0      0      0      0      0      0      0      0      0      0    886
04:      0      0      0      0      0      0      0      0      0      0    927
05:      0      0      0      0      0      0      0      0      0      0    927
06:      0      0      0      0      0      0      0      0      0      0    930
07:      0      0      0      0      0      0      0      0      0      0    842
08:      0      0      0      0      0      0      0      0      0      0    943
09:      0      0      0      0      0      0      0      0      0      0    904
10:      0      0      0      0      0      0      0      0      0      0      0
before:  10149  10066   9893   9955  10030  10025   9895  10013  10008   9966      0
after:    9222   9162   8965   9069   9103   9098   8965   9171   9065   9062   9118


On 02/13/2017 09:18 PM, Loic Dachary wrote:
> 
> 
> On 02/13/2017 08:16 PM, Sage Weil wrote:
>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>> Hi Sage,
>>>
>>> I wrote a little program to show where objects are moving when a new disk is added (disk 10 below) and it looks like this:
>>>
>>>         00     01     02     03     04     05     06     07     08     09     10 
>>> 00:      0     14     17     14     19     23     13     22     21     20   1800 
>>> 01:     12      0     11     13     19     19     15     10     16     17   1841 
>>> 02:     17     27      0     17     15     15     13     19     18     11   1813 
>>> 03:     14     17     15      0     23     11     20     15     23     17   1792 
>>> 04:     14     18     16     25      0     27     13      8     15     16   1771 
>>> 05:     19     16     22     25     13      0      9     19     21     21   1813 
>>> 06:     18     15     21     17     10     18      0     10     18     11   1873 
>>> 07:     13     17     22     13     16     17     14      0     25     12   1719 
>>> 08:     23     20     16     17     19     18     11     12      0     18   1830 
>>> 09:     14     20     15     17     12     16     17     11     13      0   1828 
>>> 10:      0      0      0      0      0      0      0      0      0      0      0 
>>>
>>> before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0 
>>> after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080 
>>>
>>>
>>> Each line shows how many objects moved from a given disk to the others 
>>> after disk 10 was added. Most objects go to the new disk and around 1% 
>>> go to each other disks. The before and after lines show how many objects 
>>> are mapped to each disk. They all have the same weight and it's using 
>>> replica 2 and straw2. Does that look right ?
>>
>> Hmm, that doesn't look right.  This is what the CRUSH.straw2_reweight unit 
>> test is there to validate: that data on moves to or from the device whose 
>> weight changed.
> 
> In the above, the bucket size changes: it has a new item. And the bucket size plays a role in bucket_straw2_choose because it loops on all items. In CRUSH.straw2_reweight only the weights change. I'm not entirely sure how that would explain the results I get though...
> 
>> It also follows from the straw2 algorithm itself: each possible choice 
>> gets a 'straw' length derived only from that item's weight (and other 
>> fixed factors, like the item id and the bucket id), and we select the max 
>> across all items.  Two devices whose weights didn't change will have the 
>> same straw lengths, and the max between them will not change.  It's only 
>> possible that the changed item's straw length changed and wasn't max and 
>> now is (got longer) or was max and now isn't (got shorter).
> 
> That's a crystal clear explanation, cool :-)
> 
> Cheers
> 
>> sage
>>
>>
>>>
>>> Cheers
>>>
>>> On 02/13/2017 03:21 PM, Sage Weil wrote:
>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>> Hi,
>>>>>
>>>>> Dan van der Ster reached out to colleagues and friends and Pedro 
>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the 
>>>>> problem and a tentative solution which he described at : 
>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>
>>>>> Unless I'm reading the document incorrectly (very possible ;) it also 
>>>>> means that the probability of each disk needs to take in account the 
>>>>> weight of all disks. Which means that whenever a disk is added / removed 
>>>>> or its weight is changed, this has an impact on the probability of all 
>>>>> disks in the cluster and objects are likely to move everywhere. Am I 
>>>>> mistaken ?
>>>>
>>>> Maybe (I haven't looked closely at the above yet).  But for comparison, in 
>>>> the normal straw2 case, adding or removing a disk also changes the 
>>>> probabilities for everything else (e.g., removing one out of 10 identical 
>>>> disks changes the probability from 1/10 to 1/9).  The key property that 
>>>> straw2 *is* able to handle is that as long as the relative probabilities 
>>>> between two unmodified disks does not change, then straw2 will avoid 
>>>> moving any objects between them (i.e., all data movement is to or from 
>>>> the disk that is reweighted).
>>>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>> This is a longstanding bug,
>>>>>>
>>>>>> 	http://tracker.ceph.com/issues/15653
>>>>>>
>>>>>> that causes low-weighted devices to get more data than they should. Loic's 
>>>>>> recent activity resurrected discussion on the original PR
>>>>>>
>>>>>> 	https://github.com/ceph/ceph/pull/10218
>>>>>>
>>>>>> but since it's closed and almost nobody will see it I'm moving the 
>>>>>> discussion here.
>>>>>>
>>>>>> The main news is that I have a simple adjustment for the weights that 
>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is 
>>>>>> pretty simple, although as with most probabilities it tends to make my 
>>>>>> brain hurt.
>>>>>>
>>>>>> The idea is that, on the second round, the original weight for the small 
>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a 
>>>>>> given b) is P(a && b) / P(b),
>>>>>>
>>>>>>  P(pick small | first pick not small)
>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>
>>>>>> The last term is easy to calculate,
>>>>>>
>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>
>>>>>> and the && term is the distribution we're trying to produce.  For exmaple, 
>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have 
>>>>>> their second replica be the small OSD.  So
>>>>>>
>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>
>>>>>> Putting those together,
>>>>>>
>>>>>>  P(pick small | first pick not small)
>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>
>>>>>> This is, on the second round, we should adjust the weights by the above so 
>>>>>> that we get the right distribution of second choices.  It turns out it 
>>>>>> works to adjust *all* weights like this to get hte conditional probability 
>>>>>> that they weren't already chosen.
>>>>>>
>>>>>> I have a branch that hacks this into straw2 and it appears to work 
>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
>>>>>> current code, you get
>>>>>>
>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>   device 6:             929148  [400572,528576]
>>>>>>
>>>>>> which is very close for the first replica (primary), but way off for the 
>>>>>> second.  With my hacky change,
>>>>>>
>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>   device 6:             804566  [400572,403994]
>>>>>>
>>>>>> which is quite close, but still skewing slightly high (by a big less than 
>>>>>> 1%).
>>>>>>
>>>>>> Next steps:
>>>>>>
>>>>>> 1- generalize this for >2 replicas
>>>>>> 2- figure out why it skews high
>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>
>>>>>> sage
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>> -- 
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>
>>> -- 
>>> Loïc Dachary, Artisan Logiciel Libre
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-13 21:01           ` Loic Dachary
@ 2017-02-13 21:15             ` Sage Weil
  2017-02-13 21:19               ` Gregory Farnum
  2017-02-13 21:43               ` Loic Dachary
  0 siblings, 2 replies; 70+ messages in thread
From: Sage Weil @ 2017-02-13 21:15 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 11229 bytes --]

On Mon, 13 Feb 2017, Loic Dachary wrote:
> I get the expected behavior for replica 1 (which is what 
> CRUSH.straw2_reweight does). The movement between buckets observered 
> below is for replica 2.

Oh, right, now I remember.  The movement for the second replica is 
unavoidable (as far as I can see).  For the second replica, sometimes we 
end up picking a dup (the same thing we got for the first 
replica) and trying again; any change in the behavior of the first choice 
may mean that we have more or less "second tries."  Although any given try 
will behave as we like (only moving to or from the reweighted item), 
adding new tries will pick uniformly.  In your example below, I think all 
of the second replicas that moved to osds 0-9 were objects that originally 
picked a dup for the second try and, once 10 was added, did not--because 
the first replica was now on the new osd 10.

sage

>         00     01     02     03     04     05     06     07     08     09     10
> 00:      0      0      0      0      0      0      0      0      0      0    927
> 01:      0      0      0      0      0      0      0      0      0      0    904
> 02:      0      0      0      0      0      0      0      0      0      0    928
> 03:      0      0      0      0      0      0      0      0      0      0    886
> 04:      0      0      0      0      0      0      0      0      0      0    927
> 05:      0      0      0      0      0      0      0      0      0      0    927
> 06:      0      0      0      0      0      0      0      0      0      0    930
> 07:      0      0      0      0      0      0      0      0      0      0    842
> 08:      0      0      0      0      0      0      0      0      0      0    943
> 09:      0      0      0      0      0      0      0      0      0      0    904
> 10:      0      0      0      0      0      0      0      0      0      0      0
> before:  10149  10066   9893   9955  10030  10025   9895  10013  10008   9966      0
> after:    9222   9162   8965   9069   9103   9098   8965   9171   9065   9062   9118
> 
> 
> On 02/13/2017 09:18 PM, Loic Dachary wrote:
> > 
> > 
> > On 02/13/2017 08:16 PM, Sage Weil wrote:
> >> On Mon, 13 Feb 2017, Loic Dachary wrote:
> >>> Hi Sage,
> >>>
> >>> I wrote a little program to show where objects are moving when a new disk is added (disk 10 below) and it looks like this:
> >>>
> >>>         00     01     02     03     04     05     06     07     08     09     10 
> >>> 00:      0     14     17     14     19     23     13     22     21     20   1800 
> >>> 01:     12      0     11     13     19     19     15     10     16     17   1841 
> >>> 02:     17     27      0     17     15     15     13     19     18     11   1813 
> >>> 03:     14     17     15      0     23     11     20     15     23     17   1792 
> >>> 04:     14     18     16     25      0     27     13      8     15     16   1771 
> >>> 05:     19     16     22     25     13      0      9     19     21     21   1813 
> >>> 06:     18     15     21     17     10     18      0     10     18     11   1873 
> >>> 07:     13     17     22     13     16     17     14      0     25     12   1719 
> >>> 08:     23     20     16     17     19     18     11     12      0     18   1830 
> >>> 09:     14     20     15     17     12     16     17     11     13      0   1828 
> >>> 10:      0      0      0      0      0      0      0      0      0      0      0 
> >>>
> >>> before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0 
> >>> after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080 
> >>>
> >>>
> >>> Each line shows how many objects moved from a given disk to the others 
> >>> after disk 10 was added. Most objects go to the new disk and around 1% 
> >>> go to each other disks. The before and after lines show how many objects 
> >>> are mapped to each disk. They all have the same weight and it's using 
> >>> replica 2 and straw2. Does that look right ?
> >>
> >> Hmm, that doesn't look right.  This is what the CRUSH.straw2_reweight unit 
> >> test is there to validate: that data on moves to or from the device whose 
> >> weight changed.
> > 
> > In the above, the bucket size changes: it has a new item. And the bucket size plays a role in bucket_straw2_choose because it loops on all items. In CRUSH.straw2_reweight only the weights change. I'm not entirely sure how that would explain the results I get though...
> > 
> >> It also follows from the straw2 algorithm itself: each possible choice 
> >> gets a 'straw' length derived only from that item's weight (and other 
> >> fixed factors, like the item id and the bucket id), and we select the max 
> >> across all items.  Two devices whose weights didn't change will have the 
> >> same straw lengths, and the max between them will not change.  It's only 
> >> possible that the changed item's straw length changed and wasn't max and 
> >> now is (got longer) or was max and now isn't (got shorter).
> > 
> > That's a crystal clear explanation, cool :-)
> > 
> > Cheers
> > 
> >> sage
> >>
> >>
> >>>
> >>> Cheers
> >>>
> >>> On 02/13/2017 03:21 PM, Sage Weil wrote:
> >>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
> >>>>> Hi,
> >>>>>
> >>>>> Dan van der Ster reached out to colleagues and friends and Pedro 
> >>>>> López-Adeva Fernández-Layos came up with a well written analysis of the 
> >>>>> problem and a tentative solution which he described at : 
> >>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
> >>>>>
> >>>>> Unless I'm reading the document incorrectly (very possible ;) it also 
> >>>>> means that the probability of each disk needs to take in account the 
> >>>>> weight of all disks. Which means that whenever a disk is added / removed 
> >>>>> or its weight is changed, this has an impact on the probability of all 
> >>>>> disks in the cluster and objects are likely to move everywhere. Am I 
> >>>>> mistaken ?
> >>>>
> >>>> Maybe (I haven't looked closely at the above yet).  But for comparison, in 
> >>>> the normal straw2 case, adding or removing a disk also changes the 
> >>>> probabilities for everything else (e.g., removing one out of 10 identical 
> >>>> disks changes the probability from 1/10 to 1/9).  The key property that 
> >>>> straw2 *is* able to handle is that as long as the relative probabilities 
> >>>> between two unmodified disks does not change, then straw2 will avoid 
> >>>> moving any objects between them (i.e., all data movement is to or from 
> >>>> the disk that is reweighted).
> >>>>
> >>>> sage
> >>>>
> >>>>
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
> >>>>>> This is a longstanding bug,
> >>>>>>
> >>>>>> 	http://tracker.ceph.com/issues/15653
> >>>>>>
> >>>>>> that causes low-weighted devices to get more data than they should. Loic's 
> >>>>>> recent activity resurrected discussion on the original PR
> >>>>>>
> >>>>>> 	https://github.com/ceph/ceph/pull/10218
> >>>>>>
> >>>>>> but since it's closed and almost nobody will see it I'm moving the 
> >>>>>> discussion here.
> >>>>>>
> >>>>>> The main news is that I have a simple adjustment for the weights that 
> >>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is 
> >>>>>> pretty simple, although as with most probabilities it tends to make my 
> >>>>>> brain hurt.
> >>>>>>
> >>>>>> The idea is that, on the second round, the original weight for the small 
> >>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
> >>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a 
> >>>>>> given b) is P(a && b) / P(b),
> >>>>>>
> >>>>>>  P(pick small | first pick not small)
> >>>>>>  = P(pick small && first pick not small) / P(first pick not small)
> >>>>>>
> >>>>>> The last term is easy to calculate,
> >>>>>>
> >>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
> >>>>>>
> >>>>>> and the && term is the distribution we're trying to produce.  For exmaple, 
> >>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have 
> >>>>>> their second replica be the small OSD.  So
> >>>>>>
> >>>>>>  P(pick small && first pick not small) = small_weight / total_weight
> >>>>>>
> >>>>>> Putting those together,
> >>>>>>
> >>>>>>  P(pick small | first pick not small)
> >>>>>>  = P(pick small && first pick not small) / P(first pick not small)
> >>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
> >>>>>>  = small_weight / (total_weight - small_weight)
> >>>>>>
> >>>>>> This is, on the second round, we should adjust the weights by the above so 
> >>>>>> that we get the right distribution of second choices.  It turns out it 
> >>>>>> works to adjust *all* weights like this to get hte conditional probability 
> >>>>>> that they weren't already chosen.
> >>>>>>
> >>>>>> I have a branch that hacks this into straw2 and it appears to work 
> >>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
> >>>>>> current code, you get
> >>>>>>
> >>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
> >>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
> >>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
> >>>>>>   device 0:             19765965        [9899364,9866601]
> >>>>>>   device 1:             19768033        [9899444,9868589]
> >>>>>>   device 2:             19769938        [9901770,9868168]
> >>>>>>   device 3:             19766918        [9898851,9868067]
> >>>>>>   device 6:             929148  [400572,528576]
> >>>>>>
> >>>>>> which is very close for the first replica (primary), but way off for the 
> >>>>>> second.  With my hacky change,
> >>>>>>
> >>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
> >>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
> >>>>>>   device 0:             19797315        [9899364,9897951]
> >>>>>>   device 1:             19799199        [9899444,9899755]
> >>>>>>   device 2:             19801016        [9901770,9899246]
> >>>>>>   device 3:             19797906        [9898851,9899055]
> >>>>>>   device 6:             804566  [400572,403994]
> >>>>>>
> >>>>>> which is quite close, but still skewing slightly high (by a big less than 
> >>>>>> 1%).
> >>>>>>
> >>>>>> Next steps:
> >>>>>>
> >>>>>> 1- generalize this for >2 replicas
> >>>>>> 2- figure out why it skews high
> >>>>>> 3- make this work for multi-level hierarchical descent
> >>>>>>
> >>>>>> sage
> >>>>>>
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>>> the body of a message to majordomo@vger.kernel.org
> >>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>
> >>>>>
> >>>>> -- 
> >>>>> Loïc Dachary, Artisan Logiciel Libre
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>> the body of a message to majordomo@vger.kernel.org
> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>
> >>> -- 
> >>> Loïc Dachary, Artisan Logiciel Libre
> > 
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> 
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-13 21:15             ` Sage Weil
@ 2017-02-13 21:19               ` Gregory Farnum
  2017-02-13 21:26                 ` Sage Weil
  2017-02-13 21:43               ` Loic Dachary
  1 sibling, 1 reply; 70+ messages in thread
From: Gregory Farnum @ 2017-02-13 21:19 UTC (permalink / raw)
  To: Sage Weil; +Cc: Loic Dachary, ceph-devel

On Mon, Feb 13, 2017 at 1:15 PM, Sage Weil <sweil@redhat.com> wrote:
> On Mon, 13 Feb 2017, Loic Dachary wrote:
>> I get the expected behavior for replica 1 (which is what
>> CRUSH.straw2_reweight does). The movement between buckets observered
>> below is for replica 2.
>
> Oh, right, now I remember.  The movement for the second replica is
> unavoidable (as far as I can see).  For the second replica, sometimes we
> end up picking a dup (the same thing we got for the first
> replica) and trying again; any change in the behavior of the first choice
> may mean that we have more or less "second tries."  Although any given try
> will behave as we like (only moving to or from the reweighted item),
> adding new tries will pick uniformly.  In your example below, I think all
> of the second replicas that moved to osds 0-9 were objects that originally
> picked a dup for the second try and, once 10 was added, did not--because
> the first replica was now on the new osd 10.

Just to be clear, that's within a bucket, right?
Because obviously changing bucket weights in the CRUSH hierarchy will
move new data to them, not all of which ends up on the new disk.
-Greg

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-13 21:19               ` Gregory Farnum
@ 2017-02-13 21:26                 ` Sage Weil
  0 siblings, 0 replies; 70+ messages in thread
From: Sage Weil @ 2017-02-13 21:26 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Loic Dachary, ceph-devel

On Mon, 13 Feb 2017, Gregory Farnum wrote:
> On Mon, Feb 13, 2017 at 1:15 PM, Sage Weil <sweil@redhat.com> wrote:
> > On Mon, 13 Feb 2017, Loic Dachary wrote:
> >> I get the expected behavior for replica 1 (which is what
> >> CRUSH.straw2_reweight does). The movement between buckets observered
> >> below is for replica 2.
> >
> > Oh, right, now I remember.  The movement for the second replica is
> > unavoidable (as far as I can see).  For the second replica, sometimes we
> > end up picking a dup (the same thing we got for the first
> > replica) and trying again; any change in the behavior of the first choice
> > may mean that we have more or less "second tries."  Although any given try
> > will behave as we like (only moving to or from the reweighted item),
> > adding new tries will pick uniformly.  In your example below, I think all
> > of the second replicas that moved to osds 0-9 were objects that originally
> > picked a dup for the second try and, once 10 was added, did not--because
> > the first replica was now on the new osd 10.
> 
> Just to be clear, that's within a bucket, right?

Right, within a (straw2) bucket.

> Because obviously changing bucket weights in the CRUSH hierarchy will
> move new data to them, not all of which ends up on the new disk.

Yep!

sage

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-13 21:15             ` Sage Weil
  2017-02-13 21:19               ` Gregory Farnum
@ 2017-02-13 21:43               ` Loic Dachary
  1 sibling, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-02-13 21:43 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel



On 02/13/2017 10:15 PM, Sage Weil wrote:
> On Mon, 13 Feb 2017, Loic Dachary wrote:
>> I get the expected behavior for replica 1 (which is what 
>> CRUSH.straw2_reweight does). The movement between buckets observered 
>> below is for replica 2.
> 
> Oh, right, now I remember.  The movement for the second replica is 
> unavoidable (as far as I can see).  For the second replica, sometimes we 
> end up picking a dup (the same thing we got for the first 
> replica) and trying again; any change in the behavior of the first choice 
> may mean that we have more or less "second tries."  Although any given try 
> will behave as we like (only moving to or from the reweighted item), 
> adding new tries will pick uniformly.  In your example below, I think all 
> of the second replicas that moved to osds 0-9 were objects that originally 
> picked a dup for the second try and, once 10 was added, did not--because 
> the first replica was now on the new osd 10.

So this is another manifestation of the multipick anomaly ?

> sage
> 
>>         00     01     02     03     04     05     06     07     08     09     10
>> 00:      0      0      0      0      0      0      0      0      0      0    927
>> 01:      0      0      0      0      0      0      0      0      0      0    904
>> 02:      0      0      0      0      0      0      0      0      0      0    928
>> 03:      0      0      0      0      0      0      0      0      0      0    886
>> 04:      0      0      0      0      0      0      0      0      0      0    927
>> 05:      0      0      0      0      0      0      0      0      0      0    927
>> 06:      0      0      0      0      0      0      0      0      0      0    930
>> 07:      0      0      0      0      0      0      0      0      0      0    842
>> 08:      0      0      0      0      0      0      0      0      0      0    943
>> 09:      0      0      0      0      0      0      0      0      0      0    904
>> 10:      0      0      0      0      0      0      0      0      0      0      0
>> before:  10149  10066   9893   9955  10030  10025   9895  10013  10008   9966      0
>> after:    9222   9162   8965   9069   9103   9098   8965   9171   9065   9062   9118
>>
>>
>> On 02/13/2017 09:18 PM, Loic Dachary wrote:
>>>
>>>
>>> On 02/13/2017 08:16 PM, Sage Weil wrote:
>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>> Hi Sage,
>>>>>
>>>>> I wrote a little program to show where objects are moving when a new disk is added (disk 10 below) and it looks like this:
>>>>>
>>>>>         00     01     02     03     04     05     06     07     08     09     10 
>>>>> 00:      0     14     17     14     19     23     13     22     21     20   1800 
>>>>> 01:     12      0     11     13     19     19     15     10     16     17   1841 
>>>>> 02:     17     27      0     17     15     15     13     19     18     11   1813 
>>>>> 03:     14     17     15      0     23     11     20     15     23     17   1792 
>>>>> 04:     14     18     16     25      0     27     13      8     15     16   1771 
>>>>> 05:     19     16     22     25     13      0      9     19     21     21   1813 
>>>>> 06:     18     15     21     17     10     18      0     10     18     11   1873 
>>>>> 07:     13     17     22     13     16     17     14      0     25     12   1719 
>>>>> 08:     23     20     16     17     19     18     11     12      0     18   1830 
>>>>> 09:     14     20     15     17     12     16     17     11     13      0   1828 
>>>>> 10:      0      0      0      0      0      0      0      0      0      0      0 
>>>>>
>>>>> before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0 
>>>>> after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080 
>>>>>
>>>>>
>>>>> Each line shows how many objects moved from a given disk to the others 
>>>>> after disk 10 was added. Most objects go to the new disk and around 1% 
>>>>> go to each other disks. The before and after lines show how many objects 
>>>>> are mapped to each disk. They all have the same weight and it's using 
>>>>> replica 2 and straw2. Does that look right ?
>>>>
>>>> Hmm, that doesn't look right.  This is what the CRUSH.straw2_reweight unit 
>>>> test is there to validate: that data on moves to or from the device whose 
>>>> weight changed.
>>>
>>> In the above, the bucket size changes: it has a new item. And the bucket size plays a role in bucket_straw2_choose because it loops on all items. In CRUSH.straw2_reweight only the weights change. I'm not entirely sure how that would explain the results I get though...
>>>
>>>> It also follows from the straw2 algorithm itself: each possible choice 
>>>> gets a 'straw' length derived only from that item's weight (and other 
>>>> fixed factors, like the item id and the bucket id), and we select the max 
>>>> across all items.  Two devices whose weights didn't change will have the 
>>>> same straw lengths, and the max between them will not change.  It's only 
>>>> possible that the changed item's straw length changed and wasn't max and 
>>>> now is (got longer) or was max and now isn't (got shorter).
>>>
>>> That's a crystal clear explanation, cool :-)
>>>
>>> Cheers
>>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 02/13/2017 03:21 PM, Sage Weil wrote:
>>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro 
>>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the 
>>>>>>> problem and a tentative solution which he described at : 
>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>
>>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also 
>>>>>>> means that the probability of each disk needs to take in account the 
>>>>>>> weight of all disks. Which means that whenever a disk is added / removed 
>>>>>>> or its weight is changed, this has an impact on the probability of all 
>>>>>>> disks in the cluster and objects are likely to move everywhere. Am I 
>>>>>>> mistaken ?
>>>>>>
>>>>>> Maybe (I haven't looked closely at the above yet).  But for comparison, in 
>>>>>> the normal straw2 case, adding or removing a disk also changes the 
>>>>>> probabilities for everything else (e.g., removing one out of 10 identical 
>>>>>> disks changes the probability from 1/10 to 1/9).  The key property that 
>>>>>> straw2 *is* able to handle is that as long as the relative probabilities 
>>>>>> between two unmodified disks does not change, then straw2 will avoid 
>>>>>> moving any objects between them (i.e., all data movement is to or from 
>>>>>> the disk that is reweighted).
>>>>>>
>>>>>> sage
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>> This is a longstanding bug,
>>>>>>>>
>>>>>>>> 	http://tracker.ceph.com/issues/15653
>>>>>>>>
>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's 
>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>
>>>>>>>> 	https://github.com/ceph/ceph/pull/10218
>>>>>>>>
>>>>>>>> but since it's closed and almost nobody will see it I'm moving the 
>>>>>>>> discussion here.
>>>>>>>>
>>>>>>>> The main news is that I have a simple adjustment for the weights that 
>>>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is 
>>>>>>>> pretty simple, although as with most probabilities it tends to make my 
>>>>>>>> brain hurt.
>>>>>>>>
>>>>>>>> The idea is that, on the second round, the original weight for the small 
>>>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
>>>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a 
>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>
>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>
>>>>>>>> The last term is easy to calculate,
>>>>>>>>
>>>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>
>>>>>>>> and the && term is the distribution we're trying to produce.  For exmaple, 
>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have 
>>>>>>>> their second replica be the small OSD.  So
>>>>>>>>
>>>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>
>>>>>>>> Putting those together,
>>>>>>>>
>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>>>
>>>>>>>> This is, on the second round, we should adjust the weights by the above so 
>>>>>>>> that we get the right distribution of second choices.  It turns out it 
>>>>>>>> works to adjust *all* weights like this to get hte conditional probability 
>>>>>>>> that they weren't already chosen.
>>>>>>>>
>>>>>>>> I have a branch that hacks this into straw2 and it appears to work 
>>>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
>>>>>>>> current code, you get
>>>>>>>>
>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>>>   device 6:             929148  [400572,528576]
>>>>>>>>
>>>>>>>> which is very close for the first replica (primary), but way off for the 
>>>>>>>> second.  With my hacky change,
>>>>>>>>
>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>>>   device 6:             804566  [400572,403994]
>>>>>>>>
>>>>>>>> which is quite close, but still skewing slightly high (by a big less than 
>>>>>>>> 1%).
>>>>>>>>
>>>>>>>> Next steps:
>>>>>>>>
>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>> 2- figure out why it skews high
>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>
>>>>>>>> sage
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>
>>>>> -- 
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>>

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-13 14:21   ` Sage Weil
  2017-02-13 18:50     ` Loic Dachary
@ 2017-02-16 22:04     ` Pedro López-Adeva
  2017-02-22  7:52       ` Loic Dachary
  1 sibling, 1 reply; 70+ messages in thread
From: Pedro López-Adeva @ 2017-02-16 22:04 UTC (permalink / raw)
  To: Sage Weil; +Cc: Loic Dachary, ceph-devel

I have updated the algorithm to handle an arbitrary number of replicas
and arbitrary constraints.

Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf

(Note: GitHub's renderization of the notebook and the PDF is quite
deficient, I recommend downloading/cloning)


In the following by policy I mean the concrete set of probabilities of
selecting the first replica, the second replica, etc...
In practical terms there are several problems:

- It's not practical for a high number of disks or replicas.

Possible solution: approximate summation over all possible disk
selections with a Monte Carlo method.
the algorithm would be: we start with a candidate solution, we run a
simulation and based on the results
we update the probabilities. Repeat until we are happy with the result.

Other solution: cluster similar disks together.

- Since it's a non-linear optimization problem I'm not sure right now
about it's convergence properties.
Does it converge to a global optimum? How fast does it converge?

Possible solution: the algorithm always converges, but it can converge
to a locally optimum policy. I see
no escape except by carefully designing the policy. All solutions to
the problem are going to be non linear
since we must condition current probabilities on previous disk selections.

- Although it can handle arbitrary constraints it does so by rejecting
disks selections that violate at least one constraint.
This means that for bad policies it can spend all the time rejecting
invalid disks selection candidates.

Possible solution: the policy cannot be designed independently of the
constraints. I don't know what constraints
are typical use cases but having a look should be the first step. The
constraints must be an input to the policy.


I hope it's of some use. Quite frankly I'm not a ceph user, I just
found the problem an interesting puzzle.
Anyway I will try to have a look at the CRUSH paper this weekend.


2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
> On Mon, 13 Feb 2017, Loic Dachary wrote:
>> Hi,
>>
>> Dan van der Ster reached out to colleagues and friends and Pedro
>> López-Adeva Fernández-Layos came up with a well written analysis of the
>> problem and a tentative solution which he described at :
>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>
>> Unless I'm reading the document incorrectly (very possible ;) it also
>> means that the probability of each disk needs to take in account the
>> weight of all disks. Which means that whenever a disk is added / removed
>> or its weight is changed, this has an impact on the probability of all
>> disks in the cluster and objects are likely to move everywhere. Am I
>> mistaken ?
>
> Maybe (I haven't looked closely at the above yet).  But for comparison, in
> the normal straw2 case, adding or removing a disk also changes the
> probabilities for everything else (e.g., removing one out of 10 identical
> disks changes the probability from 1/10 to 1/9).  The key property that
> straw2 *is* able to handle is that as long as the relative probabilities
> between two unmodified disks does not change, then straw2 will avoid
> moving any objects between them (i.e., all data movement is to or from
> the disk that is reweighted).
>
> sage
>
>
>>
>> Cheers
>>
>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>> > This is a longstanding bug,
>> >
>> >     http://tracker.ceph.com/issues/15653
>> >
>> > that causes low-weighted devices to get more data than they should. Loic's
>> > recent activity resurrected discussion on the original PR
>> >
>> >     https://github.com/ceph/ceph/pull/10218
>> >
>> > but since it's closed and almost nobody will see it I'm moving the
>> > discussion here.
>> >
>> > The main news is that I have a simple adjustment for the weights that
>> > works (almost perfectly) for the 2nd round of placements.  The solution is
>> > pretty simple, although as with most probabilities it tends to make my
>> > brain hurt.
>> >
>> > The idea is that, on the second round, the original weight for the small
>> > OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>> > P(pick small | first pick not small).  Since P(a|b) (the probability of a
>> > given b) is P(a && b) / P(b),
>> >
>> >  P(pick small | first pick not small)
>> >  = P(pick small && first pick not small) / P(first pick not small)
>> >
>> > The last term is easy to calculate,
>> >
>> >  P(first pick not small) = (total_weight - small_weight) / total_weight
>> >
>> > and the && term is the distribution we're trying to produce.  For exmaple,
>> > if small has 1/10 the weight, then we should see 1/10th of the PGs have
>> > their second replica be the small OSD.  So
>> >
>> >  P(pick small && first pick not small) = small_weight / total_weight
>> >
>> > Putting those together,
>> >
>> >  P(pick small | first pick not small)
>> >  = P(pick small && first pick not small) / P(first pick not small)
>> >  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>> >  = small_weight / (total_weight - small_weight)
>> >
>> > This is, on the second round, we should adjust the weights by the above so
>> > that we get the right distribution of second choices.  It turns out it
>> > works to adjust *all* weights like this to get hte conditional probability
>> > that they weren't already chosen.
>> >
>> > I have a branch that hacks this into straw2 and it appears to work
>> > properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>> > current code, you get
>> >
>> > $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>> > rule 0 (data), x = 0..40000000, numrep = 2..2
>> > rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>> >   device 0:             19765965        [9899364,9866601]
>> >   device 1:             19768033        [9899444,9868589]
>> >   device 2:             19769938        [9901770,9868168]
>> >   device 3:             19766918        [9898851,9868067]
>> >   device 6:             929148  [400572,528576]
>> >
>> > which is very close for the first replica (primary), but way off for the
>> > second.  With my hacky change,
>> >
>> > rule 0 (data), x = 0..40000000, numrep = 2..2
>> > rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>> >   device 0:             19797315        [9899364,9897951]
>> >   device 1:             19799199        [9899444,9899755]
>> >   device 2:             19801016        [9901770,9899246]
>> >   device 3:             19797906        [9898851,9899055]
>> >   device 6:             804566  [400572,403994]
>> >
>> > which is quite close, but still skewing slightly high (by a big less than
>> > 1%).
>> >
>> > Next steps:
>> >
>> > 1- generalize this for >2 replicas
>> > 2- figure out why it skews high
>> > 3- make this work for multi-level hierarchical descent
>> >
>> > sage
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-13 14:53   ` Gregory Farnum
@ 2017-02-20  8:47     ` Loic Dachary
  2017-02-20 17:32       ` Gregory Farnum
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-20  8:47 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel



On 02/13/2017 03:53 PM, Gregory Farnum wrote:
> On Mon, Feb 13, 2017 at 2:36 AM, Loic Dachary <loic@dachary.org> wrote:
>> Hi,
>>
>> Dan van der Ster reached out to colleagues and friends and Pedro López-Adeva Fernández-Layos came up with a well written analysis of the problem and a tentative solution which he described at : https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>
>> Unless I'm reading the document incorrectly (very possible ;) it also means that the probability of each disk needs to take in account the weight of all disks. Which means that whenever a disk is added / removed or its weight is changed, this has an impact on the probability of all disks in the cluster and objects are likely to move everywhere. Am I mistaken ?
> 
> Keep in mind that in the math presented, "all disks" for our purposes
> really means "all items within a CRUSH bucket" (at least, best I can
> tell). So if you reweight a disk, you have to recalculate weights
> within its bucket and within each parent bucket, but each bucket has a
> bounded size N so the calculation should remain feasible. I didn't
> step through the more complicated math at the end but it made
> intuitive sense as far as I went.

When crush chooses the second replica it ensures it does not land on the same host, rack etc. depending on the step CHOOSE* argument of the rule. When looking for the best weights (in the updated https://github.com/plafl/notebooks/blob/master/converted/replication.pdf versions) I think we would focus on the host weights (assuming the failure domain is the host) and not the disk weights. When drawing disks after the host was selected, the probabilities of each disk should not need to be modified because there will never be a rejection at that level (i.e. no conditional probability).

If the failure domain is the host I think the crush map should be something like:

root:
   host1:
     disk1
     disk2
   host2:
     disk3
     disk4
   host3:
     disk5
     disk6

Introducing racks such as in:

root:
 rack0:
   host1:
     disk1
     disk2
   host2:
     disk3
     disk4
 rack1:
   host3:
     disk5
     disk6

Is going to complicate the problem further, for no good reason other than a pretty display / architecture reminder. Since rejecting a second replica on host3 means it will land in rack0 instead of rack1, I think the probability distribution of the racks will need to be adjusted in the same way the probabilty distribution of the failure domain buckets need to.

Does that make sense ?

> -Greg
> 
>>
>> Cheers
>>
>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>> This is a longstanding bug,
>>>
>>>       http://tracker.ceph.com/issues/15653
>>>
>>> that causes low-weighted devices to get more data than they should. Loic's
>>> recent activity resurrected discussion on the original PR
>>>
>>>       https://github.com/ceph/ceph/pull/10218
>>>
>>> but since it's closed and almost nobody will see it I'm moving the
>>> discussion here.
>>>
>>> The main news is that I have a simple adjustment for the weights that
>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>> pretty simple, although as with most probabilities it tends to make my
>>> brain hurt.
>>>
>>> The idea is that, on the second round, the original weight for the small
>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>> given b) is P(a && b) / P(b),
>>>
>>>  P(pick small | first pick not small)
>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>
>>> The last term is easy to calculate,
>>>
>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>
>>> and the && term is the distribution we're trying to produce.  For exmaple,
>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>> their second replica be the small OSD.  So
>>>
>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>
>>> Putting those together,
>>>
>>>  P(pick small | first pick not small)
>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>  = small_weight / (total_weight - small_weight)
>>>
>>> This is, on the second round, we should adjust the weights by the above so
>>> that we get the right distribution of second choices.  It turns out it
>>> works to adjust *all* weights like this to get hte conditional probability
>>> that they weren't already chosen.
>>>
>>> I have a branch that hacks this into straw2 and it appears to work
>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>> current code, you get
>>>
>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>   device 0:             19765965        [9899364,9866601]
>>>   device 1:             19768033        [9899444,9868589]
>>>   device 2:             19769938        [9901770,9868168]
>>>   device 3:             19766918        [9898851,9868067]
>>>   device 6:             929148  [400572,528576]
>>>
>>> which is very close for the first replica (primary), but way off for the
>>> second.  With my hacky change,
>>>
>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>   device 0:             19797315        [9899364,9897951]
>>>   device 1:             19799199        [9899444,9899755]
>>>   device 2:             19801016        [9901770,9899246]
>>>   device 3:             19797906        [9898851,9899055]
>>>   device 6:             804566  [400572,403994]
>>>
>>> which is quite close, but still skewing slightly high (by a big less than
>>> 1%).
>>>
>>> Next steps:
>>>
>>> 1- generalize this for >2 replicas
>>> 2- figure out why it skews high
>>> 3- make this work for multi-level hierarchical descent
>>>
>>> sage
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-20  8:47     ` Loic Dachary
@ 2017-02-20 17:32       ` Gregory Farnum
  2017-02-20 19:31         ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Gregory Farnum @ 2017-02-20 17:32 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

On Mon, Feb 20, 2017 at 12:47 AM, Loic Dachary <loic@dachary.org> wrote:
>
>
> On 02/13/2017 03:53 PM, Gregory Farnum wrote:
>> On Mon, Feb 13, 2017 at 2:36 AM, Loic Dachary <loic@dachary.org> wrote:
>>> Hi,
>>>
>>> Dan van der Ster reached out to colleagues and friends and Pedro López-Adeva Fernández-Layos came up with a well written analysis of the problem and a tentative solution which he described at : https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>
>>> Unless I'm reading the document incorrectly (very possible ;) it also means that the probability of each disk needs to take in account the weight of all disks. Which means that whenever a disk is added / removed or its weight is changed, this has an impact on the probability of all disks in the cluster and objects are likely to move everywhere. Am I mistaken ?
>>
>> Keep in mind that in the math presented, "all disks" for our purposes
>> really means "all items within a CRUSH bucket" (at least, best I can
>> tell). So if you reweight a disk, you have to recalculate weights
>> within its bucket and within each parent bucket, but each bucket has a
>> bounded size N so the calculation should remain feasible. I didn't
>> step through the more complicated math at the end but it made
>> intuitive sense as far as I went.
>
> When crush chooses the second replica it ensures it does not land on the same host, rack etc. depending on the step CHOOSE* argument of the rule. When looking for the best weights (in the updated https://github.com/plafl/notebooks/blob/master/converted/replication.pdf versions) I think we would focus on the host weights (assuming the failure domain is the host) and not the disk weights. When drawing disks after the host was selected, the probabilities of each disk should not need to be modified because there will never be a rejection at that level (i.e. no conditional probability).

Well, you'd have changed the number of disks, so you'd need to
recalculate within the host that got a new disk added. And then you'd
need to recalculate the host and its peer buckets, and if it was in a
rack then the rack and its peer buckets, and on up the chain.

>
> If the failure domain is the host I think the crush map should be something like:
>
> root:
>    host1:
>      disk1
>      disk2
>    host2:
>      disk3
>      disk4
>    host3:
>      disk5
>      disk6
>
> Introducing racks such as in:
>
> root:
>  rack0:
>    host1:
>      disk1
>      disk2
>    host2:
>      disk3
>      disk4
>  rack1:
>    host3:
>      disk5
>      disk6
>
> Is going to complicate the problem further, for no good reason other than a pretty display / architecture reminder.

Well, there's not much point if you're replicating across hosts, since
the rack layer is very unbalanced here. But that's essentially a
misconfiguration which is going to cause problems with any CRUSH-like
system.


> Since rejecting a second replica on host3 means it will land in rack0 instead of rack1, I think the probability distribution of the racks will need to be adjusted in the same way the probabilty distribution of the failure domain buckets need to.

I think maybe you're saying what I did before? "All disks" for our
purposes really means "all items within a CRUSH bucket". The racks are
CRUSH items within the root bucket.
-Greg

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-20 17:32       ` Gregory Farnum
@ 2017-02-20 19:31         ` Loic Dachary
  0 siblings, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-02-20 19:31 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel



On 02/20/2017 06:32 PM, Gregory Farnum wrote:
> On Mon, Feb 20, 2017 at 12:47 AM, Loic Dachary <loic@dachary.org> wrote:
>>
>>
>> On 02/13/2017 03:53 PM, Gregory Farnum wrote:
>>> On Mon, Feb 13, 2017 at 2:36 AM, Loic Dachary <loic@dachary.org> wrote:
>>>> Hi,
>>>>
>>>> Dan van der Ster reached out to colleagues and friends and Pedro López-Adeva Fernández-Layos came up with a well written analysis of the problem and a tentative solution which he described at : https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>
>>>> Unless I'm reading the document incorrectly (very possible ;) it also means that the probability of each disk needs to take in account the weight of all disks. Which means that whenever a disk is added / removed or its weight is changed, this has an impact on the probability of all disks in the cluster and objects are likely to move everywhere. Am I mistaken ?
>>>
>>> Keep in mind that in the math presented, "all disks" for our purposes
>>> really means "all items within a CRUSH bucket" (at least, best I can
>>> tell). So if you reweight a disk, you have to recalculate weights
>>> within its bucket and within each parent bucket, but each bucket has a
>>> bounded size N so the calculation should remain feasible. I didn't
>>> step through the more complicated math at the end but it made
>>> intuitive sense as far as I went.
>>
>> When crush chooses the second replica it ensures it does not land on the same host, rack etc. depending on the step CHOOSE* argument of the rule. When looking for the best weights (in the updated https://github.com/plafl/notebooks/blob/master/converted/replication.pdf versions) I think we would focus on the host weights (assuming the failure domain is the host) and not the disk weights. When drawing disks after the host was selected, the probabilities of each disk should not need to be modified because there will never be a rejection at that level (i.e. no conditional probability).
> 
> Well, you'd have changed the number of disks, so you'd need to
> recalculate within the host that got a new disk added. And then you'd
> need to recalculate the host and its peer buckets, and if it was in a
> rack then the rack and its peer buckets, and on up the chain.

I meant to say that you do not need to change the weight of the disks within other hosts. But you need to change the weight of all other hosts, not just the host in which a new disk was inserted/removed.

> 
>>
>> If the failure domain is the host I think the crush map should be something like:
>>
>> root:
>>    host1:
>>      disk1
>>      disk2
>>    host2:
>>      disk3
>>      disk4
>>    host3:
>>      disk5
>>      disk6
>>
>> Introducing racks such as in:
>>
>> root:
>>  rack0:
>>    host1:
>>      disk1
>>      disk2
>>    host2:
>>      disk3
>>      disk4
>>  rack1:
>>    host3:
>>      disk5
>>      disk6
>>
>> Is going to complicate the problem further, for no good reason other than a pretty display / architecture reminder.
> 
> Well, there's not much point if you're replicating across hosts, since
> the rack layer is very unbalanced here. But that's essentially a
> misconfiguration which is going to cause problems with any CRUSH-like
> system.
> 
> 
>> Since rejecting a second replica on host3 means it will land in rack0 instead of rack1, I think the probability distribution of the racks will need to be adjusted in the same way the probabilty distribution of the failure domain buckets need to.
> 
> I think maybe you're saying what I did before? "All disks" for our
> purposes really means "all items within a CRUSH bucket". The racks are
> CRUSH items within the root bucket.
> -Greg
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-16 22:04     ` Pedro López-Adeva
@ 2017-02-22  7:52       ` Loic Dachary
  2017-02-22 11:26         ` Pedro López-Adeva
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-22  7:52 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: ceph-devel

Hi Pedro,

On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
> I have updated the algorithm to handle an arbitrary number of replicas
> and arbitrary constraints.
> 
> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf

I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words. 

You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].

You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).

This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.

[1] https://en.wikipedia.org/wiki/Loss_function
[2] https://en.wikipedia.org/wiki/Gradient
[3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations

From the above you can hopefully see how far off my understanding is. And I have one question below.

> (Note: GitHub's renderization of the notebook and the PDF is quite
> deficient, I recommend downloading/cloning)
> 
> 
> In the following by policy I mean the concrete set of probabilities of
> selecting the first replica, the second replica, etc...
> In practical terms there are several problems:
> 
> - It's not practical for a high number of disks or replicas.
> 
> Possible solution: approximate summation over all possible disk
> selections with a Monte Carlo method.
> the algorithm would be: we start with a candidate solution, we run a
> simulation and based on the results
> we update the probabilities. Repeat until we are happy with the result.
> 
> Other solution: cluster similar disks together.
> 
> - Since it's a non-linear optimization problem I'm not sure right now
> about it's convergence properties.
> Does it converge to a global optimum? How fast does it converge?
> 
> Possible solution: the algorithm always converges, but it can converge
> to a locally optimum policy. I see
> no escape except by carefully designing the policy. All solutions to
> the problem are going to be non linear
> since we must condition current probabilities on previous disk selections.
> 
> - Although it can handle arbitrary constraints it does so by rejecting
> disks selections that violate at least one constraint.
> This means that for bad policies it can spend all the time rejecting
> invalid disks selection candidates.
> 
> Possible solution: the policy cannot be designed independently of the
> constraints. I don't know what constraints
> are typical use cases but having a look should be the first step. The
> constraints must be an input to the policy.
> 
> 
> I hope it's of some use. Quite frankly I'm not a ceph user, I just
> found the problem an interesting puzzle.
> Anyway I will try to have a look at the CRUSH paper this weekend.

In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used. 

[1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
[2] https://github.com/ceph/ceph/tree/master/src/crush

Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.

        00     01     02     03     04     05     06     07     08     09     10 
00:      0     14     17     14     19     23     13     22     21     20   1800 
01:     12      0     11     13     19     19     15     10     16     17   1841 
02:     17     27      0     17     15     15     13     19     18     11   1813 
03:     14     17     15      0     23     11     20     15     23     17   1792 
04:     14     18     16     25      0     27     13      8     15     16   1771 
05:     19     16     22     25     13      0      9     19     21     21   1813 
06:     18     15     21     17     10     18      0     10     18     11   1873 
07:     13     17     22     13     16     17     14      0     25     12   1719 
08:     23     20     16     17     19     18     11     12      0     18   1830 
09:     14     20     15     17     12     16     17     11     13      0   1828 
10:      0      0      0      0      0      0      0      0      0      0      0 
before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0 
after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080 

About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.

[1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c

Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?

Cheers

> 
> 
> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>> Hi,
>>>
>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>> problem and a tentative solution which he described at :
>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>
>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>> means that the probability of each disk needs to take in account the
>>> weight of all disks. Which means that whenever a disk is added / removed
>>> or its weight is changed, this has an impact on the probability of all
>>> disks in the cluster and objects are likely to move everywhere. Am I
>>> mistaken ?
>>
>> Maybe (I haven't looked closely at the above yet).  But for comparison, in
>> the normal straw2 case, adding or removing a disk also changes the
>> probabilities for everything else (e.g., removing one out of 10 identical
>> disks changes the probability from 1/10 to 1/9).  The key property that
>> straw2 *is* able to handle is that as long as the relative probabilities
>> between two unmodified disks does not change, then straw2 will avoid
>> moving any objects between them (i.e., all data movement is to or from
>> the disk that is reweighted).
>>
>> sage
>>
>>
>>>
>>> Cheers
>>>
>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>> This is a longstanding bug,
>>>>
>>>>     http://tracker.ceph.com/issues/15653
>>>>
>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>> recent activity resurrected discussion on the original PR
>>>>
>>>>     https://github.com/ceph/ceph/pull/10218
>>>>
>>>> but since it's closed and almost nobody will see it I'm moving the
>>>> discussion here.
>>>>
>>>> The main news is that I have a simple adjustment for the weights that
>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>> pretty simple, although as with most probabilities it tends to make my
>>>> brain hurt.
>>>>
>>>> The idea is that, on the second round, the original weight for the small
>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>> given b) is P(a && b) / P(b),
>>>>
>>>>  P(pick small | first pick not small)
>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>
>>>> The last term is easy to calculate,
>>>>
>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>
>>>> and the && term is the distribution we're trying to produce.  For exmaple,
>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>> their second replica be the small OSD.  So
>>>>
>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>
>>>> Putting those together,
>>>>
>>>>  P(pick small | first pick not small)
>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>  = small_weight / (total_weight - small_weight)
>>>>
>>>> This is, on the second round, we should adjust the weights by the above so
>>>> that we get the right distribution of second choices.  It turns out it
>>>> works to adjust *all* weights like this to get hte conditional probability
>>>> that they weren't already chosen.
>>>>
>>>> I have a branch that hacks this into straw2 and it appears to work
>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>> current code, you get
>>>>
>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>   device 0:             19765965        [9899364,9866601]
>>>>   device 1:             19768033        [9899444,9868589]
>>>>   device 2:             19769938        [9901770,9868168]
>>>>   device 3:             19766918        [9898851,9868067]
>>>>   device 6:             929148  [400572,528576]
>>>>
>>>> which is very close for the first replica (primary), but way off for the
>>>> second.  With my hacky change,
>>>>
>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>   device 0:             19797315        [9899364,9897951]
>>>>   device 1:             19799199        [9899444,9899755]
>>>>   device 2:             19801016        [9901770,9899246]
>>>>   device 3:             19797906        [9898851,9899055]
>>>>   device 6:             804566  [400572,403994]
>>>>
>>>> which is quite close, but still skewing slightly high (by a big less than
>>>> 1%).
>>>>
>>>> Next steps:
>>>>
>>>> 1- generalize this for >2 replicas
>>>> 2- figure out why it skews high
>>>> 3- make this work for multi-level hierarchical descent
>>>>
>>>> sage
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-22  7:52       ` Loic Dachary
@ 2017-02-22 11:26         ` Pedro López-Adeva
  2017-02-22 11:38           ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Pedro López-Adeva @ 2017-02-22 11:26 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

Hi,

I think your description of my proposed solution is quite good.

I had a first look at Sage's paper but not ceph's implementation. My
plan is to finish the paper and make an implementation in python that
mimicks more closely ceph's algorithm.

Regarding your question about data movement:

If I understood the paper correctly what is happening right now is
that when weights change on the devices some of them will become
overloaded and the current algorithm will try to correct for that but
this approach, I think, is independent of how we compute the weights
for each device. My point is that the current data movement pattern
will not be modified.

Could the data movement algorithm be improved? Maybe. I don't know.
Maybe by making the probabilities non-stationary with the new disk
getting at first very high probability and after each replica
placement decrease it until it stabilizes to it's final value. But I'm
just guessing and I really don't know if this can be made to work in a
distributed manner as is currently the case and how would this fit in
the current architecture. In any case it would be a problem as hard at
least as the current reweighting problem.

So, to summarize, my current plans:

- Have another look at the paper
- Make an implementation in python that imitates more closely the
current algorithm
- Make sure the new reweighting algorithm is fast and gives the desired results

I will give updates here when there are significant changes so
everyone can have a look and suggest improvements.

Cheers,
Pedro.

2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
> Hi Pedro,
>
> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>> I have updated the algorithm to handle an arbitrary number of replicas
>> and arbitrary constraints.
>>
>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>
> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>
> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>
> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>
> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>
> [1] https://en.wikipedia.org/wiki/Loss_function
> [2] https://en.wikipedia.org/wiki/Gradient
> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>
> From the above you can hopefully see how far off my understanding is. And I have one question below.
>
>> (Note: GitHub's renderization of the notebook and the PDF is quite
>> deficient, I recommend downloading/cloning)
>>
>>
>> In the following by policy I mean the concrete set of probabilities of
>> selecting the first replica, the second replica, etc...
>> In practical terms there are several problems:
>>
>> - It's not practical for a high number of disks or replicas.
>>
>> Possible solution: approximate summation over all possible disk
>> selections with a Monte Carlo method.
>> the algorithm would be: we start with a candidate solution, we run a
>> simulation and based on the results
>> we update the probabilities. Repeat until we are happy with the result.
>>
>> Other solution: cluster similar disks together.
>>
>> - Since it's a non-linear optimization problem I'm not sure right now
>> about it's convergence properties.
>> Does it converge to a global optimum? How fast does it converge?
>>
>> Possible solution: the algorithm always converges, but it can converge
>> to a locally optimum policy. I see
>> no escape except by carefully designing the policy. All solutions to
>> the problem are going to be non linear
>> since we must condition current probabilities on previous disk selections.
>>
>> - Although it can handle arbitrary constraints it does so by rejecting
>> disks selections that violate at least one constraint.
>> This means that for bad policies it can spend all the time rejecting
>> invalid disks selection candidates.
>>
>> Possible solution: the policy cannot be designed independently of the
>> constraints. I don't know what constraints
>> are typical use cases but having a look should be the first step. The
>> constraints must be an input to the policy.
>>
>>
>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>> found the problem an interesting puzzle.
>> Anyway I will try to have a look at the CRUSH paper this weekend.
>
> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>
> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
> [2] https://github.com/ceph/ceph/tree/master/src/crush
>
> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>
>         00     01     02     03     04     05     06     07     08     09     10
> 00:      0     14     17     14     19     23     13     22     21     20   1800
> 01:     12      0     11     13     19     19     15     10     16     17   1841
> 02:     17     27      0     17     15     15     13     19     18     11   1813
> 03:     14     17     15      0     23     11     20     15     23     17   1792
> 04:     14     18     16     25      0     27     13      8     15     16   1771
> 05:     19     16     22     25     13      0      9     19     21     21   1813
> 06:     18     15     21     17     10     18      0     10     18     11   1873
> 07:     13     17     22     13     16     17     14      0     25     12   1719
> 08:     23     20     16     17     19     18     11     12      0     18   1830
> 09:     14     20     15     17     12     16     17     11     13      0   1828
> 10:      0      0      0      0      0      0      0      0      0      0      0
> before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0
> after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080
>
> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>
> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>
> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>
> Cheers
>
>>
>>
>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>> Hi,
>>>>
>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>> problem and a tentative solution which he described at :
>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>
>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>> means that the probability of each disk needs to take in account the
>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>> or its weight is changed, this has an impact on the probability of all
>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>> mistaken ?
>>>
>>> Maybe (I haven't looked closely at the above yet).  But for comparison, in
>>> the normal straw2 case, adding or removing a disk also changes the
>>> probabilities for everything else (e.g., removing one out of 10 identical
>>> disks changes the probability from 1/10 to 1/9).  The key property that
>>> straw2 *is* able to handle is that as long as the relative probabilities
>>> between two unmodified disks does not change, then straw2 will avoid
>>> moving any objects between them (i.e., all data movement is to or from
>>> the disk that is reweighted).
>>>
>>> sage
>>>
>>>
>>>>
>>>> Cheers
>>>>
>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>> This is a longstanding bug,
>>>>>
>>>>>     http://tracker.ceph.com/issues/15653
>>>>>
>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>> recent activity resurrected discussion on the original PR
>>>>>
>>>>>     https://github.com/ceph/ceph/pull/10218
>>>>>
>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>> discussion here.
>>>>>
>>>>> The main news is that I have a simple adjustment for the weights that
>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>> brain hurt.
>>>>>
>>>>> The idea is that, on the second round, the original weight for the small
>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>> given b) is P(a && b) / P(b),
>>>>>
>>>>>  P(pick small | first pick not small)
>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>
>>>>> The last term is easy to calculate,
>>>>>
>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>
>>>>> and the && term is the distribution we're trying to produce.  For exmaple,
>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>> their second replica be the small OSD.  So
>>>>>
>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>
>>>>> Putting those together,
>>>>>
>>>>>  P(pick small | first pick not small)
>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>  = small_weight / (total_weight - small_weight)
>>>>>
>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>> that we get the right distribution of second choices.  It turns out it
>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>> that they weren't already chosen.
>>>>>
>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>>> current code, you get
>>>>>
>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>   device 6:             929148  [400572,528576]
>>>>>
>>>>> which is very close for the first replica (primary), but way off for the
>>>>> second.  With my hacky change,
>>>>>
>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>   device 6:             804566  [400572,403994]
>>>>>
>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>> 1%).
>>>>>
>>>>> Next steps:
>>>>>
>>>>> 1- generalize this for >2 replicas
>>>>> 2- figure out why it skews high
>>>>> 3- make this work for multi-level hierarchical descent
>>>>>
>>>>> sage
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-22 11:26         ` Pedro López-Adeva
@ 2017-02-22 11:38           ` Loic Dachary
  2017-02-22 11:46             ` Pedro López-Adeva
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-22 11:38 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: ceph-devel



On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
> Hi,
> 
> I think your description of my proposed solution is quite good.
> 
> I had a first look at Sage's paper but not ceph's implementation. My
> plan is to finish the paper and make an implementation in python that
> mimicks more closely ceph's algorithm.
> 
> Regarding your question about data movement:
> 
> If I understood the paper correctly what is happening right now is
> that when weights change on the devices some of them will become
> overloaded and the current algorithm will try to correct for that but
> this approach, I think, is independent of how we compute the weights
> for each device. My point is that the current data movement pattern
> will not be modified.
> 
> Could the data movement algorithm be improved? Maybe. I don't know.
> Maybe by making the probabilities non-stationary with the new disk
> getting at first very high probability and after each replica
> placement decrease it until it stabilizes to it's final value. But I'm
> just guessing and I really don't know if this can be made to work in a
> distributed manner as is currently the case and how would this fit in
> the current architecture. In any case it would be a problem as hard at
> least as the current reweighting problem.
> 
> So, to summarize, my current plans:
> 
> - Have another look at the paper
> - Make an implementation in python that imitates more closely the
> current algorithm

What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.

> - Make sure the new reweighting algorithm is fast and gives the desired results
> 
> I will give updates here when there are significant changes so
> everyone can have a look and suggest improvements.
> 
> Cheers,
> Pedro.
> 
> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>> I have updated the algorithm to handle an arbitrary number of replicas
>>> and arbitrary constraints.
>>>
>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>
>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>
>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>
>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>
>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>
>> [1] https://en.wikipedia.org/wiki/Loss_function
>> [2] https://en.wikipedia.org/wiki/Gradient
>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>
>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>
>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>> deficient, I recommend downloading/cloning)
>>>
>>>
>>> In the following by policy I mean the concrete set of probabilities of
>>> selecting the first replica, the second replica, etc...
>>> In practical terms there are several problems:
>>>
>>> - It's not practical for a high number of disks or replicas.
>>>
>>> Possible solution: approximate summation over all possible disk
>>> selections with a Monte Carlo method.
>>> the algorithm would be: we start with a candidate solution, we run a
>>> simulation and based on the results
>>> we update the probabilities. Repeat until we are happy with the result.
>>>
>>> Other solution: cluster similar disks together.
>>>
>>> - Since it's a non-linear optimization problem I'm not sure right now
>>> about it's convergence properties.
>>> Does it converge to a global optimum? How fast does it converge?
>>>
>>> Possible solution: the algorithm always converges, but it can converge
>>> to a locally optimum policy. I see
>>> no escape except by carefully designing the policy. All solutions to
>>> the problem are going to be non linear
>>> since we must condition current probabilities on previous disk selections.
>>>
>>> - Although it can handle arbitrary constraints it does so by rejecting
>>> disks selections that violate at least one constraint.
>>> This means that for bad policies it can spend all the time rejecting
>>> invalid disks selection candidates.
>>>
>>> Possible solution: the policy cannot be designed independently of the
>>> constraints. I don't know what constraints
>>> are typical use cases but having a look should be the first step. The
>>> constraints must be an input to the policy.
>>>
>>>
>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>> found the problem an interesting puzzle.
>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>
>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>
>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>
>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>
>>         00     01     02     03     04     05     06     07     08     09     10
>> 00:      0     14     17     14     19     23     13     22     21     20   1800
>> 01:     12      0     11     13     19     19     15     10     16     17   1841
>> 02:     17     27      0     17     15     15     13     19     18     11   1813
>> 03:     14     17     15      0     23     11     20     15     23     17   1792
>> 04:     14     18     16     25      0     27     13      8     15     16   1771
>> 05:     19     16     22     25     13      0      9     19     21     21   1813
>> 06:     18     15     21     17     10     18      0     10     18     11   1873
>> 07:     13     17     22     13     16     17     14      0     25     12   1719
>> 08:     23     20     16     17     19     18     11     12      0     18   1830
>> 09:     14     20     15     17     12     16     17     11     13      0   1828
>> 10:      0      0      0      0      0      0      0      0      0      0      0
>> before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0
>> after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080
>>
>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>
>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>
>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>
>> Cheers
>>
>>>
>>>
>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>> Hi,
>>>>>
>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>> problem and a tentative solution which he described at :
>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>
>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>> means that the probability of each disk needs to take in account the
>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>> or its weight is changed, this has an impact on the probability of all
>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>> mistaken ?
>>>>
>>>> Maybe (I haven't looked closely at the above yet).  But for comparison, in
>>>> the normal straw2 case, adding or removing a disk also changes the
>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>> disks changes the probability from 1/10 to 1/9).  The key property that
>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>> between two unmodified disks does not change, then straw2 will avoid
>>>> moving any objects between them (i.e., all data movement is to or from
>>>> the disk that is reweighted).
>>>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>> This is a longstanding bug,
>>>>>>
>>>>>>     http://tracker.ceph.com/issues/15653
>>>>>>
>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>> recent activity resurrected discussion on the original PR
>>>>>>
>>>>>>     https://github.com/ceph/ceph/pull/10218
>>>>>>
>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>> discussion here.
>>>>>>
>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>> brain hurt.
>>>>>>
>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>> given b) is P(a && b) / P(b),
>>>>>>
>>>>>>  P(pick small | first pick not small)
>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>
>>>>>> The last term is easy to calculate,
>>>>>>
>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>
>>>>>> and the && term is the distribution we're trying to produce.  For exmaple,
>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>> their second replica be the small OSD.  So
>>>>>>
>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>
>>>>>> Putting those together,
>>>>>>
>>>>>>  P(pick small | first pick not small)
>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>
>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>> that they weren't already chosen.
>>>>>>
>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>>>> current code, you get
>>>>>>
>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>   device 6:             929148  [400572,528576]
>>>>>>
>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>> second.  With my hacky change,
>>>>>>
>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>   device 6:             804566  [400572,403994]
>>>>>>
>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>> 1%).
>>>>>>
>>>>>> Next steps:
>>>>>>
>>>>>> 1- generalize this for >2 replicas
>>>>>> 2- figure out why it skews high
>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>
>>>>>> sage
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-22 11:38           ` Loic Dachary
@ 2017-02-22 11:46             ` Pedro López-Adeva
  2017-02-25  0:38               ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Pedro López-Adeva @ 2017-02-22 11:46 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

That, for validation, would be great. Until weekend I don't think I'm
going to have time to work on this anyway.

2017-02-22 12:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>
>
> On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
>> Hi,
>>
>> I think your description of my proposed solution is quite good.
>>
>> I had a first look at Sage's paper but not ceph's implementation. My
>> plan is to finish the paper and make an implementation in python that
>> mimicks more closely ceph's algorithm.
>>
>> Regarding your question about data movement:
>>
>> If I understood the paper correctly what is happening right now is
>> that when weights change on the devices some of them will become
>> overloaded and the current algorithm will try to correct for that but
>> this approach, I think, is independent of how we compute the weights
>> for each device. My point is that the current data movement pattern
>> will not be modified.
>>
>> Could the data movement algorithm be improved? Maybe. I don't know.
>> Maybe by making the probabilities non-stationary with the new disk
>> getting at first very high probability and after each replica
>> placement decrease it until it stabilizes to it's final value. But I'm
>> just guessing and I really don't know if this can be made to work in a
>> distributed manner as is currently the case and how would this fit in
>> the current architecture. In any case it would be a problem as hard at
>> least as the current reweighting problem.
>>
>> So, to summarize, my current plans:
>>
>> - Have another look at the paper
>> - Make an implementation in python that imitates more closely the
>> current algorithm
>
> What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
>
>> - Make sure the new reweighting algorithm is fast and gives the desired results
>>
>> I will give updates here when there are significant changes so
>> everyone can have a look and suggest improvements.
>>
>> Cheers,
>> Pedro.
>>
>> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> Hi Pedro,
>>>
>>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>>> I have updated the algorithm to handle an arbitrary number of replicas
>>>> and arbitrary constraints.
>>>>
>>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>>
>>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>>
>>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>>
>>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>>
>>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>>
>>> [1] https://en.wikipedia.org/wiki/Loss_function
>>> [2] https://en.wikipedia.org/wiki/Gradient
>>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>>
>>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>>
>>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>>> deficient, I recommend downloading/cloning)
>>>>
>>>>
>>>> In the following by policy I mean the concrete set of probabilities of
>>>> selecting the first replica, the second replica, etc...
>>>> In practical terms there are several problems:
>>>>
>>>> - It's not practical for a high number of disks or replicas.
>>>>
>>>> Possible solution: approximate summation over all possible disk
>>>> selections with a Monte Carlo method.
>>>> the algorithm would be: we start with a candidate solution, we run a
>>>> simulation and based on the results
>>>> we update the probabilities. Repeat until we are happy with the result.
>>>>
>>>> Other solution: cluster similar disks together.
>>>>
>>>> - Since it's a non-linear optimization problem I'm not sure right now
>>>> about it's convergence properties.
>>>> Does it converge to a global optimum? How fast does it converge?
>>>>
>>>> Possible solution: the algorithm always converges, but it can converge
>>>> to a locally optimum policy. I see
>>>> no escape except by carefully designing the policy. All solutions to
>>>> the problem are going to be non linear
>>>> since we must condition current probabilities on previous disk selections.
>>>>
>>>> - Although it can handle arbitrary constraints it does so by rejecting
>>>> disks selections that violate at least one constraint.
>>>> This means that for bad policies it can spend all the time rejecting
>>>> invalid disks selection candidates.
>>>>
>>>> Possible solution: the policy cannot be designed independently of the
>>>> constraints. I don't know what constraints
>>>> are typical use cases but having a look should be the first step. The
>>>> constraints must be an input to the policy.
>>>>
>>>>
>>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>>> found the problem an interesting puzzle.
>>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>>
>>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>>
>>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>>
>>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>>
>>>         00     01     02     03     04     05     06     07     08     09     10
>>> 00:      0     14     17     14     19     23     13     22     21     20   1800
>>> 01:     12      0     11     13     19     19     15     10     16     17   1841
>>> 02:     17     27      0     17     15     15     13     19     18     11   1813
>>> 03:     14     17     15      0     23     11     20     15     23     17   1792
>>> 04:     14     18     16     25      0     27     13      8     15     16   1771
>>> 05:     19     16     22     25     13      0      9     19     21     21   1813
>>> 06:     18     15     21     17     10     18      0     10     18     11   1873
>>> 07:     13     17     22     13     16     17     14      0     25     12   1719
>>> 08:     23     20     16     17     19     18     11     12      0     18   1830
>>> 09:     14     20     15     17     12     16     17     11     13      0   1828
>>> 10:      0      0      0      0      0      0      0      0      0      0      0
>>> before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0
>>> after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080
>>>
>>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>>
>>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>>
>>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>>
>>> Cheers
>>>
>>>>
>>>>
>>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>> problem and a tentative solution which he described at :
>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>
>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>> means that the probability of each disk needs to take in account the
>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>> mistaken ?
>>>>>
>>>>> Maybe (I haven't looked closely at the above yet).  But for comparison, in
>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>> disks changes the probability from 1/10 to 1/9).  The key property that
>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>> the disk that is reweighted).
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>> This is a longstanding bug,
>>>>>>>
>>>>>>>     http://tracker.ceph.com/issues/15653
>>>>>>>
>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>
>>>>>>>     https://github.com/ceph/ceph/pull/10218
>>>>>>>
>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>> discussion here.
>>>>>>>
>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>> brain hurt.
>>>>>>>
>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>
>>>>>>>  P(pick small | first pick not small)
>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>
>>>>>>> The last term is easy to calculate,
>>>>>>>
>>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>
>>>>>>> and the && term is the distribution we're trying to produce.  For exmaple,
>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>> their second replica be the small OSD.  So
>>>>>>>
>>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>
>>>>>>> Putting those together,
>>>>>>>
>>>>>>>  P(pick small | first pick not small)
>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>>
>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>> that they weren't already chosen.
>>>>>>>
>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>>>>> current code, you get
>>>>>>>
>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>>   device 6:             929148  [400572,528576]
>>>>>>>
>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>> second.  With my hacky change,
>>>>>>>
>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>>   device 6:             804566  [400572,403994]
>>>>>>>
>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>> 1%).
>>>>>>>
>>>>>>> Next steps:
>>>>>>>
>>>>>>> 1- generalize this for >2 replicas
>>>>>>> 2- figure out why it skews high
>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>
>>>>>>> sage
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-22 11:46             ` Pedro López-Adeva
@ 2017-02-25  0:38               ` Loic Dachary
  2017-02-25  8:41                 ` Pedro López-Adeva
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-25  0:38 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: ceph-devel

Hi Pedro,

On 02/22/2017 12:46 PM, Pedro López-Adeva wrote:
> That, for validation, would be great. Until weekend I don't think I'm
> going to have time to work on this anyway.

An initial version of the module is ready and documented at http://crush.readthedocs.io/en/latest/.

Cheers

> 2017-02-22 12:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>
>>
>> On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
>>> Hi,
>>>
>>> I think your description of my proposed solution is quite good.
>>>
>>> I had a first look at Sage's paper but not ceph's implementation. My
>>> plan is to finish the paper and make an implementation in python that
>>> mimicks more closely ceph's algorithm.
>>>
>>> Regarding your question about data movement:
>>>
>>> If I understood the paper correctly what is happening right now is
>>> that when weights change on the devices some of them will become
>>> overloaded and the current algorithm will try to correct for that but
>>> this approach, I think, is independent of how we compute the weights
>>> for each device. My point is that the current data movement pattern
>>> will not be modified.
>>>
>>> Could the data movement algorithm be improved? Maybe. I don't know.
>>> Maybe by making the probabilities non-stationary with the new disk
>>> getting at first very high probability and after each replica
>>> placement decrease it until it stabilizes to it's final value. But I'm
>>> just guessing and I really don't know if this can be made to work in a
>>> distributed manner as is currently the case and how would this fit in
>>> the current architecture. In any case it would be a problem as hard at
>>> least as the current reweighting problem.
>>>
>>> So, to summarize, my current plans:
>>>
>>> - Have another look at the paper
>>> - Make an implementation in python that imitates more closely the
>>> current algorithm
>>
>> What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
>>
>>> - Make sure the new reweighting algorithm is fast and gives the desired results
>>>
>>> I will give updates here when there are significant changes so
>>> everyone can have a look and suggest improvements.
>>>
>>> Cheers,
>>> Pedro.
>>>
>>> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>> Hi Pedro,
>>>>
>>>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>>>> I have updated the algorithm to handle an arbitrary number of replicas
>>>>> and arbitrary constraints.
>>>>>
>>>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>>>
>>>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>>>
>>>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>>>
>>>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>>>
>>>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>>>
>>>> [1] https://en.wikipedia.org/wiki/Loss_function
>>>> [2] https://en.wikipedia.org/wiki/Gradient
>>>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>>>
>>>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>>>
>>>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>>>> deficient, I recommend downloading/cloning)
>>>>>
>>>>>
>>>>> In the following by policy I mean the concrete set of probabilities of
>>>>> selecting the first replica, the second replica, etc...
>>>>> In practical terms there are several problems:
>>>>>
>>>>> - It's not practical for a high number of disks or replicas.
>>>>>
>>>>> Possible solution: approximate summation over all possible disk
>>>>> selections with a Monte Carlo method.
>>>>> the algorithm would be: we start with a candidate solution, we run a
>>>>> simulation and based on the results
>>>>> we update the probabilities. Repeat until we are happy with the result.
>>>>>
>>>>> Other solution: cluster similar disks together.
>>>>>
>>>>> - Since it's a non-linear optimization problem I'm not sure right now
>>>>> about it's convergence properties.
>>>>> Does it converge to a global optimum? How fast does it converge?
>>>>>
>>>>> Possible solution: the algorithm always converges, but it can converge
>>>>> to a locally optimum policy. I see
>>>>> no escape except by carefully designing the policy. All solutions to
>>>>> the problem are going to be non linear
>>>>> since we must condition current probabilities on previous disk selections.
>>>>>
>>>>> - Although it can handle arbitrary constraints it does so by rejecting
>>>>> disks selections that violate at least one constraint.
>>>>> This means that for bad policies it can spend all the time rejecting
>>>>> invalid disks selection candidates.
>>>>>
>>>>> Possible solution: the policy cannot be designed independently of the
>>>>> constraints. I don't know what constraints
>>>>> are typical use cases but having a look should be the first step. The
>>>>> constraints must be an input to the policy.
>>>>>
>>>>>
>>>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>>>> found the problem an interesting puzzle.
>>>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>>>
>>>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>>>
>>>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>>>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>>>
>>>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>>>
>>>>         00     01     02     03     04     05     06     07     08     09     10
>>>> 00:      0     14     17     14     19     23     13     22     21     20   1800
>>>> 01:     12      0     11     13     19     19     15     10     16     17   1841
>>>> 02:     17     27      0     17     15     15     13     19     18     11   1813
>>>> 03:     14     17     15      0     23     11     20     15     23     17   1792
>>>> 04:     14     18     16     25      0     27     13      8     15     16   1771
>>>> 05:     19     16     22     25     13      0      9     19     21     21   1813
>>>> 06:     18     15     21     17     10     18      0     10     18     11   1873
>>>> 07:     13     17     22     13     16     17     14      0     25     12   1719
>>>> 08:     23     20     16     17     19     18     11     12      0     18   1830
>>>> 09:     14     20     15     17     12     16     17     11     13      0   1828
>>>> 10:      0      0      0      0      0      0      0      0      0      0      0
>>>> before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0
>>>> after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080
>>>>
>>>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>>>
>>>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>>>
>>>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>>>
>>>> Cheers
>>>>
>>>>>
>>>>>
>>>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>>> problem and a tentative solution which he described at :
>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>
>>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>>> means that the probability of each disk needs to take in account the
>>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>>> mistaken ?
>>>>>>
>>>>>> Maybe (I haven't looked closely at the above yet).  But for comparison, in
>>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>>> disks changes the probability from 1/10 to 1/9).  The key property that
>>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>>> the disk that is reweighted).
>>>>>>
>>>>>> sage
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>> This is a longstanding bug,
>>>>>>>>
>>>>>>>>     http://tracker.ceph.com/issues/15653
>>>>>>>>
>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>
>>>>>>>>     https://github.com/ceph/ceph/pull/10218
>>>>>>>>
>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>> discussion here.
>>>>>>>>
>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>> brain hurt.
>>>>>>>>
>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>
>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>
>>>>>>>> The last term is easy to calculate,
>>>>>>>>
>>>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>
>>>>>>>> and the && term is the distribution we're trying to produce.  For exmaple,
>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>> their second replica be the small OSD.  So
>>>>>>>>
>>>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>
>>>>>>>> Putting those together,
>>>>>>>>
>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>>>
>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>> that they weren't already chosen.
>>>>>>>>
>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>>>>>> current code, you get
>>>>>>>>
>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>>>   device 6:             929148  [400572,528576]
>>>>>>>>
>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>> second.  With my hacky change,
>>>>>>>>
>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>>>   device 6:             804566  [400572,403994]
>>>>>>>>
>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>> 1%).
>>>>>>>>
>>>>>>>> Next steps:
>>>>>>>>
>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>> 2- figure out why it skews high
>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>
>>>>>>>> sage
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-25  0:38               ` Loic Dachary
@ 2017-02-25  8:41                 ` Pedro López-Adeva
  2017-02-25  9:02                   ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Pedro López-Adeva @ 2017-02-25  8:41 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

Great! Installed without problem and ran the example OK. I will
convert what I already have to use the library and continue from
there.


2017-02-25 1:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
> Hi Pedro,
>
> On 02/22/2017 12:46 PM, Pedro López-Adeva wrote:
>> That, for validation, would be great. Until weekend I don't think I'm
>> going to have time to work on this anyway.
>
> An initial version of the module is ready and documented at http://crush.readthedocs.io/en/latest/.
>
> Cheers
>
>> 2017-02-22 12:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>
>>>
>>> On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
>>>> Hi,
>>>>
>>>> I think your description of my proposed solution is quite good.
>>>>
>>>> I had a first look at Sage's paper but not ceph's implementation. My
>>>> plan is to finish the paper and make an implementation in python that
>>>> mimicks more closely ceph's algorithm.
>>>>
>>>> Regarding your question about data movement:
>>>>
>>>> If I understood the paper correctly what is happening right now is
>>>> that when weights change on the devices some of them will become
>>>> overloaded and the current algorithm will try to correct for that but
>>>> this approach, I think, is independent of how we compute the weights
>>>> for each device. My point is that the current data movement pattern
>>>> will not be modified.
>>>>
>>>> Could the data movement algorithm be improved? Maybe. I don't know.
>>>> Maybe by making the probabilities non-stationary with the new disk
>>>> getting at first very high probability and after each replica
>>>> placement decrease it until it stabilizes to it's final value. But I'm
>>>> just guessing and I really don't know if this can be made to work in a
>>>> distributed manner as is currently the case and how would this fit in
>>>> the current architecture. In any case it would be a problem as hard at
>>>> least as the current reweighting problem.
>>>>
>>>> So, to summarize, my current plans:
>>>>
>>>> - Have another look at the paper
>>>> - Make an implementation in python that imitates more closely the
>>>> current algorithm
>>>
>>> What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
>>>
>>>> - Make sure the new reweighting algorithm is fast and gives the desired results
>>>>
>>>> I will give updates here when there are significant changes so
>>>> everyone can have a look and suggest improvements.
>>>>
>>>> Cheers,
>>>> Pedro.
>>>>
>>>> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>> Hi Pedro,
>>>>>
>>>>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>>>>> I have updated the algorithm to handle an arbitrary number of replicas
>>>>>> and arbitrary constraints.
>>>>>>
>>>>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>>>>
>>>>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>>>>
>>>>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>>>>
>>>>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>>>>
>>>>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>>>>
>>>>> [1] https://en.wikipedia.org/wiki/Loss_function
>>>>> [2] https://en.wikipedia.org/wiki/Gradient
>>>>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>>>>
>>>>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>>>>
>>>>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>>>>> deficient, I recommend downloading/cloning)
>>>>>>
>>>>>>
>>>>>> In the following by policy I mean the concrete set of probabilities of
>>>>>> selecting the first replica, the second replica, etc...
>>>>>> In practical terms there are several problems:
>>>>>>
>>>>>> - It's not practical for a high number of disks or replicas.
>>>>>>
>>>>>> Possible solution: approximate summation over all possible disk
>>>>>> selections with a Monte Carlo method.
>>>>>> the algorithm would be: we start with a candidate solution, we run a
>>>>>> simulation and based on the results
>>>>>> we update the probabilities. Repeat until we are happy with the result.
>>>>>>
>>>>>> Other solution: cluster similar disks together.
>>>>>>
>>>>>> - Since it's a non-linear optimization problem I'm not sure right now
>>>>>> about it's convergence properties.
>>>>>> Does it converge to a global optimum? How fast does it converge?
>>>>>>
>>>>>> Possible solution: the algorithm always converges, but it can converge
>>>>>> to a locally optimum policy. I see
>>>>>> no escape except by carefully designing the policy. All solutions to
>>>>>> the problem are going to be non linear
>>>>>> since we must condition current probabilities on previous disk selections.
>>>>>>
>>>>>> - Although it can handle arbitrary constraints it does so by rejecting
>>>>>> disks selections that violate at least one constraint.
>>>>>> This means that for bad policies it can spend all the time rejecting
>>>>>> invalid disks selection candidates.
>>>>>>
>>>>>> Possible solution: the policy cannot be designed independently of the
>>>>>> constraints. I don't know what constraints
>>>>>> are typical use cases but having a look should be the first step. The
>>>>>> constraints must be an input to the policy.
>>>>>>
>>>>>>
>>>>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>>>>> found the problem an interesting puzzle.
>>>>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>>>>
>>>>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>>>>
>>>>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>>>>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>>>>
>>>>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>>>>
>>>>>         00     01     02     03     04     05     06     07     08     09     10
>>>>> 00:      0     14     17     14     19     23     13     22     21     20   1800
>>>>> 01:     12      0     11     13     19     19     15     10     16     17   1841
>>>>> 02:     17     27      0     17     15     15     13     19     18     11   1813
>>>>> 03:     14     17     15      0     23     11     20     15     23     17   1792
>>>>> 04:     14     18     16     25      0     27     13      8     15     16   1771
>>>>> 05:     19     16     22     25     13      0      9     19     21     21   1813
>>>>> 06:     18     15     21     17     10     18      0     10     18     11   1873
>>>>> 07:     13     17     22     13     16     17     14      0     25     12   1719
>>>>> 08:     23     20     16     17     19     18     11     12      0     18   1830
>>>>> 09:     14     20     15     17     12     16     17     11     13      0   1828
>>>>> 10:      0      0      0      0      0      0      0      0      0      0      0
>>>>> before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0
>>>>> after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080
>>>>>
>>>>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>>>>
>>>>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>>>>
>>>>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>>>>
>>>>> Cheers
>>>>>
>>>>>>
>>>>>>
>>>>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>>>> problem and a tentative solution which he described at :
>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>
>>>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>>>> means that the probability of each disk needs to take in account the
>>>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>>>> mistaken ?
>>>>>>>
>>>>>>> Maybe (I haven't looked closely at the above yet).  But for comparison, in
>>>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>>>> disks changes the probability from 1/10 to 1/9).  The key property that
>>>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>>>> the disk that is reweighted).
>>>>>>>
>>>>>>> sage
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>>> This is a longstanding bug,
>>>>>>>>>
>>>>>>>>>     http://tracker.ceph.com/issues/15653
>>>>>>>>>
>>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>>
>>>>>>>>>     https://github.com/ceph/ceph/pull/10218
>>>>>>>>>
>>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>>> discussion here.
>>>>>>>>>
>>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>>> brain hurt.
>>>>>>>>>
>>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>>
>>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>
>>>>>>>>> The last term is easy to calculate,
>>>>>>>>>
>>>>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>>
>>>>>>>>> and the && term is the distribution we're trying to produce.  For exmaple,
>>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>>> their second replica be the small OSD.  So
>>>>>>>>>
>>>>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>>
>>>>>>>>> Putting those together,
>>>>>>>>>
>>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>>>>
>>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>>> that they weren't already chosen.
>>>>>>>>>
>>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>>>>>>> current code, you get
>>>>>>>>>
>>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>>>>   device 6:             929148  [400572,528576]
>>>>>>>>>
>>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>>> second.  With my hacky change,
>>>>>>>>>
>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>>>>   device 6:             804566  [400572,403994]
>>>>>>>>>
>>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>>> 1%).
>>>>>>>>>
>>>>>>>>> Next steps:
>>>>>>>>>
>>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>>> 2- figure out why it skews high
>>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>>
>>>>>>>>> sage
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-25  8:41                 ` Pedro López-Adeva
@ 2017-02-25  9:02                   ` Loic Dachary
  2017-03-02  9:43                     ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-02-25  9:02 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: ceph-devel



On 02/25/2017 09:41 AM, Pedro López-Adeva wrote:
> Great! Installed without problem and ran the example OK. I will
> convert what I already have to use the library and continue from
> there.

Cool :-) http://crush.readthedocs.io/en/latest/api.html is a complete reference of the crushmap structure, let me know if something is missing.

Cheers

> 
> 2017-02-25 1:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> On 02/22/2017 12:46 PM, Pedro López-Adeva wrote:
>>> That, for validation, would be great. Until weekend I don't think I'm
>>> going to have time to work on this anyway.
>>
>> An initial version of the module is ready and documented at http://crush.readthedocs.io/en/latest/.
>>
>> Cheers
>>
>>> 2017-02-22 12:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>
>>>>
>>>> On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
>>>>> Hi,
>>>>>
>>>>> I think your description of my proposed solution is quite good.
>>>>>
>>>>> I had a first look at Sage's paper but not ceph's implementation. My
>>>>> plan is to finish the paper and make an implementation in python that
>>>>> mimicks more closely ceph's algorithm.
>>>>>
>>>>> Regarding your question about data movement:
>>>>>
>>>>> If I understood the paper correctly what is happening right now is
>>>>> that when weights change on the devices some of them will become
>>>>> overloaded and the current algorithm will try to correct for that but
>>>>> this approach, I think, is independent of how we compute the weights
>>>>> for each device. My point is that the current data movement pattern
>>>>> will not be modified.
>>>>>
>>>>> Could the data movement algorithm be improved? Maybe. I don't know.
>>>>> Maybe by making the probabilities non-stationary with the new disk
>>>>> getting at first very high probability and after each replica
>>>>> placement decrease it until it stabilizes to it's final value. But I'm
>>>>> just guessing and I really don't know if this can be made to work in a
>>>>> distributed manner as is currently the case and how would this fit in
>>>>> the current architecture. In any case it would be a problem as hard at
>>>>> least as the current reweighting problem.
>>>>>
>>>>> So, to summarize, my current plans:
>>>>>
>>>>> - Have another look at the paper
>>>>> - Make an implementation in python that imitates more closely the
>>>>> current algorithm
>>>>
>>>> What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
>>>>
>>>>> - Make sure the new reweighting algorithm is fast and gives the desired results
>>>>>
>>>>> I will give updates here when there are significant changes so
>>>>> everyone can have a look and suggest improvements.
>>>>>
>>>>> Cheers,
>>>>> Pedro.
>>>>>
>>>>> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>>>>>> I have updated the algorithm to handle an arbitrary number of replicas
>>>>>>> and arbitrary constraints.
>>>>>>>
>>>>>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>>>>>
>>>>>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>>>>>
>>>>>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>>>>>
>>>>>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>>>>>
>>>>>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>>>>>
>>>>>> [1] https://en.wikipedia.org/wiki/Loss_function
>>>>>> [2] https://en.wikipedia.org/wiki/Gradient
>>>>>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>>>>>
>>>>>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>>>>>
>>>>>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>>>>>> deficient, I recommend downloading/cloning)
>>>>>>>
>>>>>>>
>>>>>>> In the following by policy I mean the concrete set of probabilities of
>>>>>>> selecting the first replica, the second replica, etc...
>>>>>>> In practical terms there are several problems:
>>>>>>>
>>>>>>> - It's not practical for a high number of disks or replicas.
>>>>>>>
>>>>>>> Possible solution: approximate summation over all possible disk
>>>>>>> selections with a Monte Carlo method.
>>>>>>> the algorithm would be: we start with a candidate solution, we run a
>>>>>>> simulation and based on the results
>>>>>>> we update the probabilities. Repeat until we are happy with the result.
>>>>>>>
>>>>>>> Other solution: cluster similar disks together.
>>>>>>>
>>>>>>> - Since it's a non-linear optimization problem I'm not sure right now
>>>>>>> about it's convergence properties.
>>>>>>> Does it converge to a global optimum? How fast does it converge?
>>>>>>>
>>>>>>> Possible solution: the algorithm always converges, but it can converge
>>>>>>> to a locally optimum policy. I see
>>>>>>> no escape except by carefully designing the policy. All solutions to
>>>>>>> the problem are going to be non linear
>>>>>>> since we must condition current probabilities on previous disk selections.
>>>>>>>
>>>>>>> - Although it can handle arbitrary constraints it does so by rejecting
>>>>>>> disks selections that violate at least one constraint.
>>>>>>> This means that for bad policies it can spend all the time rejecting
>>>>>>> invalid disks selection candidates.
>>>>>>>
>>>>>>> Possible solution: the policy cannot be designed independently of the
>>>>>>> constraints. I don't know what constraints
>>>>>>> are typical use cases but having a look should be the first step. The
>>>>>>> constraints must be an input to the policy.
>>>>>>>
>>>>>>>
>>>>>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>>>>>> found the problem an interesting puzzle.
>>>>>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>>>>>
>>>>>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>>>>>
>>>>>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>>>>>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>>>>>
>>>>>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>>>>>
>>>>>>         00     01     02     03     04     05     06     07     08     09     10
>>>>>> 00:      0     14     17     14     19     23     13     22     21     20   1800
>>>>>> 01:     12      0     11     13     19     19     15     10     16     17   1841
>>>>>> 02:     17     27      0     17     15     15     13     19     18     11   1813
>>>>>> 03:     14     17     15      0     23     11     20     15     23     17   1792
>>>>>> 04:     14     18     16     25      0     27     13      8     15     16   1771
>>>>>> 05:     19     16     22     25     13      0      9     19     21     21   1813
>>>>>> 06:     18     15     21     17     10     18      0     10     18     11   1873
>>>>>> 07:     13     17     22     13     16     17     14      0     25     12   1719
>>>>>> 08:     23     20     16     17     19     18     11     12      0     18   1830
>>>>>> 09:     14     20     15     17     12     16     17     11     13      0   1828
>>>>>> 10:      0      0      0      0      0      0      0      0      0      0      0
>>>>>> before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0
>>>>>> after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080
>>>>>>
>>>>>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>>>>>
>>>>>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>>>>>
>>>>>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>>>>> problem and a tentative solution which he described at :
>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>
>>>>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>>>>> means that the probability of each disk needs to take in account the
>>>>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>>>>> mistaken ?
>>>>>>>>
>>>>>>>> Maybe (I haven't looked closely at the above yet).  But for comparison, in
>>>>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>>>>> disks changes the probability from 1/10 to 1/9).  The key property that
>>>>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>>>>> the disk that is reweighted).
>>>>>>>>
>>>>>>>> sage
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>>>> This is a longstanding bug,
>>>>>>>>>>
>>>>>>>>>>     http://tracker.ceph.com/issues/15653
>>>>>>>>>>
>>>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>>>
>>>>>>>>>>     https://github.com/ceph/ceph/pull/10218
>>>>>>>>>>
>>>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>>>> discussion here.
>>>>>>>>>>
>>>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>>>> brain hurt.
>>>>>>>>>>
>>>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>>>
>>>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>
>>>>>>>>>> The last term is easy to calculate,
>>>>>>>>>>
>>>>>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>>>
>>>>>>>>>> and the && term is the distribution we're trying to produce.  For exmaple,
>>>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>>>> their second replica be the small OSD.  So
>>>>>>>>>>
>>>>>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>>>
>>>>>>>>>> Putting those together,
>>>>>>>>>>
>>>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>>>>>
>>>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>>>> that they weren't already chosen.
>>>>>>>>>>
>>>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>>>>>>>> current code, you get
>>>>>>>>>>
>>>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>>>>>   device 6:             929148  [400572,528576]
>>>>>>>>>>
>>>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>>>> second.  With my hacky change,
>>>>>>>>>>
>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>>>>>   device 6:             804566  [400572,403994]
>>>>>>>>>>
>>>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>>>> 1%).
>>>>>>>>>>
>>>>>>>>>> Next steps:
>>>>>>>>>>
>>>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>>>> 2- figure out why it skews high
>>>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>>>
>>>>>>>>>> sage
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-02-25  9:02                   ` Loic Dachary
@ 2017-03-02  9:43                     ` Loic Dachary
  2017-03-02  9:58                       ` Pedro López-Adeva
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-03-02  9:43 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: ceph-devel

Hi Pedro,

There is a new version of python-crush at https://pypi.python.org/pypi/crush which changes the layout of the crushmap and the documentation was updated accordingly at http://crush.readthedocs.io/. Sorry for the inconvenience.

Cheers

On 02/25/2017 10:02 AM, Loic Dachary wrote:
> 
> 
> On 02/25/2017 09:41 AM, Pedro López-Adeva wrote:
>> Great! Installed without problem and ran the example OK. I will
>> convert what I already have to use the library and continue from
>> there.
> 
> Cool :-) http://crush.readthedocs.io/en/latest/api.html is a complete reference of the crushmap structure, let me know if something is missing.
> 
> Cheers
> 
>>
>> 2017-02-25 1:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> Hi Pedro,
>>>
>>> On 02/22/2017 12:46 PM, Pedro López-Adeva wrote:
>>>> That, for validation, would be great. Until weekend I don't think I'm
>>>> going to have time to work on this anyway.
>>>
>>> An initial version of the module is ready and documented at http://crush.readthedocs.io/en/latest/.
>>>
>>> Cheers
>>>
>>>> 2017-02-22 12:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>
>>>>>
>>>>> On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I think your description of my proposed solution is quite good.
>>>>>>
>>>>>> I had a first look at Sage's paper but not ceph's implementation. My
>>>>>> plan is to finish the paper and make an implementation in python that
>>>>>> mimicks more closely ceph's algorithm.
>>>>>>
>>>>>> Regarding your question about data movement:
>>>>>>
>>>>>> If I understood the paper correctly what is happening right now is
>>>>>> that when weights change on the devices some of them will become
>>>>>> overloaded and the current algorithm will try to correct for that but
>>>>>> this approach, I think, is independent of how we compute the weights
>>>>>> for each device. My point is that the current data movement pattern
>>>>>> will not be modified.
>>>>>>
>>>>>> Could the data movement algorithm be improved? Maybe. I don't know.
>>>>>> Maybe by making the probabilities non-stationary with the new disk
>>>>>> getting at first very high probability and after each replica
>>>>>> placement decrease it until it stabilizes to it's final value. But I'm
>>>>>> just guessing and I really don't know if this can be made to work in a
>>>>>> distributed manner as is currently the case and how would this fit in
>>>>>> the current architecture. In any case it would be a problem as hard at
>>>>>> least as the current reweighting problem.
>>>>>>
>>>>>> So, to summarize, my current plans:
>>>>>>
>>>>>> - Have another look at the paper
>>>>>> - Make an implementation in python that imitates more closely the
>>>>>> current algorithm
>>>>>
>>>>> What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
>>>>>
>>>>>> - Make sure the new reweighting algorithm is fast and gives the desired results
>>>>>>
>>>>>> I will give updates here when there are significant changes so
>>>>>> everyone can have a look and suggest improvements.
>>>>>>
>>>>>> Cheers,
>>>>>> Pedro.
>>>>>>
>>>>>> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>> Hi Pedro,
>>>>>>>
>>>>>>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>>>>>>> I have updated the algorithm to handle an arbitrary number of replicas
>>>>>>>> and arbitrary constraints.
>>>>>>>>
>>>>>>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>>>>>>
>>>>>>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>>>>>>
>>>>>>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>>>>>>
>>>>>>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>>>>>>
>>>>>>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>>>>>>
>>>>>>> [1] https://en.wikipedia.org/wiki/Loss_function
>>>>>>> [2] https://en.wikipedia.org/wiki/Gradient
>>>>>>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>>>>>>
>>>>>>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>>>>>>
>>>>>>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>>>>>>> deficient, I recommend downloading/cloning)
>>>>>>>>
>>>>>>>>
>>>>>>>> In the following by policy I mean the concrete set of probabilities of
>>>>>>>> selecting the first replica, the second replica, etc...
>>>>>>>> In practical terms there are several problems:
>>>>>>>>
>>>>>>>> - It's not practical for a high number of disks or replicas.
>>>>>>>>
>>>>>>>> Possible solution: approximate summation over all possible disk
>>>>>>>> selections with a Monte Carlo method.
>>>>>>>> the algorithm would be: we start with a candidate solution, we run a
>>>>>>>> simulation and based on the results
>>>>>>>> we update the probabilities. Repeat until we are happy with the result.
>>>>>>>>
>>>>>>>> Other solution: cluster similar disks together.
>>>>>>>>
>>>>>>>> - Since it's a non-linear optimization problem I'm not sure right now
>>>>>>>> about it's convergence properties.
>>>>>>>> Does it converge to a global optimum? How fast does it converge?
>>>>>>>>
>>>>>>>> Possible solution: the algorithm always converges, but it can converge
>>>>>>>> to a locally optimum policy. I see
>>>>>>>> no escape except by carefully designing the policy. All solutions to
>>>>>>>> the problem are going to be non linear
>>>>>>>> since we must condition current probabilities on previous disk selections.
>>>>>>>>
>>>>>>>> - Although it can handle arbitrary constraints it does so by rejecting
>>>>>>>> disks selections that violate at least one constraint.
>>>>>>>> This means that for bad policies it can spend all the time rejecting
>>>>>>>> invalid disks selection candidates.
>>>>>>>>
>>>>>>>> Possible solution: the policy cannot be designed independently of the
>>>>>>>> constraints. I don't know what constraints
>>>>>>>> are typical use cases but having a look should be the first step. The
>>>>>>>> constraints must be an input to the policy.
>>>>>>>>
>>>>>>>>
>>>>>>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>>>>>>> found the problem an interesting puzzle.
>>>>>>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>>>>>>
>>>>>>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>>>>>>
>>>>>>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>>>>>>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>>>>>>
>>>>>>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>>>>>>
>>>>>>>         00     01     02     03     04     05     06     07     08     09     10
>>>>>>> 00:      0     14     17     14     19     23     13     22     21     20   1800
>>>>>>> 01:     12      0     11     13     19     19     15     10     16     17   1841
>>>>>>> 02:     17     27      0     17     15     15     13     19     18     11   1813
>>>>>>> 03:     14     17     15      0     23     11     20     15     23     17   1792
>>>>>>> 04:     14     18     16     25      0     27     13      8     15     16   1771
>>>>>>> 05:     19     16     22     25     13      0      9     19     21     21   1813
>>>>>>> 06:     18     15     21     17     10     18      0     10     18     11   1873
>>>>>>> 07:     13     17     22     13     16     17     14      0     25     12   1719
>>>>>>> 08:     23     20     16     17     19     18     11     12      0     18   1830
>>>>>>> 09:     14     20     15     17     12     16     17     11     13      0   1828
>>>>>>> 10:      0      0      0      0      0      0      0      0      0      0      0
>>>>>>> before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0
>>>>>>> after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080
>>>>>>>
>>>>>>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>>>>>>
>>>>>>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>>>>>>
>>>>>>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>>>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>>>>>> problem and a tentative solution which he described at :
>>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>>
>>>>>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>>>>>> means that the probability of each disk needs to take in account the
>>>>>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>>>>>> mistaken ?
>>>>>>>>>
>>>>>>>>> Maybe (I haven't looked closely at the above yet).  But for comparison, in
>>>>>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>>>>>> disks changes the probability from 1/10 to 1/9).  The key property that
>>>>>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>>>>>> the disk that is reweighted).
>>>>>>>>>
>>>>>>>>> sage
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>>>>> This is a longstanding bug,
>>>>>>>>>>>
>>>>>>>>>>>     http://tracker.ceph.com/issues/15653
>>>>>>>>>>>
>>>>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>>>>
>>>>>>>>>>>     https://github.com/ceph/ceph/pull/10218
>>>>>>>>>>>
>>>>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>>>>> discussion here.
>>>>>>>>>>>
>>>>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>>>>> brain hurt.
>>>>>>>>>>>
>>>>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>>>>
>>>>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>>
>>>>>>>>>>> The last term is easy to calculate,
>>>>>>>>>>>
>>>>>>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>>>>
>>>>>>>>>>> and the && term is the distribution we're trying to produce.  For exmaple,
>>>>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>>>>> their second replica be the small OSD.  So
>>>>>>>>>>>
>>>>>>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>>>>
>>>>>>>>>>> Putting those together,
>>>>>>>>>>>
>>>>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>>>>>>
>>>>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>>>>> that they weren't already chosen.
>>>>>>>>>>>
>>>>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>>>>>>>>> current code, you get
>>>>>>>>>>>
>>>>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>>>>>>   device 6:             929148  [400572,528576]
>>>>>>>>>>>
>>>>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>>>>> second.  With my hacky change,
>>>>>>>>>>>
>>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>>>>>>   device 6:             804566  [400572,403994]
>>>>>>>>>>>
>>>>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>>>>> 1%).
>>>>>>>>>>>
>>>>>>>>>>> Next steps:
>>>>>>>>>>>
>>>>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>>>>> 2- figure out why it skews high
>>>>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>>>>
>>>>>>>>>>> sage
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-02  9:43                     ` Loic Dachary
@ 2017-03-02  9:58                       ` Pedro López-Adeva
  2017-03-02 10:31                         ` Loic Dachary
  2017-03-07 23:06                         ` Sage Weil
  0 siblings, 2 replies; 70+ messages in thread
From: Pedro López-Adeva @ 2017-03-02  9:58 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

Hi,

I will have a look. BTW, I have not progressed that much but I have
been thinking about it. In order to adapt the previous algorithm in
the python notebook I need to substitute the iteration over all
possible devices permutations to iteration over all the possible
selections that crush would make. That is the main thing I need to
work on.

The other thing is of course that weights change for each replica.
That is, they cannot be really fixed in the crush map. So the
algorithm inside libcrush, not only the weights in the map, need to be
changed. The weights in the crush map should reflect then, maybe, the
desired usage frequencies. Or maybe each replica should have their own
crush map, but then the information about the previous selection
should be passed to the next replica placement run so it avoids
selecting the same one again.

I have a question also. Is there any significant difference between
the device selection algorithm description in the paper and its final
implementation?

Cheers,
Pedro.

2017-03-02 10:43 GMT+01:00 Loic Dachary <loic@dachary.org>:
> Hi Pedro,
>
> There is a new version of python-crush at https://pypi.python.org/pypi/crush which changes the layout of the crushmap and the documentation was updated accordingly at http://crush.readthedocs.io/. Sorry for the inconvenience.
>
> Cheers
>
> On 02/25/2017 10:02 AM, Loic Dachary wrote:
>>
>>
>> On 02/25/2017 09:41 AM, Pedro López-Adeva wrote:
>>> Great! Installed without problem and ran the example OK. I will
>>> convert what I already have to use the library and continue from
>>> there.
>>
>> Cool :-) http://crush.readthedocs.io/en/latest/api.html is a complete reference of the crushmap structure, let me know if something is missing.
>>
>> Cheers
>>
>>>
>>> 2017-02-25 1:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>> Hi Pedro,
>>>>
>>>> On 02/22/2017 12:46 PM, Pedro López-Adeva wrote:
>>>>> That, for validation, would be great. Until weekend I don't think I'm
>>>>> going to have time to work on this anyway.
>>>>
>>>> An initial version of the module is ready and documented at http://crush.readthedocs.io/en/latest/.
>>>>
>>>> Cheers
>>>>
>>>>> 2017-02-22 12:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>
>>>>>>
>>>>>> On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I think your description of my proposed solution is quite good.
>>>>>>>
>>>>>>> I had a first look at Sage's paper but not ceph's implementation. My
>>>>>>> plan is to finish the paper and make an implementation in python that
>>>>>>> mimicks more closely ceph's algorithm.
>>>>>>>
>>>>>>> Regarding your question about data movement:
>>>>>>>
>>>>>>> If I understood the paper correctly what is happening right now is
>>>>>>> that when weights change on the devices some of them will become
>>>>>>> overloaded and the current algorithm will try to correct for that but
>>>>>>> this approach, I think, is independent of how we compute the weights
>>>>>>> for each device. My point is that the current data movement pattern
>>>>>>> will not be modified.
>>>>>>>
>>>>>>> Could the data movement algorithm be improved? Maybe. I don't know.
>>>>>>> Maybe by making the probabilities non-stationary with the new disk
>>>>>>> getting at first very high probability and after each replica
>>>>>>> placement decrease it until it stabilizes to it's final value. But I'm
>>>>>>> just guessing and I really don't know if this can be made to work in a
>>>>>>> distributed manner as is currently the case and how would this fit in
>>>>>>> the current architecture. In any case it would be a problem as hard at
>>>>>>> least as the current reweighting problem.
>>>>>>>
>>>>>>> So, to summarize, my current plans:
>>>>>>>
>>>>>>> - Have another look at the paper
>>>>>>> - Make an implementation in python that imitates more closely the
>>>>>>> current algorithm
>>>>>>
>>>>>> What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
>>>>>>
>>>>>>> - Make sure the new reweighting algorithm is fast and gives the desired results
>>>>>>>
>>>>>>> I will give updates here when there are significant changes so
>>>>>>> everyone can have a look and suggest improvements.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Pedro.
>>>>>>>
>>>>>>> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>>>>>>>> I have updated the algorithm to handle an arbitrary number of replicas
>>>>>>>>> and arbitrary constraints.
>>>>>>>>>
>>>>>>>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>>>>>>>
>>>>>>>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>>>>>>>
>>>>>>>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>>>>>>>
>>>>>>>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>>>>>>>
>>>>>>>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>>>>>>>
>>>>>>>> [1] https://en.wikipedia.org/wiki/Loss_function
>>>>>>>> [2] https://en.wikipedia.org/wiki/Gradient
>>>>>>>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>>>>>>>
>>>>>>>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>>>>>>>
>>>>>>>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>>>>>>>> deficient, I recommend downloading/cloning)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In the following by policy I mean the concrete set of probabilities of
>>>>>>>>> selecting the first replica, the second replica, etc...
>>>>>>>>> In practical terms there are several problems:
>>>>>>>>>
>>>>>>>>> - It's not practical for a high number of disks or replicas.
>>>>>>>>>
>>>>>>>>> Possible solution: approximate summation over all possible disk
>>>>>>>>> selections with a Monte Carlo method.
>>>>>>>>> the algorithm would be: we start with a candidate solution, we run a
>>>>>>>>> simulation and based on the results
>>>>>>>>> we update the probabilities. Repeat until we are happy with the result.
>>>>>>>>>
>>>>>>>>> Other solution: cluster similar disks together.
>>>>>>>>>
>>>>>>>>> - Since it's a non-linear optimization problem I'm not sure right now
>>>>>>>>> about it's convergence properties.
>>>>>>>>> Does it converge to a global optimum? How fast does it converge?
>>>>>>>>>
>>>>>>>>> Possible solution: the algorithm always converges, but it can converge
>>>>>>>>> to a locally optimum policy. I see
>>>>>>>>> no escape except by carefully designing the policy. All solutions to
>>>>>>>>> the problem are going to be non linear
>>>>>>>>> since we must condition current probabilities on previous disk selections.
>>>>>>>>>
>>>>>>>>> - Although it can handle arbitrary constraints it does so by rejecting
>>>>>>>>> disks selections that violate at least one constraint.
>>>>>>>>> This means that for bad policies it can spend all the time rejecting
>>>>>>>>> invalid disks selection candidates.
>>>>>>>>>
>>>>>>>>> Possible solution: the policy cannot be designed independently of the
>>>>>>>>> constraints. I don't know what constraints
>>>>>>>>> are typical use cases but having a look should be the first step. The
>>>>>>>>> constraints must be an input to the policy.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>>>>>>>> found the problem an interesting puzzle.
>>>>>>>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>>>>>>>
>>>>>>>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>>>>>>>
>>>>>>>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>>>>>>>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>>>>>>>
>>>>>>>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>>>>>>>
>>>>>>>>         00     01     02     03     04     05     06     07     08     09     10
>>>>>>>> 00:      0     14     17     14     19     23     13     22     21     20   1800
>>>>>>>> 01:     12      0     11     13     19     19     15     10     16     17   1841
>>>>>>>> 02:     17     27      0     17     15     15     13     19     18     11   1813
>>>>>>>> 03:     14     17     15      0     23     11     20     15     23     17   1792
>>>>>>>> 04:     14     18     16     25      0     27     13      8     15     16   1771
>>>>>>>> 05:     19     16     22     25     13      0      9     19     21     21   1813
>>>>>>>> 06:     18     15     21     17     10     18      0     10     18     11   1873
>>>>>>>> 07:     13     17     22     13     16     17     14      0     25     12   1719
>>>>>>>> 08:     23     20     16     17     19     18     11     12      0     18   1830
>>>>>>>> 09:     14     20     15     17     12     16     17     11     13      0   1828
>>>>>>>> 10:      0      0      0      0      0      0      0      0      0      0      0
>>>>>>>> before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0
>>>>>>>> after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080
>>>>>>>>
>>>>>>>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>>>>>>>
>>>>>>>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>>>>>>>
>>>>>>>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>>>>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>>>>>>> problem and a tentative solution which he described at :
>>>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>>>
>>>>>>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>>>>>>> means that the probability of each disk needs to take in account the
>>>>>>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>>>>>>> mistaken ?
>>>>>>>>>>
>>>>>>>>>> Maybe (I haven't looked closely at the above yet).  But for comparison, in
>>>>>>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>>>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>>>>>>> disks changes the probability from 1/10 to 1/9).  The key property that
>>>>>>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>>>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>>>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>>>>>>> the disk that is reweighted).
>>>>>>>>>>
>>>>>>>>>> sage
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>>
>>>>>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>>>>>> This is a longstanding bug,
>>>>>>>>>>>>
>>>>>>>>>>>>     http://tracker.ceph.com/issues/15653
>>>>>>>>>>>>
>>>>>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>>>>>
>>>>>>>>>>>>     https://github.com/ceph/ceph/pull/10218
>>>>>>>>>>>>
>>>>>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>>>>>> discussion here.
>>>>>>>>>>>>
>>>>>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>>>>>> brain hurt.
>>>>>>>>>>>>
>>>>>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>>>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>>>>>
>>>>>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>>>
>>>>>>>>>>>> The last term is easy to calculate,
>>>>>>>>>>>>
>>>>>>>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>>>>>
>>>>>>>>>>>> and the && term is the distribution we're trying to produce.  For exmaple,
>>>>>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>>>>>> their second replica be the small OSD.  So
>>>>>>>>>>>>
>>>>>>>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>>>>>
>>>>>>>>>>>> Putting those together,
>>>>>>>>>>>>
>>>>>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>>>>>>>
>>>>>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>>>>>> that they weren't already chosen.
>>>>>>>>>>>>
>>>>>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>>>>>>>>>> current code, you get
>>>>>>>>>>>>
>>>>>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>>>>>>>   device 6:             929148  [400572,528576]
>>>>>>>>>>>>
>>>>>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>>>>>> second.  With my hacky change,
>>>>>>>>>>>>
>>>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>>>>>>>   device 6:             804566  [400572,403994]
>>>>>>>>>>>>
>>>>>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>>>>>> 1%).
>>>>>>>>>>>>
>>>>>>>>>>>> Next steps:
>>>>>>>>>>>>
>>>>>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>>>>>> 2- figure out why it skews high
>>>>>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>>>>>
>>>>>>>>>>>> sage
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-02  9:58                       ` Pedro López-Adeva
@ 2017-03-02 10:31                         ` Loic Dachary
  2017-03-07 23:06                         ` Sage Weil
  1 sibling, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-03-02 10:31 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: ceph-devel



On 03/02/2017 10:58 AM, Pedro López-Adeva wrote:
> Hi,
> 
> I will have a look. BTW, I have not progressed that much but I have
> been thinking about it. In order to adapt the previous algorithm in
> the python notebook I need to substitute the iteration over all
> possible devices permutations to iteration over all the possible
> selections that crush would make. That is the main thing I need to
> work on.

That should be easy.

> The other thing is of course that weights change for each replica.
> That is, they cannot be really fixed in the crush map. 

Do you mean that the weights for the replicas cannot be pre-calculated and stored in the crushmap before it is used for actually object mapping ?

> So the
> algorithm inside libcrush, not only the weights in the map, need to be
> changed. The weights in the crush map should reflect then, maybe, the
> desired usage frequencies. Or maybe each replica should have their own
> crush map, but then the information about the previous selection
> should be passed to the next replica placement run so it avoids
> selecting the same one again.
> 
> I have a question also. Is there any significant difference between
> the device selection algorithm description in the paper and its final
> implementation?

The implementation "Algorithm 1" is crush_do_rule and is different although it looks the same[1]. Since the devil is in the details, I would refer to the current code. It is a little difficult to read because it contains parts only required for backward compatibility. You can assume vary_r == 1, stable == 1, chooseleaf_descend_once == 0, local_fallback_retries == 0 and ignore all the parts that are not in that code path.

[1] crush_do_rule http://libcrush.org/main/libcrush/blob/master/crush/mapper.c#L852

> Cheers,
> Pedro.
> 
> 2017-03-02 10:43 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> There is a new version of python-crush at https://pypi.python.org/pypi/crush which changes the layout of the crushmap and the documentation was updated accordingly at http://crush.readthedocs.io/. Sorry for the inconvenience.
>>
>> Cheers
>>
>> On 02/25/2017 10:02 AM, Loic Dachary wrote:
>>>
>>>
>>> On 02/25/2017 09:41 AM, Pedro López-Adeva wrote:
>>>> Great! Installed without problem and ran the example OK. I will
>>>> convert what I already have to use the library and continue from
>>>> there.
>>>
>>> Cool :-) http://crush.readthedocs.io/en/latest/api.html is a complete reference of the crushmap structure, let me know if something is missing.
>>>
>>> Cheers
>>>
>>>>
>>>> 2017-02-25 1:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>> Hi Pedro,
>>>>>
>>>>> On 02/22/2017 12:46 PM, Pedro López-Adeva wrote:
>>>>>> That, for validation, would be great. Until weekend I don't think I'm
>>>>>> going to have time to work on this anyway.
>>>>>
>>>>> An initial version of the module is ready and documented at http://crush.readthedocs.io/en/latest/.
>>>>>
>>>>> Cheers
>>>>>
>>>>>> 2017-02-22 12:38 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>>
>>>>>>>
>>>>>>> On 02/22/2017 12:26 PM, Pedro López-Adeva wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I think your description of my proposed solution is quite good.
>>>>>>>>
>>>>>>>> I had a first look at Sage's paper but not ceph's implementation. My
>>>>>>>> plan is to finish the paper and make an implementation in python that
>>>>>>>> mimicks more closely ceph's algorithm.
>>>>>>>>
>>>>>>>> Regarding your question about data movement:
>>>>>>>>
>>>>>>>> If I understood the paper correctly what is happening right now is
>>>>>>>> that when weights change on the devices some of them will become
>>>>>>>> overloaded and the current algorithm will try to correct for that but
>>>>>>>> this approach, I think, is independent of how we compute the weights
>>>>>>>> for each device. My point is that the current data movement pattern
>>>>>>>> will not be modified.
>>>>>>>>
>>>>>>>> Could the data movement algorithm be improved? Maybe. I don't know.
>>>>>>>> Maybe by making the probabilities non-stationary with the new disk
>>>>>>>> getting at first very high probability and after each replica
>>>>>>>> placement decrease it until it stabilizes to it's final value. But I'm
>>>>>>>> just guessing and I really don't know if this can be made to work in a
>>>>>>>> distributed manner as is currently the case and how would this fit in
>>>>>>>> the current architecture. In any case it would be a problem as hard at
>>>>>>>> least as the current reweighting problem.
>>>>>>>>
>>>>>>>> So, to summarize, my current plans:
>>>>>>>>
>>>>>>>> - Have another look at the paper
>>>>>>>> - Make an implementation in python that imitates more closely the
>>>>>>>> current algorithm
>>>>>>>
>>>>>>> What about I provide you with a python module that includes the current crush implementation (wrapping the C library into a python module) so you don't have to ? I think it would be generaly useful for experimenting and worth the effort. I can have that ready this weekend.
>>>>>>>
>>>>>>>> - Make sure the new reweighting algorithm is fast and gives the desired results
>>>>>>>>
>>>>>>>> I will give updates here when there are significant changes so
>>>>>>>> everyone can have a look and suggest improvements.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Pedro.
>>>>>>>>
>>>>>>>> 2017-02-22 8:52 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>>>> Hi Pedro,
>>>>>>>>>
>>>>>>>>> On 02/16/2017 11:04 PM, Pedro López-Adeva wrote:
>>>>>>>>>> I have updated the algorithm to handle an arbitrary number of replicas
>>>>>>>>>> and arbitrary constraints.
>>>>>>>>>>
>>>>>>>>>> Notebook: https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>> PDF: https://github.com/plafl/notebooks/blob/master/converted/replication.pdf
>>>>>>>>>
>>>>>>>>> I'm very impressed :-) Thanks to friends who helped with the maths parts that were unknown to me I think I now get the spirit of the solution you found. Here it is, in my own words.
>>>>>>>>>
>>>>>>>>> You wrote a family of functions describing the desired outcome: equally filled disks when distributing objects replicas with a constraint. It's not a formula we can use to figure out which probability to assign to each disk, there are two many unknowns. But you also proposed a function to measure, for a given set of probabilities, how far from the best probabilities they are. That's the loss function[1].
>>>>>>>>>
>>>>>>>>> You implemented an abstract python interface to look for the best solution, using this loss function. Trying things at random would take way too much time. Instead you use the gradient[2] of the function to figure out in which direction the values should be modified (that's where the jacobian[3] helps).
>>>>>>>>>
>>>>>>>>> This is part one of your document and in part two you focus on one constraints: no two replica on the same disk. And with an implementation of the abstract interface you show with a few examples that after iterating a number of times you get a set of probabilities that are close enough to the solution. Not the ideal solution but less that 0.001 away from it.
>>>>>>>>>
>>>>>>>>> [1] https://en.wikipedia.org/wiki/Loss_function
>>>>>>>>> [2] https://en.wikipedia.org/wiki/Gradient
>>>>>>>>> [3] https://en.wikipedia.org/wiki/Jacobi_elliptic_functions#Jacobi_elliptic_functions_as_solutions_of_nonlinear_ordinary_differential_equations
>>>>>>>>>
>>>>>>>>> From the above you can hopefully see how far off my understanding is. And I have one question below.
>>>>>>>>>
>>>>>>>>>> (Note: GitHub's renderization of the notebook and the PDF is quite
>>>>>>>>>> deficient, I recommend downloading/cloning)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In the following by policy I mean the concrete set of probabilities of
>>>>>>>>>> selecting the first replica, the second replica, etc...
>>>>>>>>>> In practical terms there are several problems:
>>>>>>>>>>
>>>>>>>>>> - It's not practical for a high number of disks or replicas.
>>>>>>>>>>
>>>>>>>>>> Possible solution: approximate summation over all possible disk
>>>>>>>>>> selections with a Monte Carlo method.
>>>>>>>>>> the algorithm would be: we start with a candidate solution, we run a
>>>>>>>>>> simulation and based on the results
>>>>>>>>>> we update the probabilities. Repeat until we are happy with the result.
>>>>>>>>>>
>>>>>>>>>> Other solution: cluster similar disks together.
>>>>>>>>>>
>>>>>>>>>> - Since it's a non-linear optimization problem I'm not sure right now
>>>>>>>>>> about it's convergence properties.
>>>>>>>>>> Does it converge to a global optimum? How fast does it converge?
>>>>>>>>>>
>>>>>>>>>> Possible solution: the algorithm always converges, but it can converge
>>>>>>>>>> to a locally optimum policy. I see
>>>>>>>>>> no escape except by carefully designing the policy. All solutions to
>>>>>>>>>> the problem are going to be non linear
>>>>>>>>>> since we must condition current probabilities on previous disk selections.
>>>>>>>>>>
>>>>>>>>>> - Although it can handle arbitrary constraints it does so by rejecting
>>>>>>>>>> disks selections that violate at least one constraint.
>>>>>>>>>> This means that for bad policies it can spend all the time rejecting
>>>>>>>>>> invalid disks selection candidates.
>>>>>>>>>>
>>>>>>>>>> Possible solution: the policy cannot be designed independently of the
>>>>>>>>>> constraints. I don't know what constraints
>>>>>>>>>> are typical use cases but having a look should be the first step. The
>>>>>>>>>> constraints must be an input to the policy.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I hope it's of some use. Quite frankly I'm not a ceph user, I just
>>>>>>>>>> found the problem an interesting puzzle.
>>>>>>>>>> Anyway I will try to have a look at the CRUSH paper this weekend.
>>>>>>>>>
>>>>>>>>> In Sage's paper[1] as well as in the Ceph implementation[2] minimizing data movement when a disk is added / removed is an important goal. When looking for a disk to place an object, a mixture of hashing, recursive exploration of a hierarchy describing the racks/hosts/disks and higher probabilities for bigger disks are used.
>>>>>>>>>
>>>>>>>>> [1] http://www.crss.ucsc.edu/media/papers/weil-sc06.pdf
>>>>>>>>> [2] https://github.com/ceph/ceph/tree/master/src/crush
>>>>>>>>>
>>>>>>>>> Here is an example[1] showing how data move around with the current implementation when adding one disk to a 10 disk host (all disks have the same probability of being chosen but no two copies of the same object can be on the same disk) with 100,000 objects and replica 2. The first line reads like this: 14 objects moved from disk 00 to disk 01, 17 objects moved from disk 00 to disk 02 ... 1800 objects moved from disk 00 to disk 10. The "before:" line shows how many objects were in each disk before the new one was added, the "after:" line shows the distribution after the disk was added and objects moved from the existing disks to the new disk.
>>>>>>>>>
>>>>>>>>>         00     01     02     03     04     05     06     07     08     09     10
>>>>>>>>> 00:      0     14     17     14     19     23     13     22     21     20   1800
>>>>>>>>> 01:     12      0     11     13     19     19     15     10     16     17   1841
>>>>>>>>> 02:     17     27      0     17     15     15     13     19     18     11   1813
>>>>>>>>> 03:     14     17     15      0     23     11     20     15     23     17   1792
>>>>>>>>> 04:     14     18     16     25      0     27     13      8     15     16   1771
>>>>>>>>> 05:     19     16     22     25     13      0      9     19     21     21   1813
>>>>>>>>> 06:     18     15     21     17     10     18      0     10     18     11   1873
>>>>>>>>> 07:     13     17     22     13     16     17     14      0     25     12   1719
>>>>>>>>> 08:     23     20     16     17     19     18     11     12      0     18   1830
>>>>>>>>> 09:     14     20     15     17     12     16     17     11     13      0   1828
>>>>>>>>> 10:      0      0      0      0      0      0      0      0      0      0      0
>>>>>>>>> before:  20164  19990  19863  19959  19977  20004  19926  20133  20041  19943      0
>>>>>>>>> after:   18345  18181  18053  18170  18200  18190  18040  18391  18227  18123  18080
>>>>>>>>>
>>>>>>>>> About 1% of the data movement happens between existing disks and serve no useful purpose but the rest are objects moving from existing disks to the new one which is what we need.
>>>>>>>>>
>>>>>>>>> [1] http://libcrush.org/dachary/libcrush/blob/wip-sheepdog/compare.c
>>>>>>>>>
>>>>>>>>> Would it be possible to somehow reconcile the two goals: equally filled disks (which your solution does) and minimizing data movement (which crush does) ?
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2017-02-13 15:21 GMT+01:00 Sage Weil <sweil@redhat.com>:
>>>>>>>>>>> On Mon, 13 Feb 2017, Loic Dachary wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Dan van der Ster reached out to colleagues and friends and Pedro
>>>>>>>>>>>> López-Adeva Fernández-Layos came up with a well written analysis of the
>>>>>>>>>>>> problem and a tentative solution which he described at :
>>>>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>>>>
>>>>>>>>>>>> Unless I'm reading the document incorrectly (very possible ;) it also
>>>>>>>>>>>> means that the probability of each disk needs to take in account the
>>>>>>>>>>>> weight of all disks. Which means that whenever a disk is added / removed
>>>>>>>>>>>> or its weight is changed, this has an impact on the probability of all
>>>>>>>>>>>> disks in the cluster and objects are likely to move everywhere. Am I
>>>>>>>>>>>> mistaken ?
>>>>>>>>>>>
>>>>>>>>>>> Maybe (I haven't looked closely at the above yet).  But for comparison, in
>>>>>>>>>>> the normal straw2 case, adding or removing a disk also changes the
>>>>>>>>>>> probabilities for everything else (e.g., removing one out of 10 identical
>>>>>>>>>>> disks changes the probability from 1/10 to 1/9).  The key property that
>>>>>>>>>>> straw2 *is* able to handle is that as long as the relative probabilities
>>>>>>>>>>> between two unmodified disks does not change, then straw2 will avoid
>>>>>>>>>>> moving any objects between them (i.e., all data movement is to or from
>>>>>>>>>>> the disk that is reweighted).
>>>>>>>>>>>
>>>>>>>>>>> sage
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>>
>>>>>>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>>>>>>> This is a longstanding bug,
>>>>>>>>>>>>>
>>>>>>>>>>>>>     http://tracker.ceph.com/issues/15653
>>>>>>>>>>>>>
>>>>>>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>>>>>>
>>>>>>>>>>>>>     https://github.com/ceph/ceph/pull/10218
>>>>>>>>>>>>>
>>>>>>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>>>>>>> discussion here.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>>>>>>> brain hurt.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>>>>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>>>>>>>
>>>>>>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>>>>
>>>>>>>>>>>>> The last term is easy to calculate,
>>>>>>>>>>>>>
>>>>>>>>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>>>>>>
>>>>>>>>>>>>> and the && term is the distribution we're trying to produce.  For exmaple,
>>>>>>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>>>>>>> their second replica be the small OSD.  So
>>>>>>>>>>>>>
>>>>>>>>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>>>>>>
>>>>>>>>>>>>> Putting those together,
>>>>>>>>>>>>>
>>>>>>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>>>>>>> that they weren't already chosen.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>>>>>>>>>>> current code, you get
>>>>>>>>>>>>>
>>>>>>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>>>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>>>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>>>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>>>>>>>>   device 6:             929148  [400572,528576]
>>>>>>>>>>>>>
>>>>>>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>>>>>>> second.  With my hacky change,
>>>>>>>>>>>>>
>>>>>>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>>>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>>>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>>>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>>>>>>>>   device 6:             804566  [400572,403994]
>>>>>>>>>>>>>
>>>>>>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>>>>>>> 1%).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Next steps:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>>>>>>> 2- figure out why it skews high
>>>>>>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>>>>>>
>>>>>>>>>>>>> sage
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-02  9:58                       ` Pedro López-Adeva
  2017-03-02 10:31                         ` Loic Dachary
@ 2017-03-07 23:06                         ` Sage Weil
  2017-03-09  8:47                           ` Pedro López-Adeva
  1 sibling, 1 reply; 70+ messages in thread
From: Sage Weil @ 2017-03-07 23:06 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: Loic Dachary, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2112 bytes --]

Hi Pedro,

Thanks for taking a look at this!  It's a frustrating problem and we 
haven't made much headway.

On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
> Hi,
> 
> I will have a look. BTW, I have not progressed that much but I have
> been thinking about it. In order to adapt the previous algorithm in
> the python notebook I need to substitute the iteration over all
> possible devices permutations to iteration over all the possible
> selections that crush would make. That is the main thing I need to
> work on.
> 
> The other thing is of course that weights change for each replica.
> That is, they cannot be really fixed in the crush map. So the
> algorithm inside libcrush, not only the weights in the map, need to be
> changed. The weights in the crush map should reflect then, maybe, the
> desired usage frequencies. Or maybe each replica should have their own
> crush map, but then the information about the previous selection
> should be passed to the next replica placement run so it avoids
> selecting the same one again.

My suspicion is that the best solution here (whatever that means!) 
leaves the CRUSH weights intact with the desired distribution, and 
then generates a set of derivative weights--probably one set for each 
round/replica/rank.

One nice property of this is that once the support is added to encode 
multiple sets of weights, the algorithm used to generate them is free to 
change and evolve independently.  (In most cases any change is 
CRUSH's mapping behavior is difficult to roll out because all 
parties participating in the cluster have to support any new behavior 
before it is enabled or used.)

> I have a question also. Is there any significant difference between
> the device selection algorithm description in the paper and its final
> implementation?

The main difference is the "retry_bucket" behavior was found to be a bad 
idea; any collision or failed()/overload() case triggers the 
retry_descent.

There are other changes, of course, but I don't think they'll impact any 
solution we come with here (or at least any solution can be suitably 
adapted)!

sage

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-07 23:06                         ` Sage Weil
@ 2017-03-09  8:47                           ` Pedro López-Adeva
  2017-03-18  9:21                             ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Pedro López-Adeva @ 2017-03-09  8:47 UTC (permalink / raw)
  To: Sage Weil; +Cc: Loic Dachary, ceph-devel

Great, thanks for the clarifications.
I also think that the most natural way is to keep just a set of
weights in the CRUSH map and update them inside the algorithm.

I keep working on it.


2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
> Hi Pedro,
>
> Thanks for taking a look at this!  It's a frustrating problem and we
> haven't made much headway.
>
> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>> Hi,
>>
>> I will have a look. BTW, I have not progressed that much but I have
>> been thinking about it. In order to adapt the previous algorithm in
>> the python notebook I need to substitute the iteration over all
>> possible devices permutations to iteration over all the possible
>> selections that crush would make. That is the main thing I need to
>> work on.
>>
>> The other thing is of course that weights change for each replica.
>> That is, they cannot be really fixed in the crush map. So the
>> algorithm inside libcrush, not only the weights in the map, need to be
>> changed. The weights in the crush map should reflect then, maybe, the
>> desired usage frequencies. Or maybe each replica should have their own
>> crush map, but then the information about the previous selection
>> should be passed to the next replica placement run so it avoids
>> selecting the same one again.
>
> My suspicion is that the best solution here (whatever that means!)
> leaves the CRUSH weights intact with the desired distribution, and
> then generates a set of derivative weights--probably one set for each
> round/replica/rank.
>
> One nice property of this is that once the support is added to encode
> multiple sets of weights, the algorithm used to generate them is free to
> change and evolve independently.  (In most cases any change is
> CRUSH's mapping behavior is difficult to roll out because all
> parties participating in the cluster have to support any new behavior
> before it is enabled or used.)
>
>> I have a question also. Is there any significant difference between
>> the device selection algorithm description in the paper and its final
>> implementation?
>
> The main difference is the "retry_bucket" behavior was found to be a bad
> idea; any collision or failed()/overload() case triggers the
> retry_descent.
>
> There are other changes, of course, but I don't think they'll impact any
> solution we come with here (or at least any solution can be suitably
> adapted)!
>
> sage

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-09  8:47                           ` Pedro López-Adeva
@ 2017-03-18  9:21                             ` Loic Dachary
  2017-03-19 22:31                               ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-03-18  9:21 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: ceph-devel

Hi Pedro,

I'm going to experiment with what you did at

https://github.com/plafl/notebooks/blob/master/replication.ipynb

and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.

Cheers

On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
> Great, thanks for the clarifications.
> I also think that the most natural way is to keep just a set of
> weights in the CRUSH map and update them inside the algorithm.
> 
> I keep working on it.
> 
> 
> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>> Hi Pedro,
>>
>> Thanks for taking a look at this!  It's a frustrating problem and we
>> haven't made much headway.
>>
>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>> Hi,
>>>
>>> I will have a look. BTW, I have not progressed that much but I have
>>> been thinking about it. In order to adapt the previous algorithm in
>>> the python notebook I need to substitute the iteration over all
>>> possible devices permutations to iteration over all the possible
>>> selections that crush would make. That is the main thing I need to
>>> work on.
>>>
>>> The other thing is of course that weights change for each replica.
>>> That is, they cannot be really fixed in the crush map. So the
>>> algorithm inside libcrush, not only the weights in the map, need to be
>>> changed. The weights in the crush map should reflect then, maybe, the
>>> desired usage frequencies. Or maybe each replica should have their own
>>> crush map, but then the information about the previous selection
>>> should be passed to the next replica placement run so it avoids
>>> selecting the same one again.
>>
>> My suspicion is that the best solution here (whatever that means!)
>> leaves the CRUSH weights intact with the desired distribution, and
>> then generates a set of derivative weights--probably one set for each
>> round/replica/rank.
>>
>> One nice property of this is that once the support is added to encode
>> multiple sets of weights, the algorithm used to generate them is free to
>> change and evolve independently.  (In most cases any change is
>> CRUSH's mapping behavior is difficult to roll out because all
>> parties participating in the cluster have to support any new behavior
>> before it is enabled or used.)
>>
>>> I have a question also. Is there any significant difference between
>>> the device selection algorithm description in the paper and its final
>>> implementation?
>>
>> The main difference is the "retry_bucket" behavior was found to be a bad
>> idea; any collision or failed()/overload() case triggers the
>> retry_descent.
>>
>> There are other changes, of course, but I don't think they'll impact any
>> solution we come with here (or at least any solution can be suitably
>> adapted)!
>>
>> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-18  9:21                             ` Loic Dachary
@ 2017-03-19 22:31                               ` Loic Dachary
  2017-03-20 10:49                                 ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-03-19 22:31 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: ceph-devel

Hi Pedro,

It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?

Cheers

On 03/18/2017 10:21 AM, Loic Dachary wrote:
> Hi Pedro,
> 
> I'm going to experiment with what you did at
> 
> https://github.com/plafl/notebooks/blob/master/replication.ipynb
> 
> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
> 
> Cheers
> 
> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>> Great, thanks for the clarifications.
>> I also think that the most natural way is to keep just a set of
>> weights in the CRUSH map and update them inside the algorithm.
>>
>> I keep working on it.
>>
>>
>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>> Hi Pedro,
>>>
>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>> haven't made much headway.
>>>
>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>> Hi,
>>>>
>>>> I will have a look. BTW, I have not progressed that much but I have
>>>> been thinking about it. In order to adapt the previous algorithm in
>>>> the python notebook I need to substitute the iteration over all
>>>> possible devices permutations to iteration over all the possible
>>>> selections that crush would make. That is the main thing I need to
>>>> work on.
>>>>
>>>> The other thing is of course that weights change for each replica.
>>>> That is, they cannot be really fixed in the crush map. So the
>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>> desired usage frequencies. Or maybe each replica should have their own
>>>> crush map, but then the information about the previous selection
>>>> should be passed to the next replica placement run so it avoids
>>>> selecting the same one again.
>>>
>>> My suspicion is that the best solution here (whatever that means!)
>>> leaves the CRUSH weights intact with the desired distribution, and
>>> then generates a set of derivative weights--probably one set for each
>>> round/replica/rank.
>>>
>>> One nice property of this is that once the support is added to encode
>>> multiple sets of weights, the algorithm used to generate them is free to
>>> change and evolve independently.  (In most cases any change is
>>> CRUSH's mapping behavior is difficult to roll out because all
>>> parties participating in the cluster have to support any new behavior
>>> before it is enabled or used.)
>>>
>>>> I have a question also. Is there any significant difference between
>>>> the device selection algorithm description in the paper and its final
>>>> implementation?
>>>
>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>> idea; any collision or failed()/overload() case triggers the
>>> retry_descent.
>>>
>>> There are other changes, of course, but I don't think they'll impact any
>>> solution we come with here (or at least any solution can be suitably
>>> adapted)!
>>>
>>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-19 22:31                               ` Loic Dachary
@ 2017-03-20 10:49                                 ` Loic Dachary
  2017-03-23 11:49                                   ` Pedro López-Adeva
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-03-20 10:49 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: ceph-devel

Hi,

I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?

Thanks !

Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
------------------------------------------------------------------------
Before: All replicas on each hard drive
Expected vs actual use (20000 samples)
 disk 0: 1.39e-01 1.12e-01
 disk 1: 1.11e-01 1.10e-01
 disk 2: 8.33e-02 1.13e-01
 disk 3: 1.39e-01 1.11e-01
 disk 4: 1.11e-01 1.11e-01
 disk 5: 8.33e-02 1.11e-01
 disk 6: 1.39e-01 1.12e-01
 disk 7: 1.11e-01 1.12e-01
 disk 8: 8.33e-02 1.10e-01
it=    1 jac norm=1.59e-01 loss=5.27e-03
it=    2 jac norm=1.55e-01 loss=5.03e-03
...
it=  212 jac norm=1.02e-03 loss=2.41e-07
it=  213 jac norm=1.00e-03 loss=2.31e-07
Converged to desired accuracy :)
After: All replicas on each hard drive
Expected vs actual use (20000 samples)
 disk 0: 1.39e-01 1.42e-01
 disk 1: 1.11e-01 1.09e-01
 disk 2: 8.33e-02 8.37e-02
 disk 3: 1.39e-01 1.40e-01
 disk 4: 1.11e-01 1.13e-01
 disk 5: 8.33e-02 8.08e-02
 disk 6: 1.39e-01 1.38e-01
 disk 7: 1.11e-01 1.09e-01
 disk 8: 8.33e-02 8.48e-02


Simulation: R=2 devices capacity [10 10 10 10  1]
------------------------------------------------------------------------
Before: All replicas on each hard drive
Expected vs actual use (20000 samples)
 disk 0: 2.44e-01 2.36e-01
 disk 1: 2.44e-01 2.38e-01
 disk 2: 2.44e-01 2.34e-01
 disk 3: 2.44e-01 2.38e-01
 disk 4: 2.44e-02 5.37e-02
it=    1 jac norm=2.43e-01 loss=2.98e-03
it=    2 jac norm=2.28e-01 loss=2.47e-03
...
it=   37 jac norm=1.28e-03 loss=3.48e-08
it=   38 jac norm=1.07e-03 loss=2.42e-08
Converged to desired accuracy :)
After: All replicas on each hard drive
Expected vs actual use (20000 samples)
 disk 0: 2.44e-01 2.46e-01
 disk 1: 2.44e-01 2.44e-01
 disk 2: 2.44e-01 2.41e-01
 disk 3: 2.44e-01 2.45e-01
 disk 4: 2.44e-02 2.33e-02


[1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
[2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68

On 03/19/2017 11:31 PM, Loic Dachary wrote:
> Hi Pedro,
> 
> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
> 
> Cheers
> 
> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>> Hi Pedro,
>>
>> I'm going to experiment with what you did at
>>
>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>
>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>
>> Cheers
>>
>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>> Great, thanks for the clarifications.
>>> I also think that the most natural way is to keep just a set of
>>> weights in the CRUSH map and update them inside the algorithm.
>>>
>>> I keep working on it.
>>>
>>>
>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>> Hi Pedro,
>>>>
>>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>>> haven't made much headway.
>>>>
>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>> Hi,
>>>>>
>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>> the python notebook I need to substitute the iteration over all
>>>>> possible devices permutations to iteration over all the possible
>>>>> selections that crush would make. That is the main thing I need to
>>>>> work on.
>>>>>
>>>>> The other thing is of course that weights change for each replica.
>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>> crush map, but then the information about the previous selection
>>>>> should be passed to the next replica placement run so it avoids
>>>>> selecting the same one again.
>>>>
>>>> My suspicion is that the best solution here (whatever that means!)
>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>> then generates a set of derivative weights--probably one set for each
>>>> round/replica/rank.
>>>>
>>>> One nice property of this is that once the support is added to encode
>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>> change and evolve independently.  (In most cases any change is
>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>> parties participating in the cluster have to support any new behavior
>>>> before it is enabled or used.)
>>>>
>>>>> I have a question also. Is there any significant difference between
>>>>> the device selection algorithm description in the paper and its final
>>>>> implementation?
>>>>
>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>> idea; any collision or failed()/overload() case triggers the
>>>> retry_descent.
>>>>
>>>> There are other changes, of course, but I don't think they'll impact any
>>>> solution we come with here (or at least any solution can be suitably
>>>> adapted)!
>>>>
>>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-20 10:49                                 ` Loic Dachary
@ 2017-03-23 11:49                                   ` Pedro López-Adeva
  2017-03-23 14:13                                     ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Pedro López-Adeva @ 2017-03-23 11:49 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

Hi Loic,

From what I see everything seems OK. The interesting thing would be to
test on some complex mapping. The reason is that "CrushPolicyFamily"
is right now modeling just a single straw bucket not the full CRUSH
algorithm. That's the work that remains to be done. The only way that
would avoid reimplementing the CRUSH algorithm and computing the
gradient would be treating CRUSH as a black box and eliminating the
necessity of computing the gradient either by using a gradient-free
optimization method or making an estimation of the gradient.



2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
> Hi,
>
> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>
> Thanks !
>
> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
> ------------------------------------------------------------------------
> Before: All replicas on each hard drive
> Expected vs actual use (20000 samples)
>  disk 0: 1.39e-01 1.12e-01
>  disk 1: 1.11e-01 1.10e-01
>  disk 2: 8.33e-02 1.13e-01
>  disk 3: 1.39e-01 1.11e-01
>  disk 4: 1.11e-01 1.11e-01
>  disk 5: 8.33e-02 1.11e-01
>  disk 6: 1.39e-01 1.12e-01
>  disk 7: 1.11e-01 1.12e-01
>  disk 8: 8.33e-02 1.10e-01
> it=    1 jac norm=1.59e-01 loss=5.27e-03
> it=    2 jac norm=1.55e-01 loss=5.03e-03
> ...
> it=  212 jac norm=1.02e-03 loss=2.41e-07
> it=  213 jac norm=1.00e-03 loss=2.31e-07
> Converged to desired accuracy :)
> After: All replicas on each hard drive
> Expected vs actual use (20000 samples)
>  disk 0: 1.39e-01 1.42e-01
>  disk 1: 1.11e-01 1.09e-01
>  disk 2: 8.33e-02 8.37e-02
>  disk 3: 1.39e-01 1.40e-01
>  disk 4: 1.11e-01 1.13e-01
>  disk 5: 8.33e-02 8.08e-02
>  disk 6: 1.39e-01 1.38e-01
>  disk 7: 1.11e-01 1.09e-01
>  disk 8: 8.33e-02 8.48e-02
>
>
> Simulation: R=2 devices capacity [10 10 10 10  1]
> ------------------------------------------------------------------------
> Before: All replicas on each hard drive
> Expected vs actual use (20000 samples)
>  disk 0: 2.44e-01 2.36e-01
>  disk 1: 2.44e-01 2.38e-01
>  disk 2: 2.44e-01 2.34e-01
>  disk 3: 2.44e-01 2.38e-01
>  disk 4: 2.44e-02 5.37e-02
> it=    1 jac norm=2.43e-01 loss=2.98e-03
> it=    2 jac norm=2.28e-01 loss=2.47e-03
> ...
> it=   37 jac norm=1.28e-03 loss=3.48e-08
> it=   38 jac norm=1.07e-03 loss=2.42e-08
> Converged to desired accuracy :)
> After: All replicas on each hard drive
> Expected vs actual use (20000 samples)
>  disk 0: 2.44e-01 2.46e-01
>  disk 1: 2.44e-01 2.44e-01
>  disk 2: 2.44e-01 2.41e-01
>  disk 3: 2.44e-01 2.45e-01
>  disk 4: 2.44e-02 2.33e-02
>
>
> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>
> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>> Hi Pedro,
>>
>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>
>> Cheers
>>
>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>> Hi Pedro,
>>>
>>> I'm going to experiment with what you did at
>>>
>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>
>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>
>>> Cheers
>>>
>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>> Great, thanks for the clarifications.
>>>> I also think that the most natural way is to keep just a set of
>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>
>>>> I keep working on it.
>>>>
>>>>
>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>> Hi Pedro,
>>>>>
>>>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>>>> haven't made much headway.
>>>>>
>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>> the python notebook I need to substitute the iteration over all
>>>>>> possible devices permutations to iteration over all the possible
>>>>>> selections that crush would make. That is the main thing I need to
>>>>>> work on.
>>>>>>
>>>>>> The other thing is of course that weights change for each replica.
>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>> crush map, but then the information about the previous selection
>>>>>> should be passed to the next replica placement run so it avoids
>>>>>> selecting the same one again.
>>>>>
>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>> then generates a set of derivative weights--probably one set for each
>>>>> round/replica/rank.
>>>>>
>>>>> One nice property of this is that once the support is added to encode
>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>> change and evolve independently.  (In most cases any change is
>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>> parties participating in the cluster have to support any new behavior
>>>>> before it is enabled or used.)
>>>>>
>>>>>> I have a question also. Is there any significant difference between
>>>>>> the device selection algorithm description in the paper and its final
>>>>>> implementation?
>>>>>
>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>> idea; any collision or failed()/overload() case triggers the
>>>>> retry_descent.
>>>>>
>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>> solution we come with here (or at least any solution can be suitably
>>>>> adapted)!
>>>>>
>>>>> sage
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-23 11:49                                   ` Pedro López-Adeva
@ 2017-03-23 14:13                                     ` Loic Dachary
  2017-03-23 15:32                                       ` Pedro López-Adeva
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-03-23 14:13 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: ceph-devel

Hi Pedro,

On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
> Hi Loic,
> 
>>From what I see everything seems OK. 

Cool. I'll keep going in this direction then !

> The interesting thing would be to
> test on some complex mapping. The reason is that "CrushPolicyFamily"
> is right now modeling just a single straw bucket not the full CRUSH
> algorithm. 

A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.

> That's the work that remains to be done. The only way that
> would avoid reimplementing the CRUSH algorithm and computing the
> gradient would be treating CRUSH as a black box and eliminating the
> necessity of computing the gradient either by using a gradient-free
> optimization method or making an estimation of the gradient.

By gradient-free optimization you mean simulated annealing or Monte Carlo ?

Cheers

> 
> 
> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> Hi,
>>
>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>
>> Thanks !
>>
>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
>> ------------------------------------------------------------------------
>> Before: All replicas on each hard drive
>> Expected vs actual use (20000 samples)
>>  disk 0: 1.39e-01 1.12e-01
>>  disk 1: 1.11e-01 1.10e-01
>>  disk 2: 8.33e-02 1.13e-01
>>  disk 3: 1.39e-01 1.11e-01
>>  disk 4: 1.11e-01 1.11e-01
>>  disk 5: 8.33e-02 1.11e-01
>>  disk 6: 1.39e-01 1.12e-01
>>  disk 7: 1.11e-01 1.12e-01
>>  disk 8: 8.33e-02 1.10e-01
>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>> ...
>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>> Converged to desired accuracy :)
>> After: All replicas on each hard drive
>> Expected vs actual use (20000 samples)
>>  disk 0: 1.39e-01 1.42e-01
>>  disk 1: 1.11e-01 1.09e-01
>>  disk 2: 8.33e-02 8.37e-02
>>  disk 3: 1.39e-01 1.40e-01
>>  disk 4: 1.11e-01 1.13e-01
>>  disk 5: 8.33e-02 8.08e-02
>>  disk 6: 1.39e-01 1.38e-01
>>  disk 7: 1.11e-01 1.09e-01
>>  disk 8: 8.33e-02 8.48e-02
>>
>>
>> Simulation: R=2 devices capacity [10 10 10 10  1]
>> ------------------------------------------------------------------------
>> Before: All replicas on each hard drive
>> Expected vs actual use (20000 samples)
>>  disk 0: 2.44e-01 2.36e-01
>>  disk 1: 2.44e-01 2.38e-01
>>  disk 2: 2.44e-01 2.34e-01
>>  disk 3: 2.44e-01 2.38e-01
>>  disk 4: 2.44e-02 5.37e-02
>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>> ...
>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>> Converged to desired accuracy :)
>> After: All replicas on each hard drive
>> Expected vs actual use (20000 samples)
>>  disk 0: 2.44e-01 2.46e-01
>>  disk 1: 2.44e-01 2.44e-01
>>  disk 2: 2.44e-01 2.41e-01
>>  disk 3: 2.44e-01 2.45e-01
>>  disk 4: 2.44e-02 2.33e-02
>>
>>
>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>
>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>> Hi Pedro,
>>>
>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>
>>> Cheers
>>>
>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>> Hi Pedro,
>>>>
>>>> I'm going to experiment with what you did at
>>>>
>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>
>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>
>>>> Cheers
>>>>
>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>> Great, thanks for the clarifications.
>>>>> I also think that the most natural way is to keep just a set of
>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>
>>>>> I keep working on it.
>>>>>
>>>>>
>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>>>>> haven't made much headway.
>>>>>>
>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>> work on.
>>>>>>>
>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>> crush map, but then the information about the previous selection
>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>> selecting the same one again.
>>>>>>
>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>> round/replica/rank.
>>>>>>
>>>>>> One nice property of this is that once the support is added to encode
>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>> change and evolve independently.  (In most cases any change is
>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>> parties participating in the cluster have to support any new behavior
>>>>>> before it is enabled or used.)
>>>>>>
>>>>>>> I have a question also. Is there any significant difference between
>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>> implementation?
>>>>>>
>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>> retry_descent.
>>>>>>
>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>> adapted)!
>>>>>>
>>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-23 14:13                                     ` Loic Dachary
@ 2017-03-23 15:32                                       ` Pedro López-Adeva
  2017-03-23 16:18                                         ` Loic Dachary
                                                           ` (3 more replies)
  0 siblings, 4 replies; 70+ messages in thread
From: Pedro López-Adeva @ 2017-03-23 15:32 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel

There are lot of gradient-free methods. I will try first to run the
ones available using just scipy
(https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
Some of them don't require the gradient and some of them can estimate
it. The reason to go without the gradient is to run the CRUSH
algorithm as a black box. In that case this would be the pseudo-code:

- BEGIN CODE -
def build_target(desired_freqs):
    def target(weights):
        # run a simulation of CRUSH for a number of objects
        sim_freqs = run_crush(weights)
        # Kullback-Leibler divergence between desired frequencies and
current ones
        return loss(sim_freqs, desired_freqs)
   return target

weights = scipy.optimize.minimize(build_target(desired_freqs))
- END CODE -

The tricky thing here is that this procedure can be slow if the
simulation (run_crush) needs to place a lot of objects to get accurate
simulated frequencies. This is true specially if the minimize method
attempts to approximate the gradient using finite differences since it
will evaluate the target function a number of times proportional to
the number of weights). Apart from the ones in scipy I would try also
optimization methods that try to perform as few evaluations as
possible like for example HyperOpt
(http://hyperopt.github.io/hyperopt/), which by the way takes into
account that the target function can be noisy.

This black box approximation is simple to implement and makes the
computer do all the work instead of us.
I think that this black box approximation is worthy to try even if
it's not the final one because if this approximation works then we
know that a more elaborate one that computes the gradient of the CRUSH
algorithm will work for sure.

I can try this black box approximation this weekend not on the real
CRUSH algorithm but with the simple implementation I did in python. If
it works it's just a matter of substituting one simulation with
another and see what happens.

2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
> Hi Pedro,
>
> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>> Hi Loic,
>>
>>>From what I see everything seems OK.
>
> Cool. I'll keep going in this direction then !
>
>> The interesting thing would be to
>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>> is right now modeling just a single straw bucket not the full CRUSH
>> algorithm.
>
> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>
>> That's the work that remains to be done. The only way that
>> would avoid reimplementing the CRUSH algorithm and computing the
>> gradient would be treating CRUSH as a black box and eliminating the
>> necessity of computing the gradient either by using a gradient-free
>> optimization method or making an estimation of the gradient.
>
> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>
> Cheers
>
>>
>>
>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> Hi,
>>>
>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>
>>> Thanks !
>>>
>>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
>>> ------------------------------------------------------------------------
>>> Before: All replicas on each hard drive
>>> Expected vs actual use (20000 samples)
>>>  disk 0: 1.39e-01 1.12e-01
>>>  disk 1: 1.11e-01 1.10e-01
>>>  disk 2: 8.33e-02 1.13e-01
>>>  disk 3: 1.39e-01 1.11e-01
>>>  disk 4: 1.11e-01 1.11e-01
>>>  disk 5: 8.33e-02 1.11e-01
>>>  disk 6: 1.39e-01 1.12e-01
>>>  disk 7: 1.11e-01 1.12e-01
>>>  disk 8: 8.33e-02 1.10e-01
>>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>> ...
>>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>> Converged to desired accuracy :)
>>> After: All replicas on each hard drive
>>> Expected vs actual use (20000 samples)
>>>  disk 0: 1.39e-01 1.42e-01
>>>  disk 1: 1.11e-01 1.09e-01
>>>  disk 2: 8.33e-02 8.37e-02
>>>  disk 3: 1.39e-01 1.40e-01
>>>  disk 4: 1.11e-01 1.13e-01
>>>  disk 5: 8.33e-02 8.08e-02
>>>  disk 6: 1.39e-01 1.38e-01
>>>  disk 7: 1.11e-01 1.09e-01
>>>  disk 8: 8.33e-02 8.48e-02
>>>
>>>
>>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>> ------------------------------------------------------------------------
>>> Before: All replicas on each hard drive
>>> Expected vs actual use (20000 samples)
>>>  disk 0: 2.44e-01 2.36e-01
>>>  disk 1: 2.44e-01 2.38e-01
>>>  disk 2: 2.44e-01 2.34e-01
>>>  disk 3: 2.44e-01 2.38e-01
>>>  disk 4: 2.44e-02 5.37e-02
>>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>> ...
>>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>> Converged to desired accuracy :)
>>> After: All replicas on each hard drive
>>> Expected vs actual use (20000 samples)
>>>  disk 0: 2.44e-01 2.46e-01
>>>  disk 1: 2.44e-01 2.44e-01
>>>  disk 2: 2.44e-01 2.41e-01
>>>  disk 3: 2.44e-01 2.45e-01
>>>  disk 4: 2.44e-02 2.33e-02
>>>
>>>
>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>
>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>> Hi Pedro,
>>>>
>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>
>>>> Cheers
>>>>
>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>> Hi Pedro,
>>>>>
>>>>> I'm going to experiment with what you did at
>>>>>
>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>
>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>> Great, thanks for the clarifications.
>>>>>> I also think that the most natural way is to keep just a set of
>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>
>>>>>> I keep working on it.
>>>>>>
>>>>>>
>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>> Hi Pedro,
>>>>>>>
>>>>>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>>>>>> haven't made much headway.
>>>>>>>
>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>> work on.
>>>>>>>>
>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>> selecting the same one again.
>>>>>>>
>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>> round/replica/rank.
>>>>>>>
>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>> change and evolve independently.  (In most cases any change is
>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>> before it is enabled or used.)
>>>>>>>
>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>> implementation?
>>>>>>>
>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>> retry_descent.
>>>>>>>
>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>> adapted)!
>>>>>>>
>>>>>>> sage
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-23 15:32                                       ` Pedro López-Adeva
@ 2017-03-23 16:18                                         ` Loic Dachary
  2017-03-25 18:42                                         ` Sage Weil
                                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-03-23 16:18 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: ceph-devel



On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
> There are lot of gradient-free methods. I will try first to run the
> ones available using just scipy
> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> Some of them don't require the gradient and some of them can estimate
> it. The reason to go without the gradient is to run the CRUSH
> algorithm as a black box. In that case this would be the pseudo-code:
> 
> - BEGIN CODE -
> def build_target(desired_freqs):
>     def target(weights):
>         # run a simulation of CRUSH for a number of objects
>         sim_freqs = run_crush(weights)
>         # Kullback-Leibler divergence between desired frequencies and
> current ones
>         return loss(sim_freqs, desired_freqs)
>    return target
> 
> weights = scipy.optimize.minimize(build_target(desired_freqs))
> - END CODE -
> 
> The tricky thing here is that this procedure can be slow if the
> simulation (run_crush) needs to place a lot of objects to get accurate
> simulated frequencies. This is true specially if the minimize method
> attempts to approximate the gradient using finite differences since it
> will evaluate the target function a number of times proportional to
> the number of weights). Apart from the ones in scipy I would try also
> optimization methods that try to perform as few evaluations as
> possible like for example HyperOpt
> (http://hyperopt.github.io/hyperopt/), which by the way takes into
> account that the target function can be noisy.
> 
> This black box approximation is simple to implement and makes the
> computer do all the work instead of us.
> I think that this black box approximation is worthy to try even if
> it's not the final one because if this approximation works then we
> know that a more elaborate one that computes the gradient of the CRUSH
> algorithm will work for sure.
> 
> I can try this black box approximation this weekend not on the real
> CRUSH algorithm but with the simple implementation I did in python. If
> it works it's just a matter of substituting one simulation with
> another and see what happens.

Great! And I'll do whatever is needed to adapt what you did to use crush.

Cheers

> 
> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>> Hi Loic,
>>>
>>> >From what I see everything seems OK.
>>
>> Cool. I'll keep going in this direction then !
>>
>>> The interesting thing would be to
>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>> is right now modeling just a single straw bucket not the full CRUSH
>>> algorithm.
>>
>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>
>>> That's the work that remains to be done. The only way that
>>> would avoid reimplementing the CRUSH algorithm and computing the
>>> gradient would be treating CRUSH as a black box and eliminating the
>>> necessity of computing the gradient either by using a gradient-free
>>> optimization method or making an estimation of the gradient.
>>
>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>
>> Cheers
>>
>>>
>>>
>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>> Hi,
>>>>
>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>
>>>> Thanks !
>>>>
>>>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
>>>> ------------------------------------------------------------------------
>>>> Before: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>>  disk 0: 1.39e-01 1.12e-01
>>>>  disk 1: 1.11e-01 1.10e-01
>>>>  disk 2: 8.33e-02 1.13e-01
>>>>  disk 3: 1.39e-01 1.11e-01
>>>>  disk 4: 1.11e-01 1.11e-01
>>>>  disk 5: 8.33e-02 1.11e-01
>>>>  disk 6: 1.39e-01 1.12e-01
>>>>  disk 7: 1.11e-01 1.12e-01
>>>>  disk 8: 8.33e-02 1.10e-01
>>>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>>> ...
>>>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>>> Converged to desired accuracy :)
>>>> After: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>>  disk 0: 1.39e-01 1.42e-01
>>>>  disk 1: 1.11e-01 1.09e-01
>>>>  disk 2: 8.33e-02 8.37e-02
>>>>  disk 3: 1.39e-01 1.40e-01
>>>>  disk 4: 1.11e-01 1.13e-01
>>>>  disk 5: 8.33e-02 8.08e-02
>>>>  disk 6: 1.39e-01 1.38e-01
>>>>  disk 7: 1.11e-01 1.09e-01
>>>>  disk 8: 8.33e-02 8.48e-02
>>>>
>>>>
>>>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>>> ------------------------------------------------------------------------
>>>> Before: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>>  disk 0: 2.44e-01 2.36e-01
>>>>  disk 1: 2.44e-01 2.38e-01
>>>>  disk 2: 2.44e-01 2.34e-01
>>>>  disk 3: 2.44e-01 2.38e-01
>>>>  disk 4: 2.44e-02 5.37e-02
>>>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>>> ...
>>>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>>> Converged to desired accuracy :)
>>>> After: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>>  disk 0: 2.44e-01 2.46e-01
>>>>  disk 1: 2.44e-01 2.44e-01
>>>>  disk 2: 2.44e-01 2.41e-01
>>>>  disk 3: 2.44e-01 2.45e-01
>>>>  disk 4: 2.44e-02 2.33e-02
>>>>
>>>>
>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>
>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>> Hi Pedro,
>>>>>
>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> I'm going to experiment with what you did at
>>>>>>
>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>
>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>> Great, thanks for the clarifications.
>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>
>>>>>>> I keep working on it.
>>>>>>>
>>>>>>>
>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>>>>>>> haven't made much headway.
>>>>>>>>
>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>> work on.
>>>>>>>>>
>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>> selecting the same one again.
>>>>>>>>
>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>> round/replica/rank.
>>>>>>>>
>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>> change and evolve independently.  (In most cases any change is
>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>> before it is enabled or used.)
>>>>>>>>
>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>> implementation?
>>>>>>>>
>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>> retry_descent.
>>>>>>>>
>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>> adapted)!
>>>>>>>>
>>>>>>>> sage
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-23 15:32                                       ` Pedro López-Adeva
  2017-03-23 16:18                                         ` Loic Dachary
@ 2017-03-25 18:42                                         ` Sage Weil
       [not found]                                           ` <CAHMeWhHV=5u=QFggXFNMn2MzGLgQJ6nMnae+ZgK=MB5yYr1p9g@mail.gmail.com>
  2017-04-11 15:22                                         ` Loic Dachary
  2017-04-22 16:51                                         ` Loic Dachary
  3 siblings, 1 reply; 70+ messages in thread
From: Sage Weil @ 2017-03-25 18:42 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: Loic Dachary, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 13475 bytes --]

Hi Pedro, Loic,

For what it's worth, my intuition here (which has had a mixed record as 
far as CRUSH goes) is that this is the most promising path forward.

Thinking ahead a few steps, and confirming that I'm following the 
discussion so far, if you're able to do get black (or white) box gradient 
descent to work, then this will give us a set of weights for each item in 
the tree for each selection round, derived from the tree structure and 
original (target) weights.  That would basically give us a map of item id 
(bucket id or leave item id) to weight for each round.  i.e.,

 map<int, map<int, float>> weight_by_position;  // position -> item -> weight

where the 0 round would (I think?) match the target weights, and each 
round after that would skew low-weighted items lower to some degree.  
Right?

The next question I have is: does this generalize from the single-bucket 
case to the hierarchy?  I.e., if I have a "tree" (single bucket) like

3.1
 |_____________
 |   \    \    \
1.0  1.0  1.0  .1

it clearly works, but when we have a multi-level tree like


8.4
 |____________________________________
 |                 \                  \
3.1                3.1                2.2
 |_____________     |_____________     |_____________
 |   \    \    \    |   \    \    \    |   \    \    \
1.0  1.0  1.0  .1   1.0  1.0  1.0  .1  1.0  1.0 .1   .1   

and the second round weights skew the small .1 leaves lower, can we 
continue to build the summed-weight hierarchy, such that the adjusted 
weights at the higher level are appropriately adjusted to give us the 
right probabilities of descending into those trees?  I'm not sure if that 
logically follows from the above or if my intuition is oversimplifying 
things.

If this *is* how we think this will shake out, then I'm wondering if we 
should go ahead and build this weigh matrix into CRUSH sooner rather 
than later (i.e., for luminous).  As with the explicit remappings, the 
hard part is all done offline, and the adjustments to the CRUSH mapping 
calculation itself (storing and making use of the adjusted weights for 
each round of placement) are relatively straightforward.  And the sooner 
this is incorporated into a release the sooner real users will be able to 
roll out code to all clients and start making us of it.

Thanks again for looking at this problem!  I'm excited that we may be 
closing in on a real solution!

sage





On Thu, 23 Mar 2017, Pedro López-Adeva wrote:

> There are lot of gradient-free methods. I will try first to run the
> ones available using just scipy
> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> Some of them don't require the gradient and some of them can estimate
> it. The reason to go without the gradient is to run the CRUSH
> algorithm as a black box. In that case this would be the pseudo-code:
> 
> - BEGIN CODE -
> def build_target(desired_freqs):
>     def target(weights):
>         # run a simulation of CRUSH for a number of objects
>         sim_freqs = run_crush(weights)
>         # Kullback-Leibler divergence between desired frequencies and
> current ones
>         return loss(sim_freqs, desired_freqs)
>    return target
> 
> weights = scipy.optimize.minimize(build_target(desired_freqs))
> - END CODE -
> 
> The tricky thing here is that this procedure can be slow if the
> simulation (run_crush) needs to place a lot of objects to get accurate
> simulated frequencies. This is true specially if the minimize method
> attempts to approximate the gradient using finite differences since it
> will evaluate the target function a number of times proportional to
> the number of weights). Apart from the ones in scipy I would try also
> optimization methods that try to perform as few evaluations as
> possible like for example HyperOpt
> (http://hyperopt.github.io/hyperopt/), which by the way takes into
> account that the target function can be noisy.
> 
> This black box approximation is simple to implement and makes the
> computer do all the work instead of us.
> I think that this black box approximation is worthy to try even if
> it's not the final one because if this approximation works then we
> know that a more elaborate one that computes the gradient of the CRUSH
> algorithm will work for sure.
> 
> I can try this black box approximation this weekend not on the real
> CRUSH algorithm but with the simple implementation I did in python. If
> it works it's just a matter of substituting one simulation with
> another and see what happens.
> 
> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
> > Hi Pedro,
> >
> > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
> >> Hi Loic,
> >>
> >>>From what I see everything seems OK.
> >
> > Cool. I'll keep going in this direction then !
> >
> >> The interesting thing would be to
> >> test on some complex mapping. The reason is that "CrushPolicyFamily"
> >> is right now modeling just a single straw bucket not the full CRUSH
> >> algorithm.
> >
> > A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
> >
> >> That's the work that remains to be done. The only way that
> >> would avoid reimplementing the CRUSH algorithm and computing the
> >> gradient would be treating CRUSH as a black box and eliminating the
> >> necessity of computing the gradient either by using a gradient-free
> >> optimization method or making an estimation of the gradient.
> >
> > By gradient-free optimization you mean simulated annealing or Monte Carlo ?
> >
> > Cheers
> >
> >>
> >>
> >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
> >>> Hi,
> >>>
> >>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
> >>>
> >>> Thanks !
> >>>
> >>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
> >>> ------------------------------------------------------------------------
> >>> Before: All replicas on each hard drive
> >>> Expected vs actual use (20000 samples)
> >>>  disk 0: 1.39e-01 1.12e-01
> >>>  disk 1: 1.11e-01 1.10e-01
> >>>  disk 2: 8.33e-02 1.13e-01
> >>>  disk 3: 1.39e-01 1.11e-01
> >>>  disk 4: 1.11e-01 1.11e-01
> >>>  disk 5: 8.33e-02 1.11e-01
> >>>  disk 6: 1.39e-01 1.12e-01
> >>>  disk 7: 1.11e-01 1.12e-01
> >>>  disk 8: 8.33e-02 1.10e-01
> >>> it=    1 jac norm=1.59e-01 loss=5.27e-03
> >>> it=    2 jac norm=1.55e-01 loss=5.03e-03
> >>> ...
> >>> it=  212 jac norm=1.02e-03 loss=2.41e-07
> >>> it=  213 jac norm=1.00e-03 loss=2.31e-07
> >>> Converged to desired accuracy :)
> >>> After: All replicas on each hard drive
> >>> Expected vs actual use (20000 samples)
> >>>  disk 0: 1.39e-01 1.42e-01
> >>>  disk 1: 1.11e-01 1.09e-01
> >>>  disk 2: 8.33e-02 8.37e-02
> >>>  disk 3: 1.39e-01 1.40e-01
> >>>  disk 4: 1.11e-01 1.13e-01
> >>>  disk 5: 8.33e-02 8.08e-02
> >>>  disk 6: 1.39e-01 1.38e-01
> >>>  disk 7: 1.11e-01 1.09e-01
> >>>  disk 8: 8.33e-02 8.48e-02
> >>>
> >>>
> >>> Simulation: R=2 devices capacity [10 10 10 10  1]
> >>> ------------------------------------------------------------------------
> >>> Before: All replicas on each hard drive
> >>> Expected vs actual use (20000 samples)
> >>>  disk 0: 2.44e-01 2.36e-01
> >>>  disk 1: 2.44e-01 2.38e-01
> >>>  disk 2: 2.44e-01 2.34e-01
> >>>  disk 3: 2.44e-01 2.38e-01
> >>>  disk 4: 2.44e-02 5.37e-02
> >>> it=    1 jac norm=2.43e-01 loss=2.98e-03
> >>> it=    2 jac norm=2.28e-01 loss=2.47e-03
> >>> ...
> >>> it=   37 jac norm=1.28e-03 loss=3.48e-08
> >>> it=   38 jac norm=1.07e-03 loss=2.42e-08
> >>> Converged to desired accuracy :)
> >>> After: All replicas on each hard drive
> >>> Expected vs actual use (20000 samples)
> >>>  disk 0: 2.44e-01 2.46e-01
> >>>  disk 1: 2.44e-01 2.44e-01
> >>>  disk 2: 2.44e-01 2.41e-01
> >>>  disk 3: 2.44e-01 2.45e-01
> >>>  disk 4: 2.44e-02 2.33e-02
> >>>
> >>>
> >>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
> >>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
> >>>
> >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
> >>>> Hi Pedro,
> >>>>
> >>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
> >>>>
> >>>> Cheers
> >>>>
> >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
> >>>>> Hi Pedro,
> >>>>>
> >>>>> I'm going to experiment with what you did at
> >>>>>
> >>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
> >>>>>
> >>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
> >>>>>> Great, thanks for the clarifications.
> >>>>>> I also think that the most natural way is to keep just a set of
> >>>>>> weights in the CRUSH map and update them inside the algorithm.
> >>>>>>
> >>>>>> I keep working on it.
> >>>>>>
> >>>>>>
> >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
> >>>>>>> Hi Pedro,
> >>>>>>>
> >>>>>>> Thanks for taking a look at this!  It's a frustrating problem and we
> >>>>>>> haven't made much headway.
> >>>>>>>
> >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I will have a look. BTW, I have not progressed that much but I have
> >>>>>>>> been thinking about it. In order to adapt the previous algorithm in
> >>>>>>>> the python notebook I need to substitute the iteration over all
> >>>>>>>> possible devices permutations to iteration over all the possible
> >>>>>>>> selections that crush would make. That is the main thing I need to
> >>>>>>>> work on.
> >>>>>>>>
> >>>>>>>> The other thing is of course that weights change for each replica.
> >>>>>>>> That is, they cannot be really fixed in the crush map. So the
> >>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
> >>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
> >>>>>>>> desired usage frequencies. Or maybe each replica should have their own
> >>>>>>>> crush map, but then the information about the previous selection
> >>>>>>>> should be passed to the next replica placement run so it avoids
> >>>>>>>> selecting the same one again.
> >>>>>>>
> >>>>>>> My suspicion is that the best solution here (whatever that means!)
> >>>>>>> leaves the CRUSH weights intact with the desired distribution, and
> >>>>>>> then generates a set of derivative weights--probably one set for each
> >>>>>>> round/replica/rank.
> >>>>>>>
> >>>>>>> One nice property of this is that once the support is added to encode
> >>>>>>> multiple sets of weights, the algorithm used to generate them is free to
> >>>>>>> change and evolve independently.  (In most cases any change is
> >>>>>>> CRUSH's mapping behavior is difficult to roll out because all
> >>>>>>> parties participating in the cluster have to support any new behavior
> >>>>>>> before it is enabled or used.)
> >>>>>>>
> >>>>>>>> I have a question also. Is there any significant difference between
> >>>>>>>> the device selection algorithm description in the paper and its final
> >>>>>>>> implementation?
> >>>>>>>
> >>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
> >>>>>>> idea; any collision or failed()/overload() case triggers the
> >>>>>>> retry_descent.
> >>>>>>>
> >>>>>>> There are other changes, of course, but I don't think they'll impact any
> >>>>>>> solution we come with here (or at least any solution can be suitably
> >>>>>>> adapted)!
> >>>>>>>
> >>>>>>> sage
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>>> the body of a message to majordomo@vger.kernel.org
> >>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>> --
> >>> Loïc Dachary, Artisan Logiciel Libre
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> > --
> > Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
       [not found]                                           ` <CAHMeWhHV=5u=QFggXFNMn2MzGLgQJ6nMnae+ZgK=MB5yYr1p9g@mail.gmail.com>
@ 2017-03-27  2:33                                             ` Sage Weil
  2017-03-27  6:45                                               ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Sage Weil @ 2017-03-27  2:33 UTC (permalink / raw)
  To: Adam Kupczyk; +Cc: Pedro López-Adeva, Loic Dachary, Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 20718 bytes --]

On Sun, 26 Mar 2017, Adam Kupczyk wrote:
> Hello Sage, Loic, Pedro,
> 
> 
> I am certain that almost perfect mapping can be achieved by
> substituting weights from crush map with slightly modified weights.
> By perfect mapping I mean we get on each OSD number of PGs exactly
> proportional to weights specified in crush map.
> 
> 1. Example
> Lets think of PGs of single object pool.
> We have OSDs with following weights:
> [10, 10, 10, 5, 5]
> 
> Ideally, we would like following distribution of 200PG x 3 copies = 600
> PGcopies :
> [150, 150, 150, 75, 75]
> 
> However, because crush simulates random process we have:
> [143, 152, 158, 71, 76]
> 
> We could have obtained perfect distribution had we used weights like this:
> [10.2, 9.9, 9.6, 5.2, 4.9]
> 
> 
> 2. Obtaining perfect mapping weights from OSD capacity weights
> 
> When we apply crush for the first time, distribution of PGs comes as random.
> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
> 
> But CRUSH is not random proces at all, it behaves in numerically stable way.
> Specifically, if we increase weight on one node, we will get more PGs on
> this node and less on every other node:
> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
> 
> Now, finding ideal weights can be done by any numerical minimization method,
> for example NLMS.
> 
> 
> 3. The proposal
> For each pool, from initial weights given in crush map perfect weights will
> be derived.
> This weights will be used to calculate PG distribution. This of course will
> be close to perfect.
> 
> 3a: Downside when OSD is out
> When an OSD is out, missing PG copies will be replicated elsewhere.
> Because now weights deviate from OSD capacity, some OSDs will statistically
> get more copies then they should.
> This unevenness in distribution is proportional to scale of deviation of
> calculated weights to capacity weights.
> 
> 3b: Upside
> This all can be achieved without changes to crush.

Yes!

And no.  You're totally right--we should use an offline optimization to 
tweak the crush input weights to get a better balance.  It won't be robust 
to changes to the cluster, but we can incrementally optimize after that 
happens to converge on something better.

The problem with doing this with current versions of Ceph is that we lose 
the original "input" or "target" weights (i.e., the actual size of 
the OSD) that we want to converge on.  This is one reason why we haven't 
done something like this before.

In luminous we *could* work around this by storing those canonical 
weights outside of crush using something (probably?) ugly and 
maintain backward compatibility with older clients using existing 
CRUSH behavior.

OR, (and this is my preferred route), if the multi-pick anomaly approach 
that Pedro is working on works out, we'll want to extend the CRUSH map to 
include a set of derivative weights used for actual placement calculations 
instead of the canonical target weights, and we can do what you're 
proposing *and* solve the multipick problem with one change in the crush 
map and algorithm.  (Actually choosing those derivative weights will 
be an offline process that can both improve the balance for the inputs we 
care about *and* adjust them based on the position to fix the skew issue 
for replicas.)  This doesn't help pre-luminous clients, but I think the 
end solution will be simpler and more elegant...

What do you think?

sage


> 4. Extra
> Some time ago I made such change to perfectly balance Thomson-Reuters
> cluster.
> It succeeded.
> A solution was not accepted, because modification of OSD weights were higher
> then 50%, which was caused by fact that different placement rules operated
> on different sets of OSDs, and those sets were not disjointed.


> 
> Best regards,
> Adam
> 
> 
> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
>       Hi Pedro, Loic,
> 
>       For what it's worth, my intuition here (which has had a mixed
>       record as
>       far as CRUSH goes) is that this is the most promising path
>       forward.
> 
>       Thinking ahead a few steps, and confirming that I'm following
>       the
>       discussion so far, if you're able to do get black (or white) box
>       gradient
>       descent to work, then this will give us a set of weights for
>       each item in
>       the tree for each selection round, derived from the tree
>       structure and
>       original (target) weights.  That would basically give us a map
>       of item id
>       (bucket id or leave item id) to weight for each round.  i.e.,
> 
>        map<int, map<int, float>> weight_by_position;  // position ->
>       item -> weight
> 
>       where the 0 round would (I think?) match the target weights, and
>       each
>       round after that would skew low-weighted items lower to some
>       degree.
>       Right?
> 
>       The next question I have is: does this generalize from the
>       single-bucket
>       case to the hierarchy?  I.e., if I have a "tree" (single bucket)
>       like
> 
>       3.1
>        |_____________
>        |   \    \    \
>       1.0  1.0  1.0  .1
> 
>       it clearly works, but when we have a multi-level tree like
> 
> 
>       8.4
>        |____________________________________
>        |                 \                  \
>       3.1                3.1                2.2
>        |_____________     |_____________     |_____________
>        |   \    \    \    |   \    \    \    |   \    \    \
>       1.0  1.0  1.0  .1   1.0  1.0  1.0  .1  1.0  1.0 .1   .1
> 
>       and the second round weights skew the small .1 leaves lower, can
>       we
>       continue to build the summed-weight hierarchy, such that the
>       adjusted
>       weights at the higher level are appropriately adjusted to give
>       us the
>       right probabilities of descending into those trees?  I'm not
>       sure if that
>       logically follows from the above or if my intuition is
>       oversimplifying
>       things.
> 
>       If this *is* how we think this will shake out, then I'm
>       wondering if we
>       should go ahead and build this weigh matrix into CRUSH sooner
>       rather
>       than later (i.e., for luminous).  As with the explicit
>       remappings, the
>       hard part is all done offline, and the adjustments to the CRUSH
>       mapping
>       calculation itself (storing and making use of the adjusted
>       weights for
>       each round of placement) are relatively straightforward.  And
>       the sooner
>       this is incorporated into a release the sooner real users will
>       be able to
>       roll out code to all clients and start making us of it.
> 
>       Thanks again for looking at this problem!  I'm excited that we
>       may be
>       closing in on a real solution!
> 
>       sage
> 
> 
> 
> 
> 
>       On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
> 
>       > There are lot of gradient-free methods. I will try first to
>       run the
>       > ones available using just scipy
>       >
>       (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>       > Some of them don't require the gradient and some of them can
>       estimate
>       > it. The reason to go without the gradient is to run the CRUSH
>       > algorithm as a black box. In that case this would be the
>       pseudo-code:
>       >
>       > - BEGIN CODE -
>       > def build_target(desired_freqs):
>       >     def target(weights):
>       >         # run a simulation of CRUSH for a number of objects
>       >         sim_freqs = run_crush(weights)
>       >         # Kullback-Leibler divergence between desired
>       frequencies and
>       > current ones
>       >         return loss(sim_freqs, desired_freqs)
>       >    return target
>       >
>       > weights = scipy.optimize.minimize(build_target(desired_freqs))
>       > - END CODE -
>       >
>       > The tricky thing here is that this procedure can be slow if
>       the
>       > simulation (run_crush) needs to place a lot of objects to get
>       accurate
>       > simulated frequencies. This is true specially if the minimize
>       method
>       > attempts to approximate the gradient using finite differences
>       since it
>       > will evaluate the target function a number of times
>       proportional to
>       > the number of weights). Apart from the ones in scipy I would
>       try also
>       > optimization methods that try to perform as few evaluations as
>       > possible like for example HyperOpt
>       > (http://hyperopt.github.io/hyperopt/), which by the way takes
>       into
>       > account that the target function can be noisy.
>       >
>       > This black box approximation is simple to implement and makes
>       the
>       > computer do all the work instead of us.
>       > I think that this black box approximation is worthy to try
>       even if
>       > it's not the final one because if this approximation works
>       then we
>       > know that a more elaborate one that computes the gradient of
>       the CRUSH
>       > algorithm will work for sure.
>       >
>       > I can try this black box approximation this weekend not on the
>       real
>       > CRUSH algorithm but with the simple implementation I did in
>       python. If
>       > it works it's just a matter of substituting one simulation
>       with
>       > another and see what happens.
>       >
>       > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>       > > Hi Pedro,
>       > >
>       > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>       > >> Hi Loic,
>       > >>
>       > >>>From what I see everything seems OK.
>       > >
>       > > Cool. I'll keep going in this direction then !
>       > >
>       > >> The interesting thing would be to
>       > >> test on some complex mapping. The reason is that
>       "CrushPolicyFamily"
>       > >> is right now modeling just a single straw bucket not the
>       full CRUSH
>       > >> algorithm.
>       > >
>       > > A number of use cases use a single straw bucket, maybe the
>       majority of them. Even though it does not reflect the full range
>       of what crush can offer, it could be useful. To be more
>       specific, a crush map that states "place objects so that there
>       is at most one replica per host" or "one replica per rack" is
>       common. Such a crushmap can be reduced to a single straw bucket
>       that contains all the hosts and by using the CrushPolicyFamily,
>       we can change the weights of each host to fix the probabilities.
>       The hosts themselves contain disks with varying weights but I
>       think we can ignore that because crush will only recurse to
>       place one object within a given host.
>       > >
>       > >> That's the work that remains to be done. The only way that
>       > >> would avoid reimplementing the CRUSH algorithm and
>       computing the
>       > >> gradient would be treating CRUSH as a black box and
>       eliminating the
>       > >> necessity of computing the gradient either by using a
>       gradient-free
>       > >> optimization method or making an estimation of the
>       gradient.
>       > >
>       > > By gradient-free optimization you mean simulated annealing
>       or Monte Carlo ?
>       > >
>       > > Cheers
>       > >
>       > >>
>       > >>
>       > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>       > >>> Hi,
>       > >>>
>       > >>> I modified the crush library to accept two weights (one
>       for the first disk, the other for the remaining disks)[1]. This
>       really is a hack for experimentation purposes only ;-) I was
>       able to run a variation of your code[2] and got the following
>       results which are encouraging. Do you think what I did is
>       sensible ? Or is there a problem I don't see ?
>       > >>>
>       > >>> Thanks !
>       > >>>
>       > >>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8 
>       6]
>       > >>>
>       ------------------------------------------------------------------------
>       > >>> Before: All replicas on each hard drive
>       > >>> Expected vs actual use (20000 samples)
>       > >>>  disk 0: 1.39e-01 1.12e-01
>       > >>>  disk 1: 1.11e-01 1.10e-01
>       > >>>  disk 2: 8.33e-02 1.13e-01
>       > >>>  disk 3: 1.39e-01 1.11e-01
>       > >>>  disk 4: 1.11e-01 1.11e-01
>       > >>>  disk 5: 8.33e-02 1.11e-01
>       > >>>  disk 6: 1.39e-01 1.12e-01
>       > >>>  disk 7: 1.11e-01 1.12e-01
>       > >>>  disk 8: 8.33e-02 1.10e-01
>       > >>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>       > >>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>       > >>> ...
>       > >>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>       > >>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>       > >>> Converged to desired accuracy :)
>       > >>> After: All replicas on each hard drive
>       > >>> Expected vs actual use (20000 samples)
>       > >>>  disk 0: 1.39e-01 1.42e-01
>       > >>>  disk 1: 1.11e-01 1.09e-01
>       > >>>  disk 2: 8.33e-02 8.37e-02
>       > >>>  disk 3: 1.39e-01 1.40e-01
>       > >>>  disk 4: 1.11e-01 1.13e-01
>       > >>>  disk 5: 8.33e-02 8.08e-02
>       > >>>  disk 6: 1.39e-01 1.38e-01
>       > >>>  disk 7: 1.11e-01 1.09e-01
>       > >>>  disk 8: 8.33e-02 8.48e-02
>       > >>>
>       > >>>
>       > >>> Simulation: R=2 devices capacity [10 10 10 10  1]
>       > >>>
>       ------------------------------------------------------------------------
>       > >>> Before: All replicas on each hard drive
>       > >>> Expected vs actual use (20000 samples)
>       > >>>  disk 0: 2.44e-01 2.36e-01
>       > >>>  disk 1: 2.44e-01 2.38e-01
>       > >>>  disk 2: 2.44e-01 2.34e-01
>       > >>>  disk 3: 2.44e-01 2.38e-01
>       > >>>  disk 4: 2.44e-02 5.37e-02
>       > >>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>       > >>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>       > >>> ...
>       > >>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>       > >>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>       > >>> Converged to desired accuracy :)
>       > >>> After: All replicas on each hard drive
>       > >>> Expected vs actual use (20000 samples)
>       > >>>  disk 0: 2.44e-01 2.46e-01
>       > >>>  disk 1: 2.44e-01 2.44e-01
>       > >>>  disk 2: 2.44e-01 2.41e-01
>       > >>>  disk 3: 2.44e-01 2.45e-01
>       > >>>  disk 4: 2.44e-02 2.33e-02
>       > >>>
>       > >>>
>       > >>> [1] crush hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
>       56fee8
>       > >>> [2] python-crush hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
>       bd25f8f2c4b68
>       > >>>
>       > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>       > >>>> Hi Pedro,
>       > >>>>
>       > >>>> It looks like trying to experiment with crush won't work
>       as expected because crush does not distinguish the probability
>       of selecting the first device from the probability of selecting
>       the second or third device. Am I mistaken ?
>       > >>>>
>       > >>>> Cheers
>       > >>>>
>       > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>       > >>>>> Hi Pedro,
>       > >>>>>
>       > >>>>> I'm going to experiment with what you did at
>       > >>>>>
>       > >>>>>
>       https://github.com/plafl/notebooks/blob/master/replication.ipynb
>       > >>>>>
>       > >>>>> and the latest python-crush published today. A
>       comparison function was added that will help measure the data
>       movement. I'm hoping we can release an offline tool based on
>       your solution. Please let me know if I should wait before diving
>       into this, in case you have unpublished drafts or new ideas.
>       > >>>>>
>       > >>>>> Cheers
>       > >>>>>
>       > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>       > >>>>>> Great, thanks for the clarifications.
>       > >>>>>> I also think that the most natural way is to keep just
>       a set of
>       > >>>>>> weights in the CRUSH map and update them inside the
>       algorithm.
>       > >>>>>>
>       > >>>>>> I keep working on it.
>       > >>>>>>
>       > >>>>>>
>       > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
>       <sage@newdream.net>:
>       > >>>>>>> Hi Pedro,
>       > >>>>>>>
>       > >>>>>>> Thanks for taking a look at this!  It's a frustrating
>       problem and we
>       > >>>>>>> haven't made much headway.
>       > >>>>>>>
>       > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>       > >>>>>>>> Hi,
>       > >>>>>>>>
>       > >>>>>>>> I will have a look. BTW, I have not progressed that
>       much but I have
>       > >>>>>>>> been thinking about it. In order to adapt the
>       previous algorithm in
>       > >>>>>>>> the python notebook I need to substitute the
>       iteration over all
>       > >>>>>>>> possible devices permutations to iteration over all
>       the possible
>       > >>>>>>>> selections that crush would make. That is the main
>       thing I need to
>       > >>>>>>>> work on.
>       > >>>>>>>>
>       > >>>>>>>> The other thing is of course that weights change for
>       each replica.
>       > >>>>>>>> That is, they cannot be really fixed in the crush
>       map. So the
>       > >>>>>>>> algorithm inside libcrush, not only the weights in
>       the map, need to be
>       > >>>>>>>> changed. The weights in the crush map should reflect
>       then, maybe, the
>       > >>>>>>>> desired usage frequencies. Or maybe each replica
>       should have their own
>       > >>>>>>>> crush map, but then the information about the
>       previous selection
>       > >>>>>>>> should be passed to the next replica placement run so
>       it avoids
>       > >>>>>>>> selecting the same one again.
>       > >>>>>>>
>       > >>>>>>> My suspicion is that the best solution here (whatever
>       that means!)
>       > >>>>>>> leaves the CRUSH weights intact with the desired
>       distribution, and
>       > >>>>>>> then generates a set of derivative weights--probably
>       one set for each
>       > >>>>>>> round/replica/rank.
>       > >>>>>>>
>       > >>>>>>> One nice property of this is that once the support is
>       added to encode
>       > >>>>>>> multiple sets of weights, the algorithm used to
>       generate them is free to
>       > >>>>>>> change and evolve independently.  (In most cases any
>       change is
>       > >>>>>>> CRUSH's mapping behavior is difficult to roll out
>       because all
>       > >>>>>>> parties participating in the cluster have to support
>       any new behavior
>       > >>>>>>> before it is enabled or used.)
>       > >>>>>>>
>       > >>>>>>>> I have a question also. Is there any significant
>       difference between
>       > >>>>>>>> the device selection algorithm description in the
>       paper and its final
>       > >>>>>>>> implementation?
>       > >>>>>>>
>       > >>>>>>> The main difference is the "retry_bucket" behavior was
>       found to be a bad
>       > >>>>>>> idea; any collision or failed()/overload() case
>       triggers the
>       > >>>>>>> retry_descent.
>       > >>>>>>>
>       > >>>>>>> There are other changes, of course, but I don't think
>       they'll impact any
>       > >>>>>>> solution we come with here (or at least any solution
>       can be suitably
>       > >>>>>>> adapted)!
>       > >>>>>>>
>       > >>>>>>> sage
>       > >>>>>> --
>       > >>>>>> To unsubscribe from this list: send the line
>       "unsubscribe ceph-devel" in
>       > >>>>>> the body of a message to majordomo@vger.kernel.org
>       > >>>>>> More majordomo info at 
>       http://vger.kernel.org/majordomo-info.html
>       > >>>>>>
>       > >>>>>
>       > >>>>
>       > >>>
>       > >>> --
>       > >>> Loïc Dachary, Artisan Logiciel Libre
>       > >> --
>       > >> To unsubscribe from this list: send the line "unsubscribe
>       ceph-devel" in
>       > >> the body of a message to majordomo@vger.kernel.org
>       > >> More majordomo info at 
>       http://vger.kernel.org/majordomo-info.html
>       > >>
>       > >
>       > > --
>       > > Loïc Dachary, Artisan Logiciel Libre
>       > --
>       > To unsubscribe from this list: send the line "unsubscribe
>       ceph-devel" in
>       > the body of a message to majordomo@vger.kernel.org
>       > More majordomo info at 
>       http://vger.kernel.org/majordomo-info.html
>       >
>       >
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-27  2:33                                             ` Sage Weil
@ 2017-03-27  6:45                                               ` Loic Dachary
       [not found]                                                 ` <CAHMeWhGuJnu2664VTxomQ-wJewBEPjRT_VGWH+g-v5k3ka6X5Q@mail.gmail.com>
  2017-03-27 13:24                                                 ` Sage Weil
  0 siblings, 2 replies; 70+ messages in thread
From: Loic Dachary @ 2017-03-27  6:45 UTC (permalink / raw)
  To: Sage Weil, Adam Kupczyk; +Cc: Ceph Development



On 03/27/2017 04:33 AM, Sage Weil wrote:
> On Sun, 26 Mar 2017, Adam Kupczyk wrote:
>> Hello Sage, Loic, Pedro,
>>
>>
>> I am certain that almost perfect mapping can be achieved by
>> substituting weights from crush map with slightly modified weights.
>> By perfect mapping I mean we get on each OSD number of PGs exactly
>> proportional to weights specified in crush map.
>>
>> 1. Example
>> Lets think of PGs of single object pool.
>> We have OSDs with following weights:
>> [10, 10, 10, 5, 5]
>>
>> Ideally, we would like following distribution of 200PG x 3 copies = 600
>> PGcopies :
>> [150, 150, 150, 75, 75]
>>
>> However, because crush simulates random process we have:
>> [143, 152, 158, 71, 76]
>>
>> We could have obtained perfect distribution had we used weights like this:
>> [10.2, 9.9, 9.6, 5.2, 4.9]
>>
>>
>> 2. Obtaining perfect mapping weights from OSD capacity weights
>>
>> When we apply crush for the first time, distribution of PGs comes as random.
>> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
>>
>> But CRUSH is not random proces at all, it behaves in numerically stable way.
>> Specifically, if we increase weight on one node, we will get more PGs on
>> this node and less on every other node:
>> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
>>
>> Now, finding ideal weights can be done by any numerical minimization method,
>> for example NLMS.
>>
>>
>> 3. The proposal
>> For each pool, from initial weights given in crush map perfect weights will
>> be derived.
>> This weights will be used to calculate PG distribution. This of course will
>> be close to perfect.
>>
>> 3a: Downside when OSD is out
>> When an OSD is out, missing PG copies will be replicated elsewhere.
>> Because now weights deviate from OSD capacity, some OSDs will statistically
>> get more copies then they should.
>> This unevenness in distribution is proportional to scale of deviation of
>> calculated weights to capacity weights.
>>
>> 3b: Upside
>> This all can be achieved without changes to crush.
> 
> Yes!
> 
> And no.  You're totally right--we should use an offline optimization to 
> tweak the crush input weights to get a better balance.  It won't be robust 
> to changes to the cluster, but we can incrementally optimize after that 
> happens to converge on something better.
> 
> The problem with doing this with current versions of Ceph is that we lose 
> the original "input" or "target" weights (i.e., the actual size of 
> the OSD) that we want to converge on.  This is one reason why we haven't 
> done something like this before.
> 
> In luminous we *could* work around this by storing those canonical 
> weights outside of crush using something (probably?) ugly and 
> maintain backward compatibility with older clients using existing 
> CRUSH behavior.

These canonical weights could be stored in crush by creating dedicated buckets. For instance the root-canonical bucket could be created to store the canonical weights of the root bucket. The sysadmin needs to be aware of the difference and know to add a new device in the host01-canonical bucket instead of the host01 bucket. And to run an offline tool to keep the two buckets in sync and compute the weight to use for placement derived from the weights representing the device capacity.

It is a little bit ugly ;-)

> OR, (and this is my preferred route), if the multi-pick anomaly approach 
> that Pedro is working on works out, we'll want to extend the CRUSH map to 
> include a set of derivative weights used for actual placement calculations 
> instead of the canonical target weights, and we can do what you're 
> proposing *and* solve the multipick problem with one change in the crush 
> map and algorithm.  (Actually choosing those derivative weights will 
> be an offline process that can both improve the balance for the inputs we 
> care about *and* adjust them based on the position to fix the skew issue 
> for replicas.)  This doesn't help pre-luminous clients, but I think the 
> end solution will be simpler and more elegant...
> 
> What do you think?
> 
> sage
> 
> 
>> 4. Extra
>> Some time ago I made such change to perfectly balance Thomson-Reuters
>> cluster.
>> It succeeded.
>> A solution was not accepted, because modification of OSD weights were higher
>> then 50%, which was caused by fact that different placement rules operated
>> on different sets of OSDs, and those sets were not disjointed.
> 
> 
>>
>> Best regards,
>> Adam
>>
>>
>> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
>>       Hi Pedro, Loic,
>>
>>       For what it's worth, my intuition here (which has had a mixed
>>       record as
>>       far as CRUSH goes) is that this is the most promising path
>>       forward.
>>
>>       Thinking ahead a few steps, and confirming that I'm following
>>       the
>>       discussion so far, if you're able to do get black (or white) box
>>       gradient
>>       descent to work, then this will give us a set of weights for
>>       each item in
>>       the tree for each selection round, derived from the tree
>>       structure and
>>       original (target) weights.  That would basically give us a map
>>       of item id
>>       (bucket id or leave item id) to weight for each round.  i.e.,
>>
>>        map<int, map<int, float>> weight_by_position;  // position ->
>>       item -> weight
>>
>>       where the 0 round would (I think?) match the target weights, and
>>       each
>>       round after that would skew low-weighted items lower to some
>>       degree.
>>       Right?
>>
>>       The next question I have is: does this generalize from the
>>       single-bucket
>>       case to the hierarchy?  I.e., if I have a "tree" (single bucket)
>>       like
>>
>>       3.1
>>        |_____________
>>        |   \    \    \
>>       1.0  1.0  1.0  .1
>>
>>       it clearly works, but when we have a multi-level tree like
>>
>>
>>       8.4
>>        |____________________________________
>>        |                 \                  \
>>       3.1                3.1                2.2
>>        |_____________     |_____________     |_____________
>>        |   \    \    \    |   \    \    \    |   \    \    \
>>       1.0  1.0  1.0  .1   1.0  1.0  1.0  .1  1.0  1.0 .1   .1
>>
>>       and the second round weights skew the small .1 leaves lower, can
>>       we
>>       continue to build the summed-weight hierarchy, such that the
>>       adjusted
>>       weights at the higher level are appropriately adjusted to give
>>       us the
>>       right probabilities of descending into those trees?  I'm not
>>       sure if that
>>       logically follows from the above or if my intuition is
>>       oversimplifying
>>       things.
>>
>>       If this *is* how we think this will shake out, then I'm
>>       wondering if we
>>       should go ahead and build this weigh matrix into CRUSH sooner
>>       rather
>>       than later (i.e., for luminous).  As with the explicit
>>       remappings, the
>>       hard part is all done offline, and the adjustments to the CRUSH
>>       mapping
>>       calculation itself (storing and making use of the adjusted
>>       weights for
>>       each round of placement) are relatively straightforward.  And
>>       the sooner
>>       this is incorporated into a release the sooner real users will
>>       be able to
>>       roll out code to all clients and start making us of it.
>>
>>       Thanks again for looking at this problem!  I'm excited that we
>>       may be
>>       closing in on a real solution!
>>
>>       sage
>>
>>
>>
>>
>>
>>       On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
>>
>>       > There are lot of gradient-free methods. I will try first to
>>       run the
>>       > ones available using just scipy
>>       >
>>       (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>       > Some of them don't require the gradient and some of them can
>>       estimate
>>       > it. The reason to go without the gradient is to run the CRUSH
>>       > algorithm as a black box. In that case this would be the
>>       pseudo-code:
>>       >
>>       > - BEGIN CODE -
>>       > def build_target(desired_freqs):
>>       >     def target(weights):
>>       >         # run a simulation of CRUSH for a number of objects
>>       >         sim_freqs = run_crush(weights)
>>       >         # Kullback-Leibler divergence between desired
>>       frequencies and
>>       > current ones
>>       >         return loss(sim_freqs, desired_freqs)
>>       >    return target
>>       >
>>       > weights = scipy.optimize.minimize(build_target(desired_freqs))
>>       > - END CODE -
>>       >
>>       > The tricky thing here is that this procedure can be slow if
>>       the
>>       > simulation (run_crush) needs to place a lot of objects to get
>>       accurate
>>       > simulated frequencies. This is true specially if the minimize
>>       method
>>       > attempts to approximate the gradient using finite differences
>>       since it
>>       > will evaluate the target function a number of times
>>       proportional to
>>       > the number of weights). Apart from the ones in scipy I would
>>       try also
>>       > optimization methods that try to perform as few evaluations as
>>       > possible like for example HyperOpt
>>       > (http://hyperopt.github.io/hyperopt/), which by the way takes
>>       into
>>       > account that the target function can be noisy.
>>       >
>>       > This black box approximation is simple to implement and makes
>>       the
>>       > computer do all the work instead of us.
>>       > I think that this black box approximation is worthy to try
>>       even if
>>       > it's not the final one because if this approximation works
>>       then we
>>       > know that a more elaborate one that computes the gradient of
>>       the CRUSH
>>       > algorithm will work for sure.
>>       >
>>       > I can try this black box approximation this weekend not on the
>>       real
>>       > CRUSH algorithm but with the simple implementation I did in
>>       python. If
>>       > it works it's just a matter of substituting one simulation
>>       with
>>       > another and see what happens.
>>       >
>>       > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>       > > Hi Pedro,
>>       > >
>>       > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>       > >> Hi Loic,
>>       > >>
>>       > >>>From what I see everything seems OK.
>>       > >
>>       > > Cool. I'll keep going in this direction then !
>>       > >
>>       > >> The interesting thing would be to
>>       > >> test on some complex mapping. The reason is that
>>       "CrushPolicyFamily"
>>       > >> is right now modeling just a single straw bucket not the
>>       full CRUSH
>>       > >> algorithm.
>>       > >
>>       > > A number of use cases use a single straw bucket, maybe the
>>       majority of them. Even though it does not reflect the full range
>>       of what crush can offer, it could be useful. To be more
>>       specific, a crush map that states "place objects so that there
>>       is at most one replica per host" or "one replica per rack" is
>>       common. Such a crushmap can be reduced to a single straw bucket
>>       that contains all the hosts and by using the CrushPolicyFamily,
>>       we can change the weights of each host to fix the probabilities.
>>       The hosts themselves contain disks with varying weights but I
>>       think we can ignore that because crush will only recurse to
>>       place one object within a given host.
>>       > >
>>       > >> That's the work that remains to be done. The only way that
>>       > >> would avoid reimplementing the CRUSH algorithm and
>>       computing the
>>       > >> gradient would be treating CRUSH as a black box and
>>       eliminating the
>>       > >> necessity of computing the gradient either by using a
>>       gradient-free
>>       > >> optimization method or making an estimation of the
>>       gradient.
>>       > >
>>       > > By gradient-free optimization you mean simulated annealing
>>       or Monte Carlo ?
>>       > >
>>       > > Cheers
>>       > >
>>       > >>
>>       > >>
>>       > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>       > >>> Hi,
>>       > >>>
>>       > >>> I modified the crush library to accept two weights (one
>>       for the first disk, the other for the remaining disks)[1]. This
>>       really is a hack for experimentation purposes only ;-) I was
>>       able to run a variation of your code[2] and got the following
>>       results which are encouraging. Do you think what I did is
>>       sensible ? Or is there a problem I don't see ?
>>       > >>>
>>       > >>> Thanks !
>>       > >>>
>>       > >>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8 
>>       6]
>>       > >>>
>>       ------------------------------------------------------------------------
>>       > >>> Before: All replicas on each hard drive
>>       > >>> Expected vs actual use (20000 samples)
>>       > >>>  disk 0: 1.39e-01 1.12e-01
>>       > >>>  disk 1: 1.11e-01 1.10e-01
>>       > >>>  disk 2: 8.33e-02 1.13e-01
>>       > >>>  disk 3: 1.39e-01 1.11e-01
>>       > >>>  disk 4: 1.11e-01 1.11e-01
>>       > >>>  disk 5: 8.33e-02 1.11e-01
>>       > >>>  disk 6: 1.39e-01 1.12e-01
>>       > >>>  disk 7: 1.11e-01 1.12e-01
>>       > >>>  disk 8: 8.33e-02 1.10e-01
>>       > >>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>       > >>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>       > >>> ...
>>       > >>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>       > >>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>       > >>> Converged to desired accuracy :)
>>       > >>> After: All replicas on each hard drive
>>       > >>> Expected vs actual use (20000 samples)
>>       > >>>  disk 0: 1.39e-01 1.42e-01
>>       > >>>  disk 1: 1.11e-01 1.09e-01
>>       > >>>  disk 2: 8.33e-02 8.37e-02
>>       > >>>  disk 3: 1.39e-01 1.40e-01
>>       > >>>  disk 4: 1.11e-01 1.13e-01
>>       > >>>  disk 5: 8.33e-02 8.08e-02
>>       > >>>  disk 6: 1.39e-01 1.38e-01
>>       > >>>  disk 7: 1.11e-01 1.09e-01
>>       > >>>  disk 8: 8.33e-02 8.48e-02
>>       > >>>
>>       > >>>
>>       > >>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>       > >>>
>>       ------------------------------------------------------------------------
>>       > >>> Before: All replicas on each hard drive
>>       > >>> Expected vs actual use (20000 samples)
>>       > >>>  disk 0: 2.44e-01 2.36e-01
>>       > >>>  disk 1: 2.44e-01 2.38e-01
>>       > >>>  disk 2: 2.44e-01 2.34e-01
>>       > >>>  disk 3: 2.44e-01 2.38e-01
>>       > >>>  disk 4: 2.44e-02 5.37e-02
>>       > >>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>       > >>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>       > >>> ...
>>       > >>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>       > >>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>       > >>> Converged to desired accuracy :)
>>       > >>> After: All replicas on each hard drive
>>       > >>> Expected vs actual use (20000 samples)
>>       > >>>  disk 0: 2.44e-01 2.46e-01
>>       > >>>  disk 1: 2.44e-01 2.44e-01
>>       > >>>  disk 2: 2.44e-01 2.41e-01
>>       > >>>  disk 3: 2.44e-01 2.45e-01
>>       > >>>  disk 4: 2.44e-02 2.33e-02
>>       > >>>
>>       > >>>
>>       > >>> [1] crush hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
>>       56fee8
>>       > >>> [2] python-crush hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
>>       bd25f8f2c4b68
>>       > >>>
>>       > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>       > >>>> Hi Pedro,
>>       > >>>>
>>       > >>>> It looks like trying to experiment with crush won't work
>>       as expected because crush does not distinguish the probability
>>       of selecting the first device from the probability of selecting
>>       the second or third device. Am I mistaken ?
>>       > >>>>
>>       > >>>> Cheers
>>       > >>>>
>>       > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>       > >>>>> Hi Pedro,
>>       > >>>>>
>>       > >>>>> I'm going to experiment with what you did at
>>       > >>>>>
>>       > >>>>>
>>       https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>       > >>>>>
>>       > >>>>> and the latest python-crush published today. A
>>       comparison function was added that will help measure the data
>>       movement. I'm hoping we can release an offline tool based on
>>       your solution. Please let me know if I should wait before diving
>>       into this, in case you have unpublished drafts or new ideas.
>>       > >>>>>
>>       > >>>>> Cheers
>>       > >>>>>
>>       > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>       > >>>>>> Great, thanks for the clarifications.
>>       > >>>>>> I also think that the most natural way is to keep just
>>       a set of
>>       > >>>>>> weights in the CRUSH map and update them inside the
>>       algorithm.
>>       > >>>>>>
>>       > >>>>>> I keep working on it.
>>       > >>>>>>
>>       > >>>>>>
>>       > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
>>       <sage@newdream.net>:
>>       > >>>>>>> Hi Pedro,
>>       > >>>>>>>
>>       > >>>>>>> Thanks for taking a look at this!  It's a frustrating
>>       problem and we
>>       > >>>>>>> haven't made much headway.
>>       > >>>>>>>
>>       > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>       > >>>>>>>> Hi,
>>       > >>>>>>>>
>>       > >>>>>>>> I will have a look. BTW, I have not progressed that
>>       much but I have
>>       > >>>>>>>> been thinking about it. In order to adapt the
>>       previous algorithm in
>>       > >>>>>>>> the python notebook I need to substitute the
>>       iteration over all
>>       > >>>>>>>> possible devices permutations to iteration over all
>>       the possible
>>       > >>>>>>>> selections that crush would make. That is the main
>>       thing I need to
>>       > >>>>>>>> work on.
>>       > >>>>>>>>
>>       > >>>>>>>> The other thing is of course that weights change for
>>       each replica.
>>       > >>>>>>>> That is, they cannot be really fixed in the crush
>>       map. So the
>>       > >>>>>>>> algorithm inside libcrush, not only the weights in
>>       the map, need to be
>>       > >>>>>>>> changed. The weights in the crush map should reflect
>>       then, maybe, the
>>       > >>>>>>>> desired usage frequencies. Or maybe each replica
>>       should have their own
>>       > >>>>>>>> crush map, but then the information about the
>>       previous selection
>>       > >>>>>>>> should be passed to the next replica placement run so
>>       it avoids
>>       > >>>>>>>> selecting the same one again.
>>       > >>>>>>>
>>       > >>>>>>> My suspicion is that the best solution here (whatever
>>       that means!)
>>       > >>>>>>> leaves the CRUSH weights intact with the desired
>>       distribution, and
>>       > >>>>>>> then generates a set of derivative weights--probably
>>       one set for each
>>       > >>>>>>> round/replica/rank.
>>       > >>>>>>>
>>       > >>>>>>> One nice property of this is that once the support is
>>       added to encode
>>       > >>>>>>> multiple sets of weights, the algorithm used to
>>       generate them is free to
>>       > >>>>>>> change and evolve independently.  (In most cases any
>>       change is
>>       > >>>>>>> CRUSH's mapping behavior is difficult to roll out
>>       because all
>>       > >>>>>>> parties participating in the cluster have to support
>>       any new behavior
>>       > >>>>>>> before it is enabled or used.)
>>       > >>>>>>>
>>       > >>>>>>>> I have a question also. Is there any significant
>>       difference between
>>       > >>>>>>>> the device selection algorithm description in the
>>       paper and its final
>>       > >>>>>>>> implementation?
>>       > >>>>>>>
>>       > >>>>>>> The main difference is the "retry_bucket" behavior was
>>       found to be a bad
>>       > >>>>>>> idea; any collision or failed()/overload() case
>>       triggers the
>>       > >>>>>>> retry_descent.
>>       > >>>>>>>
>>       > >>>>>>> There are other changes, of course, but I don't think
>>       they'll impact any
>>       > >>>>>>> solution we come with here (or at least any solution
>>       can be suitably
>>       > >>>>>>> adapted)!
>>       > >>>>>>>
>>       > >>>>>>> sage
>>       > >>>>>> --
>>       > >>>>>> To unsubscribe from this list: send the line
>>       "unsubscribe ceph-devel" in
>>       > >>>>>> the body of a message to majordomo@vger.kernel.org
>>       > >>>>>> More majordomo info at 
>>       http://vger.kernel.org/majordomo-info.html
>>       > >>>>>>
>>       > >>>>>
>>       > >>>>
>>       > >>>
>>       > >>> --
>>       > >>> Loïc Dachary, Artisan Logiciel Libre
>>       > >> --
>>       > >> To unsubscribe from this list: send the line "unsubscribe
>>       ceph-devel" in
>>       > >> the body of a message to majordomo@vger.kernel.org
>>       > >> More majordomo info at 
>>       http://vger.kernel.org/majordomo-info.html
>>       > >>
>>       > >
>>       > > --
>>       > > Loïc Dachary, Artisan Logiciel Libre
>>       > --
>>       > To unsubscribe from this list: send the line "unsubscribe
>>       ceph-devel" in
>>       > the body of a message to majordomo@vger.kernel.org
>>       > More majordomo info at 
>>       http://vger.kernel.org/majordomo-info.html
>>       >
>>       >
>>
>>
>>

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
       [not found]                                                 ` <CAHMeWhGuJnu2664VTxomQ-wJewBEPjRT_VGWH+g-v5k3ka6X5Q@mail.gmail.com>
@ 2017-03-27  9:27                                                   ` Adam Kupczyk
  2017-03-27 10:29                                                     ` Loic Dachary
                                                                       ` (2 more replies)
  0 siblings, 3 replies; 70+ messages in thread
From: Adam Kupczyk @ 2017-03-27  9:27 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Sage Weil, Ceph Development

Hi,

My understanding is that optimal tweaked weights will depend on:
1) pool_id, because of rjenkins(pool_id) in crush
2) number of placement groups and replication factor, as it determines
amount of samples

Therefore tweaked weights should rather be property of instantialized pool,
not crush placement definition.

If tweaked weights are to be part of crush definition, than for each
created pool we need to have separate list of weights.
Is it possible to provide clients with different weights depending on on
which pool they want to operate?

Best regards,
Adam

On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
> Hi,
>
> My understanding is that optimal tweaked weights will depend on:
> 1) pool_id, because of rjenkins(pool_id) in crush
> 2) number of placement groups and replication factor, as it determines
> amount of samples
>
> Therefore tweaked weights should rather be property of instantialized pool,
> not crush placement definition.
>
> If tweaked weights are to be part of crush definition, than for each created
> pool we need to have separate list of weights.
> Is it possible to provide clients with different weights depending on on
> which pool they want to operate?
>
> Best regards,
> Adam
>
>
> On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@dachary.org> wrote:
>>
>>
>>
>> On 03/27/2017 04:33 AM, Sage Weil wrote:
>> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
>> >> Hello Sage, Loic, Pedro,
>> >>
>> >>
>> >> I am certain that almost perfect mapping can be achieved by
>> >> substituting weights from crush map with slightly modified weights.
>> >> By perfect mapping I mean we get on each OSD number of PGs exactly
>> >> proportional to weights specified in crush map.
>> >>
>> >> 1. Example
>> >> Lets think of PGs of single object pool.
>> >> We have OSDs with following weights:
>> >> [10, 10, 10, 5, 5]
>> >>
>> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
>> >> PGcopies :
>> >> [150, 150, 150, 75, 75]
>> >>
>> >> However, because crush simulates random process we have:
>> >> [143, 152, 158, 71, 76]
>> >>
>> >> We could have obtained perfect distribution had we used weights like
>> >> this:
>> >> [10.2, 9.9, 9.6, 5.2, 4.9]
>> >>
>> >>
>> >> 2. Obtaining perfect mapping weights from OSD capacity weights
>> >>
>> >> When we apply crush for the first time, distribution of PGs comes as
>> >> random.
>> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
>> >>
>> >> But CRUSH is not random proces at all, it behaves in numerically stable
>> >> way.
>> >> Specifically, if we increase weight on one node, we will get more PGs
>> >> on
>> >> this node and less on every other node:
>> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
>> >>
>> >> Now, finding ideal weights can be done by any numerical minimization
>> >> method,
>> >> for example NLMS.
>> >>
>> >>
>> >> 3. The proposal
>> >> For each pool, from initial weights given in crush map perfect weights
>> >> will
>> >> be derived.
>> >> This weights will be used to calculate PG distribution. This of course
>> >> will
>> >> be close to perfect.
>> >>
>> >> 3a: Downside when OSD is out
>> >> When an OSD is out, missing PG copies will be replicated elsewhere.
>> >> Because now weights deviate from OSD capacity, some OSDs will
>> >> statistically
>> >> get more copies then they should.
>> >> This unevenness in distribution is proportional to scale of deviation
>> >> of
>> >> calculated weights to capacity weights.
>> >>
>> >> 3b: Upside
>> >> This all can be achieved without changes to crush.
>> >
>> > Yes!
>> >
>> > And no.  You're totally right--we should use an offline optimization to
>> > tweak the crush input weights to get a better balance.  It won't be
>> > robust
>> > to changes to the cluster, but we can incrementally optimize after that
>> > happens to converge on something better.
>> >
>> > The problem with doing this with current versions of Ceph is that we
>> > lose
>> > the original "input" or "target" weights (i.e., the actual size of
>> > the OSD) that we want to converge on.  This is one reason why we haven't
>> > done something like this before.
>> >
>> > In luminous we *could* work around this by storing those canonical
>> > weights outside of crush using something (probably?) ugly and
>> > maintain backward compatibility with older clients using existing
>> > CRUSH behavior.
>>
>> These canonical weights could be stored in crush by creating dedicated
>> buckets. For instance the root-canonical bucket could be created to store
>> the canonical weights of the root bucket. The sysadmin needs to be aware of
>> the difference and know to add a new device in the host01-canonical bucket
>> instead of the host01 bucket. And to run an offline tool to keep the two
>> buckets in sync and compute the weight to use for placement derived from the
>> weights representing the device capacity.
>>
>> It is a little bit ugly ;-)
>>
>> > OR, (and this is my preferred route), if the multi-pick anomaly approach
>> > that Pedro is working on works out, we'll want to extend the CRUSH map
>> > to
>> > include a set of derivative weights used for actual placement
>> > calculations
>> > instead of the canonical target weights, and we can do what you're
>> > proposing *and* solve the multipick problem with one change in the crush
>> > map and algorithm.  (Actually choosing those derivative weights will
>> > be an offline process that can both improve the balance for the inputs
>> > we
>> > care about *and* adjust them based on the position to fix the skew issue
>> > for replicas.)  This doesn't help pre-luminous clients, but I think the
>> > end solution will be simpler and more elegant...
>> >
>> > What do you think?
>> >
>> > sage
>> >
>> >
>> >> 4. Extra
>> >> Some time ago I made such change to perfectly balance Thomson-Reuters
>> >> cluster.
>> >> It succeeded.
>> >> A solution was not accepted, because modification of OSD weights were
>> >> higher
>> >> then 50%, which was caused by fact that different placement rules
>> >> operated
>> >> on different sets of OSDs, and those sets were not disjointed.
>> >
>> >
>> >>
>> >> Best regards,
>> >> Adam
>> >>
>> >>
>> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
>> >>       Hi Pedro, Loic,
>> >>
>> >>       For what it's worth, my intuition here (which has had a mixed
>> >>       record as
>> >>       far as CRUSH goes) is that this is the most promising path
>> >>       forward.
>> >>
>> >>       Thinking ahead a few steps, and confirming that I'm following
>> >>       the
>> >>       discussion so far, if you're able to do get black (or white) box
>> >>       gradient
>> >>       descent to work, then this will give us a set of weights for
>> >>       each item in
>> >>       the tree for each selection round, derived from the tree
>> >>       structure and
>> >>       original (target) weights.  That would basically give us a map
>> >>       of item id
>> >>       (bucket id or leave item id) to weight for each round.  i.e.,
>> >>
>> >>        map<int, map<int, float>> weight_by_position;  // position ->
>> >>       item -> weight
>> >>
>> >>       where the 0 round would (I think?) match the target weights, and
>> >>       each
>> >>       round after that would skew low-weighted items lower to some
>> >>       degree.
>> >>       Right?
>> >>
>> >>       The next question I have is: does this generalize from the
>> >>       single-bucket
>> >>       case to the hierarchy?  I.e., if I have a "tree" (single bucket)
>> >>       like
>> >>
>> >>       3.1
>> >>        |_____________
>> >>        |   \    \    \
>> >>       1.0  1.0  1.0  .1
>> >>
>> >>       it clearly works, but when we have a multi-level tree like
>> >>
>> >>
>> >>       8.4
>> >>        |____________________________________
>> >>        |                 \                  \
>> >>       3.1                3.1                2.2
>> >>        |_____________     |_____________     |_____________
>> >>        |   \    \    \    |   \    \    \    |   \    \    \
>> >>       1.0  1.0  1.0  .1   1.0  1.0  1.0  .1  1.0  1.0 .1   .1
>> >>
>> >>       and the second round weights skew the small .1 leaves lower, can
>> >>       we
>> >>       continue to build the summed-weight hierarchy, such that the
>> >>       adjusted
>> >>       weights at the higher level are appropriately adjusted to give
>> >>       us the
>> >>       right probabilities of descending into those trees?  I'm not
>> >>       sure if that
>> >>       logically follows from the above or if my intuition is
>> >>       oversimplifying
>> >>       things.
>> >>
>> >>       If this *is* how we think this will shake out, then I'm
>> >>       wondering if we
>> >>       should go ahead and build this weigh matrix into CRUSH sooner
>> >>       rather
>> >>       than later (i.e., for luminous).  As with the explicit
>> >>       remappings, the
>> >>       hard part is all done offline, and the adjustments to the CRUSH
>> >>       mapping
>> >>       calculation itself (storing and making use of the adjusted
>> >>       weights for
>> >>       each round of placement) are relatively straightforward.  And
>> >>       the sooner
>> >>       this is incorporated into a release the sooner real users will
>> >>       be able to
>> >>       roll out code to all clients and start making us of it.
>> >>
>> >>       Thanks again for looking at this problem!  I'm excited that we
>> >>       may be
>> >>       closing in on a real solution!
>> >>
>> >>       sage
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>       On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
>> >>
>> >>       > There are lot of gradient-free methods. I will try first to
>> >>       run the
>> >>       > ones available using just scipy
>> >>       >
>> >>
>> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>> >>       > Some of them don't require the gradient and some of them can
>> >>       estimate
>> >>       > it. The reason to go without the gradient is to run the CRUSH
>> >>       > algorithm as a black box. In that case this would be the
>> >>       pseudo-code:
>> >>       >
>> >>       > - BEGIN CODE -
>> >>       > def build_target(desired_freqs):
>> >>       >     def target(weights):
>> >>       >         # run a simulation of CRUSH for a number of objects
>> >>       >         sim_freqs = run_crush(weights)
>> >>       >         # Kullback-Leibler divergence between desired
>> >>       frequencies and
>> >>       > current ones
>> >>       >         return loss(sim_freqs, desired_freqs)
>> >>       >    return target
>> >>       >
>> >>       > weights = scipy.optimize.minimize(build_target(desired_freqs))
>> >>       > - END CODE -
>> >>       >
>> >>       > The tricky thing here is that this procedure can be slow if
>> >>       the
>> >>       > simulation (run_crush) needs to place a lot of objects to get
>> >>       accurate
>> >>       > simulated frequencies. This is true specially if the minimize
>> >>       method
>> >>       > attempts to approximate the gradient using finite differences
>> >>       since it
>> >>       > will evaluate the target function a number of times
>> >>       proportional to
>> >>       > the number of weights). Apart from the ones in scipy I would
>> >>       try also
>> >>       > optimization methods that try to perform as few evaluations as
>> >>       > possible like for example HyperOpt
>> >>       > (http://hyperopt.github.io/hyperopt/), which by the way takes
>> >>       into
>> >>       > account that the target function can be noisy.
>> >>       >
>> >>       > This black box approximation is simple to implement and makes
>> >>       the
>> >>       > computer do all the work instead of us.
>> >>       > I think that this black box approximation is worthy to try
>> >>       even if
>> >>       > it's not the final one because if this approximation works
>> >>       then we
>> >>       > know that a more elaborate one that computes the gradient of
>> >>       the CRUSH
>> >>       > algorithm will work for sure.
>> >>       >
>> >>       > I can try this black box approximation this weekend not on the
>> >>       real
>> >>       > CRUSH algorithm but with the simple implementation I did in
>> >>       python. If
>> >>       > it works it's just a matter of substituting one simulation
>> >>       with
>> >>       > another and see what happens.
>> >>       >
>> >>       > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> >>       > > Hi Pedro,
>> >>       > >
>> >>       > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>> >>       > >> Hi Loic,
>> >>       > >>
>> >>       > >>>From what I see everything seems OK.
>> >>       > >
>> >>       > > Cool. I'll keep going in this direction then !
>> >>       > >
>> >>       > >> The interesting thing would be to
>> >>       > >> test on some complex mapping. The reason is that
>> >>       "CrushPolicyFamily"
>> >>       > >> is right now modeling just a single straw bucket not the
>> >>       full CRUSH
>> >>       > >> algorithm.
>> >>       > >
>> >>       > > A number of use cases use a single straw bucket, maybe the
>> >>       majority of them. Even though it does not reflect the full range
>> >>       of what crush can offer, it could be useful. To be more
>> >>       specific, a crush map that states "place objects so that there
>> >>       is at most one replica per host" or "one replica per rack" is
>> >>       common. Such a crushmap can be reduced to a single straw bucket
>> >>       that contains all the hosts and by using the CrushPolicyFamily,
>> >>       we can change the weights of each host to fix the probabilities.
>> >>       The hosts themselves contain disks with varying weights but I
>> >>       think we can ignore that because crush will only recurse to
>> >>       place one object within a given host.
>> >>       > >
>> >>       > >> That's the work that remains to be done. The only way that
>> >>       > >> would avoid reimplementing the CRUSH algorithm and
>> >>       computing the
>> >>       > >> gradient would be treating CRUSH as a black box and
>> >>       eliminating the
>> >>       > >> necessity of computing the gradient either by using a
>> >>       gradient-free
>> >>       > >> optimization method or making an estimation of the
>> >>       gradient.
>> >>       > >
>> >>       > > By gradient-free optimization you mean simulated annealing
>> >>       or Monte Carlo ?
>> >>       > >
>> >>       > > Cheers
>> >>       > >
>> >>       > >>
>> >>       > >>
>> >>       > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> >>       > >>> Hi,
>> >>       > >>>
>> >>       > >>> I modified the crush library to accept two weights (one
>> >>       for the first disk, the other for the remaining disks)[1]. This
>> >>       really is a hack for experimentation purposes only ;-) I was
>> >>       able to run a variation of your code[2] and got the following
>> >>       results which are encouraging. Do you think what I did is
>> >>       sensible ? Or is there a problem I don't see ?
>> >>       > >>>
>> >>       > >>> Thanks !
>> >>       > >>>
>> >>       > >>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8
>> >>       6]
>> >>       > >>>
>> >>
>> >> ------------------------------------------------------------------------
>> >>       > >>> Before: All replicas on each hard drive
>> >>       > >>> Expected vs actual use (20000 samples)
>> >>       > >>>  disk 0: 1.39e-01 1.12e-01
>> >>       > >>>  disk 1: 1.11e-01 1.10e-01
>> >>       > >>>  disk 2: 8.33e-02 1.13e-01
>> >>       > >>>  disk 3: 1.39e-01 1.11e-01
>> >>       > >>>  disk 4: 1.11e-01 1.11e-01
>> >>       > >>>  disk 5: 8.33e-02 1.11e-01
>> >>       > >>>  disk 6: 1.39e-01 1.12e-01
>> >>       > >>>  disk 7: 1.11e-01 1.12e-01
>> >>       > >>>  disk 8: 8.33e-02 1.10e-01
>> >>       > >>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>> >>       > >>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>> >>       > >>> ...
>> >>       > >>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>> >>       > >>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>> >>       > >>> Converged to desired accuracy :)
>> >>       > >>> After: All replicas on each hard drive
>> >>       > >>> Expected vs actual use (20000 samples)
>> >>       > >>>  disk 0: 1.39e-01 1.42e-01
>> >>       > >>>  disk 1: 1.11e-01 1.09e-01
>> >>       > >>>  disk 2: 8.33e-02 8.37e-02
>> >>       > >>>  disk 3: 1.39e-01 1.40e-01
>> >>       > >>>  disk 4: 1.11e-01 1.13e-01
>> >>       > >>>  disk 5: 8.33e-02 8.08e-02
>> >>       > >>>  disk 6: 1.39e-01 1.38e-01
>> >>       > >>>  disk 7: 1.11e-01 1.09e-01
>> >>       > >>>  disk 8: 8.33e-02 8.48e-02
>> >>       > >>>
>> >>       > >>>
>> >>       > >>> Simulation: R=2 devices capacity [10 10 10 10  1]
>> >>       > >>>
>> >>
>> >> ------------------------------------------------------------------------
>> >>       > >>> Before: All replicas on each hard drive
>> >>       > >>> Expected vs actual use (20000 samples)
>> >>       > >>>  disk 0: 2.44e-01 2.36e-01
>> >>       > >>>  disk 1: 2.44e-01 2.38e-01
>> >>       > >>>  disk 2: 2.44e-01 2.34e-01
>> >>       > >>>  disk 3: 2.44e-01 2.38e-01
>> >>       > >>>  disk 4: 2.44e-02 5.37e-02
>> >>       > >>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>> >>       > >>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>> >>       > >>> ...
>> >>       > >>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>> >>       > >>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>> >>       > >>> Converged to desired accuracy :)
>> >>       > >>> After: All replicas on each hard drive
>> >>       > >>> Expected vs actual use (20000 samples)
>> >>       > >>>  disk 0: 2.44e-01 2.46e-01
>> >>       > >>>  disk 1: 2.44e-01 2.44e-01
>> >>       > >>>  disk 2: 2.44e-01 2.41e-01
>> >>       > >>>  disk 3: 2.44e-01 2.45e-01
>> >>       > >>>  disk 4: 2.44e-02 2.33e-02
>> >>       > >>>
>> >>       > >>>
>> >>       > >>> [1] crush
>> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
>> >>       56fee8
>> >>       > >>> [2] python-crush
>> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
>> >>       bd25f8f2c4b68
>> >>       > >>>
>> >>       > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>> >>       > >>>> Hi Pedro,
>> >>       > >>>>
>> >>       > >>>> It looks like trying to experiment with crush won't work
>> >>       as expected because crush does not distinguish the probability
>> >>       of selecting the first device from the probability of selecting
>> >>       the second or third device. Am I mistaken ?
>> >>       > >>>>
>> >>       > >>>> Cheers
>> >>       > >>>>
>> >>       > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>> >>       > >>>>> Hi Pedro,
>> >>       > >>>>>
>> >>       > >>>>> I'm going to experiment with what you did at
>> >>       > >>>>>
>> >>       > >>>>>
>> >>       https://github.com/plafl/notebooks/blob/master/replication.ipynb
>> >>       > >>>>>
>> >>       > >>>>> and the latest python-crush published today. A
>> >>       comparison function was added that will help measure the data
>> >>       movement. I'm hoping we can release an offline tool based on
>> >>       your solution. Please let me know if I should wait before diving
>> >>       into this, in case you have unpublished drafts or new ideas.
>> >>       > >>>>>
>> >>       > >>>>> Cheers
>> >>       > >>>>>
>> >>       > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>> >>       > >>>>>> Great, thanks for the clarifications.
>> >>       > >>>>>> I also think that the most natural way is to keep just
>> >>       a set of
>> >>       > >>>>>> weights in the CRUSH map and update them inside the
>> >>       algorithm.
>> >>       > >>>>>>
>> >>       > >>>>>> I keep working on it.
>> >>       > >>>>>>
>> >>       > >>>>>>
>> >>       > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
>> >>       <sage@newdream.net>:
>> >>       > >>>>>>> Hi Pedro,
>> >>       > >>>>>>>
>> >>       > >>>>>>> Thanks for taking a look at this!  It's a frustrating
>> >>       problem and we
>> >>       > >>>>>>> haven't made much headway.
>> >>       > >>>>>>>
>> >>       > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>> >>       > >>>>>>>> Hi,
>> >>       > >>>>>>>>
>> >>       > >>>>>>>> I will have a look. BTW, I have not progressed that
>> >>       much but I have
>> >>       > >>>>>>>> been thinking about it. In order to adapt the
>> >>       previous algorithm in
>> >>       > >>>>>>>> the python notebook I need to substitute the
>> >>       iteration over all
>> >>       > >>>>>>>> possible devices permutations to iteration over all
>> >>       the possible
>> >>       > >>>>>>>> selections that crush would make. That is the main
>> >>       thing I need to
>> >>       > >>>>>>>> work on.
>> >>       > >>>>>>>>
>> >>       > >>>>>>>> The other thing is of course that weights change for
>> >>       each replica.
>> >>       > >>>>>>>> That is, they cannot be really fixed in the crush
>> >>       map. So the
>> >>       > >>>>>>>> algorithm inside libcrush, not only the weights in
>> >>       the map, need to be
>> >>       > >>>>>>>> changed. The weights in the crush map should reflect
>> >>       then, maybe, the
>> >>       > >>>>>>>> desired usage frequencies. Or maybe each replica
>> >>       should have their own
>> >>       > >>>>>>>> crush map, but then the information about the
>> >>       previous selection
>> >>       > >>>>>>>> should be passed to the next replica placement run so
>> >>       it avoids
>> >>       > >>>>>>>> selecting the same one again.
>> >>       > >>>>>>>
>> >>       > >>>>>>> My suspicion is that the best solution here (whatever
>> >>       that means!)
>> >>       > >>>>>>> leaves the CRUSH weights intact with the desired
>> >>       distribution, and
>> >>       > >>>>>>> then generates a set of derivative weights--probably
>> >>       one set for each
>> >>       > >>>>>>> round/replica/rank.
>> >>       > >>>>>>>
>> >>       > >>>>>>> One nice property of this is that once the support is
>> >>       added to encode
>> >>       > >>>>>>> multiple sets of weights, the algorithm used to
>> >>       generate them is free to
>> >>       > >>>>>>> change and evolve independently.  (In most cases any
>> >>       change is
>> >>       > >>>>>>> CRUSH's mapping behavior is difficult to roll out
>> >>       because all
>> >>       > >>>>>>> parties participating in the cluster have to support
>> >>       any new behavior
>> >>       > >>>>>>> before it is enabled or used.)
>> >>       > >>>>>>>
>> >>       > >>>>>>>> I have a question also. Is there any significant
>> >>       difference between
>> >>       > >>>>>>>> the device selection algorithm description in the
>> >>       paper and its final
>> >>       > >>>>>>>> implementation?
>> >>       > >>>>>>>
>> >>       > >>>>>>> The main difference is the "retry_bucket" behavior was
>> >>       found to be a bad
>> >>       > >>>>>>> idea; any collision or failed()/overload() case
>> >>       triggers the
>> >>       > >>>>>>> retry_descent.
>> >>       > >>>>>>>
>> >>       > >>>>>>> There are other changes, of course, but I don't think
>> >>       they'll impact any
>> >>       > >>>>>>> solution we come with here (or at least any solution
>> >>       can be suitably
>> >>       > >>>>>>> adapted)!
>> >>       > >>>>>>>
>> >>       > >>>>>>> sage
>> >>       > >>>>>> --
>> >>       > >>>>>> To unsubscribe from this list: send the line
>> >>       "unsubscribe ceph-devel" in
>> >>       > >>>>>> the body of a message to majordomo@vger.kernel.org
>> >>       > >>>>>> More majordomo info at
>> >>       http://vger.kernel.org/majordomo-info.html
>> >>       > >>>>>>
>> >>       > >>>>>
>> >>       > >>>>
>> >>       > >>>
>> >>       > >>> --
>> >>       > >>> Loïc Dachary, Artisan Logiciel Libre
>> >>       > >> --
>> >>       > >> To unsubscribe from this list: send the line "unsubscribe
>> >>       ceph-devel" in
>> >>       > >> the body of a message to majordomo@vger.kernel.org
>> >>       > >> More majordomo info at
>> >>       http://vger.kernel.org/majordomo-info.html
>> >>       > >>
>> >>       > >
>> >>       > > --
>> >>       > > Loïc Dachary, Artisan Logiciel Libre
>> >>       > --
>> >>       > To unsubscribe from this list: send the line "unsubscribe
>> >>       ceph-devel" in
>> >>       > the body of a message to majordomo@vger.kernel.org
>> >>       > More majordomo info at
>> >>       http://vger.kernel.org/majordomo-info.html
>> >>       >
>> >>       >
>> >>
>> >>
>> >>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-27  9:27                                                   ` Adam Kupczyk
@ 2017-03-27 10:29                                                     ` Loic Dachary
  2017-03-27 10:37                                                     ` Pedro López-Adeva
  2017-03-27 13:39                                                     ` Sage Weil
  2 siblings, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-03-27 10:29 UTC (permalink / raw)
  To: Adam Kupczyk; +Cc: Sage Weil, Ceph Development

Hi Adam,

On 03/27/2017 11:27 AM, Adam Kupczyk wrote:
> Hi,
> 
> My understanding is that optimal tweaked weights will depend on:
> 1) pool_id, because of rjenkins(pool_id) in crush
> 2) number of placement groups and replication factor, as it determines
> amount of samples
> 
> Therefore tweaked weights should rather be property of instantialized pool,
> not crush placement definition.
> 
> If tweaked weights are to be part of crush definition, than for each
> created pool we need to have separate list of weights.

This could be achieved by creating a bucket tree for each pool. There is a hack doing that at http://libcrush.org/main/python-crush/merge_requests/40/diffs and we can hopefully get something useable for the sysadmin (see http://libcrush.org/main/python-crush/issues/13). This is however different from fixing the crush multipick anomaly, it is primarily useful when there are not enough samples to get an even distribution. 

> Is it possible to provide clients with different weights depending on on
> which pool they want to operate?
> 
> Best regards,
> Adam
> 
> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
>> Hi,
>>
>> My understanding is that optimal tweaked weights will depend on:
>> 1) pool_id, because of rjenkins(pool_id) in crush
>> 2) number of placement groups and replication factor, as it determines
>> amount of samples
>>
>> Therefore tweaked weights should rather be property of instantialized pool,
>> not crush placement definition.
>>
>> If tweaked weights are to be part of crush definition, than for each created
>> pool we need to have separate list of weights.
>> Is it possible to provide clients with different weights depending on on
>> which pool they want to operate?
>>
>> Best regards,
>> Adam
>>
>>
>> On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@dachary.org> wrote:
>>>
>>>
>>>
>>> On 03/27/2017 04:33 AM, Sage Weil wrote:
>>>> On Sun, 26 Mar 2017, Adam Kupczyk wrote:
>>>>> Hello Sage, Loic, Pedro,
>>>>>
>>>>>
>>>>> I am certain that almost perfect mapping can be achieved by
>>>>> substituting weights from crush map with slightly modified weights.
>>>>> By perfect mapping I mean we get on each OSD number of PGs exactly
>>>>> proportional to weights specified in crush map.
>>>>>
>>>>> 1. Example
>>>>> Lets think of PGs of single object pool.
>>>>> We have OSDs with following weights:
>>>>> [10, 10, 10, 5, 5]
>>>>>
>>>>> Ideally, we would like following distribution of 200PG x 3 copies = 600
>>>>> PGcopies :
>>>>> [150, 150, 150, 75, 75]
>>>>>
>>>>> However, because crush simulates random process we have:
>>>>> [143, 152, 158, 71, 76]
>>>>>
>>>>> We could have obtained perfect distribution had we used weights like
>>>>> this:
>>>>> [10.2, 9.9, 9.6, 5.2, 4.9]
>>>>>
>>>>>
>>>>> 2. Obtaining perfect mapping weights from OSD capacity weights
>>>>>
>>>>> When we apply crush for the first time, distribution of PGs comes as
>>>>> random.
>>>>> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
>>>>>
>>>>> But CRUSH is not random proces at all, it behaves in numerically stable
>>>>> way.
>>>>> Specifically, if we increase weight on one node, we will get more PGs
>>>>> on
>>>>> this node and less on every other node:
>>>>> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
>>>>>
>>>>> Now, finding ideal weights can be done by any numerical minimization
>>>>> method,
>>>>> for example NLMS.
>>>>>
>>>>>
>>>>> 3. The proposal
>>>>> For each pool, from initial weights given in crush map perfect weights
>>>>> will
>>>>> be derived.
>>>>> This weights will be used to calculate PG distribution. This of course
>>>>> will
>>>>> be close to perfect.
>>>>>
>>>>> 3a: Downside when OSD is out
>>>>> When an OSD is out, missing PG copies will be replicated elsewhere.
>>>>> Because now weights deviate from OSD capacity, some OSDs will
>>>>> statistically
>>>>> get more copies then they should.
>>>>> This unevenness in distribution is proportional to scale of deviation
>>>>> of
>>>>> calculated weights to capacity weights.
>>>>>
>>>>> 3b: Upside
>>>>> This all can be achieved without changes to crush.
>>>>
>>>> Yes!
>>>>
>>>> And no.  You're totally right--we should use an offline optimization to
>>>> tweak the crush input weights to get a better balance.  It won't be
>>>> robust
>>>> to changes to the cluster, but we can incrementally optimize after that
>>>> happens to converge on something better.
>>>>
>>>> The problem with doing this with current versions of Ceph is that we
>>>> lose
>>>> the original "input" or "target" weights (i.e., the actual size of
>>>> the OSD) that we want to converge on.  This is one reason why we haven't
>>>> done something like this before.
>>>>
>>>> In luminous we *could* work around this by storing those canonical
>>>> weights outside of crush using something (probably?) ugly and
>>>> maintain backward compatibility with older clients using existing
>>>> CRUSH behavior.
>>>
>>> These canonical weights could be stored in crush by creating dedicated
>>> buckets. For instance the root-canonical bucket could be created to store
>>> the canonical weights of the root bucket. The sysadmin needs to be aware of
>>> the difference and know to add a new device in the host01-canonical bucket
>>> instead of the host01 bucket. And to run an offline tool to keep the two
>>> buckets in sync and compute the weight to use for placement derived from the
>>> weights representing the device capacity.
>>>
>>> It is a little bit ugly ;-)
>>>
>>>> OR, (and this is my preferred route), if the multi-pick anomaly approach
>>>> that Pedro is working on works out, we'll want to extend the CRUSH map
>>>> to
>>>> include a set of derivative weights used for actual placement
>>>> calculations
>>>> instead of the canonical target weights, and we can do what you're
>>>> proposing *and* solve the multipick problem with one change in the crush
>>>> map and algorithm.  (Actually choosing those derivative weights will
>>>> be an offline process that can both improve the balance for the inputs
>>>> we
>>>> care about *and* adjust them based on the position to fix the skew issue
>>>> for replicas.)  This doesn't help pre-luminous clients, but I think the
>>>> end solution will be simpler and more elegant...
>>>>
>>>> What do you think?
>>>>
>>>> sage
>>>>
>>>>
>>>>> 4. Extra
>>>>> Some time ago I made such change to perfectly balance Thomson-Reuters
>>>>> cluster.
>>>>> It succeeded.
>>>>> A solution was not accepted, because modification of OSD weights were
>>>>> higher
>>>>> then 50%, which was caused by fact that different placement rules
>>>>> operated
>>>>> on different sets of OSDs, and those sets were not disjointed.
>>>>
>>>>
>>>>>
>>>>> Best regards,
>>>>> Adam
>>>>>
>>>>>
>>>>> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
>>>>>       Hi Pedro, Loic,
>>>>>
>>>>>       For what it's worth, my intuition here (which has had a mixed
>>>>>       record as
>>>>>       far as CRUSH goes) is that this is the most promising path
>>>>>       forward.
>>>>>
>>>>>       Thinking ahead a few steps, and confirming that I'm following
>>>>>       the
>>>>>       discussion so far, if you're able to do get black (or white) box
>>>>>       gradient
>>>>>       descent to work, then this will give us a set of weights for
>>>>>       each item in
>>>>>       the tree for each selection round, derived from the tree
>>>>>       structure and
>>>>>       original (target) weights.  That would basically give us a map
>>>>>       of item id
>>>>>       (bucket id or leave item id) to weight for each round.  i.e.,
>>>>>
>>>>>        map<int, map<int, float>> weight_by_position;  // position ->
>>>>>       item -> weight
>>>>>
>>>>>       where the 0 round would (I think?) match the target weights, and
>>>>>       each
>>>>>       round after that would skew low-weighted items lower to some
>>>>>       degree.
>>>>>       Right?
>>>>>
>>>>>       The next question I have is: does this generalize from the
>>>>>       single-bucket
>>>>>       case to the hierarchy?  I.e., if I have a "tree" (single bucket)
>>>>>       like
>>>>>
>>>>>       3.1
>>>>>        |_____________
>>>>>        |   \    \    \
>>>>>       1.0  1.0  1.0  .1
>>>>>
>>>>>       it clearly works, but when we have a multi-level tree like
>>>>>
>>>>>
>>>>>       8.4
>>>>>        |____________________________________
>>>>>        |                 \                  \
>>>>>       3.1                3.1                2.2
>>>>>        |_____________     |_____________     |_____________
>>>>>        |   \    \    \    |   \    \    \    |   \    \    \
>>>>>       1.0  1.0  1.0  .1   1.0  1.0  1.0  .1  1.0  1.0 .1   .1
>>>>>
>>>>>       and the second round weights skew the small .1 leaves lower, can
>>>>>       we
>>>>>       continue to build the summed-weight hierarchy, such that the
>>>>>       adjusted
>>>>>       weights at the higher level are appropriately adjusted to give
>>>>>       us the
>>>>>       right probabilities of descending into those trees?  I'm not
>>>>>       sure if that
>>>>>       logically follows from the above or if my intuition is
>>>>>       oversimplifying
>>>>>       things.
>>>>>
>>>>>       If this *is* how we think this will shake out, then I'm
>>>>>       wondering if we
>>>>>       should go ahead and build this weigh matrix into CRUSH sooner
>>>>>       rather
>>>>>       than later (i.e., for luminous).  As with the explicit
>>>>>       remappings, the
>>>>>       hard part is all done offline, and the adjustments to the CRUSH
>>>>>       mapping
>>>>>       calculation itself (storing and making use of the adjusted
>>>>>       weights for
>>>>>       each round of placement) are relatively straightforward.  And
>>>>>       the sooner
>>>>>       this is incorporated into a release the sooner real users will
>>>>>       be able to
>>>>>       roll out code to all clients and start making us of it.
>>>>>
>>>>>       Thanks again for looking at this problem!  I'm excited that we
>>>>>       may be
>>>>>       closing in on a real solution!
>>>>>
>>>>>       sage
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>       On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
>>>>>
>>>>>       > There are lot of gradient-free methods. I will try first to
>>>>>       run the
>>>>>       > ones available using just scipy
>>>>>       >
>>>>>
>>>>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>>>>       > Some of them don't require the gradient and some of them can
>>>>>       estimate
>>>>>       > it. The reason to go without the gradient is to run the CRUSH
>>>>>       > algorithm as a black box. In that case this would be the
>>>>>       pseudo-code:
>>>>>       >
>>>>>       > - BEGIN CODE -
>>>>>       > def build_target(desired_freqs):
>>>>>       >     def target(weights):
>>>>>       >         # run a simulation of CRUSH for a number of objects
>>>>>       >         sim_freqs = run_crush(weights)
>>>>>       >         # Kullback-Leibler divergence between desired
>>>>>       frequencies and
>>>>>       > current ones
>>>>>       >         return loss(sim_freqs, desired_freqs)
>>>>>       >    return target
>>>>>       >
>>>>>       > weights = scipy.optimize.minimize(build_target(desired_freqs))
>>>>>       > - END CODE -
>>>>>       >
>>>>>       > The tricky thing here is that this procedure can be slow if
>>>>>       the
>>>>>       > simulation (run_crush) needs to place a lot of objects to get
>>>>>       accurate
>>>>>       > simulated frequencies. This is true specially if the minimize
>>>>>       method
>>>>>       > attempts to approximate the gradient using finite differences
>>>>>       since it
>>>>>       > will evaluate the target function a number of times
>>>>>       proportional to
>>>>>       > the number of weights). Apart from the ones in scipy I would
>>>>>       try also
>>>>>       > optimization methods that try to perform as few evaluations as
>>>>>       > possible like for example HyperOpt
>>>>>       > (http://hyperopt.github.io/hyperopt/), which by the way takes
>>>>>       into
>>>>>       > account that the target function can be noisy.
>>>>>       >
>>>>>       > This black box approximation is simple to implement and makes
>>>>>       the
>>>>>       > computer do all the work instead of us.
>>>>>       > I think that this black box approximation is worthy to try
>>>>>       even if
>>>>>       > it's not the final one because if this approximation works
>>>>>       then we
>>>>>       > know that a more elaborate one that computes the gradient of
>>>>>       the CRUSH
>>>>>       > algorithm will work for sure.
>>>>>       >
>>>>>       > I can try this black box approximation this weekend not on the
>>>>>       real
>>>>>       > CRUSH algorithm but with the simple implementation I did in
>>>>>       python. If
>>>>>       > it works it's just a matter of substituting one simulation
>>>>>       with
>>>>>       > another and see what happens.
>>>>>       >
>>>>>       > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>       > > Hi Pedro,
>>>>>       > >
>>>>>       > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>>>       > >> Hi Loic,
>>>>>       > >>
>>>>>       > >>>From what I see everything seems OK.
>>>>>       > >
>>>>>       > > Cool. I'll keep going in this direction then !
>>>>>       > >
>>>>>       > >> The interesting thing would be to
>>>>>       > >> test on some complex mapping. The reason is that
>>>>>       "CrushPolicyFamily"
>>>>>       > >> is right now modeling just a single straw bucket not the
>>>>>       full CRUSH
>>>>>       > >> algorithm.
>>>>>       > >
>>>>>       > > A number of use cases use a single straw bucket, maybe the
>>>>>       majority of them. Even though it does not reflect the full range
>>>>>       of what crush can offer, it could be useful. To be more
>>>>>       specific, a crush map that states "place objects so that there
>>>>>       is at most one replica per host" or "one replica per rack" is
>>>>>       common. Such a crushmap can be reduced to a single straw bucket
>>>>>       that contains all the hosts and by using the CrushPolicyFamily,
>>>>>       we can change the weights of each host to fix the probabilities.
>>>>>       The hosts themselves contain disks with varying weights but I
>>>>>       think we can ignore that because crush will only recurse to
>>>>>       place one object within a given host.
>>>>>       > >
>>>>>       > >> That's the work that remains to be done. The only way that
>>>>>       > >> would avoid reimplementing the CRUSH algorithm and
>>>>>       computing the
>>>>>       > >> gradient would be treating CRUSH as a black box and
>>>>>       eliminating the
>>>>>       > >> necessity of computing the gradient either by using a
>>>>>       gradient-free
>>>>>       > >> optimization method or making an estimation of the
>>>>>       gradient.
>>>>>       > >
>>>>>       > > By gradient-free optimization you mean simulated annealing
>>>>>       or Monte Carlo ?
>>>>>       > >
>>>>>       > > Cheers
>>>>>       > >
>>>>>       > >>
>>>>>       > >>
>>>>>       > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>       > >>> Hi,
>>>>>       > >>>
>>>>>       > >>> I modified the crush library to accept two weights (one
>>>>>       for the first disk, the other for the remaining disks)[1]. This
>>>>>       really is a hack for experimentation purposes only ;-) I was
>>>>>       able to run a variation of your code[2] and got the following
>>>>>       results which are encouraging. Do you think what I did is
>>>>>       sensible ? Or is there a problem I don't see ?
>>>>>       > >>>
>>>>>       > >>> Thanks !
>>>>>       > >>>
>>>>>       > >>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8
>>>>>       6]
>>>>>       > >>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>       > >>> Before: All replicas on each hard drive
>>>>>       > >>> Expected vs actual use (20000 samples)
>>>>>       > >>>  disk 0: 1.39e-01 1.12e-01
>>>>>       > >>>  disk 1: 1.11e-01 1.10e-01
>>>>>       > >>>  disk 2: 8.33e-02 1.13e-01
>>>>>       > >>>  disk 3: 1.39e-01 1.11e-01
>>>>>       > >>>  disk 4: 1.11e-01 1.11e-01
>>>>>       > >>>  disk 5: 8.33e-02 1.11e-01
>>>>>       > >>>  disk 6: 1.39e-01 1.12e-01
>>>>>       > >>>  disk 7: 1.11e-01 1.12e-01
>>>>>       > >>>  disk 8: 8.33e-02 1.10e-01
>>>>>       > >>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>>>>       > >>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>>>>       > >>> ...
>>>>>       > >>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>>>>       > >>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>>>>       > >>> Converged to desired accuracy :)
>>>>>       > >>> After: All replicas on each hard drive
>>>>>       > >>> Expected vs actual use (20000 samples)
>>>>>       > >>>  disk 0: 1.39e-01 1.42e-01
>>>>>       > >>>  disk 1: 1.11e-01 1.09e-01
>>>>>       > >>>  disk 2: 8.33e-02 8.37e-02
>>>>>       > >>>  disk 3: 1.39e-01 1.40e-01
>>>>>       > >>>  disk 4: 1.11e-01 1.13e-01
>>>>>       > >>>  disk 5: 8.33e-02 8.08e-02
>>>>>       > >>>  disk 6: 1.39e-01 1.38e-01
>>>>>       > >>>  disk 7: 1.11e-01 1.09e-01
>>>>>       > >>>  disk 8: 8.33e-02 8.48e-02
>>>>>       > >>>
>>>>>       > >>>
>>>>>       > >>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>>>>       > >>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>       > >>> Before: All replicas on each hard drive
>>>>>       > >>> Expected vs actual use (20000 samples)
>>>>>       > >>>  disk 0: 2.44e-01 2.36e-01
>>>>>       > >>>  disk 1: 2.44e-01 2.38e-01
>>>>>       > >>>  disk 2: 2.44e-01 2.34e-01
>>>>>       > >>>  disk 3: 2.44e-01 2.38e-01
>>>>>       > >>>  disk 4: 2.44e-02 5.37e-02
>>>>>       > >>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>>>>       > >>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>>>>       > >>> ...
>>>>>       > >>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>>>>       > >>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>>>>       > >>> Converged to desired accuracy :)
>>>>>       > >>> After: All replicas on each hard drive
>>>>>       > >>> Expected vs actual use (20000 samples)
>>>>>       > >>>  disk 0: 2.44e-01 2.46e-01
>>>>>       > >>>  disk 1: 2.44e-01 2.44e-01
>>>>>       > >>>  disk 2: 2.44e-01 2.41e-01
>>>>>       > >>>  disk 3: 2.44e-01 2.45e-01
>>>>>       > >>>  disk 4: 2.44e-02 2.33e-02
>>>>>       > >>>
>>>>>       > >>>
>>>>>       > >>> [1] crush
>>>>> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
>>>>>       56fee8
>>>>>       > >>> [2] python-crush
>>>>> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
>>>>>       bd25f8f2c4b68
>>>>>       > >>>
>>>>>       > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>       > >>>> Hi Pedro,
>>>>>       > >>>>
>>>>>       > >>>> It looks like trying to experiment with crush won't work
>>>>>       as expected because crush does not distinguish the probability
>>>>>       of selecting the first device from the probability of selecting
>>>>>       the second or third device. Am I mistaken ?
>>>>>       > >>>>
>>>>>       > >>>> Cheers
>>>>>       > >>>>
>>>>>       > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>       > >>>>> Hi Pedro,
>>>>>       > >>>>>
>>>>>       > >>>>> I'm going to experiment with what you did at
>>>>>       > >>>>>
>>>>>       > >>>>>
>>>>>       https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>       > >>>>>
>>>>>       > >>>>> and the latest python-crush published today. A
>>>>>       comparison function was added that will help measure the data
>>>>>       movement. I'm hoping we can release an offline tool based on
>>>>>       your solution. Please let me know if I should wait before diving
>>>>>       into this, in case you have unpublished drafts or new ideas.
>>>>>       > >>>>>
>>>>>       > >>>>> Cheers
>>>>>       > >>>>>
>>>>>       > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>       > >>>>>> Great, thanks for the clarifications.
>>>>>       > >>>>>> I also think that the most natural way is to keep just
>>>>>       a set of
>>>>>       > >>>>>> weights in the CRUSH map and update them inside the
>>>>>       algorithm.
>>>>>       > >>>>>>
>>>>>       > >>>>>> I keep working on it.
>>>>>       > >>>>>>
>>>>>       > >>>>>>
>>>>>       > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
>>>>>       <sage@newdream.net>:
>>>>>       > >>>>>>> Hi Pedro,
>>>>>       > >>>>>>>
>>>>>       > >>>>>>> Thanks for taking a look at this!  It's a frustrating
>>>>>       problem and we
>>>>>       > >>>>>>> haven't made much headway.
>>>>>       > >>>>>>>
>>>>>       > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>       > >>>>>>>> Hi,
>>>>>       > >>>>>>>>
>>>>>       > >>>>>>>> I will have a look. BTW, I have not progressed that
>>>>>       much but I have
>>>>>       > >>>>>>>> been thinking about it. In order to adapt the
>>>>>       previous algorithm in
>>>>>       > >>>>>>>> the python notebook I need to substitute the
>>>>>       iteration over all
>>>>>       > >>>>>>>> possible devices permutations to iteration over all
>>>>>       the possible
>>>>>       > >>>>>>>> selections that crush would make. That is the main
>>>>>       thing I need to
>>>>>       > >>>>>>>> work on.
>>>>>       > >>>>>>>>
>>>>>       > >>>>>>>> The other thing is of course that weights change for
>>>>>       each replica.
>>>>>       > >>>>>>>> That is, they cannot be really fixed in the crush
>>>>>       map. So the
>>>>>       > >>>>>>>> algorithm inside libcrush, not only the weights in
>>>>>       the map, need to be
>>>>>       > >>>>>>>> changed. The weights in the crush map should reflect
>>>>>       then, maybe, the
>>>>>       > >>>>>>>> desired usage frequencies. Or maybe each replica
>>>>>       should have their own
>>>>>       > >>>>>>>> crush map, but then the information about the
>>>>>       previous selection
>>>>>       > >>>>>>>> should be passed to the next replica placement run so
>>>>>       it avoids
>>>>>       > >>>>>>>> selecting the same one again.
>>>>>       > >>>>>>>
>>>>>       > >>>>>>> My suspicion is that the best solution here (whatever
>>>>>       that means!)
>>>>>       > >>>>>>> leaves the CRUSH weights intact with the desired
>>>>>       distribution, and
>>>>>       > >>>>>>> then generates a set of derivative weights--probably
>>>>>       one set for each
>>>>>       > >>>>>>> round/replica/rank.
>>>>>       > >>>>>>>
>>>>>       > >>>>>>> One nice property of this is that once the support is
>>>>>       added to encode
>>>>>       > >>>>>>> multiple sets of weights, the algorithm used to
>>>>>       generate them is free to
>>>>>       > >>>>>>> change and evolve independently.  (In most cases any
>>>>>       change is
>>>>>       > >>>>>>> CRUSH's mapping behavior is difficult to roll out
>>>>>       because all
>>>>>       > >>>>>>> parties participating in the cluster have to support
>>>>>       any new behavior
>>>>>       > >>>>>>> before it is enabled or used.)
>>>>>       > >>>>>>>
>>>>>       > >>>>>>>> I have a question also. Is there any significant
>>>>>       difference between
>>>>>       > >>>>>>>> the device selection algorithm description in the
>>>>>       paper and its final
>>>>>       > >>>>>>>> implementation?
>>>>>       > >>>>>>>
>>>>>       > >>>>>>> The main difference is the "retry_bucket" behavior was
>>>>>       found to be a bad
>>>>>       > >>>>>>> idea; any collision or failed()/overload() case
>>>>>       triggers the
>>>>>       > >>>>>>> retry_descent.
>>>>>       > >>>>>>>
>>>>>       > >>>>>>> There are other changes, of course, but I don't think
>>>>>       they'll impact any
>>>>>       > >>>>>>> solution we come with here (or at least any solution
>>>>>       can be suitably
>>>>>       > >>>>>>> adapted)!
>>>>>       > >>>>>>>
>>>>>       > >>>>>>> sage
>>>>>       > >>>>>> --
>>>>>       > >>>>>> To unsubscribe from this list: send the line
>>>>>       "unsubscribe ceph-devel" in
>>>>>       > >>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>       > >>>>>> More majordomo info at
>>>>>       http://vger.kernel.org/majordomo-info.html
>>>>>       > >>>>>>
>>>>>       > >>>>>
>>>>>       > >>>>
>>>>>       > >>>
>>>>>       > >>> --
>>>>>       > >>> Loïc Dachary, Artisan Logiciel Libre
>>>>>       > >> --
>>>>>       > >> To unsubscribe from this list: send the line "unsubscribe
>>>>>       ceph-devel" in
>>>>>       > >> the body of a message to majordomo@vger.kernel.org
>>>>>       > >> More majordomo info at
>>>>>       http://vger.kernel.org/majordomo-info.html
>>>>>       > >>
>>>>>       > >
>>>>>       > > --
>>>>>       > > Loïc Dachary, Artisan Logiciel Libre
>>>>>       > --
>>>>>       > To unsubscribe from this list: send the line "unsubscribe
>>>>>       ceph-devel" in
>>>>>       > the body of a message to majordomo@vger.kernel.org
>>>>>       > More majordomo info at
>>>>>       http://vger.kernel.org/majordomo-info.html
>>>>>       >
>>>>>       >
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-27  9:27                                                   ` Adam Kupczyk
  2017-03-27 10:29                                                     ` Loic Dachary
@ 2017-03-27 10:37                                                     ` Pedro López-Adeva
  2017-03-27 13:39                                                     ` Sage Weil
  2 siblings, 0 replies; 70+ messages in thread
From: Pedro López-Adeva @ 2017-03-27 10:37 UTC (permalink / raw)
  To: Adam Kupczyk; +Cc: Loic Dachary, Sage Weil, Ceph Development

I have performed some tests as I said using the black box method.
Remember that I have not used the real CRUSH algorithm.
The idea was to compare the results against the white box with
gradient information.

I have used two replicas and 9 disks with the following capacities:
10, 8, 6, 10, 8, 6, 10, 8, 6
Contrary to what I thought scipy.optimize didn't give results I think
because the target function is noisy. This at least made the method I
tried (SLSQP) to return non success, so I switched to HyperOpt.

First result is that of course the method is much slower. This was expected:

Time using jacobian:  0.36s
Time using simulation: 69.19s

But the results I think are OK. The following columns show the
parameters (weights for the second replica placement) estimated using
the jacobian (white box) vs the simulation method (black box).

jac  sim
--------
0.17 0.16
0.11 0.12
0.06 0.06
0.17 0.15
0.11 0.09
0.06 0.08
0.17 0.16
0.11 0.13
0.06 0.06

As you can see the agreement is reasonable.
You can see here the changes I made:
https://github.com/plafl/snippets/commit/ea701d2cffbf3884eab866ce6e2388879e040894

Where to go from here (I think):
1. Perform this same test using the real CRUSH algorithm
2. Improve the method to run faster


2017-03-27 11:27 GMT+02:00 Adam Kupczyk <akupczyk@mirantis.com>:
> Hi,
>
> My understanding is that optimal tweaked weights will depend on:
> 1) pool_id, because of rjenkins(pool_id) in crush
> 2) number of placement groups and replication factor, as it determines
> amount of samples
>
> Therefore tweaked weights should rather be property of instantialized pool,
> not crush placement definition.
>
> If tweaked weights are to be part of crush definition, than for each
> created pool we need to have separate list of weights.
> Is it possible to provide clients with different weights depending on on
> which pool they want to operate?
>
> Best regards,
> Adam
>
> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
>> Hi,
>>
>> My understanding is that optimal tweaked weights will depend on:
>> 1) pool_id, because of rjenkins(pool_id) in crush
>> 2) number of placement groups and replication factor, as it determines
>> amount of samples
>>
>> Therefore tweaked weights should rather be property of instantialized pool,
>> not crush placement definition.
>>
>> If tweaked weights are to be part of crush definition, than for each created
>> pool we need to have separate list of weights.
>> Is it possible to provide clients with different weights depending on on
>> which pool they want to operate?
>>
>> Best regards,
>> Adam
>>
>>
>> On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@dachary.org> wrote:
>>>
>>>
>>>
>>> On 03/27/2017 04:33 AM, Sage Weil wrote:
>>> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
>>> >> Hello Sage, Loic, Pedro,
>>> >>
>>> >>
>>> >> I am certain that almost perfect mapping can be achieved by
>>> >> substituting weights from crush map with slightly modified weights.
>>> >> By perfect mapping I mean we get on each OSD number of PGs exactly
>>> >> proportional to weights specified in crush map.
>>> >>
>>> >> 1. Example
>>> >> Lets think of PGs of single object pool.
>>> >> We have OSDs with following weights:
>>> >> [10, 10, 10, 5, 5]
>>> >>
>>> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
>>> >> PGcopies :
>>> >> [150, 150, 150, 75, 75]
>>> >>
>>> >> However, because crush simulates random process we have:
>>> >> [143, 152, 158, 71, 76]
>>> >>
>>> >> We could have obtained perfect distribution had we used weights like
>>> >> this:
>>> >> [10.2, 9.9, 9.6, 5.2, 4.9]
>>> >>
>>> >>
>>> >> 2. Obtaining perfect mapping weights from OSD capacity weights
>>> >>
>>> >> When we apply crush for the first time, distribution of PGs comes as
>>> >> random.
>>> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
>>> >>
>>> >> But CRUSH is not random proces at all, it behaves in numerically stable
>>> >> way.
>>> >> Specifically, if we increase weight on one node, we will get more PGs
>>> >> on
>>> >> this node and less on every other node:
>>> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
>>> >>
>>> >> Now, finding ideal weights can be done by any numerical minimization
>>> >> method,
>>> >> for example NLMS.
>>> >>
>>> >>
>>> >> 3. The proposal
>>> >> For each pool, from initial weights given in crush map perfect weights
>>> >> will
>>> >> be derived.
>>> >> This weights will be used to calculate PG distribution. This of course
>>> >> will
>>> >> be close to perfect.
>>> >>
>>> >> 3a: Downside when OSD is out
>>> >> When an OSD is out, missing PG copies will be replicated elsewhere.
>>> >> Because now weights deviate from OSD capacity, some OSDs will
>>> >> statistically
>>> >> get more copies then they should.
>>> >> This unevenness in distribution is proportional to scale of deviation
>>> >> of
>>> >> calculated weights to capacity weights.
>>> >>
>>> >> 3b: Upside
>>> >> This all can be achieved without changes to crush.
>>> >
>>> > Yes!
>>> >
>>> > And no.  You're totally right--we should use an offline optimization to
>>> > tweak the crush input weights to get a better balance.  It won't be
>>> > robust
>>> > to changes to the cluster, but we can incrementally optimize after that
>>> > happens to converge on something better.
>>> >
>>> > The problem with doing this with current versions of Ceph is that we
>>> > lose
>>> > the original "input" or "target" weights (i.e., the actual size of
>>> > the OSD) that we want to converge on.  This is one reason why we haven't
>>> > done something like this before.
>>> >
>>> > In luminous we *could* work around this by storing those canonical
>>> > weights outside of crush using something (probably?) ugly and
>>> > maintain backward compatibility with older clients using existing
>>> > CRUSH behavior.
>>>
>>> These canonical weights could be stored in crush by creating dedicated
>>> buckets. For instance the root-canonical bucket could be created to store
>>> the canonical weights of the root bucket. The sysadmin needs to be aware of
>>> the difference and know to add a new device in the host01-canonical bucket
>>> instead of the host01 bucket. And to run an offline tool to keep the two
>>> buckets in sync and compute the weight to use for placement derived from the
>>> weights representing the device capacity.
>>>
>>> It is a little bit ugly ;-)
>>>
>>> > OR, (and this is my preferred route), if the multi-pick anomaly approach
>>> > that Pedro is working on works out, we'll want to extend the CRUSH map
>>> > to
>>> > include a set of derivative weights used for actual placement
>>> > calculations
>>> > instead of the canonical target weights, and we can do what you're
>>> > proposing *and* solve the multipick problem with one change in the crush
>>> > map and algorithm.  (Actually choosing those derivative weights will
>>> > be an offline process that can both improve the balance for the inputs
>>> > we
>>> > care about *and* adjust them based on the position to fix the skew issue
>>> > for replicas.)  This doesn't help pre-luminous clients, but I think the
>>> > end solution will be simpler and more elegant...
>>> >
>>> > What do you think?
>>> >
>>> > sage
>>> >
>>> >
>>> >> 4. Extra
>>> >> Some time ago I made such change to perfectly balance Thomson-Reuters
>>> >> cluster.
>>> >> It succeeded.
>>> >> A solution was not accepted, because modification of OSD weights were
>>> >> higher
>>> >> then 50%, which was caused by fact that different placement rules
>>> >> operated
>>> >> on different sets of OSDs, and those sets were not disjointed.
>>> >
>>> >
>>> >>
>>> >> Best regards,
>>> >> Adam
>>> >>
>>> >>
>>> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
>>> >>       Hi Pedro, Loic,
>>> >>
>>> >>       For what it's worth, my intuition here (which has had a mixed
>>> >>       record as
>>> >>       far as CRUSH goes) is that this is the most promising path
>>> >>       forward.
>>> >>
>>> >>       Thinking ahead a few steps, and confirming that I'm following
>>> >>       the
>>> >>       discussion so far, if you're able to do get black (or white) box
>>> >>       gradient
>>> >>       descent to work, then this will give us a set of weights for
>>> >>       each item in
>>> >>       the tree for each selection round, derived from the tree
>>> >>       structure and
>>> >>       original (target) weights.  That would basically give us a map
>>> >>       of item id
>>> >>       (bucket id or leave item id) to weight for each round.  i.e.,
>>> >>
>>> >>        map<int, map<int, float>> weight_by_position;  // position ->
>>> >>       item -> weight
>>> >>
>>> >>       where the 0 round would (I think?) match the target weights, and
>>> >>       each
>>> >>       round after that would skew low-weighted items lower to some
>>> >>       degree.
>>> >>       Right?
>>> >>
>>> >>       The next question I have is: does this generalize from the
>>> >>       single-bucket
>>> >>       case to the hierarchy?  I.e., if I have a "tree" (single bucket)
>>> >>       like
>>> >>
>>> >>       3.1
>>> >>        |_____________
>>> >>        |   \    \    \
>>> >>       1.0  1.0  1.0  .1
>>> >>
>>> >>       it clearly works, but when we have a multi-level tree like
>>> >>
>>> >>
>>> >>       8.4
>>> >>        |____________________________________
>>> >>        |                 \                  \
>>> >>       3.1                3.1                2.2
>>> >>        |_____________     |_____________     |_____________
>>> >>        |   \    \    \    |   \    \    \    |   \    \    \
>>> >>       1.0  1.0  1.0  .1   1.0  1.0  1.0  .1  1.0  1.0 .1   .1
>>> >>
>>> >>       and the second round weights skew the small .1 leaves lower, can
>>> >>       we
>>> >>       continue to build the summed-weight hierarchy, such that the
>>> >>       adjusted
>>> >>       weights at the higher level are appropriately adjusted to give
>>> >>       us the
>>> >>       right probabilities of descending into those trees?  I'm not
>>> >>       sure if that
>>> >>       logically follows from the above or if my intuition is
>>> >>       oversimplifying
>>> >>       things.
>>> >>
>>> >>       If this *is* how we think this will shake out, then I'm
>>> >>       wondering if we
>>> >>       should go ahead and build this weigh matrix into CRUSH sooner
>>> >>       rather
>>> >>       than later (i.e., for luminous).  As with the explicit
>>> >>       remappings, the
>>> >>       hard part is all done offline, and the adjustments to the CRUSH
>>> >>       mapping
>>> >>       calculation itself (storing and making use of the adjusted
>>> >>       weights for
>>> >>       each round of placement) are relatively straightforward.  And
>>> >>       the sooner
>>> >>       this is incorporated into a release the sooner real users will
>>> >>       be able to
>>> >>       roll out code to all clients and start making us of it.
>>> >>
>>> >>       Thanks again for looking at this problem!  I'm excited that we
>>> >>       may be
>>> >>       closing in on a real solution!
>>> >>
>>> >>       sage
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>       On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
>>> >>
>>> >>       > There are lot of gradient-free methods. I will try first to
>>> >>       run the
>>> >>       > ones available using just scipy
>>> >>       >
>>> >>
>>> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>> >>       > Some of them don't require the gradient and some of them can
>>> >>       estimate
>>> >>       > it. The reason to go without the gradient is to run the CRUSH
>>> >>       > algorithm as a black box. In that case this would be the
>>> >>       pseudo-code:
>>> >>       >
>>> >>       > - BEGIN CODE -
>>> >>       > def build_target(desired_freqs):
>>> >>       >     def target(weights):
>>> >>       >         # run a simulation of CRUSH for a number of objects
>>> >>       >         sim_freqs = run_crush(weights)
>>> >>       >         # Kullback-Leibler divergence between desired
>>> >>       frequencies and
>>> >>       > current ones
>>> >>       >         return loss(sim_freqs, desired_freqs)
>>> >>       >    return target
>>> >>       >
>>> >>       > weights = scipy.optimize.minimize(build_target(desired_freqs))
>>> >>       > - END CODE -
>>> >>       >
>>> >>       > The tricky thing here is that this procedure can be slow if
>>> >>       the
>>> >>       > simulation (run_crush) needs to place a lot of objects to get
>>> >>       accurate
>>> >>       > simulated frequencies. This is true specially if the minimize
>>> >>       method
>>> >>       > attempts to approximate the gradient using finite differences
>>> >>       since it
>>> >>       > will evaluate the target function a number of times
>>> >>       proportional to
>>> >>       > the number of weights). Apart from the ones in scipy I would
>>> >>       try also
>>> >>       > optimization methods that try to perform as few evaluations as
>>> >>       > possible like for example HyperOpt
>>> >>       > (http://hyperopt.github.io/hyperopt/), which by the way takes
>>> >>       into
>>> >>       > account that the target function can be noisy.
>>> >>       >
>>> >>       > This black box approximation is simple to implement and makes
>>> >>       the
>>> >>       > computer do all the work instead of us.
>>> >>       > I think that this black box approximation is worthy to try
>>> >>       even if
>>> >>       > it's not the final one because if this approximation works
>>> >>       then we
>>> >>       > know that a more elaborate one that computes the gradient of
>>> >>       the CRUSH
>>> >>       > algorithm will work for sure.
>>> >>       >
>>> >>       > I can try this black box approximation this weekend not on the
>>> >>       real
>>> >>       > CRUSH algorithm but with the simple implementation I did in
>>> >>       python. If
>>> >>       > it works it's just a matter of substituting one simulation
>>> >>       with
>>> >>       > another and see what happens.
>>> >>       >
>>> >>       > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> >>       > > Hi Pedro,
>>> >>       > >
>>> >>       > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>> >>       > >> Hi Loic,
>>> >>       > >>
>>> >>       > >>>From what I see everything seems OK.
>>> >>       > >
>>> >>       > > Cool. I'll keep going in this direction then !
>>> >>       > >
>>> >>       > >> The interesting thing would be to
>>> >>       > >> test on some complex mapping. The reason is that
>>> >>       "CrushPolicyFamily"
>>> >>       > >> is right now modeling just a single straw bucket not the
>>> >>       full CRUSH
>>> >>       > >> algorithm.
>>> >>       > >
>>> >>       > > A number of use cases use a single straw bucket, maybe the
>>> >>       majority of them. Even though it does not reflect the full range
>>> >>       of what crush can offer, it could be useful. To be more
>>> >>       specific, a crush map that states "place objects so that there
>>> >>       is at most one replica per host" or "one replica per rack" is
>>> >>       common. Such a crushmap can be reduced to a single straw bucket
>>> >>       that contains all the hosts and by using the CrushPolicyFamily,
>>> >>       we can change the weights of each host to fix the probabilities.
>>> >>       The hosts themselves contain disks with varying weights but I
>>> >>       think we can ignore that because crush will only recurse to
>>> >>       place one object within a given host.
>>> >>       > >
>>> >>       > >> That's the work that remains to be done. The only way that
>>> >>       > >> would avoid reimplementing the CRUSH algorithm and
>>> >>       computing the
>>> >>       > >> gradient would be treating CRUSH as a black box and
>>> >>       eliminating the
>>> >>       > >> necessity of computing the gradient either by using a
>>> >>       gradient-free
>>> >>       > >> optimization method or making an estimation of the
>>> >>       gradient.
>>> >>       > >
>>> >>       > > By gradient-free optimization you mean simulated annealing
>>> >>       or Monte Carlo ?
>>> >>       > >
>>> >>       > > Cheers
>>> >>       > >
>>> >>       > >>
>>> >>       > >>
>>> >>       > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> >>       > >>> Hi,
>>> >>       > >>>
>>> >>       > >>> I modified the crush library to accept two weights (one
>>> >>       for the first disk, the other for the remaining disks)[1]. This
>>> >>       really is a hack for experimentation purposes only ;-) I was
>>> >>       able to run a variation of your code[2] and got the following
>>> >>       results which are encouraging. Do you think what I did is
>>> >>       sensible ? Or is there a problem I don't see ?
>>> >>       > >>>
>>> >>       > >>> Thanks !
>>> >>       > >>>
>>> >>       > >>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8
>>> >>       6]
>>> >>       > >>>
>>> >>
>>> >> ------------------------------------------------------------------------
>>> >>       > >>> Before: All replicas on each hard drive
>>> >>       > >>> Expected vs actual use (20000 samples)
>>> >>       > >>>  disk 0: 1.39e-01 1.12e-01
>>> >>       > >>>  disk 1: 1.11e-01 1.10e-01
>>> >>       > >>>  disk 2: 8.33e-02 1.13e-01
>>> >>       > >>>  disk 3: 1.39e-01 1.11e-01
>>> >>       > >>>  disk 4: 1.11e-01 1.11e-01
>>> >>       > >>>  disk 5: 8.33e-02 1.11e-01
>>> >>       > >>>  disk 6: 1.39e-01 1.12e-01
>>> >>       > >>>  disk 7: 1.11e-01 1.12e-01
>>> >>       > >>>  disk 8: 8.33e-02 1.10e-01
>>> >>       > >>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>> >>       > >>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>> >>       > >>> ...
>>> >>       > >>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>> >>       > >>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>> >>       > >>> Converged to desired accuracy :)
>>> >>       > >>> After: All replicas on each hard drive
>>> >>       > >>> Expected vs actual use (20000 samples)
>>> >>       > >>>  disk 0: 1.39e-01 1.42e-01
>>> >>       > >>>  disk 1: 1.11e-01 1.09e-01
>>> >>       > >>>  disk 2: 8.33e-02 8.37e-02
>>> >>       > >>>  disk 3: 1.39e-01 1.40e-01
>>> >>       > >>>  disk 4: 1.11e-01 1.13e-01
>>> >>       > >>>  disk 5: 8.33e-02 8.08e-02
>>> >>       > >>>  disk 6: 1.39e-01 1.38e-01
>>> >>       > >>>  disk 7: 1.11e-01 1.09e-01
>>> >>       > >>>  disk 8: 8.33e-02 8.48e-02
>>> >>       > >>>
>>> >>       > >>>
>>> >>       > >>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>> >>       > >>>
>>> >>
>>> >> ------------------------------------------------------------------------
>>> >>       > >>> Before: All replicas on each hard drive
>>> >>       > >>> Expected vs actual use (20000 samples)
>>> >>       > >>>  disk 0: 2.44e-01 2.36e-01
>>> >>       > >>>  disk 1: 2.44e-01 2.38e-01
>>> >>       > >>>  disk 2: 2.44e-01 2.34e-01
>>> >>       > >>>  disk 3: 2.44e-01 2.38e-01
>>> >>       > >>>  disk 4: 2.44e-02 5.37e-02
>>> >>       > >>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>> >>       > >>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>> >>       > >>> ...
>>> >>       > >>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>> >>       > >>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>> >>       > >>> Converged to desired accuracy :)
>>> >>       > >>> After: All replicas on each hard drive
>>> >>       > >>> Expected vs actual use (20000 samples)
>>> >>       > >>>  disk 0: 2.44e-01 2.46e-01
>>> >>       > >>>  disk 1: 2.44e-01 2.44e-01
>>> >>       > >>>  disk 2: 2.44e-01 2.41e-01
>>> >>       > >>>  disk 3: 2.44e-01 2.45e-01
>>> >>       > >>>  disk 4: 2.44e-02 2.33e-02
>>> >>       > >>>
>>> >>       > >>>
>>> >>       > >>> [1] crush
>>> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
>>> >>       56fee8
>>> >>       > >>> [2] python-crush
>>> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
>>> >>       bd25f8f2c4b68
>>> >>       > >>>
>>> >>       > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>> >>       > >>>> Hi Pedro,
>>> >>       > >>>>
>>> >>       > >>>> It looks like trying to experiment with crush won't work
>>> >>       as expected because crush does not distinguish the probability
>>> >>       of selecting the first device from the probability of selecting
>>> >>       the second or third device. Am I mistaken ?
>>> >>       > >>>>
>>> >>       > >>>> Cheers
>>> >>       > >>>>
>>> >>       > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>> >>       > >>>>> Hi Pedro,
>>> >>       > >>>>>
>>> >>       > >>>>> I'm going to experiment with what you did at
>>> >>       > >>>>>
>>> >>       > >>>>>
>>> >>       https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>> >>       > >>>>>
>>> >>       > >>>>> and the latest python-crush published today. A
>>> >>       comparison function was added that will help measure the data
>>> >>       movement. I'm hoping we can release an offline tool based on
>>> >>       your solution. Please let me know if I should wait before diving
>>> >>       into this, in case you have unpublished drafts or new ideas.
>>> >>       > >>>>>
>>> >>       > >>>>> Cheers
>>> >>       > >>>>>
>>> >>       > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>> >>       > >>>>>> Great, thanks for the clarifications.
>>> >>       > >>>>>> I also think that the most natural way is to keep just
>>> >>       a set of
>>> >>       > >>>>>> weights in the CRUSH map and update them inside the
>>> >>       algorithm.
>>> >>       > >>>>>>
>>> >>       > >>>>>> I keep working on it.
>>> >>       > >>>>>>
>>> >>       > >>>>>>
>>> >>       > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
>>> >>       <sage@newdream.net>:
>>> >>       > >>>>>>> Hi Pedro,
>>> >>       > >>>>>>>
>>> >>       > >>>>>>> Thanks for taking a look at this!  It's a frustrating
>>> >>       problem and we
>>> >>       > >>>>>>> haven't made much headway.
>>> >>       > >>>>>>>
>>> >>       > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>> >>       > >>>>>>>> Hi,
>>> >>       > >>>>>>>>
>>> >>       > >>>>>>>> I will have a look. BTW, I have not progressed that
>>> >>       much but I have
>>> >>       > >>>>>>>> been thinking about it. In order to adapt the
>>> >>       previous algorithm in
>>> >>       > >>>>>>>> the python notebook I need to substitute the
>>> >>       iteration over all
>>> >>       > >>>>>>>> possible devices permutations to iteration over all
>>> >>       the possible
>>> >>       > >>>>>>>> selections that crush would make. That is the main
>>> >>       thing I need to
>>> >>       > >>>>>>>> work on.
>>> >>       > >>>>>>>>
>>> >>       > >>>>>>>> The other thing is of course that weights change for
>>> >>       each replica.
>>> >>       > >>>>>>>> That is, they cannot be really fixed in the crush
>>> >>       map. So the
>>> >>       > >>>>>>>> algorithm inside libcrush, not only the weights in
>>> >>       the map, need to be
>>> >>       > >>>>>>>> changed. The weights in the crush map should reflect
>>> >>       then, maybe, the
>>> >>       > >>>>>>>> desired usage frequencies. Or maybe each replica
>>> >>       should have their own
>>> >>       > >>>>>>>> crush map, but then the information about the
>>> >>       previous selection
>>> >>       > >>>>>>>> should be passed to the next replica placement run so
>>> >>       it avoids
>>> >>       > >>>>>>>> selecting the same one again.
>>> >>       > >>>>>>>
>>> >>       > >>>>>>> My suspicion is that the best solution here (whatever
>>> >>       that means!)
>>> >>       > >>>>>>> leaves the CRUSH weights intact with the desired
>>> >>       distribution, and
>>> >>       > >>>>>>> then generates a set of derivative weights--probably
>>> >>       one set for each
>>> >>       > >>>>>>> round/replica/rank.
>>> >>       > >>>>>>>
>>> >>       > >>>>>>> One nice property of this is that once the support is
>>> >>       added to encode
>>> >>       > >>>>>>> multiple sets of weights, the algorithm used to
>>> >>       generate them is free to
>>> >>       > >>>>>>> change and evolve independently.  (In most cases any
>>> >>       change is
>>> >>       > >>>>>>> CRUSH's mapping behavior is difficult to roll out
>>> >>       because all
>>> >>       > >>>>>>> parties participating in the cluster have to support
>>> >>       any new behavior
>>> >>       > >>>>>>> before it is enabled or used.)
>>> >>       > >>>>>>>
>>> >>       > >>>>>>>> I have a question also. Is there any significant
>>> >>       difference between
>>> >>       > >>>>>>>> the device selection algorithm description in the
>>> >>       paper and its final
>>> >>       > >>>>>>>> implementation?
>>> >>       > >>>>>>>
>>> >>       > >>>>>>> The main difference is the "retry_bucket" behavior was
>>> >>       found to be a bad
>>> >>       > >>>>>>> idea; any collision or failed()/overload() case
>>> >>       triggers the
>>> >>       > >>>>>>> retry_descent.
>>> >>       > >>>>>>>
>>> >>       > >>>>>>> There are other changes, of course, but I don't think
>>> >>       they'll impact any
>>> >>       > >>>>>>> solution we come with here (or at least any solution
>>> >>       can be suitably
>>> >>       > >>>>>>> adapted)!
>>> >>       > >>>>>>>
>>> >>       > >>>>>>> sage
>>> >>       > >>>>>> --
>>> >>       > >>>>>> To unsubscribe from this list: send the line
>>> >>       "unsubscribe ceph-devel" in
>>> >>       > >>>>>> the body of a message to majordomo@vger.kernel.org
>>> >>       > >>>>>> More majordomo info at
>>> >>       http://vger.kernel.org/majordomo-info.html
>>> >>       > >>>>>>
>>> >>       > >>>>>
>>> >>       > >>>>
>>> >>       > >>>
>>> >>       > >>> --
>>> >>       > >>> Loïc Dachary, Artisan Logiciel Libre
>>> >>       > >> --
>>> >>       > >> To unsubscribe from this list: send the line "unsubscribe
>>> >>       ceph-devel" in
>>> >>       > >> the body of a message to majordomo@vger.kernel.org
>>> >>       > >> More majordomo info at
>>> >>       http://vger.kernel.org/majordomo-info.html
>>> >>       > >>
>>> >>       > >
>>> >>       > > --
>>> >>       > > Loïc Dachary, Artisan Logiciel Libre
>>> >>       > --
>>> >>       > To unsubscribe from this list: send the line "unsubscribe
>>> >>       ceph-devel" in
>>> >>       > the body of a message to majordomo@vger.kernel.org
>>> >>       > More majordomo info at
>>> >>       http://vger.kernel.org/majordomo-info.html
>>> >>       >
>>> >>       >
>>> >>
>>> >>
>>> >>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-27  6:45                                               ` Loic Dachary
       [not found]                                                 ` <CAHMeWhGuJnu2664VTxomQ-wJewBEPjRT_VGWH+g-v5k3ka6X5Q@mail.gmail.com>
@ 2017-03-27 13:24                                                 ` Sage Weil
  1 sibling, 0 replies; 70+ messages in thread
From: Sage Weil @ 2017-03-27 13:24 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Adam Kupczyk, Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 23328 bytes --]

On Mon, 27 Mar 2017, Loic Dachary wrote:
> On 03/27/2017 04:33 AM, Sage Weil wrote:
> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
> >> Hello Sage, Loic, Pedro,
> >>
> >>
> >> I am certain that almost perfect mapping can be achieved by
> >> substituting weights from crush map with slightly modified weights.
> >> By perfect mapping I mean we get on each OSD number of PGs exactly
> >> proportional to weights specified in crush map.
> >>
> >> 1. Example
> >> Lets think of PGs of single object pool.
> >> We have OSDs with following weights:
> >> [10, 10, 10, 5, 5]
> >>
> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
> >> PGcopies :
> >> [150, 150, 150, 75, 75]
> >>
> >> However, because crush simulates random process we have:
> >> [143, 152, 158, 71, 76]
> >>
> >> We could have obtained perfect distribution had we used weights like this:
> >> [10.2, 9.9, 9.6, 5.2, 4.9]
> >>
> >>
> >> 2. Obtaining perfect mapping weights from OSD capacity weights
> >>
> >> When we apply crush for the first time, distribution of PGs comes as random.
> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
> >>
> >> But CRUSH is not random proces at all, it behaves in numerically stable way.
> >> Specifically, if we increase weight on one node, we will get more PGs on
> >> this node and less on every other node:
> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
> >>
> >> Now, finding ideal weights can be done by any numerical minimization method,
> >> for example NLMS.
> >>
> >>
> >> 3. The proposal
> >> For each pool, from initial weights given in crush map perfect weights will
> >> be derived.
> >> This weights will be used to calculate PG distribution. This of course will
> >> be close to perfect.
> >>
> >> 3a: Downside when OSD is out
> >> When an OSD is out, missing PG copies will be replicated elsewhere.
> >> Because now weights deviate from OSD capacity, some OSDs will statistically
> >> get more copies then they should.
> >> This unevenness in distribution is proportional to scale of deviation of
> >> calculated weights to capacity weights.
> >>
> >> 3b: Upside
> >> This all can be achieved without changes to crush.
> > 
> > Yes!
> > 
> > And no.  You're totally right--we should use an offline optimization to 
> > tweak the crush input weights to get a better balance.  It won't be robust 
> > to changes to the cluster, but we can incrementally optimize after that 
> > happens to converge on something better.
> > 
> > The problem with doing this with current versions of Ceph is that we lose 
> > the original "input" or "target" weights (i.e., the actual size of 
> > the OSD) that we want to converge on.  This is one reason why we haven't 
> > done something like this before.
> > 
> > In luminous we *could* work around this by storing those canonical 
> > weights outside of crush using something (probably?) ugly and 
> > maintain backward compatibility with older clients using existing 
> > CRUSH behavior.
> 
> These canonical weights could be stored in crush by creating dedicated buckets. For instance the root-canonical bucket could be created to store the canonical weights of the root bucket. The sysadmin needs to be aware of the difference and know to add a new device in the host01-canonical bucket instead of the host01 bucket. And to run an offline tool to keep the two buckets in sync and compute the weight to use for placement derived from the weights representing the device capacity.

Oh, right!  I should have looked at the PR more closely.

> It is a little bit ugly ;-)

A bit, but it could be worse.  And we can kludge ceph to hide the 
derivative buckets in things like 'osd tree'.  I'd probably flip it around 
and keep teh existing buckets as the 'canonical' ones, and create new 
~adjusted buckets, or some similar naming like we are doing with the 
device classes.

If there is an offline crush weight optimizer, it can either do the 
somewhat ugly parallel hierarchy, or if the crush encoding is luminous+ it 
can make use of the new (coming) weight matrix...

sage


> 
> > OR, (and this is my preferred route), if the multi-pick anomaly approach 
> > that Pedro is working on works out, we'll want to extend the CRUSH map to 
> > include a set of derivative weights used for actual placement calculations 
> > instead of the canonical target weights, and we can do what you're 
> > proposing *and* solve the multipick problem with one change in the crush 
> > map and algorithm.  (Actually choosing those derivative weights will 
> > be an offline process that can both improve the balance for the inputs we 
> > care about *and* adjust them based on the position to fix the skew issue 
> > for replicas.)  This doesn't help pre-luminous clients, but I think the 
> > end solution will be simpler and more elegant...
> > 
> > What do you think?
> > 
> > sage
> > 
> > 
> >> 4. Extra
> >> Some time ago I made such change to perfectly balance Thomson-Reuters
> >> cluster.
> >> It succeeded.
> >> A solution was not accepted, because modification of OSD weights were higher
> >> then 50%, which was caused by fact that different placement rules operated
> >> on different sets of OSDs, and those sets were not disjointed.
> > 
> > 
> >>
> >> Best regards,
> >> Adam
> >>
> >>
> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
> >>       Hi Pedro, Loic,
> >>
> >>       For what it's worth, my intuition here (which has had a mixed
> >>       record as
> >>       far as CRUSH goes) is that this is the most promising path
> >>       forward.
> >>
> >>       Thinking ahead a few steps, and confirming that I'm following
> >>       the
> >>       discussion so far, if you're able to do get black (or white) box
> >>       gradient
> >>       descent to work, then this will give us a set of weights for
> >>       each item in
> >>       the tree for each selection round, derived from the tree
> >>       structure and
> >>       original (target) weights.  That would basically give us a map
> >>       of item id
> >>       (bucket id or leave item id) to weight for each round.  i.e.,
> >>
> >>        map<int, map<int, float>> weight_by_position;  // position ->
> >>       item -> weight
> >>
> >>       where the 0 round would (I think?) match the target weights, and
> >>       each
> >>       round after that would skew low-weighted items lower to some
> >>       degree.
> >>       Right?
> >>
> >>       The next question I have is: does this generalize from the
> >>       single-bucket
> >>       case to the hierarchy?  I.e., if I have a "tree" (single bucket)
> >>       like
> >>
> >>       3.1
> >>        |_____________
> >>        |   \    \    \
> >>       1.0  1.0  1.0  .1
> >>
> >>       it clearly works, but when we have a multi-level tree like
> >>
> >>
> >>       8.4
> >>        |____________________________________
> >>        |                 \                  \
> >>       3.1                3.1                2.2
> >>        |_____________     |_____________     |_____________
> >>        |   \    \    \    |   \    \    \    |   \    \    \
> >>       1.0  1.0  1.0  .1   1.0  1.0  1.0  .1  1.0  1.0 .1   .1
> >>
> >>       and the second round weights skew the small .1 leaves lower, can
> >>       we
> >>       continue to build the summed-weight hierarchy, such that the
> >>       adjusted
> >>       weights at the higher level are appropriately adjusted to give
> >>       us the
> >>       right probabilities of descending into those trees?  I'm not
> >>       sure if that
> >>       logically follows from the above or if my intuition is
> >>       oversimplifying
> >>       things.
> >>
> >>       If this *is* how we think this will shake out, then I'm
> >>       wondering if we
> >>       should go ahead and build this weigh matrix into CRUSH sooner
> >>       rather
> >>       than later (i.e., for luminous).  As with the explicit
> >>       remappings, the
> >>       hard part is all done offline, and the adjustments to the CRUSH
> >>       mapping
> >>       calculation itself (storing and making use of the adjusted
> >>       weights for
> >>       each round of placement) are relatively straightforward.  And
> >>       the sooner
> >>       this is incorporated into a release the sooner real users will
> >>       be able to
> >>       roll out code to all clients and start making us of it.
> >>
> >>       Thanks again for looking at this problem!  I'm excited that we
> >>       may be
> >>       closing in on a real solution!
> >>
> >>       sage
> >>
> >>
> >>
> >>
> >>
> >>       On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
> >>
> >>       > There are lot of gradient-free methods. I will try first to
> >>       run the
> >>       > ones available using just scipy
> >>       >
> >>       (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> >>       > Some of them don't require the gradient and some of them can
> >>       estimate
> >>       > it. The reason to go without the gradient is to run the CRUSH
> >>       > algorithm as a black box. In that case this would be the
> >>       pseudo-code:
> >>       >
> >>       > - BEGIN CODE -
> >>       > def build_target(desired_freqs):
> >>       >     def target(weights):
> >>       >         # run a simulation of CRUSH for a number of objects
> >>       >         sim_freqs = run_crush(weights)
> >>       >         # Kullback-Leibler divergence between desired
> >>       frequencies and
> >>       > current ones
> >>       >         return loss(sim_freqs, desired_freqs)
> >>       >    return target
> >>       >
> >>       > weights = scipy.optimize.minimize(build_target(desired_freqs))
> >>       > - END CODE -
> >>       >
> >>       > The tricky thing here is that this procedure can be slow if
> >>       the
> >>       > simulation (run_crush) needs to place a lot of objects to get
> >>       accurate
> >>       > simulated frequencies. This is true specially if the minimize
> >>       method
> >>       > attempts to approximate the gradient using finite differences
> >>       since it
> >>       > will evaluate the target function a number of times
> >>       proportional to
> >>       > the number of weights). Apart from the ones in scipy I would
> >>       try also
> >>       > optimization methods that try to perform as few evaluations as
> >>       > possible like for example HyperOpt
> >>       > (http://hyperopt.github.io/hyperopt/), which by the way takes
> >>       into
> >>       > account that the target function can be noisy.
> >>       >
> >>       > This black box approximation is simple to implement and makes
> >>       the
> >>       > computer do all the work instead of us.
> >>       > I think that this black box approximation is worthy to try
> >>       even if
> >>       > it's not the final one because if this approximation works
> >>       then we
> >>       > know that a more elaborate one that computes the gradient of
> >>       the CRUSH
> >>       > algorithm will work for sure.
> >>       >
> >>       > I can try this black box approximation this weekend not on the
> >>       real
> >>       > CRUSH algorithm but with the simple implementation I did in
> >>       python. If
> >>       > it works it's just a matter of substituting one simulation
> >>       with
> >>       > another and see what happens.
> >>       >
> >>       > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
> >>       > > Hi Pedro,
> >>       > >
> >>       > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
> >>       > >> Hi Loic,
> >>       > >>
> >>       > >>>From what I see everything seems OK.
> >>       > >
> >>       > > Cool. I'll keep going in this direction then !
> >>       > >
> >>       > >> The interesting thing would be to
> >>       > >> test on some complex mapping. The reason is that
> >>       "CrushPolicyFamily"
> >>       > >> is right now modeling just a single straw bucket not the
> >>       full CRUSH
> >>       > >> algorithm.
> >>       > >
> >>       > > A number of use cases use a single straw bucket, maybe the
> >>       majority of them. Even though it does not reflect the full range
> >>       of what crush can offer, it could be useful. To be more
> >>       specific, a crush map that states "place objects so that there
> >>       is at most one replica per host" or "one replica per rack" is
> >>       common. Such a crushmap can be reduced to a single straw bucket
> >>       that contains all the hosts and by using the CrushPolicyFamily,
> >>       we can change the weights of each host to fix the probabilities.
> >>       The hosts themselves contain disks with varying weights but I
> >>       think we can ignore that because crush will only recurse to
> >>       place one object within a given host.
> >>       > >
> >>       > >> That's the work that remains to be done. The only way that
> >>       > >> would avoid reimplementing the CRUSH algorithm and
> >>       computing the
> >>       > >> gradient would be treating CRUSH as a black box and
> >>       eliminating the
> >>       > >> necessity of computing the gradient either by using a
> >>       gradient-free
> >>       > >> optimization method or making an estimation of the
> >>       gradient.
> >>       > >
> >>       > > By gradient-free optimization you mean simulated annealing
> >>       or Monte Carlo ?
> >>       > >
> >>       > > Cheers
> >>       > >
> >>       > >>
> >>       > >>
> >>       > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
> >>       > >>> Hi,
> >>       > >>>
> >>       > >>> I modified the crush library to accept two weights (one
> >>       for the first disk, the other for the remaining disks)[1]. This
> >>       really is a hack for experimentation purposes only ;-) I was
> >>       able to run a variation of your code[2] and got the following
> >>       results which are encouraging. Do you think what I did is
> >>       sensible ? Or is there a problem I don't see ?
> >>       > >>>
> >>       > >>> Thanks !
> >>       > >>>
> >>       > >>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8 
> >>       6]
> >>       > >>>
> >>       ------------------------------------------------------------------------
> >>       > >>> Before: All replicas on each hard drive
> >>       > >>> Expected vs actual use (20000 samples)
> >>       > >>>  disk 0: 1.39e-01 1.12e-01
> >>       > >>>  disk 1: 1.11e-01 1.10e-01
> >>       > >>>  disk 2: 8.33e-02 1.13e-01
> >>       > >>>  disk 3: 1.39e-01 1.11e-01
> >>       > >>>  disk 4: 1.11e-01 1.11e-01
> >>       > >>>  disk 5: 8.33e-02 1.11e-01
> >>       > >>>  disk 6: 1.39e-01 1.12e-01
> >>       > >>>  disk 7: 1.11e-01 1.12e-01
> >>       > >>>  disk 8: 8.33e-02 1.10e-01
> >>       > >>> it=    1 jac norm=1.59e-01 loss=5.27e-03
> >>       > >>> it=    2 jac norm=1.55e-01 loss=5.03e-03
> >>       > >>> ...
> >>       > >>> it=  212 jac norm=1.02e-03 loss=2.41e-07
> >>       > >>> it=  213 jac norm=1.00e-03 loss=2.31e-07
> >>       > >>> Converged to desired accuracy :)
> >>       > >>> After: All replicas on each hard drive
> >>       > >>> Expected vs actual use (20000 samples)
> >>       > >>>  disk 0: 1.39e-01 1.42e-01
> >>       > >>>  disk 1: 1.11e-01 1.09e-01
> >>       > >>>  disk 2: 8.33e-02 8.37e-02
> >>       > >>>  disk 3: 1.39e-01 1.40e-01
> >>       > >>>  disk 4: 1.11e-01 1.13e-01
> >>       > >>>  disk 5: 8.33e-02 8.08e-02
> >>       > >>>  disk 6: 1.39e-01 1.38e-01
> >>       > >>>  disk 7: 1.11e-01 1.09e-01
> >>       > >>>  disk 8: 8.33e-02 8.48e-02
> >>       > >>>
> >>       > >>>
> >>       > >>> Simulation: R=2 devices capacity [10 10 10 10  1]
> >>       > >>>
> >>       ------------------------------------------------------------------------
> >>       > >>> Before: All replicas on each hard drive
> >>       > >>> Expected vs actual use (20000 samples)
> >>       > >>>  disk 0: 2.44e-01 2.36e-01
> >>       > >>>  disk 1: 2.44e-01 2.38e-01
> >>       > >>>  disk 2: 2.44e-01 2.34e-01
> >>       > >>>  disk 3: 2.44e-01 2.38e-01
> >>       > >>>  disk 4: 2.44e-02 5.37e-02
> >>       > >>> it=    1 jac norm=2.43e-01 loss=2.98e-03
> >>       > >>> it=    2 jac norm=2.28e-01 loss=2.47e-03
> >>       > >>> ...
> >>       > >>> it=   37 jac norm=1.28e-03 loss=3.48e-08
> >>       > >>> it=   38 jac norm=1.07e-03 loss=2.42e-08
> >>       > >>> Converged to desired accuracy :)
> >>       > >>> After: All replicas on each hard drive
> >>       > >>> Expected vs actual use (20000 samples)
> >>       > >>>  disk 0: 2.44e-01 2.46e-01
> >>       > >>>  disk 1: 2.44e-01 2.44e-01
> >>       > >>>  disk 2: 2.44e-01 2.41e-01
> >>       > >>>  disk 3: 2.44e-01 2.45e-01
> >>       > >>>  disk 4: 2.44e-02 2.33e-02
> >>       > >>>
> >>       > >>>
> >>       > >>> [1] crush hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
> >>       56fee8
> >>       > >>> [2] python-crush hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
> >>       bd25f8f2c4b68
> >>       > >>>
> >>       > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
> >>       > >>>> Hi Pedro,
> >>       > >>>>
> >>       > >>>> It looks like trying to experiment with crush won't work
> >>       as expected because crush does not distinguish the probability
> >>       of selecting the first device from the probability of selecting
> >>       the second or third device. Am I mistaken ?
> >>       > >>>>
> >>       > >>>> Cheers
> >>       > >>>>
> >>       > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
> >>       > >>>>> Hi Pedro,
> >>       > >>>>>
> >>       > >>>>> I'm going to experiment with what you did at
> >>       > >>>>>
> >>       > >>>>>
> >>       https://github.com/plafl/notebooks/blob/master/replication.ipynb
> >>       > >>>>>
> >>       > >>>>> and the latest python-crush published today. A
> >>       comparison function was added that will help measure the data
> >>       movement. I'm hoping we can release an offline tool based on
> >>       your solution. Please let me know if I should wait before diving
> >>       into this, in case you have unpublished drafts or new ideas.
> >>       > >>>>>
> >>       > >>>>> Cheers
> >>       > >>>>>
> >>       > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
> >>       > >>>>>> Great, thanks for the clarifications.
> >>       > >>>>>> I also think that the most natural way is to keep just
> >>       a set of
> >>       > >>>>>> weights in the CRUSH map and update them inside the
> >>       algorithm.
> >>       > >>>>>>
> >>       > >>>>>> I keep working on it.
> >>       > >>>>>>
> >>       > >>>>>>
> >>       > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
> >>       <sage@newdream.net>:
> >>       > >>>>>>> Hi Pedro,
> >>       > >>>>>>>
> >>       > >>>>>>> Thanks for taking a look at this!  It's a frustrating
> >>       problem and we
> >>       > >>>>>>> haven't made much headway.
> >>       > >>>>>>>
> >>       > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
> >>       > >>>>>>>> Hi,
> >>       > >>>>>>>>
> >>       > >>>>>>>> I will have a look. BTW, I have not progressed that
> >>       much but I have
> >>       > >>>>>>>> been thinking about it. In order to adapt the
> >>       previous algorithm in
> >>       > >>>>>>>> the python notebook I need to substitute the
> >>       iteration over all
> >>       > >>>>>>>> possible devices permutations to iteration over all
> >>       the possible
> >>       > >>>>>>>> selections that crush would make. That is the main
> >>       thing I need to
> >>       > >>>>>>>> work on.
> >>       > >>>>>>>>
> >>       > >>>>>>>> The other thing is of course that weights change for
> >>       each replica.
> >>       > >>>>>>>> That is, they cannot be really fixed in the crush
> >>       map. So the
> >>       > >>>>>>>> algorithm inside libcrush, not only the weights in
> >>       the map, need to be
> >>       > >>>>>>>> changed. The weights in the crush map should reflect
> >>       then, maybe, the
> >>       > >>>>>>>> desired usage frequencies. Or maybe each replica
> >>       should have their own
> >>       > >>>>>>>> crush map, but then the information about the
> >>       previous selection
> >>       > >>>>>>>> should be passed to the next replica placement run so
> >>       it avoids
> >>       > >>>>>>>> selecting the same one again.
> >>       > >>>>>>>
> >>       > >>>>>>> My suspicion is that the best solution here (whatever
> >>       that means!)
> >>       > >>>>>>> leaves the CRUSH weights intact with the desired
> >>       distribution, and
> >>       > >>>>>>> then generates a set of derivative weights--probably
> >>       one set for each
> >>       > >>>>>>> round/replica/rank.
> >>       > >>>>>>>
> >>       > >>>>>>> One nice property of this is that once the support is
> >>       added to encode
> >>       > >>>>>>> multiple sets of weights, the algorithm used to
> >>       generate them is free to
> >>       > >>>>>>> change and evolve independently.  (In most cases any
> >>       change is
> >>       > >>>>>>> CRUSH's mapping behavior is difficult to roll out
> >>       because all
> >>       > >>>>>>> parties participating in the cluster have to support
> >>       any new behavior
> >>       > >>>>>>> before it is enabled or used.)
> >>       > >>>>>>>
> >>       > >>>>>>>> I have a question also. Is there any significant
> >>       difference between
> >>       > >>>>>>>> the device selection algorithm description in the
> >>       paper and its final
> >>       > >>>>>>>> implementation?
> >>       > >>>>>>>
> >>       > >>>>>>> The main difference is the "retry_bucket" behavior was
> >>       found to be a bad
> >>       > >>>>>>> idea; any collision or failed()/overload() case
> >>       triggers the
> >>       > >>>>>>> retry_descent.
> >>       > >>>>>>>
> >>       > >>>>>>> There are other changes, of course, but I don't think
> >>       they'll impact any
> >>       > >>>>>>> solution we come with here (or at least any solution
> >>       can be suitably
> >>       > >>>>>>> adapted)!
> >>       > >>>>>>>
> >>       > >>>>>>> sage
> >>       > >>>>>> --
> >>       > >>>>>> To unsubscribe from this list: send the line
> >>       "unsubscribe ceph-devel" in
> >>       > >>>>>> the body of a message to majordomo@vger.kernel.org
> >>       > >>>>>> More majordomo info at 
> >>       http://vger.kernel.org/majordomo-info.html
> >>       > >>>>>>
> >>       > >>>>>
> >>       > >>>>
> >>       > >>>
> >>       > >>> --
> >>       > >>> Loïc Dachary, Artisan Logiciel Libre
> >>       > >> --
> >>       > >> To unsubscribe from this list: send the line "unsubscribe
> >>       ceph-devel" in
> >>       > >> the body of a message to majordomo@vger.kernel.org
> >>       > >> More majordomo info at 
> >>       http://vger.kernel.org/majordomo-info.html
> >>       > >>
> >>       > >
> >>       > > --
> >>       > > Loïc Dachary, Artisan Logiciel Libre
> >>       > --
> >>       > To unsubscribe from this list: send the line "unsubscribe
> >>       ceph-devel" in
> >>       > the body of a message to majordomo@vger.kernel.org
> >>       > More majordomo info at 
> >>       http://vger.kernel.org/majordomo-info.html
> >>       >
> >>       >
> >>
> >>
> >>
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> 
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-27  9:27                                                   ` Adam Kupczyk
  2017-03-27 10:29                                                     ` Loic Dachary
  2017-03-27 10:37                                                     ` Pedro López-Adeva
@ 2017-03-27 13:39                                                     ` Sage Weil
  2017-03-28  6:52                                                       ` Adam Kupczyk
  2 siblings, 1 reply; 70+ messages in thread
From: Sage Weil @ 2017-03-27 13:39 UTC (permalink / raw)
  To: Adam Kupczyk; +Cc: Loic Dachary, Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 26927 bytes --]

On Mon, 27 Mar 2017, Adam Kupczyk wrote:
> Hi,
> 
> My understanding is that optimal tweaked weights will depend on:
> 1) pool_id, because of rjenkins(pool_id) in crush
> 2) number of placement groups and replication factor, as it determines
> amount of samples
> 
> Therefore tweaked weights should rather be property of instantialized pool,
> not crush placement definition.
>
> If tweaked weights are to be part of crush definition, than for each
> created pool we need to have separate list of weights.
> Is it possible to provide clients with different weights depending on on
> which pool they want to operate?

As Loic suggested, you can create as many derivative hierarchies in the 
crush map as you like, potentially one per pool.  Or you could treat the 
sum total of all pgs as the interesting set, balance those, and get some 
OSDs doing a bit more of one pool than another.  The new post-CRUSH OSD 
remap capability can always clean this up (and turn a "good" crush 
distribution into a perfect distribution).

I guess the question is: when we add the explicit adjusted weight matrix 
to crush should we have multiple sets of weights (perhaps one for each 
pool), or simply have a single global set.  It might make sense to allow N 
sets of adjusted weights so that the crush users can choose a particular 
set of them for different pools (or whatever it is they're calculating the 
mapping for)..

sage


> 
> Best regards,
> Adam
> 
> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
> > Hi,
> >
> > My understanding is that optimal tweaked weights will depend on:
> > 1) pool_id, because of rjenkins(pool_id) in crush
> > 2) number of placement groups and replication factor, as it determines
> > amount of samples
> >
> > Therefore tweaked weights should rather be property of instantialized pool,
> > not crush placement definition.
> >
> > If tweaked weights are to be part of crush definition, than for each created
> > pool we need to have separate list of weights.
> > Is it possible to provide clients with different weights depending on on
> > which pool they want to operate?
> >
> > Best regards,
> > Adam
> >
> >
> > On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@dachary.org> wrote:
> >>
> >>
> >>
> >> On 03/27/2017 04:33 AM, Sage Weil wrote:
> >> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
> >> >> Hello Sage, Loic, Pedro,
> >> >>
> >> >>
> >> >> I am certain that almost perfect mapping can be achieved by
> >> >> substituting weights from crush map with slightly modified weights.
> >> >> By perfect mapping I mean we get on each OSD number of PGs exactly
> >> >> proportional to weights specified in crush map.
> >> >>
> >> >> 1. Example
> >> >> Lets think of PGs of single object pool.
> >> >> We have OSDs with following weights:
> >> >> [10, 10, 10, 5, 5]
> >> >>
> >> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
> >> >> PGcopies :
> >> >> [150, 150, 150, 75, 75]
> >> >>
> >> >> However, because crush simulates random process we have:
> >> >> [143, 152, 158, 71, 76]
> >> >>
> >> >> We could have obtained perfect distribution had we used weights like
> >> >> this:
> >> >> [10.2, 9.9, 9.6, 5.2, 4.9]
> >> >>
> >> >>
> >> >> 2. Obtaining perfect mapping weights from OSD capacity weights
> >> >>
> >> >> When we apply crush for the first time, distribution of PGs comes as
> >> >> random.
> >> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
> >> >>
> >> >> But CRUSH is not random proces at all, it behaves in numerically stable
> >> >> way.
> >> >> Specifically, if we increase weight on one node, we will get more PGs
> >> >> on
> >> >> this node and less on every other node:
> >> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
> >> >>
> >> >> Now, finding ideal weights can be done by any numerical minimization
> >> >> method,
> >> >> for example NLMS.
> >> >>
> >> >>
> >> >> 3. The proposal
> >> >> For each pool, from initial weights given in crush map perfect weights
> >> >> will
> >> >> be derived.
> >> >> This weights will be used to calculate PG distribution. This of course
> >> >> will
> >> >> be close to perfect.
> >> >>
> >> >> 3a: Downside when OSD is out
> >> >> When an OSD is out, missing PG copies will be replicated elsewhere.
> >> >> Because now weights deviate from OSD capacity, some OSDs will
> >> >> statistically
> >> >> get more copies then they should.
> >> >> This unevenness in distribution is proportional to scale of deviation
> >> >> of
> >> >> calculated weights to capacity weights.
> >> >>
> >> >> 3b: Upside
> >> >> This all can be achieved without changes to crush.
> >> >
> >> > Yes!
> >> >
> >> > And no.  You're totally right--we should use an offline optimization to
> >> > tweak the crush input weights to get a better balance.  It won't be
> >> > robust
> >> > to changes to the cluster, but we can incrementally optimize after that
> >> > happens to converge on something better.
> >> >
> >> > The problem with doing this with current versions of Ceph is that we
> >> > lose
> >> > the original "input" or "target" weights (i.e., the actual size of
> >> > the OSD) that we want to converge on.  This is one reason why we haven't
> >> > done something like this before.
> >> >
> >> > In luminous we *could* work around this by storing those canonical
> >> > weights outside of crush using something (probably?) ugly and
> >> > maintain backward compatibility with older clients using existing
> >> > CRUSH behavior.
> >>
> >> These canonical weights could be stored in crush by creating dedicated
> >> buckets. For instance the root-canonical bucket could be created to store
> >> the canonical weights of the root bucket. The sysadmin needs to be aware of
> >> the difference and know to add a new device in the host01-canonical bucket
> >> instead of the host01 bucket. And to run an offline tool to keep the two
> >> buckets in sync and compute the weight to use for placement derived from the
> >> weights representing the device capacity.
> >>
> >> It is a little bit ugly ;-)
> >>
> >> > OR, (and this is my preferred route), if the multi-pick anomaly approach
> >> > that Pedro is working on works out, we'll want to extend the CRUSH map
> >> > to
> >> > include a set of derivative weights used for actual placement
> >> > calculations
> >> > instead of the canonical target weights, and we can do what you're
> >> > proposing *and* solve the multipick problem with one change in the crush
> >> > map and algorithm.  (Actually choosing those derivative weights will
> >> > be an offline process that can both improve the balance for the inputs
> >> > we
> >> > care about *and* adjust them based on the position to fix the skew issue
> >> > for replicas.)  This doesn't help pre-luminous clients, but I think the
> >> > end solution will be simpler and more elegant...
> >> >
> >> > What do you think?
> >> >
> >> > sage
> >> >
> >> >
> >> >> 4. Extra
> >> >> Some time ago I made such change to perfectly balance Thomson-Reuters
> >> >> cluster.
> >> >> It succeeded.
> >> >> A solution was not accepted, because modification of OSD weights were
> >> >> higher
> >> >> then 50%, which was caused by fact that different placement rules
> >> >> operated
> >> >> on different sets of OSDs, and those sets were not disjointed.
> >> >
> >> >
> >> >>
> >> >> Best regards,
> >> >> Adam
> >> >>
> >> >>
> >> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
> >> >>       Hi Pedro, Loic,
> >> >>
> >> >>       For what it's worth, my intuition here (which has had a mixed
> >> >>       record as
> >> >>       far as CRUSH goes) is that this is the most promising path
> >> >>       forward.
> >> >>
> >> >>       Thinking ahead a few steps, and confirming that I'm following
> >> >>       the
> >> >>       discussion so far, if you're able to do get black (or white) box
> >> >>       gradient
> >> >>       descent to work, then this will give us a set of weights for
> >> >>       each item in
> >> >>       the tree for each selection round, derived from the tree
> >> >>       structure and
> >> >>       original (target) weights.  That would basically give us a map
> >> >>       of item id
> >> >>       (bucket id or leave item id) to weight for each round.  i.e.,
> >> >>
> >> >>        map<int, map<int, float>> weight_by_position;  // position ->
> >> >>       item -> weight
> >> >>
> >> >>       where the 0 round would (I think?) match the target weights, and
> >> >>       each
> >> >>       round after that would skew low-weighted items lower to some
> >> >>       degree.
> >> >>       Right?
> >> >>
> >> >>       The next question I have is: does this generalize from the
> >> >>       single-bucket
> >> >>       case to the hierarchy?  I.e., if I have a "tree" (single bucket)
> >> >>       like
> >> >>
> >> >>       3.1
> >> >>        |_____________
> >> >>        |   \    \    \
> >> >>       1.0  1.0  1.0  .1
> >> >>
> >> >>       it clearly works, but when we have a multi-level tree like
> >> >>
> >> >>
> >> >>       8.4
> >> >>        |____________________________________
> >> >>        |                 \                  \
> >> >>       3.1                3.1                2.2
> >> >>        |_____________     |_____________     |_____________
> >> >>        |   \    \    \    |   \    \    \    |   \    \    \
> >> >>       1.0  1.0  1.0  .1   1.0  1.0  1.0  .1  1.0  1.0 .1   .1
> >> >>
> >> >>       and the second round weights skew the small .1 leaves lower, can
> >> >>       we
> >> >>       continue to build the summed-weight hierarchy, such that the
> >> >>       adjusted
> >> >>       weights at the higher level are appropriately adjusted to give
> >> >>       us the
> >> >>       right probabilities of descending into those trees?  I'm not
> >> >>       sure if that
> >> >>       logically follows from the above or if my intuition is
> >> >>       oversimplifying
> >> >>       things.
> >> >>
> >> >>       If this *is* how we think this will shake out, then I'm
> >> >>       wondering if we
> >> >>       should go ahead and build this weigh matrix into CRUSH sooner
> >> >>       rather
> >> >>       than later (i.e., for luminous).  As with the explicit
> >> >>       remappings, the
> >> >>       hard part is all done offline, and the adjustments to the CRUSH
> >> >>       mapping
> >> >>       calculation itself (storing and making use of the adjusted
> >> >>       weights for
> >> >>       each round of placement) are relatively straightforward.  And
> >> >>       the sooner
> >> >>       this is incorporated into a release the sooner real users will
> >> >>       be able to
> >> >>       roll out code to all clients and start making us of it.
> >> >>
> >> >>       Thanks again for looking at this problem!  I'm excited that we
> >> >>       may be
> >> >>       closing in on a real solution!
> >> >>
> >> >>       sage
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>       On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
> >> >>
> >> >>       > There are lot of gradient-free methods. I will try first to
> >> >>       run the
> >> >>       > ones available using just scipy
> >> >>       >
> >> >>
> >> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> >> >>       > Some of them don't require the gradient and some of them can
> >> >>       estimate
> >> >>       > it. The reason to go without the gradient is to run the CRUSH
> >> >>       > algorithm as a black box. In that case this would be the
> >> >>       pseudo-code:
> >> >>       >
> >> >>       > - BEGIN CODE -
> >> >>       > def build_target(desired_freqs):
> >> >>       >     def target(weights):
> >> >>       >         # run a simulation of CRUSH for a number of objects
> >> >>       >         sim_freqs = run_crush(weights)
> >> >>       >         # Kullback-Leibler divergence between desired
> >> >>       frequencies and
> >> >>       > current ones
> >> >>       >         return loss(sim_freqs, desired_freqs)
> >> >>       >    return target
> >> >>       >
> >> >>       > weights = scipy.optimize.minimize(build_target(desired_freqs))
> >> >>       > - END CODE -
> >> >>       >
> >> >>       > The tricky thing here is that this procedure can be slow if
> >> >>       the
> >> >>       > simulation (run_crush) needs to place a lot of objects to get
> >> >>       accurate
> >> >>       > simulated frequencies. This is true specially if the minimize
> >> >>       method
> >> >>       > attempts to approximate the gradient using finite differences
> >> >>       since it
> >> >>       > will evaluate the target function a number of times
> >> >>       proportional to
> >> >>       > the number of weights). Apart from the ones in scipy I would
> >> >>       try also
> >> >>       > optimization methods that try to perform as few evaluations as
> >> >>       > possible like for example HyperOpt
> >> >>       > (http://hyperopt.github.io/hyperopt/), which by the way takes
> >> >>       into
> >> >>       > account that the target function can be noisy.
> >> >>       >
> >> >>       > This black box approximation is simple to implement and makes
> >> >>       the
> >> >>       > computer do all the work instead of us.
> >> >>       > I think that this black box approximation is worthy to try
> >> >>       even if
> >> >>       > it's not the final one because if this approximation works
> >> >>       then we
> >> >>       > know that a more elaborate one that computes the gradient of
> >> >>       the CRUSH
> >> >>       > algorithm will work for sure.
> >> >>       >
> >> >>       > I can try this black box approximation this weekend not on the
> >> >>       real
> >> >>       > CRUSH algorithm but with the simple implementation I did in
> >> >>       python. If
> >> >>       > it works it's just a matter of substituting one simulation
> >> >>       with
> >> >>       > another and see what happens.
> >> >>       >
> >> >>       > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
> >> >>       > > Hi Pedro,
> >> >>       > >
> >> >>       > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
> >> >>       > >> Hi Loic,
> >> >>       > >>
> >> >>       > >>>From what I see everything seems OK.
> >> >>       > >
> >> >>       > > Cool. I'll keep going in this direction then !
> >> >>       > >
> >> >>       > >> The interesting thing would be to
> >> >>       > >> test on some complex mapping. The reason is that
> >> >>       "CrushPolicyFamily"
> >> >>       > >> is right now modeling just a single straw bucket not the
> >> >>       full CRUSH
> >> >>       > >> algorithm.
> >> >>       > >
> >> >>       > > A number of use cases use a single straw bucket, maybe the
> >> >>       majority of them. Even though it does not reflect the full range
> >> >>       of what crush can offer, it could be useful. To be more
> >> >>       specific, a crush map that states "place objects so that there
> >> >>       is at most one replica per host" or "one replica per rack" is
> >> >>       common. Such a crushmap can be reduced to a single straw bucket
> >> >>       that contains all the hosts and by using the CrushPolicyFamily,
> >> >>       we can change the weights of each host to fix the probabilities.
> >> >>       The hosts themselves contain disks with varying weights but I
> >> >>       think we can ignore that because crush will only recurse to
> >> >>       place one object within a given host.
> >> >>       > >
> >> >>       > >> That's the work that remains to be done. The only way that
> >> >>       > >> would avoid reimplementing the CRUSH algorithm and
> >> >>       computing the
> >> >>       > >> gradient would be treating CRUSH as a black box and
> >> >>       eliminating the
> >> >>       > >> necessity of computing the gradient either by using a
> >> >>       gradient-free
> >> >>       > >> optimization method or making an estimation of the
> >> >>       gradient.
> >> >>       > >
> >> >>       > > By gradient-free optimization you mean simulated annealing
> >> >>       or Monte Carlo ?
> >> >>       > >
> >> >>       > > Cheers
> >> >>       > >
> >> >>       > >>
> >> >>       > >>
> >> >>       > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
> >> >>       > >>> Hi,
> >> >>       > >>>
> >> >>       > >>> I modified the crush library to accept two weights (one
> >> >>       for the first disk, the other for the remaining disks)[1]. This
> >> >>       really is a hack for experimentation purposes only ;-) I was
> >> >>       able to run a variation of your code[2] and got the following
> >> >>       results which are encouraging. Do you think what I did is
> >> >>       sensible ? Or is there a problem I don't see ?
> >> >>       > >>>
> >> >>       > >>> Thanks !
> >> >>       > >>>
> >> >>       > >>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8
> >> >>       6]
> >> >>       > >>>
> >> >>
> >> >> ------------------------------------------------------------------------
> >> >>       > >>> Before: All replicas on each hard drive
> >> >>       > >>> Expected vs actual use (20000 samples)
> >> >>       > >>>  disk 0: 1.39e-01 1.12e-01
> >> >>       > >>>  disk 1: 1.11e-01 1.10e-01
> >> >>       > >>>  disk 2: 8.33e-02 1.13e-01
> >> >>       > >>>  disk 3: 1.39e-01 1.11e-01
> >> >>       > >>>  disk 4: 1.11e-01 1.11e-01
> >> >>       > >>>  disk 5: 8.33e-02 1.11e-01
> >> >>       > >>>  disk 6: 1.39e-01 1.12e-01
> >> >>       > >>>  disk 7: 1.11e-01 1.12e-01
> >> >>       > >>>  disk 8: 8.33e-02 1.10e-01
> >> >>       > >>> it=    1 jac norm=1.59e-01 loss=5.27e-03
> >> >>       > >>> it=    2 jac norm=1.55e-01 loss=5.03e-03
> >> >>       > >>> ...
> >> >>       > >>> it=  212 jac norm=1.02e-03 loss=2.41e-07
> >> >>       > >>> it=  213 jac norm=1.00e-03 loss=2.31e-07
> >> >>       > >>> Converged to desired accuracy :)
> >> >>       > >>> After: All replicas on each hard drive
> >> >>       > >>> Expected vs actual use (20000 samples)
> >> >>       > >>>  disk 0: 1.39e-01 1.42e-01
> >> >>       > >>>  disk 1: 1.11e-01 1.09e-01
> >> >>       > >>>  disk 2: 8.33e-02 8.37e-02
> >> >>       > >>>  disk 3: 1.39e-01 1.40e-01
> >> >>       > >>>  disk 4: 1.11e-01 1.13e-01
> >> >>       > >>>  disk 5: 8.33e-02 8.08e-02
> >> >>       > >>>  disk 6: 1.39e-01 1.38e-01
> >> >>       > >>>  disk 7: 1.11e-01 1.09e-01
> >> >>       > >>>  disk 8: 8.33e-02 8.48e-02
> >> >>       > >>>
> >> >>       > >>>
> >> >>       > >>> Simulation: R=2 devices capacity [10 10 10 10  1]
> >> >>       > >>>
> >> >>
> >> >> ------------------------------------------------------------------------
> >> >>       > >>> Before: All replicas on each hard drive
> >> >>       > >>> Expected vs actual use (20000 samples)
> >> >>       > >>>  disk 0: 2.44e-01 2.36e-01
> >> >>       > >>>  disk 1: 2.44e-01 2.38e-01
> >> >>       > >>>  disk 2: 2.44e-01 2.34e-01
> >> >>       > >>>  disk 3: 2.44e-01 2.38e-01
> >> >>       > >>>  disk 4: 2.44e-02 5.37e-02
> >> >>       > >>> it=    1 jac norm=2.43e-01 loss=2.98e-03
> >> >>       > >>> it=    2 jac norm=2.28e-01 loss=2.47e-03
> >> >>       > >>> ...
> >> >>       > >>> it=   37 jac norm=1.28e-03 loss=3.48e-08
> >> >>       > >>> it=   38 jac norm=1.07e-03 loss=2.42e-08
> >> >>       > >>> Converged to desired accuracy :)
> >> >>       > >>> After: All replicas on each hard drive
> >> >>       > >>> Expected vs actual use (20000 samples)
> >> >>       > >>>  disk 0: 2.44e-01 2.46e-01
> >> >>       > >>>  disk 1: 2.44e-01 2.44e-01
> >> >>       > >>>  disk 2: 2.44e-01 2.41e-01
> >> >>       > >>>  disk 3: 2.44e-01 2.45e-01
> >> >>       > >>>  disk 4: 2.44e-02 2.33e-02
> >> >>       > >>>
> >> >>       > >>>
> >> >>       > >>> [1] crush
> >> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
> >> >>       56fee8
> >> >>       > >>> [2] python-crush
> >> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
> >> >>       bd25f8f2c4b68
> >> >>       > >>>
> >> >>       > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
> >> >>       > >>>> Hi Pedro,
> >> >>       > >>>>
> >> >>       > >>>> It looks like trying to experiment with crush won't work
> >> >>       as expected because crush does not distinguish the probability
> >> >>       of selecting the first device from the probability of selecting
> >> >>       the second or third device. Am I mistaken ?
> >> >>       > >>>>
> >> >>       > >>>> Cheers
> >> >>       > >>>>
> >> >>       > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
> >> >>       > >>>>> Hi Pedro,
> >> >>       > >>>>>
> >> >>       > >>>>> I'm going to experiment with what you did at
> >> >>       > >>>>>
> >> >>       > >>>>>
> >> >>       https://github.com/plafl/notebooks/blob/master/replication.ipynb
> >> >>       > >>>>>
> >> >>       > >>>>> and the latest python-crush published today. A
> >> >>       comparison function was added that will help measure the data
> >> >>       movement. I'm hoping we can release an offline tool based on
> >> >>       your solution. Please let me know if I should wait before diving
> >> >>       into this, in case you have unpublished drafts or new ideas.
> >> >>       > >>>>>
> >> >>       > >>>>> Cheers
> >> >>       > >>>>>
> >> >>       > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
> >> >>       > >>>>>> Great, thanks for the clarifications.
> >> >>       > >>>>>> I also think that the most natural way is to keep just
> >> >>       a set of
> >> >>       > >>>>>> weights in the CRUSH map and update them inside the
> >> >>       algorithm.
> >> >>       > >>>>>>
> >> >>       > >>>>>> I keep working on it.
> >> >>       > >>>>>>
> >> >>       > >>>>>>
> >> >>       > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
> >> >>       <sage@newdream.net>:
> >> >>       > >>>>>>> Hi Pedro,
> >> >>       > >>>>>>>
> >> >>       > >>>>>>> Thanks for taking a look at this!  It's a frustrating
> >> >>       problem and we
> >> >>       > >>>>>>> haven't made much headway.
> >> >>       > >>>>>>>
> >> >>       > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
> >> >>       > >>>>>>>> Hi,
> >> >>       > >>>>>>>>
> >> >>       > >>>>>>>> I will have a look. BTW, I have not progressed that
> >> >>       much but I have
> >> >>       > >>>>>>>> been thinking about it. In order to adapt the
> >> >>       previous algorithm in
> >> >>       > >>>>>>>> the python notebook I need to substitute the
> >> >>       iteration over all
> >> >>       > >>>>>>>> possible devices permutations to iteration over all
> >> >>       the possible
> >> >>       > >>>>>>>> selections that crush would make. That is the main
> >> >>       thing I need to
> >> >>       > >>>>>>>> work on.
> >> >>       > >>>>>>>>
> >> >>       > >>>>>>>> The other thing is of course that weights change for
> >> >>       each replica.
> >> >>       > >>>>>>>> That is, they cannot be really fixed in the crush
> >> >>       map. So the
> >> >>       > >>>>>>>> algorithm inside libcrush, not only the weights in
> >> >>       the map, need to be
> >> >>       > >>>>>>>> changed. The weights in the crush map should reflect
> >> >>       then, maybe, the
> >> >>       > >>>>>>>> desired usage frequencies. Or maybe each replica
> >> >>       should have their own
> >> >>       > >>>>>>>> crush map, but then the information about the
> >> >>       previous selection
> >> >>       > >>>>>>>> should be passed to the next replica placement run so
> >> >>       it avoids
> >> >>       > >>>>>>>> selecting the same one again.
> >> >>       > >>>>>>>
> >> >>       > >>>>>>> My suspicion is that the best solution here (whatever
> >> >>       that means!)
> >> >>       > >>>>>>> leaves the CRUSH weights intact with the desired
> >> >>       distribution, and
> >> >>       > >>>>>>> then generates a set of derivative weights--probably
> >> >>       one set for each
> >> >>       > >>>>>>> round/replica/rank.
> >> >>       > >>>>>>>
> >> >>       > >>>>>>> One nice property of this is that once the support is
> >> >>       added to encode
> >> >>       > >>>>>>> multiple sets of weights, the algorithm used to
> >> >>       generate them is free to
> >> >>       > >>>>>>> change and evolve independently.  (In most cases any
> >> >>       change is
> >> >>       > >>>>>>> CRUSH's mapping behavior is difficult to roll out
> >> >>       because all
> >> >>       > >>>>>>> parties participating in the cluster have to support
> >> >>       any new behavior
> >> >>       > >>>>>>> before it is enabled or used.)
> >> >>       > >>>>>>>
> >> >>       > >>>>>>>> I have a question also. Is there any significant
> >> >>       difference between
> >> >>       > >>>>>>>> the device selection algorithm description in the
> >> >>       paper and its final
> >> >>       > >>>>>>>> implementation?
> >> >>       > >>>>>>>
> >> >>       > >>>>>>> The main difference is the "retry_bucket" behavior was
> >> >>       found to be a bad
> >> >>       > >>>>>>> idea; any collision or failed()/overload() case
> >> >>       triggers the
> >> >>       > >>>>>>> retry_descent.
> >> >>       > >>>>>>>
> >> >>       > >>>>>>> There are other changes, of course, but I don't think
> >> >>       they'll impact any
> >> >>       > >>>>>>> solution we come with here (or at least any solution
> >> >>       can be suitably
> >> >>       > >>>>>>> adapted)!
> >> >>       > >>>>>>>
> >> >>       > >>>>>>> sage
> >> >>       > >>>>>> --
> >> >>       > >>>>>> To unsubscribe from this list: send the line
> >> >>       "unsubscribe ceph-devel" in
> >> >>       > >>>>>> the body of a message to majordomo@vger.kernel.org
> >> >>       > >>>>>> More majordomo info at
> >> >>       http://vger.kernel.org/majordomo-info.html
> >> >>       > >>>>>>
> >> >>       > >>>>>
> >> >>       > >>>>
> >> >>       > >>>
> >> >>       > >>> --
> >> >>       > >>> Loïc Dachary, Artisan Logiciel Libre
> >> >>       > >> --
> >> >>       > >> To unsubscribe from this list: send the line "unsubscribe
> >> >>       ceph-devel" in
> >> >>       > >> the body of a message to majordomo@vger.kernel.org
> >> >>       > >> More majordomo info at
> >> >>       http://vger.kernel.org/majordomo-info.html
> >> >>       > >>
> >> >>       > >
> >> >>       > > --
> >> >>       > > Loïc Dachary, Artisan Logiciel Libre
> >> >>       > --
> >> >>       > To unsubscribe from this list: send the line "unsubscribe
> >> >>       ceph-devel" in
> >> >>       > the body of a message to majordomo@vger.kernel.org
> >> >>       > More majordomo info at
> >> >>       http://vger.kernel.org/majordomo-info.html
> >> >>       >
> >> >>       >
> >> >>
> >> >>
> >> >>
> >>
> >> --
> >> Loïc Dachary, Artisan Logiciel Libre
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-27 13:39                                                     ` Sage Weil
@ 2017-03-28  6:52                                                       ` Adam Kupczyk
  2017-03-28  9:49                                                         ` Spandan Kumar Sahu
  2017-03-28 13:35                                                         ` Sage Weil
  0 siblings, 2 replies; 70+ messages in thread
From: Adam Kupczyk @ 2017-03-28  6:52 UTC (permalink / raw)
  To: Sage Weil, Pedro López-Adeva; +Cc: Loic Dachary, Ceph Development

"... or simply have a single global set"

No. Proof by example:

I once attempted to perfectly balance cluster X by modifying crush weights.
Pool A spanned over 352 OSDs (set A)
Pool B spanned over 176 OSDs (set B, half of A)
The result (simulated perfect balance) was that obtained weights had
- small variance for B (5%),
- small variance for A-B (5%).
- huge variance for A (800%)
This was of course because crush had to be strongly discouraged to
pick from B, when performing placement for A.

"...crush users can choose..."
For each pool there is only one vector of weights that will provide
perfect balance. (math note: actually multiple of them, but different
by scale)
I cannot at the moment imagine any other practical metrics other then
balancing. But maybe it is just failure of imagination.

On Mon, Mar 27, 2017 at 3:39 PM, Sage Weil <sage@newdream.net> wrote:
> On Mon, 27 Mar 2017, Adam Kupczyk wrote:
>> Hi,
>>
>> My understanding is that optimal tweaked weights will depend on:
>> 1) pool_id, because of rjenkins(pool_id) in crush
>> 2) number of placement groups and replication factor, as it determines
>> amount of samples
>>
>> Therefore tweaked weights should rather be property of instantialized pool,
>> not crush placement definition.
>>
>> If tweaked weights are to be part of crush definition, than for each
>> created pool we need to have separate list of weights.
>> Is it possible to provide clients with different weights depending on on
>> which pool they want to operate?
>
> As Loic suggested, you can create as many derivative hierarchies in the
> crush map as you like, potentially one per pool.  Or you could treat the
> sum total of all pgs as the interesting set, balance those, and get some
> OSDs doing a bit more of one pool than another.  The new post-CRUSH OSD
> remap capability can always clean this up (and turn a "good" crush
> distribution into a perfect distribution).
>
> I guess the question is: when we add the explicit adjusted weight matrix
> to crush should we have multiple sets of weights (perhaps one for each
> pool), or simply have a single global set.  It might make sense to allow N
> sets of adjusted weights so that the crush users can choose a particular
> set of them for different pools (or whatever it is they're calculating the
> mapping for)..
>
> sage
>
>
>>
>> Best regards,
>> Adam
>>
>> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
>> > Hi,
>> >
>> > My understanding is that optimal tweaked weights will depend on:
>> > 1) pool_id, because of rjenkins(pool_id) in crush
>> > 2) number of placement groups and replication factor, as it determines
>> > amount of samples
>> >
>> > Therefore tweaked weights should rather be property of instantialized pool,
>> > not crush placement definition.
>> >
>> > If tweaked weights are to be part of crush definition, than for each created
>> > pool we need to have separate list of weights.
>> > Is it possible to provide clients with different weights depending on on
>> > which pool they want to operate?
>> >
>> > Best regards,
>> > Adam
>> >
>> >
>> > On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@dachary.org> wrote:
>> >>
>> >>
>> >>
>> >> On 03/27/2017 04:33 AM, Sage Weil wrote:
>> >> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
>> >> >> Hello Sage, Loic, Pedro,
>> >> >>
>> >> >>
>> >> >> I am certain that almost perfect mapping can be achieved by
>> >> >> substituting weights from crush map with slightly modified weights.
>> >> >> By perfect mapping I mean we get on each OSD number of PGs exactly
>> >> >> proportional to weights specified in crush map.
>> >> >>
>> >> >> 1. Example
>> >> >> Lets think of PGs of single object pool.
>> >> >> We have OSDs with following weights:
>> >> >> [10, 10, 10, 5, 5]
>> >> >>
>> >> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
>> >> >> PGcopies :
>> >> >> [150, 150, 150, 75, 75]
>> >> >>
>> >> >> However, because crush simulates random process we have:
>> >> >> [143, 152, 158, 71, 76]
>> >> >>
>> >> >> We could have obtained perfect distribution had we used weights like
>> >> >> this:
>> >> >> [10.2, 9.9, 9.6, 5.2, 4.9]
>> >> >>
>> >> >>
>> >> >> 2. Obtaining perfect mapping weights from OSD capacity weights
>> >> >>
>> >> >> When we apply crush for the first time, distribution of PGs comes as
>> >> >> random.
>> >> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
>> >> >>
>> >> >> But CRUSH is not random proces at all, it behaves in numerically stable
>> >> >> way.
>> >> >> Specifically, if we increase weight on one node, we will get more PGs
>> >> >> on
>> >> >> this node and less on every other node:
>> >> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
>> >> >>
>> >> >> Now, finding ideal weights can be done by any numerical minimization
>> >> >> method,
>> >> >> for example NLMS.
>> >> >>
>> >> >>
>> >> >> 3. The proposal
>> >> >> For each pool, from initial weights given in crush map perfect weights
>> >> >> will
>> >> >> be derived.
>> >> >> This weights will be used to calculate PG distribution. This of course
>> >> >> will
>> >> >> be close to perfect.
>> >> >>
>> >> >> 3a: Downside when OSD is out
>> >> >> When an OSD is out, missing PG copies will be replicated elsewhere.
>> >> >> Because now weights deviate from OSD capacity, some OSDs will
>> >> >> statistically
>> >> >> get more copies then they should.
>> >> >> This unevenness in distribution is proportional to scale of deviation
>> >> >> of
>> >> >> calculated weights to capacity weights.
>> >> >>
>> >> >> 3b: Upside
>> >> >> This all can be achieved without changes to crush.
>> >> >
>> >> > Yes!
>> >> >
>> >> > And no.  You're totally right--we should use an offline optimization to
>> >> > tweak the crush input weights to get a better balance.  It won't be
>> >> > robust
>> >> > to changes to the cluster, but we can incrementally optimize after that
>> >> > happens to converge on something better.
>> >> >
>> >> > The problem with doing this with current versions of Ceph is that we
>> >> > lose
>> >> > the original "input" or "target" weights (i.e., the actual size of
>> >> > the OSD) that we want to converge on.  This is one reason why we haven't
>> >> > done something like this before.
>> >> >
>> >> > In luminous we *could* work around this by storing those canonical
>> >> > weights outside of crush using something (probably?) ugly and
>> >> > maintain backward compatibility with older clients using existing
>> >> > CRUSH behavior.
>> >>
>> >> These canonical weights could be stored in crush by creating dedicated
>> >> buckets. For instance the root-canonical bucket could be created to store
>> >> the canonical weights of the root bucket. The sysadmin needs to be aware of
>> >> the difference and know to add a new device in the host01-canonical bucket
>> >> instead of the host01 bucket. And to run an offline tool to keep the two
>> >> buckets in sync and compute the weight to use for placement derived from the
>> >> weights representing the device capacity.
>> >>
>> >> It is a little bit ugly ;-)
>> >>
>> >> > OR, (and this is my preferred route), if the multi-pick anomaly approach
>> >> > that Pedro is working on works out, we'll want to extend the CRUSH map
>> >> > to
>> >> > include a set of derivative weights used for actual placement
>> >> > calculations
>> >> > instead of the canonical target weights, and we can do what you're
>> >> > proposing *and* solve the multipick problem with one change in the crush
>> >> > map and algorithm.  (Actually choosing those derivative weights will
>> >> > be an offline process that can both improve the balance for the inputs
>> >> > we
>> >> > care about *and* adjust them based on the position to fix the skew issue
>> >> > for replicas.)  This doesn't help pre-luminous clients, but I think the
>> >> > end solution will be simpler and more elegant...
>> >> >
>> >> > What do you think?
>> >> >
>> >> > sage
>> >> >
>> >> >
>> >> >> 4. Extra
>> >> >> Some time ago I made such change to perfectly balance Thomson-Reuters
>> >> >> cluster.
>> >> >> It succeeded.
>> >> >> A solution was not accepted, because modification of OSD weights were
>> >> >> higher
>> >> >> then 50%, which was caused by fact that different placement rules
>> >> >> operated
>> >> >> on different sets of OSDs, and those sets were not disjointed.
>> >> >
>> >> >
>> >> >>
>> >> >> Best regards,
>> >> >> Adam
>> >> >>
>> >> >>
>> >> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
>> >> >>       Hi Pedro, Loic,
>> >> >>
>> >> >>       For what it's worth, my intuition here (which has had a mixed
>> >> >>       record as
>> >> >>       far as CRUSH goes) is that this is the most promising path
>> >> >>       forward.
>> >> >>
>> >> >>       Thinking ahead a few steps, and confirming that I'm following
>> >> >>       the
>> >> >>       discussion so far, if you're able to do get black (or white) box
>> >> >>       gradient
>> >> >>       descent to work, then this will give us a set of weights for
>> >> >>       each item in
>> >> >>       the tree for each selection round, derived from the tree
>> >> >>       structure and
>> >> >>       original (target) weights.  That would basically give us a map
>> >> >>       of item id
>> >> >>       (bucket id or leave item id) to weight for each round.  i.e.,
>> >> >>
>> >> >>        map<int, map<int, float>> weight_by_position;  // position ->
>> >> >>       item -> weight
>> >> >>
>> >> >>       where the 0 round would (I think?) match the target weights, and
>> >> >>       each
>> >> >>       round after that would skew low-weighted items lower to some
>> >> >>       degree.
>> >> >>       Right?
>> >> >>
>> >> >>       The next question I have is: does this generalize from the
>> >> >>       single-bucket
>> >> >>       case to the hierarchy?  I.e., if I have a "tree" (single bucket)
>> >> >>       like
>> >> >>
>> >> >>       3.1
>> >> >>        |_____________
>> >> >>        |   \    \    \
>> >> >>       1.0  1.0  1.0  .1
>> >> >>
>> >> >>       it clearly works, but when we have a multi-level tree like
>> >> >>
>> >> >>
>> >> >>       8.4
>> >> >>        |____________________________________
>> >> >>        |                 \                  \
>> >> >>       3.1                3.1                2.2
>> >> >>        |_____________     |_____________     |_____________
>> >> >>        |   \    \    \    |   \    \    \    |   \    \    \
>> >> >>       1.0  1.0  1.0  .1   1.0  1.0  1.0  .1  1.0  1.0 .1   .1
>> >> >>
>> >> >>       and the second round weights skew the small .1 leaves lower, can
>> >> >>       we
>> >> >>       continue to build the summed-weight hierarchy, such that the
>> >> >>       adjusted
>> >> >>       weights at the higher level are appropriately adjusted to give
>> >> >>       us the
>> >> >>       right probabilities of descending into those trees?  I'm not
>> >> >>       sure if that
>> >> >>       logically follows from the above or if my intuition is
>> >> >>       oversimplifying
>> >> >>       things.
>> >> >>
>> >> >>       If this *is* how we think this will shake out, then I'm
>> >> >>       wondering if we
>> >> >>       should go ahead and build this weigh matrix into CRUSH sooner
>> >> >>       rather
>> >> >>       than later (i.e., for luminous).  As with the explicit
>> >> >>       remappings, the
>> >> >>       hard part is all done offline, and the adjustments to the CRUSH
>> >> >>       mapping
>> >> >>       calculation itself (storing and making use of the adjusted
>> >> >>       weights for
>> >> >>       each round of placement) are relatively straightforward.  And
>> >> >>       the sooner
>> >> >>       this is incorporated into a release the sooner real users will
>> >> >>       be able to
>> >> >>       roll out code to all clients and start making us of it.
>> >> >>
>> >> >>       Thanks again for looking at this problem!  I'm excited that we
>> >> >>       may be
>> >> >>       closing in on a real solution!
>> >> >>
>> >> >>       sage
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>       On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
>> >> >>
>> >> >>       > There are lot of gradient-free methods. I will try first to
>> >> >>       run the
>> >> >>       > ones available using just scipy
>> >> >>       >
>> >> >>
>> >> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>> >> >>       > Some of them don't require the gradient and some of them can
>> >> >>       estimate
>> >> >>       > it. The reason to go without the gradient is to run the CRUSH
>> >> >>       > algorithm as a black box. In that case this would be the
>> >> >>       pseudo-code:
>> >> >>       >
>> >> >>       > - BEGIN CODE -
>> >> >>       > def build_target(desired_freqs):
>> >> >>       >     def target(weights):
>> >> >>       >         # run a simulation of CRUSH for a number of objects
>> >> >>       >         sim_freqs = run_crush(weights)
>> >> >>       >         # Kullback-Leibler divergence between desired
>> >> >>       frequencies and
>> >> >>       > current ones
>> >> >>       >         return loss(sim_freqs, desired_freqs)
>> >> >>       >    return target
>> >> >>       >
>> >> >>       > weights = scipy.optimize.minimize(build_target(desired_freqs))
>> >> >>       > - END CODE -
>> >> >>       >
>> >> >>       > The tricky thing here is that this procedure can be slow if
>> >> >>       the
>> >> >>       > simulation (run_crush) needs to place a lot of objects to get
>> >> >>       accurate
>> >> >>       > simulated frequencies. This is true specially if the minimize
>> >> >>       method
>> >> >>       > attempts to approximate the gradient using finite differences
>> >> >>       since it
>> >> >>       > will evaluate the target function a number of times
>> >> >>       proportional to
>> >> >>       > the number of weights). Apart from the ones in scipy I would
>> >> >>       try also
>> >> >>       > optimization methods that try to perform as few evaluations as
>> >> >>       > possible like for example HyperOpt
>> >> >>       > (http://hyperopt.github.io/hyperopt/), which by the way takes
>> >> >>       into
>> >> >>       > account that the target function can be noisy.
>> >> >>       >
>> >> >>       > This black box approximation is simple to implement and makes
>> >> >>       the
>> >> >>       > computer do all the work instead of us.
>> >> >>       > I think that this black box approximation is worthy to try
>> >> >>       even if
>> >> >>       > it's not the final one because if this approximation works
>> >> >>       then we
>> >> >>       > know that a more elaborate one that computes the gradient of
>> >> >>       the CRUSH
>> >> >>       > algorithm will work for sure.
>> >> >>       >
>> >> >>       > I can try this black box approximation this weekend not on the
>> >> >>       real
>> >> >>       > CRUSH algorithm but with the simple implementation I did in
>> >> >>       python. If
>> >> >>       > it works it's just a matter of substituting one simulation
>> >> >>       with
>> >> >>       > another and see what happens.
>> >> >>       >
>> >> >>       > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> >> >>       > > Hi Pedro,
>> >> >>       > >
>> >> >>       > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>> >> >>       > >> Hi Loic,
>> >> >>       > >>
>> >> >>       > >>>From what I see everything seems OK.
>> >> >>       > >
>> >> >>       > > Cool. I'll keep going in this direction then !
>> >> >>       > >
>> >> >>       > >> The interesting thing would be to
>> >> >>       > >> test on some complex mapping. The reason is that
>> >> >>       "CrushPolicyFamily"
>> >> >>       > >> is right now modeling just a single straw bucket not the
>> >> >>       full CRUSH
>> >> >>       > >> algorithm.
>> >> >>       > >
>> >> >>       > > A number of use cases use a single straw bucket, maybe the
>> >> >>       majority of them. Even though it does not reflect the full range
>> >> >>       of what crush can offer, it could be useful. To be more
>> >> >>       specific, a crush map that states "place objects so that there
>> >> >>       is at most one replica per host" or "one replica per rack" is
>> >> >>       common. Such a crushmap can be reduced to a single straw bucket
>> >> >>       that contains all the hosts and by using the CrushPolicyFamily,
>> >> >>       we can change the weights of each host to fix the probabilities.
>> >> >>       The hosts themselves contain disks with varying weights but I
>> >> >>       think we can ignore that because crush will only recurse to
>> >> >>       place one object within a given host.
>> >> >>       > >
>> >> >>       > >> That's the work that remains to be done. The only way that
>> >> >>       > >> would avoid reimplementing the CRUSH algorithm and
>> >> >>       computing the
>> >> >>       > >> gradient would be treating CRUSH as a black box and
>> >> >>       eliminating the
>> >> >>       > >> necessity of computing the gradient either by using a
>> >> >>       gradient-free
>> >> >>       > >> optimization method or making an estimation of the
>> >> >>       gradient.
>> >> >>       > >
>> >> >>       > > By gradient-free optimization you mean simulated annealing
>> >> >>       or Monte Carlo ?
>> >> >>       > >
>> >> >>       > > Cheers
>> >> >>       > >
>> >> >>       > >>
>> >> >>       > >>
>> >> >>       > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> >> >>       > >>> Hi,
>> >> >>       > >>>
>> >> >>       > >>> I modified the crush library to accept two weights (one
>> >> >>       for the first disk, the other for the remaining disks)[1]. This
>> >> >>       really is a hack for experimentation purposes only ;-) I was
>> >> >>       able to run a variation of your code[2] and got the following
>> >> >>       results which are encouraging. Do you think what I did is
>> >> >>       sensible ? Or is there a problem I don't see ?
>> >> >>       > >>>
>> >> >>       > >>> Thanks !
>> >> >>       > >>>
>> >> >>       > >>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8
>> >> >>       6]
>> >> >>       > >>>
>> >> >>
>> >> >> ------------------------------------------------------------------------
>> >> >>       > >>> Before: All replicas on each hard drive
>> >> >>       > >>> Expected vs actual use (20000 samples)
>> >> >>       > >>>  disk 0: 1.39e-01 1.12e-01
>> >> >>       > >>>  disk 1: 1.11e-01 1.10e-01
>> >> >>       > >>>  disk 2: 8.33e-02 1.13e-01
>> >> >>       > >>>  disk 3: 1.39e-01 1.11e-01
>> >> >>       > >>>  disk 4: 1.11e-01 1.11e-01
>> >> >>       > >>>  disk 5: 8.33e-02 1.11e-01
>> >> >>       > >>>  disk 6: 1.39e-01 1.12e-01
>> >> >>       > >>>  disk 7: 1.11e-01 1.12e-01
>> >> >>       > >>>  disk 8: 8.33e-02 1.10e-01
>> >> >>       > >>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>> >> >>       > >>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>> >> >>       > >>> ...
>> >> >>       > >>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>> >> >>       > >>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>> >> >>       > >>> Converged to desired accuracy :)
>> >> >>       > >>> After: All replicas on each hard drive
>> >> >>       > >>> Expected vs actual use (20000 samples)
>> >> >>       > >>>  disk 0: 1.39e-01 1.42e-01
>> >> >>       > >>>  disk 1: 1.11e-01 1.09e-01
>> >> >>       > >>>  disk 2: 8.33e-02 8.37e-02
>> >> >>       > >>>  disk 3: 1.39e-01 1.40e-01
>> >> >>       > >>>  disk 4: 1.11e-01 1.13e-01
>> >> >>       > >>>  disk 5: 8.33e-02 8.08e-02
>> >> >>       > >>>  disk 6: 1.39e-01 1.38e-01
>> >> >>       > >>>  disk 7: 1.11e-01 1.09e-01
>> >> >>       > >>>  disk 8: 8.33e-02 8.48e-02
>> >> >>       > >>>
>> >> >>       > >>>
>> >> >>       > >>> Simulation: R=2 devices capacity [10 10 10 10  1]
>> >> >>       > >>>
>> >> >>
>> >> >> ------------------------------------------------------------------------
>> >> >>       > >>> Before: All replicas on each hard drive
>> >> >>       > >>> Expected vs actual use (20000 samples)
>> >> >>       > >>>  disk 0: 2.44e-01 2.36e-01
>> >> >>       > >>>  disk 1: 2.44e-01 2.38e-01
>> >> >>       > >>>  disk 2: 2.44e-01 2.34e-01
>> >> >>       > >>>  disk 3: 2.44e-01 2.38e-01
>> >> >>       > >>>  disk 4: 2.44e-02 5.37e-02
>> >> >>       > >>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>> >> >>       > >>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>> >> >>       > >>> ...
>> >> >>       > >>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>> >> >>       > >>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>> >> >>       > >>> Converged to desired accuracy :)
>> >> >>       > >>> After: All replicas on each hard drive
>> >> >>       > >>> Expected vs actual use (20000 samples)
>> >> >>       > >>>  disk 0: 2.44e-01 2.46e-01
>> >> >>       > >>>  disk 1: 2.44e-01 2.44e-01
>> >> >>       > >>>  disk 2: 2.44e-01 2.41e-01
>> >> >>       > >>>  disk 3: 2.44e-01 2.45e-01
>> >> >>       > >>>  disk 4: 2.44e-02 2.33e-02
>> >> >>       > >>>
>> >> >>       > >>>
>> >> >>       > >>> [1] crush
>> >> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
>> >> >>       56fee8
>> >> >>       > >>> [2] python-crush
>> >> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
>> >> >>       bd25f8f2c4b68
>> >> >>       > >>>
>> >> >>       > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>> >> >>       > >>>> Hi Pedro,
>> >> >>       > >>>>
>> >> >>       > >>>> It looks like trying to experiment with crush won't work
>> >> >>       as expected because crush does not distinguish the probability
>> >> >>       of selecting the first device from the probability of selecting
>> >> >>       the second or third device. Am I mistaken ?
>> >> >>       > >>>>
>> >> >>       > >>>> Cheers
>> >> >>       > >>>>
>> >> >>       > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>> >> >>       > >>>>> Hi Pedro,
>> >> >>       > >>>>>
>> >> >>       > >>>>> I'm going to experiment with what you did at
>> >> >>       > >>>>>
>> >> >>       > >>>>>
>> >> >>       https://github.com/plafl/notebooks/blob/master/replication.ipynb
>> >> >>       > >>>>>
>> >> >>       > >>>>> and the latest python-crush published today. A
>> >> >>       comparison function was added that will help measure the data
>> >> >>       movement. I'm hoping we can release an offline tool based on
>> >> >>       your solution. Please let me know if I should wait before diving
>> >> >>       into this, in case you have unpublished drafts or new ideas.
>> >> >>       > >>>>>
>> >> >>       > >>>>> Cheers
>> >> >>       > >>>>>
>> >> >>       > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>> >> >>       > >>>>>> Great, thanks for the clarifications.
>> >> >>       > >>>>>> I also think that the most natural way is to keep just
>> >> >>       a set of
>> >> >>       > >>>>>> weights in the CRUSH map and update them inside the
>> >> >>       algorithm.
>> >> >>       > >>>>>>
>> >> >>       > >>>>>> I keep working on it.
>> >> >>       > >>>>>>
>> >> >>       > >>>>>>
>> >> >>       > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
>> >> >>       <sage@newdream.net>:
>> >> >>       > >>>>>>> Hi Pedro,
>> >> >>       > >>>>>>>
>> >> >>       > >>>>>>> Thanks for taking a look at this!  It's a frustrating
>> >> >>       problem and we
>> >> >>       > >>>>>>> haven't made much headway.
>> >> >>       > >>>>>>>
>> >> >>       > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>> >> >>       > >>>>>>>> Hi,
>> >> >>       > >>>>>>>>
>> >> >>       > >>>>>>>> I will have a look. BTW, I have not progressed that
>> >> >>       much but I have
>> >> >>       > >>>>>>>> been thinking about it. In order to adapt the
>> >> >>       previous algorithm in
>> >> >>       > >>>>>>>> the python notebook I need to substitute the
>> >> >>       iteration over all
>> >> >>       > >>>>>>>> possible devices permutations to iteration over all
>> >> >>       the possible
>> >> >>       > >>>>>>>> selections that crush would make. That is the main
>> >> >>       thing I need to
>> >> >>       > >>>>>>>> work on.
>> >> >>       > >>>>>>>>
>> >> >>       > >>>>>>>> The other thing is of course that weights change for
>> >> >>       each replica.
>> >> >>       > >>>>>>>> That is, they cannot be really fixed in the crush
>> >> >>       map. So the
>> >> >>       > >>>>>>>> algorithm inside libcrush, not only the weights in
>> >> >>       the map, need to be
>> >> >>       > >>>>>>>> changed. The weights in the crush map should reflect
>> >> >>       then, maybe, the
>> >> >>       > >>>>>>>> desired usage frequencies. Or maybe each replica
>> >> >>       should have their own
>> >> >>       > >>>>>>>> crush map, but then the information about the
>> >> >>       previous selection
>> >> >>       > >>>>>>>> should be passed to the next replica placement run so
>> >> >>       it avoids
>> >> >>       > >>>>>>>> selecting the same one again.
>> >> >>       > >>>>>>>
>> >> >>       > >>>>>>> My suspicion is that the best solution here (whatever
>> >> >>       that means!)
>> >> >>       > >>>>>>> leaves the CRUSH weights intact with the desired
>> >> >>       distribution, and
>> >> >>       > >>>>>>> then generates a set of derivative weights--probably
>> >> >>       one set for each
>> >> >>       > >>>>>>> round/replica/rank.
>> >> >>       > >>>>>>>
>> >> >>       > >>>>>>> One nice property of this is that once the support is
>> >> >>       added to encode
>> >> >>       > >>>>>>> multiple sets of weights, the algorithm used to
>> >> >>       generate them is free to
>> >> >>       > >>>>>>> change and evolve independently.  (In most cases any
>> >> >>       change is
>> >> >>       > >>>>>>> CRUSH's mapping behavior is difficult to roll out
>> >> >>       because all
>> >> >>       > >>>>>>> parties participating in the cluster have to support
>> >> >>       any new behavior
>> >> >>       > >>>>>>> before it is enabled or used.)
>> >> >>       > >>>>>>>
>> >> >>       > >>>>>>>> I have a question also. Is there any significant
>> >> >>       difference between
>> >> >>       > >>>>>>>> the device selection algorithm description in the
>> >> >>       paper and its final
>> >> >>       > >>>>>>>> implementation?
>> >> >>       > >>>>>>>
>> >> >>       > >>>>>>> The main difference is the "retry_bucket" behavior was
>> >> >>       found to be a bad
>> >> >>       > >>>>>>> idea; any collision or failed()/overload() case
>> >> >>       triggers the
>> >> >>       > >>>>>>> retry_descent.
>> >> >>       > >>>>>>>
>> >> >>       > >>>>>>> There are other changes, of course, but I don't think
>> >> >>       they'll impact any
>> >> >>       > >>>>>>> solution we come with here (or at least any solution
>> >> >>       can be suitably
>> >> >>       > >>>>>>> adapted)!
>> >> >>       > >>>>>>>
>> >> >>       > >>>>>>> sage
>> >> >>       > >>>>>> --
>> >> >>       > >>>>>> To unsubscribe from this list: send the line
>> >> >>       "unsubscribe ceph-devel" in
>> >> >>       > >>>>>> the body of a message to majordomo@vger.kernel.org
>> >> >>       > >>>>>> More majordomo info at
>> >> >>       http://vger.kernel.org/majordomo-info.html
>> >> >>       > >>>>>>
>> >> >>       > >>>>>
>> >> >>       > >>>>
>> >> >>       > >>>
>> >> >>       > >>> --
>> >> >>       > >>> Loïc Dachary, Artisan Logiciel Libre
>> >> >>       > >> --
>> >> >>       > >> To unsubscribe from this list: send the line "unsubscribe
>> >> >>       ceph-devel" in
>> >> >>       > >> the body of a message to majordomo@vger.kernel.org
>> >> >>       > >> More majordomo info at
>> >> >>       http://vger.kernel.org/majordomo-info.html
>> >> >>       > >>
>> >> >>       > >
>> >> >>       > > --
>> >> >>       > > Loïc Dachary, Artisan Logiciel Libre
>> >> >>       > --
>> >> >>       > To unsubscribe from this list: send the line "unsubscribe
>> >> >>       ceph-devel" in
>> >> >>       > the body of a message to majordomo@vger.kernel.org
>> >> >>       > More majordomo info at
>> >> >>       http://vger.kernel.org/majordomo-info.html
>> >> >>       >
>> >> >>       >
>> >> >>
>> >> >>
>> >> >>
>> >>
>> >> --
>> >> Loïc Dachary, Artisan Logiciel Libre
>> >
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-28  6:52                                                       ` Adam Kupczyk
@ 2017-03-28  9:49                                                         ` Spandan Kumar Sahu
  2017-03-28 13:35                                                         ` Sage Weil
  1 sibling, 0 replies; 70+ messages in thread
From: Spandan Kumar Sahu @ 2017-03-28  9:49 UTC (permalink / raw)
  To: Adam Kupczyk
  Cc: Sage Weil, Pedro López-Adeva, Loic Dachary, Ceph Development

Hi

I have a bit different algorithm of reweighing for multi-pick. It was
a bit long, so I decided to not to put on the mail thread, and instead
I have uploaded on a GitHub repo [1].

I have explained with examples, and the reasons I think it can solve
the problem. I would really appreciate if someone can go through it
and suggest if this is viable or not.

[1] : https://github.com/SpandanKumarSahu/Ceph_Proposal

On Tue, Mar 28, 2017 at 12:22 PM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
> "... or simply have a single global set"
>
> No. Proof by example:
>
> I once attempted to perfectly balance cluster X by modifying crush weights.
> Pool A spanned over 352 OSDs (set A)
> Pool B spanned over 176 OSDs (set B, half of A)
> The result (simulated perfect balance) was that obtained weights had
> - small variance for B (5%),
> - small variance for A-B (5%).
> - huge variance for A (800%)
> This was of course because crush had to be strongly discouraged to
> pick from B, when performing placement for A.
>
> "...crush users can choose..."
> For each pool there is only one vector of weights that will provide
> perfect balance. (math note: actually multiple of them, but different
> by scale)
> I cannot at the moment imagine any other practical metrics other then
> balancing. But maybe it is just failure of imagination.
>
> On Mon, Mar 27, 2017 at 3:39 PM, Sage Weil <sage@newdream.net> wrote:
>> On Mon, 27 Mar 2017, Adam Kupczyk wrote:
>>> Hi,
>>>
>>> My understanding is that optimal tweaked weights will depend on:
>>> 1) pool_id, because of rjenkins(pool_id) in crush
>>> 2) number of placement groups and replication factor, as it determines
>>> amount of samples
>>>
>>> Therefore tweaked weights should rather be property of instantialized pool,
>>> not crush placement definition.
>>>
>>> If tweaked weights are to be part of crush definition, than for each
>>> created pool we need to have separate list of weights.
>>> Is it possible to provide clients with different weights depending on on
>>> which pool they want to operate?
>>
>> As Loic suggested, you can create as many derivative hierarchies in the
>> crush map as you like, potentially one per pool.  Or you could treat the
>> sum total of all pgs as the interesting set, balance those, and get some
>> OSDs doing a bit more of one pool than another.  The new post-CRUSH OSD
>> remap capability can always clean this up (and turn a "good" crush
>> distribution into a perfect distribution).
>>
>> I guess the question is: when we add the explicit adjusted weight matrix
>> to crush should we have multiple sets of weights (perhaps one for each
>> pool), or simply have a single global set.  It might make sense to allow N
>> sets of adjusted weights so that the crush users can choose a particular
>> set of them for different pools (or whatever it is they're calculating the
>> mapping for)..
>>
>> sage
>>
>>
>>>
>>> Best regards,
>>> Adam
>>>
>>> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
>>> > Hi,
>>> >
>>> > My understanding is that optimal tweaked weights will depend on:
>>> > 1) pool_id, because of rjenkins(pool_id) in crush
>>> > 2) number of placement groups and replication factor, as it determines
>>> > amount of samples
>>> >
>>> > Therefore tweaked weights should rather be property of instantialized pool,
>>> > not crush placement definition.
>>> >
>>> > If tweaked weights are to be part of crush definition, than for each created
>>> > pool we need to have separate list of weights.
>>> > Is it possible to provide clients with different weights depending on on
>>> > which pool they want to operate?
>>> >
>>> > Best regards,
>>> > Adam
>>> >
>>> >
>>> > On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@dachary.org> wrote:
>>> >>
>>> >>
>>> >>
>>> >> On 03/27/2017 04:33 AM, Sage Weil wrote:
>>> >> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
>>> >> >> Hello Sage, Loic, Pedro,
>>> >> >>
>>> >> >>
>>> >> >> I am certain that almost perfect mapping can be achieved by
>>> >> >> substituting weights from crush map with slightly modified weights.
>>> >> >> By perfect mapping I mean we get on each OSD number of PGs exactly
>>> >> >> proportional to weights specified in crush map.
>>> >> >>
>>> >> >> 1. Example
>>> >> >> Lets think of PGs of single object pool.
>>> >> >> We have OSDs with following weights:
>>> >> >> [10, 10, 10, 5, 5]
>>> >> >>
>>> >> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
>>> >> >> PGcopies :
>>> >> >> [150, 150, 150, 75, 75]
>>> >> >>
>>> >> >> However, because crush simulates random process we have:
>>> >> >> [143, 152, 158, 71, 76]
>>> >> >>
>>> >> >> We could have obtained perfect distribution had we used weights like
>>> >> >> this:
>>> >> >> [10.2, 9.9, 9.6, 5.2, 4.9]
>>> >> >>
>>> >> >>
>>> >> >> 2. Obtaining perfect mapping weights from OSD capacity weights
>>> >> >>
>>> >> >> When we apply crush for the first time, distribution of PGs comes as
>>> >> >> random.
>>> >> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
>>> >> >>
>>> >> >> But CRUSH is not random proces at all, it behaves in numerically stable
>>> >> >> way.
>>> >> >> Specifically, if we increase weight on one node, we will get more PGs
>>> >> >> on
>>> >> >> this node and less on every other node:
>>> >> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
>>> >> >>
>>> >> >> Now, finding ideal weights can be done by any numerical minimization
>>> >> >> method,
>>> >> >> for example NLMS.
>>> >> >>
>>> >> >>
>>> >> >> 3. The proposal
>>> >> >> For each pool, from initial weights given in crush map perfect weights
>>> >> >> will
>>> >> >> be derived.
>>> >> >> This weights will be used to calculate PG distribution. This of course
>>> >> >> will
>>> >> >> be close to perfect.
>>> >> >>
>>> >> >> 3a: Downside when OSD is out
>>> >> >> When an OSD is out, missing PG copies will be replicated elsewhere.
>>> >> >> Because now weights deviate from OSD capacity, some OSDs will
>>> >> >> statistically
>>> >> >> get more copies then they should.
>>> >> >> This unevenness in distribution is proportional to scale of deviation
>>> >> >> of
>>> >> >> calculated weights to capacity weights.
>>> >> >>
>>> >> >> 3b: Upside
>>> >> >> This all can be achieved without changes to crush.
>>> >> >
>>> >> > Yes!
>>> >> >
>>> >> > And no.  You're totally right--we should use an offline optimization to
>>> >> > tweak the crush input weights to get a better balance.  It won't be
>>> >> > robust
>>> >> > to changes to the cluster, but we can incrementally optimize after that
>>> >> > happens to converge on something better.
>>> >> >
>>> >> > The problem with doing this with current versions of Ceph is that we
>>> >> > lose
>>> >> > the original "input" or "target" weights (i.e., the actual size of
>>> >> > the OSD) that we want to converge on.  This is one reason why we haven't
>>> >> > done something like this before.
>>> >> >
>>> >> > In luminous we *could* work around this by storing those canonical
>>> >> > weights outside of crush using something (probably?) ugly and
>>> >> > maintain backward compatibility with older clients using existing
>>> >> > CRUSH behavior.
>>> >>
>>> >> These canonical weights could be stored in crush by creating dedicated
>>> >> buckets. For instance the root-canonical bucket could be created to store
>>> >> the canonical weights of the root bucket. The sysadmin needs to be aware of
>>> >> the difference and know to add a new device in the host01-canonical bucket
>>> >> instead of the host01 bucket. And to run an offline tool to keep the two
>>> >> buckets in sync and compute the weight to use for placement derived from the
>>> >> weights representing the device capacity.
>>> >>
>>> >> It is a little bit ugly ;-)
>>> >>
>>> >> > OR, (and this is my preferred route), if the multi-pick anomaly approach
>>> >> > that Pedro is working on works out, we'll want to extend the CRUSH map
>>> >> > to
>>> >> > include a set of derivative weights used for actual placement
>>> >> > calculations
>>> >> > instead of the canonical target weights, and we can do what you're
>>> >> > proposing *and* solve the multipick problem with one change in the crush
>>> >> > map and algorithm.  (Actually choosing those derivative weights will
>>> >> > be an offline process that can both improve the balance for the inputs
>>> >> > we
>>> >> > care about *and* adjust them based on the position to fix the skew issue
>>> >> > for replicas.)  This doesn't help pre-luminous clients, but I think the
>>> >> > end solution will be simpler and more elegant...
>>> >> >
>>> >> > What do you think?
>>> >> >
>>> >> > sage
>>> >> >
>>> >> >
>>> >> >> 4. Extra
>>> >> >> Some time ago I made such change to perfectly balance Thomson-Reuters
>>> >> >> cluster.
>>> >> >> It succeeded.
>>> >> >> A solution was not accepted, because modification of OSD weights were
>>> >> >> higher
>>> >> >> then 50%, which was caused by fact that different placement rules
>>> >> >> operated
>>> >> >> on different sets of OSDs, and those sets were not disjointed.
>>> >> >
>>> >> >
>>> >> >>
>>> >> >> Best regards,
>>> >> >> Adam
>>> >> >>
>>> >> >>
>>> >> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
>>> >> >>       Hi Pedro, Loic,
>>> >> >>
>>> >> >>       For what it's worth, my intuition here (which has had a mixed
>>> >> >>       record as
>>> >> >>       far as CRUSH goes) is that this is the most promising path
>>> >> >>       forward.
>>> >> >>
>>> >> >>       Thinking ahead a few steps, and confirming that I'm following
>>> >> >>       the
>>> >> >>       discussion so far, if you're able to do get black (or white) box
>>> >> >>       gradient
>>> >> >>       descent to work, then this will give us a set of weights for
>>> >> >>       each item in
>>> >> >>       the tree for each selection round, derived from the tree
>>> >> >>       structure and
>>> >> >>       original (target) weights.  That would basically give us a map
>>> >> >>       of item id
>>> >> >>       (bucket id or leave item id) to weight for each round.  i.e.,
>>> >> >>
>>> >> >>        map<int, map<int, float>> weight_by_position;  // position ->
>>> >> >>       item -> weight
>>> >> >>
>>> >> >>       where the 0 round would (I think?) match the target weights, and
>>> >> >>       each
>>> >> >>       round after that would skew low-weighted items lower to some
>>> >> >>       degree.
>>> >> >>       Right?
>>> >> >>
>>> >> >>       The next question I have is: does this generalize from the
>>> >> >>       single-bucket
>>> >> >>       case to the hierarchy?  I.e., if I have a "tree" (single bucket)
>>> >> >>       like
>>> >> >>
>>> >> >>       3.1
>>> >> >>        |_____________
>>> >> >>        |   \    \    \
>>> >> >>       1.0  1.0  1.0  .1
>>> >> >>
>>> >> >>       it clearly works, but when we have a multi-level tree like
>>> >> >>
>>> >> >>
>>> >> >>       8.4
>>> >> >>        |____________________________________
>>> >> >>        |                 \                  \
>>> >> >>       3.1                3.1                2.2
>>> >> >>        |_____________     |_____________     |_____________
>>> >> >>        |   \    \    \    |   \    \    \    |   \    \    \
>>> >> >>       1.0  1.0  1.0  .1   1.0  1.0  1.0  .1  1.0  1.0 .1   .1
>>> >> >>
>>> >> >>       and the second round weights skew the small .1 leaves lower, can
>>> >> >>       we
>>> >> >>       continue to build the summed-weight hierarchy, such that the
>>> >> >>       adjusted
>>> >> >>       weights at the higher level are appropriately adjusted to give
>>> >> >>       us the
>>> >> >>       right probabilities of descending into those trees?  I'm not
>>> >> >>       sure if that
>>> >> >>       logically follows from the above or if my intuition is
>>> >> >>       oversimplifying
>>> >> >>       things.
>>> >> >>
>>> >> >>       If this *is* how we think this will shake out, then I'm
>>> >> >>       wondering if we
>>> >> >>       should go ahead and build this weigh matrix into CRUSH sooner
>>> >> >>       rather
>>> >> >>       than later (i.e., for luminous).  As with the explicit
>>> >> >>       remappings, the
>>> >> >>       hard part is all done offline, and the adjustments to the CRUSH
>>> >> >>       mapping
>>> >> >>       calculation itself (storing and making use of the adjusted
>>> >> >>       weights for
>>> >> >>       each round of placement) are relatively straightforward.  And
>>> >> >>       the sooner
>>> >> >>       this is incorporated into a release the sooner real users will
>>> >> >>       be able to
>>> >> >>       roll out code to all clients and start making us of it.
>>> >> >>
>>> >> >>       Thanks again for looking at this problem!  I'm excited that we
>>> >> >>       may be
>>> >> >>       closing in on a real solution!
>>> >> >>
>>> >> >>       sage
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>       On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
>>> >> >>
>>> >> >>       > There are lot of gradient-free methods. I will try first to
>>> >> >>       run the
>>> >> >>       > ones available using just scipy
>>> >> >>       >
>>> >> >>
>>> >> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>> >> >>       > Some of them don't require the gradient and some of them can
>>> >> >>       estimate
>>> >> >>       > it. The reason to go without the gradient is to run the CRUSH
>>> >> >>       > algorithm as a black box. In that case this would be the
>>> >> >>       pseudo-code:
>>> >> >>       >
>>> >> >>       > - BEGIN CODE -
>>> >> >>       > def build_target(desired_freqs):
>>> >> >>       >     def target(weights):
>>> >> >>       >         # run a simulation of CRUSH for a number of objects
>>> >> >>       >         sim_freqs = run_crush(weights)
>>> >> >>       >         # Kullback-Leibler divergence between desired
>>> >> >>       frequencies and
>>> >> >>       > current ones
>>> >> >>       >         return loss(sim_freqs, desired_freqs)
>>> >> >>       >    return target
>>> >> >>       >
>>> >> >>       > weights = scipy.optimize.minimize(build_target(desired_freqs))
>>> >> >>       > - END CODE -
>>> >> >>       >
>>> >> >>       > The tricky thing here is that this procedure can be slow if
>>> >> >>       the
>>> >> >>       > simulation (run_crush) needs to place a lot of objects to get
>>> >> >>       accurate
>>> >> >>       > simulated frequencies. This is true specially if the minimize
>>> >> >>       method
>>> >> >>       > attempts to approximate the gradient using finite differences
>>> >> >>       since it
>>> >> >>       > will evaluate the target function a number of times
>>> >> >>       proportional to
>>> >> >>       > the number of weights). Apart from the ones in scipy I would
>>> >> >>       try also
>>> >> >>       > optimization methods that try to perform as few evaluations as
>>> >> >>       > possible like for example HyperOpt
>>> >> >>       > (http://hyperopt.github.io/hyperopt/), which by the way takes
>>> >> >>       into
>>> >> >>       > account that the target function can be noisy.
>>> >> >>       >
>>> >> >>       > This black box approximation is simple to implement and makes
>>> >> >>       the
>>> >> >>       > computer do all the work instead of us.
>>> >> >>       > I think that this black box approximation is worthy to try
>>> >> >>       even if
>>> >> >>       > it's not the final one because if this approximation works
>>> >> >>       then we
>>> >> >>       > know that a more elaborate one that computes the gradient of
>>> >> >>       the CRUSH
>>> >> >>       > algorithm will work for sure.
>>> >> >>       >
>>> >> >>       > I can try this black box approximation this weekend not on the
>>> >> >>       real
>>> >> >>       > CRUSH algorithm but with the simple implementation I did in
>>> >> >>       python. If
>>> >> >>       > it works it's just a matter of substituting one simulation
>>> >> >>       with
>>> >> >>       > another and see what happens.
>>> >> >>       >
>>> >> >>       > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> >> >>       > > Hi Pedro,
>>> >> >>       > >
>>> >> >>       > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>> >> >>       > >> Hi Loic,
>>> >> >>       > >>
>>> >> >>       > >>>From what I see everything seems OK.
>>> >> >>       > >
>>> >> >>       > > Cool. I'll keep going in this direction then !
>>> >> >>       > >
>>> >> >>       > >> The interesting thing would be to
>>> >> >>       > >> test on some complex mapping. The reason is that
>>> >> >>       "CrushPolicyFamily"
>>> >> >>       > >> is right now modeling just a single straw bucket not the
>>> >> >>       full CRUSH
>>> >> >>       > >> algorithm.
>>> >> >>       > >
>>> >> >>       > > A number of use cases use a single straw bucket, maybe the
>>> >> >>       majority of them. Even though it does not reflect the full range
>>> >> >>       of what crush can offer, it could be useful. To be more
>>> >> >>       specific, a crush map that states "place objects so that there
>>> >> >>       is at most one replica per host" or "one replica per rack" is
>>> >> >>       common. Such a crushmap can be reduced to a single straw bucket
>>> >> >>       that contains all the hosts and by using the CrushPolicyFamily,
>>> >> >>       we can change the weights of each host to fix the probabilities.
>>> >> >>       The hosts themselves contain disks with varying weights but I
>>> >> >>       think we can ignore that because crush will only recurse to
>>> >> >>       place one object within a given host.
>>> >> >>       > >
>>> >> >>       > >> That's the work that remains to be done. The only way that
>>> >> >>       > >> would avoid reimplementing the CRUSH algorithm and
>>> >> >>       computing the
>>> >> >>       > >> gradient would be treating CRUSH as a black box and
>>> >> >>       eliminating the
>>> >> >>       > >> necessity of computing the gradient either by using a
>>> >> >>       gradient-free
>>> >> >>       > >> optimization method or making an estimation of the
>>> >> >>       gradient.
>>> >> >>       > >
>>> >> >>       > > By gradient-free optimization you mean simulated annealing
>>> >> >>       or Monte Carlo ?
>>> >> >>       > >
>>> >> >>       > > Cheers
>>> >> >>       > >
>>> >> >>       > >>
>>> >> >>       > >>
>>> >> >>       > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> >> >>       > >>> Hi,
>>> >> >>       > >>>
>>> >> >>       > >>> I modified the crush library to accept two weights (one
>>> >> >>       for the first disk, the other for the remaining disks)[1]. This
>>> >> >>       really is a hack for experimentation purposes only ;-) I was
>>> >> >>       able to run a variation of your code[2] and got the following
>>> >> >>       results which are encouraging. Do you think what I did is
>>> >> >>       sensible ? Or is there a problem I don't see ?
>>> >> >>       > >>>
>>> >> >>       > >>> Thanks !
>>> >> >>       > >>>
>>> >> >>       > >>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8
>>> >> >>       6]
>>> >> >>       > >>>
>>> >> >>
>>> >> >> ------------------------------------------------------------------------
>>> >> >>       > >>> Before: All replicas on each hard drive
>>> >> >>       > >>> Expected vs actual use (20000 samples)
>>> >> >>       > >>>  disk 0: 1.39e-01 1.12e-01
>>> >> >>       > >>>  disk 1: 1.11e-01 1.10e-01
>>> >> >>       > >>>  disk 2: 8.33e-02 1.13e-01
>>> >> >>       > >>>  disk 3: 1.39e-01 1.11e-01
>>> >> >>       > >>>  disk 4: 1.11e-01 1.11e-01
>>> >> >>       > >>>  disk 5: 8.33e-02 1.11e-01
>>> >> >>       > >>>  disk 6: 1.39e-01 1.12e-01
>>> >> >>       > >>>  disk 7: 1.11e-01 1.12e-01
>>> >> >>       > >>>  disk 8: 8.33e-02 1.10e-01
>>> >> >>       > >>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>> >> >>       > >>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>> >> >>       > >>> ...
>>> >> >>       > >>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>> >> >>       > >>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>> >> >>       > >>> Converged to desired accuracy :)
>>> >> >>       > >>> After: All replicas on each hard drive
>>> >> >>       > >>> Expected vs actual use (20000 samples)
>>> >> >>       > >>>  disk 0: 1.39e-01 1.42e-01
>>> >> >>       > >>>  disk 1: 1.11e-01 1.09e-01
>>> >> >>       > >>>  disk 2: 8.33e-02 8.37e-02
>>> >> >>       > >>>  disk 3: 1.39e-01 1.40e-01
>>> >> >>       > >>>  disk 4: 1.11e-01 1.13e-01
>>> >> >>       > >>>  disk 5: 8.33e-02 8.08e-02
>>> >> >>       > >>>  disk 6: 1.39e-01 1.38e-01
>>> >> >>       > >>>  disk 7: 1.11e-01 1.09e-01
>>> >> >>       > >>>  disk 8: 8.33e-02 8.48e-02
>>> >> >>       > >>>
>>> >> >>       > >>>
>>> >> >>       > >>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>> >> >>       > >>>
>>> >> >>
>>> >> >> ------------------------------------------------------------------------
>>> >> >>       > >>> Before: All replicas on each hard drive
>>> >> >>       > >>> Expected vs actual use (20000 samples)
>>> >> >>       > >>>  disk 0: 2.44e-01 2.36e-01
>>> >> >>       > >>>  disk 1: 2.44e-01 2.38e-01
>>> >> >>       > >>>  disk 2: 2.44e-01 2.34e-01
>>> >> >>       > >>>  disk 3: 2.44e-01 2.38e-01
>>> >> >>       > >>>  disk 4: 2.44e-02 5.37e-02
>>> >> >>       > >>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>> >> >>       > >>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>> >> >>       > >>> ...
>>> >> >>       > >>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>> >> >>       > >>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>> >> >>       > >>> Converged to desired accuracy :)
>>> >> >>       > >>> After: All replicas on each hard drive
>>> >> >>       > >>> Expected vs actual use (20000 samples)
>>> >> >>       > >>>  disk 0: 2.44e-01 2.46e-01
>>> >> >>       > >>>  disk 1: 2.44e-01 2.44e-01
>>> >> >>       > >>>  disk 2: 2.44e-01 2.41e-01
>>> >> >>       > >>>  disk 3: 2.44e-01 2.45e-01
>>> >> >>       > >>>  disk 4: 2.44e-02 2.33e-02
>>> >> >>       > >>>
>>> >> >>       > >>>
>>> >> >>       > >>> [1] crush
>>> >> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
>>> >> >>       56fee8
>>> >> >>       > >>> [2] python-crush
>>> >> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
>>> >> >>       bd25f8f2c4b68
>>> >> >>       > >>>
>>> >> >>       > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>> >> >>       > >>>> Hi Pedro,
>>> >> >>       > >>>>
>>> >> >>       > >>>> It looks like trying to experiment with crush won't work
>>> >> >>       as expected because crush does not distinguish the probability
>>> >> >>       of selecting the first device from the probability of selecting
>>> >> >>       the second or third device. Am I mistaken ?
>>> >> >>       > >>>>
>>> >> >>       > >>>> Cheers
>>> >> >>       > >>>>
>>> >> >>       > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>> >> >>       > >>>>> Hi Pedro,
>>> >> >>       > >>>>>
>>> >> >>       > >>>>> I'm going to experiment with what you did at
>>> >> >>       > >>>>>
>>> >> >>       > >>>>>
>>> >> >>       https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>> >> >>       > >>>>>
>>> >> >>       > >>>>> and the latest python-crush published today. A
>>> >> >>       comparison function was added that will help measure the data
>>> >> >>       movement. I'm hoping we can release an offline tool based on
>>> >> >>       your solution. Please let me know if I should wait before diving
>>> >> >>       into this, in case you have unpublished drafts or new ideas.
>>> >> >>       > >>>>>
>>> >> >>       > >>>>> Cheers
>>> >> >>       > >>>>>
>>> >> >>       > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>> >> >>       > >>>>>> Great, thanks for the clarifications.
>>> >> >>       > >>>>>> I also think that the most natural way is to keep just
>>> >> >>       a set of
>>> >> >>       > >>>>>> weights in the CRUSH map and update them inside the
>>> >> >>       algorithm.
>>> >> >>       > >>>>>>
>>> >> >>       > >>>>>> I keep working on it.
>>> >> >>       > >>>>>>
>>> >> >>       > >>>>>>
>>> >> >>       > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
>>> >> >>       <sage@newdream.net>:
>>> >> >>       > >>>>>>> Hi Pedro,
>>> >> >>       > >>>>>>>
>>> >> >>       > >>>>>>> Thanks for taking a look at this!  It's a frustrating
>>> >> >>       problem and we
>>> >> >>       > >>>>>>> haven't made much headway.
>>> >> >>       > >>>>>>>
>>> >> >>       > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>> >> >>       > >>>>>>>> Hi,
>>> >> >>       > >>>>>>>>
>>> >> >>       > >>>>>>>> I will have a look. BTW, I have not progressed that
>>> >> >>       much but I have
>>> >> >>       > >>>>>>>> been thinking about it. In order to adapt the
>>> >> >>       previous algorithm in
>>> >> >>       > >>>>>>>> the python notebook I need to substitute the
>>> >> >>       iteration over all
>>> >> >>       > >>>>>>>> possible devices permutations to iteration over all
>>> >> >>       the possible
>>> >> >>       > >>>>>>>> selections that crush would make. That is the main
>>> >> >>       thing I need to
>>> >> >>       > >>>>>>>> work on.
>>> >> >>       > >>>>>>>>
>>> >> >>       > >>>>>>>> The other thing is of course that weights change for
>>> >> >>       each replica.
>>> >> >>       > >>>>>>>> That is, they cannot be really fixed in the crush
>>> >> >>       map. So the
>>> >> >>       > >>>>>>>> algorithm inside libcrush, not only the weights in
>>> >> >>       the map, need to be
>>> >> >>       > >>>>>>>> changed. The weights in the crush map should reflect
>>> >> >>       then, maybe, the
>>> >> >>       > >>>>>>>> desired usage frequencies. Or maybe each replica
>>> >> >>       should have their own
>>> >> >>       > >>>>>>>> crush map, but then the information about the
>>> >> >>       previous selection
>>> >> >>       > >>>>>>>> should be passed to the next replica placement run so
>>> >> >>       it avoids
>>> >> >>       > >>>>>>>> selecting the same one again.
>>> >> >>       > >>>>>>>
>>> >> >>       > >>>>>>> My suspicion is that the best solution here (whatever
>>> >> >>       that means!)
>>> >> >>       > >>>>>>> leaves the CRUSH weights intact with the desired
>>> >> >>       distribution, and
>>> >> >>       > >>>>>>> then generates a set of derivative weights--probably
>>> >> >>       one set for each
>>> >> >>       > >>>>>>> round/replica/rank.
>>> >> >>       > >>>>>>>
>>> >> >>       > >>>>>>> One nice property of this is that once the support is
>>> >> >>       added to encode
>>> >> >>       > >>>>>>> multiple sets of weights, the algorithm used to
>>> >> >>       generate them is free to
>>> >> >>       > >>>>>>> change and evolve independently.  (In most cases any
>>> >> >>       change is
>>> >> >>       > >>>>>>> CRUSH's mapping behavior is difficult to roll out
>>> >> >>       because all
>>> >> >>       > >>>>>>> parties participating in the cluster have to support
>>> >> >>       any new behavior
>>> >> >>       > >>>>>>> before it is enabled or used.)
>>> >> >>       > >>>>>>>
>>> >> >>       > >>>>>>>> I have a question also. Is there any significant
>>> >> >>       difference between
>>> >> >>       > >>>>>>>> the device selection algorithm description in the
>>> >> >>       paper and its final
>>> >> >>       > >>>>>>>> implementation?
>>> >> >>       > >>>>>>>
>>> >> >>       > >>>>>>> The main difference is the "retry_bucket" behavior was
>>> >> >>       found to be a bad
>>> >> >>       > >>>>>>> idea; any collision or failed()/overload() case
>>> >> >>       triggers the
>>> >> >>       > >>>>>>> retry_descent.
>>> >> >>       > >>>>>>>
>>> >> >>       > >>>>>>> There are other changes, of course, but I don't think
>>> >> >>       they'll impact any
>>> >> >>       > >>>>>>> solution we come with here (or at least any solution
>>> >> >>       can be suitably
>>> >> >>       > >>>>>>> adapted)!
>>> >> >>       > >>>>>>>
>>> >> >>       > >>>>>>> sage
>>> >> >>       > >>>>>> --
>>> >> >>       > >>>>>> To unsubscribe from this list: send the line
>>> >> >>       "unsubscribe ceph-devel" in
>>> >> >>       > >>>>>> the body of a message to majordomo@vger.kernel.org
>>> >> >>       > >>>>>> More majordomo info at
>>> >> >>       http://vger.kernel.org/majordomo-info.html
>>> >> >>       > >>>>>>
>>> >> >>       > >>>>>
>>> >> >>       > >>>>
>>> >> >>       > >>>
>>> >> >>       > >>> --
>>> >> >>       > >>> Loïc Dachary, Artisan Logiciel Libre
>>> >> >>       > >> --
>>> >> >>       > >> To unsubscribe from this list: send the line "unsubscribe
>>> >> >>       ceph-devel" in
>>> >> >>       > >> the body of a message to majordomo@vger.kernel.org
>>> >> >>       > >> More majordomo info at
>>> >> >>       http://vger.kernel.org/majordomo-info.html
>>> >> >>       > >>
>>> >> >>       > >
>>> >> >>       > > --
>>> >> >>       > > Loïc Dachary, Artisan Logiciel Libre
>>> >> >>       > --
>>> >> >>       > To unsubscribe from this list: send the line "unsubscribe
>>> >> >>       ceph-devel" in
>>> >> >>       > the body of a message to majordomo@vger.kernel.org
>>> >> >>       > More majordomo info at
>>> >> >>       http://vger.kernel.org/majordomo-info.html
>>> >> >>       >
>>> >> >>       >
>>> >> >>
>>> >> >>
>>> >> >>
>>> >>
>>> >> --
>>> >> Loïc Dachary, Artisan Logiciel Libre
>>> >
>>> >
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Spandan Kumar Sahu
IIT Kharagpur

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-28  6:52                                                       ` Adam Kupczyk
  2017-03-28  9:49                                                         ` Spandan Kumar Sahu
@ 2017-03-28 13:35                                                         ` Sage Weil
  1 sibling, 0 replies; 70+ messages in thread
From: Sage Weil @ 2017-03-28 13:35 UTC (permalink / raw)
  To: Adam Kupczyk; +Cc: Pedro López-Adeva, Loic Dachary, Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 31263 bytes --]

On Tue, 28 Mar 2017, Adam Kupczyk wrote:
> "... or simply have a single global set"
> 
> No. Proof by example:
> 
> I once attempted to perfectly balance cluster X by modifying crush weights.
> Pool A spanned over 352 OSDs (set A)
> Pool B spanned over 176 OSDs (set B, half of A)
> The result (simulated perfect balance) was that obtained weights had
> - small variance for B (5%),
> - small variance for A-B (5%).
> - huge variance for A (800%)
> This was of course because crush had to be strongly discouraged to
> pick from B, when performing placement for A.

FWIW in this situation I think we should aim to have the B OSDs more 
utilized than the A-B OSDs by exactly as much data is in pool B divieded 
by 176.  We should not try to make the A and A-B sets have equal 
utilization because the rules do not suggest that we should.  Does that 
make sense?  I.e., if we treat each pool's placement in isolation by 
*only* considering the PGs from pool A, then we should aim for 
perfect balance across A, and when we look only at B PGs we should see 
perfect balance across B, and the result will be that A-B will have more 
PGs.

> "...crush users can choose..."
> For each pool there is only one vector of weights that will provide
> perfect balance. (math note: actually multiple of them, but different
> by scale)

Yeah, although again CRUSH doesn't need to be perfect here, just better; 
the new OSDMap remap can always fix up the loose ends to take the final 
step to perfect.

> I cannot at the moment imagine any other practical metrics other then
> balancing. But maybe it is just failure of imagination.

I can see us looking at other dimensions (e.g., trying to maximize the 
number of replica sets that span disk models) where there is no 
correlation to the hierarchy, but I'm not sure that fiddling with weights 
will really get us anywhere.

Also, the new device class hierarchies Loic just added could be expressed 
as alternative sets of bucket weights instead of the shadow hierarchy.  
Once the admin makes the leap to luminous compatibility as the baseline 
the map could compile to that instead of generating the hidden buckets it 
does now.

sage


> 
> On Mon, Mar 27, 2017 at 3:39 PM, Sage Weil <sage@newdream.net> wrote:
> > On Mon, 27 Mar 2017, Adam Kupczyk wrote:
> >> Hi,
> >>
> >> My understanding is that optimal tweaked weights will depend on:
> >> 1) pool_id, because of rjenkins(pool_id) in crush
> >> 2) number of placement groups and replication factor, as it determines
> >> amount of samples
> >>
> >> Therefore tweaked weights should rather be property of instantialized pool,
> >> not crush placement definition.
> >>
> >> If tweaked weights are to be part of crush definition, than for each
> >> created pool we need to have separate list of weights.
> >> Is it possible to provide clients with different weights depending on on
> >> which pool they want to operate?
> >
> > As Loic suggested, you can create as many derivative hierarchies in the
> > crush map as you like, potentially one per pool.  Or you could treat the
> > sum total of all pgs as the interesting set, balance those, and get some
> > OSDs doing a bit more of one pool than another.  The new post-CRUSH OSD
> > remap capability can always clean this up (and turn a "good" crush
> > distribution into a perfect distribution).
> >
> > I guess the question is: when we add the explicit adjusted weight matrix
> > to crush should we have multiple sets of weights (perhaps one for each
> > pool), or simply have a single global set.  It might make sense to allow N
> > sets of adjusted weights so that the crush users can choose a particular
> > set of them for different pools (or whatever it is they're calculating the
> > mapping for)..
> >
> > sage
> >
> >
> >>
> >> Best regards,
> >> Adam
> >>
> >> On Mon, Mar 27, 2017 at 10:45 AM, Adam Kupczyk <akupczyk@mirantis.com> wrote:
> >> > Hi,
> >> >
> >> > My understanding is that optimal tweaked weights will depend on:
> >> > 1) pool_id, because of rjenkins(pool_id) in crush
> >> > 2) number of placement groups and replication factor, as it determines
> >> > amount of samples
> >> >
> >> > Therefore tweaked weights should rather be property of instantialized pool,
> >> > not crush placement definition.
> >> >
> >> > If tweaked weights are to be part of crush definition, than for each created
> >> > pool we need to have separate list of weights.
> >> > Is it possible to provide clients with different weights depending on on
> >> > which pool they want to operate?
> >> >
> >> > Best regards,
> >> > Adam
> >> >
> >> >
> >> > On Mon, Mar 27, 2017 at 8:45 AM, Loic Dachary <loic@dachary.org> wrote:
> >> >>
> >> >>
> >> >>
> >> >> On 03/27/2017 04:33 AM, Sage Weil wrote:
> >> >> > On Sun, 26 Mar 2017, Adam Kupczyk wrote:
> >> >> >> Hello Sage, Loic, Pedro,
> >> >> >>
> >> >> >>
> >> >> >> I am certain that almost perfect mapping can be achieved by
> >> >> >> substituting weights from crush map with slightly modified weights.
> >> >> >> By perfect mapping I mean we get on each OSD number of PGs exactly
> >> >> >> proportional to weights specified in crush map.
> >> >> >>
> >> >> >> 1. Example
> >> >> >> Lets think of PGs of single object pool.
> >> >> >> We have OSDs with following weights:
> >> >> >> [10, 10, 10, 5, 5]
> >> >> >>
> >> >> >> Ideally, we would like following distribution of 200PG x 3 copies = 600
> >> >> >> PGcopies :
> >> >> >> [150, 150, 150, 75, 75]
> >> >> >>
> >> >> >> However, because crush simulates random process we have:
> >> >> >> [143, 152, 158, 71, 76]
> >> >> >>
> >> >> >> We could have obtained perfect distribution had we used weights like
> >> >> >> this:
> >> >> >> [10.2, 9.9, 9.6, 5.2, 4.9]
> >> >> >>
> >> >> >>
> >> >> >> 2. Obtaining perfect mapping weights from OSD capacity weights
> >> >> >>
> >> >> >> When we apply crush for the first time, distribution of PGs comes as
> >> >> >> random.
> >> >> >> CRUSH([10, 10, 10, 5, 5]) -> [143, 152, 158, 71, 76]
> >> >> >>
> >> >> >> But CRUSH is not random proces at all, it behaves in numerically stable
> >> >> >> way.
> >> >> >> Specifically, if we increase weight on one node, we will get more PGs
> >> >> >> on
> >> >> >> this node and less on every other node:
> >> >> >> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]
> >> >> >>
> >> >> >> Now, finding ideal weights can be done by any numerical minimization
> >> >> >> method,
> >> >> >> for example NLMS.
> >> >> >>
> >> >> >>
> >> >> >> 3. The proposal
> >> >> >> For each pool, from initial weights given in crush map perfect weights
> >> >> >> will
> >> >> >> be derived.
> >> >> >> This weights will be used to calculate PG distribution. This of course
> >> >> >> will
> >> >> >> be close to perfect.
> >> >> >>
> >> >> >> 3a: Downside when OSD is out
> >> >> >> When an OSD is out, missing PG copies will be replicated elsewhere.
> >> >> >> Because now weights deviate from OSD capacity, some OSDs will
> >> >> >> statistically
> >> >> >> get more copies then they should.
> >> >> >> This unevenness in distribution is proportional to scale of deviation
> >> >> >> of
> >> >> >> calculated weights to capacity weights.
> >> >> >>
> >> >> >> 3b: Upside
> >> >> >> This all can be achieved without changes to crush.
> >> >> >
> >> >> > Yes!
> >> >> >
> >> >> > And no.  You're totally right--we should use an offline optimization to
> >> >> > tweak the crush input weights to get a better balance.  It won't be
> >> >> > robust
> >> >> > to changes to the cluster, but we can incrementally optimize after that
> >> >> > happens to converge on something better.
> >> >> >
> >> >> > The problem with doing this with current versions of Ceph is that we
> >> >> > lose
> >> >> > the original "input" or "target" weights (i.e., the actual size of
> >> >> > the OSD) that we want to converge on.  This is one reason why we haven't
> >> >> > done something like this before.
> >> >> >
> >> >> > In luminous we *could* work around this by storing those canonical
> >> >> > weights outside of crush using something (probably?) ugly and
> >> >> > maintain backward compatibility with older clients using existing
> >> >> > CRUSH behavior.
> >> >>
> >> >> These canonical weights could be stored in crush by creating dedicated
> >> >> buckets. For instance the root-canonical bucket could be created to store
> >> >> the canonical weights of the root bucket. The sysadmin needs to be aware of
> >> >> the difference and know to add a new device in the host01-canonical bucket
> >> >> instead of the host01 bucket. And to run an offline tool to keep the two
> >> >> buckets in sync and compute the weight to use for placement derived from the
> >> >> weights representing the device capacity.
> >> >>
> >> >> It is a little bit ugly ;-)
> >> >>
> >> >> > OR, (and this is my preferred route), if the multi-pick anomaly approach
> >> >> > that Pedro is working on works out, we'll want to extend the CRUSH map
> >> >> > to
> >> >> > include a set of derivative weights used for actual placement
> >> >> > calculations
> >> >> > instead of the canonical target weights, and we can do what you're
> >> >> > proposing *and* solve the multipick problem with one change in the crush
> >> >> > map and algorithm.  (Actually choosing those derivative weights will
> >> >> > be an offline process that can both improve the balance for the inputs
> >> >> > we
> >> >> > care about *and* adjust them based on the position to fix the skew issue
> >> >> > for replicas.)  This doesn't help pre-luminous clients, but I think the
> >> >> > end solution will be simpler and more elegant...
> >> >> >
> >> >> > What do you think?
> >> >> >
> >> >> > sage
> >> >> >
> >> >> >
> >> >> >> 4. Extra
> >> >> >> Some time ago I made such change to perfectly balance Thomson-Reuters
> >> >> >> cluster.
> >> >> >> It succeeded.
> >> >> >> A solution was not accepted, because modification of OSD weights were
> >> >> >> higher
> >> >> >> then 50%, which was caused by fact that different placement rules
> >> >> >> operated
> >> >> >> on different sets of OSDs, and those sets were not disjointed.
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> Best regards,
> >> >> >> Adam
> >> >> >>
> >> >> >>
> >> >> >> On Sat, Mar 25, 2017 at 7:42 PM, Sage Weil <sage@newdream.net> wrote:
> >> >> >>       Hi Pedro, Loic,
> >> >> >>
> >> >> >>       For what it's worth, my intuition here (which has had a mixed
> >> >> >>       record as
> >> >> >>       far as CRUSH goes) is that this is the most promising path
> >> >> >>       forward.
> >> >> >>
> >> >> >>       Thinking ahead a few steps, and confirming that I'm following
> >> >> >>       the
> >> >> >>       discussion so far, if you're able to do get black (or white) box
> >> >> >>       gradient
> >> >> >>       descent to work, then this will give us a set of weights for
> >> >> >>       each item in
> >> >> >>       the tree for each selection round, derived from the tree
> >> >> >>       structure and
> >> >> >>       original (target) weights.  That would basically give us a map
> >> >> >>       of item id
> >> >> >>       (bucket id or leave item id) to weight for each round.  i.e.,
> >> >> >>
> >> >> >>        map<int, map<int, float>> weight_by_position;  // position ->
> >> >> >>       item -> weight
> >> >> >>
> >> >> >>       where the 0 round would (I think?) match the target weights, and
> >> >> >>       each
> >> >> >>       round after that would skew low-weighted items lower to some
> >> >> >>       degree.
> >> >> >>       Right?
> >> >> >>
> >> >> >>       The next question I have is: does this generalize from the
> >> >> >>       single-bucket
> >> >> >>       case to the hierarchy?  I.e., if I have a "tree" (single bucket)
> >> >> >>       like
> >> >> >>
> >> >> >>       3.1
> >> >> >>        |_____________
> >> >> >>        |   \    \    \
> >> >> >>       1.0  1.0  1.0  .1
> >> >> >>
> >> >> >>       it clearly works, but when we have a multi-level tree like
> >> >> >>
> >> >> >>
> >> >> >>       8.4
> >> >> >>        |____________________________________
> >> >> >>        |                 \                  \
> >> >> >>       3.1                3.1                2.2
> >> >> >>        |_____________     |_____________     |_____________
> >> >> >>        |   \    \    \    |   \    \    \    |   \    \    \
> >> >> >>       1.0  1.0  1.0  .1   1.0  1.0  1.0  .1  1.0  1.0 .1   .1
> >> >> >>
> >> >> >>       and the second round weights skew the small .1 leaves lower, can
> >> >> >>       we
> >> >> >>       continue to build the summed-weight hierarchy, such that the
> >> >> >>       adjusted
> >> >> >>       weights at the higher level are appropriately adjusted to give
> >> >> >>       us the
> >> >> >>       right probabilities of descending into those trees?  I'm not
> >> >> >>       sure if that
> >> >> >>       logically follows from the above or if my intuition is
> >> >> >>       oversimplifying
> >> >> >>       things.
> >> >> >>
> >> >> >>       If this *is* how we think this will shake out, then I'm
> >> >> >>       wondering if we
> >> >> >>       should go ahead and build this weigh matrix into CRUSH sooner
> >> >> >>       rather
> >> >> >>       than later (i.e., for luminous).  As with the explicit
> >> >> >>       remappings, the
> >> >> >>       hard part is all done offline, and the adjustments to the CRUSH
> >> >> >>       mapping
> >> >> >>       calculation itself (storing and making use of the adjusted
> >> >> >>       weights for
> >> >> >>       each round of placement) are relatively straightforward.  And
> >> >> >>       the sooner
> >> >> >>       this is incorporated into a release the sooner real users will
> >> >> >>       be able to
> >> >> >>       roll out code to all clients and start making us of it.
> >> >> >>
> >> >> >>       Thanks again for looking at this problem!  I'm excited that we
> >> >> >>       may be
> >> >> >>       closing in on a real solution!
> >> >> >>
> >> >> >>       sage
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>       On Thu, 23 Mar 2017, Pedro López-Adeva wrote:
> >> >> >>
> >> >> >>       > There are lot of gradient-free methods. I will try first to
> >> >> >>       run the
> >> >> >>       > ones available using just scipy
> >> >> >>       >
> >> >> >>
> >> >> >> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> >> >> >>       > Some of them don't require the gradient and some of them can
> >> >> >>       estimate
> >> >> >>       > it. The reason to go without the gradient is to run the CRUSH
> >> >> >>       > algorithm as a black box. In that case this would be the
> >> >> >>       pseudo-code:
> >> >> >>       >
> >> >> >>       > - BEGIN CODE -
> >> >> >>       > def build_target(desired_freqs):
> >> >> >>       >     def target(weights):
> >> >> >>       >         # run a simulation of CRUSH for a number of objects
> >> >> >>       >         sim_freqs = run_crush(weights)
> >> >> >>       >         # Kullback-Leibler divergence between desired
> >> >> >>       frequencies and
> >> >> >>       > current ones
> >> >> >>       >         return loss(sim_freqs, desired_freqs)
> >> >> >>       >    return target
> >> >> >>       >
> >> >> >>       > weights = scipy.optimize.minimize(build_target(desired_freqs))
> >> >> >>       > - END CODE -
> >> >> >>       >
> >> >> >>       > The tricky thing here is that this procedure can be slow if
> >> >> >>       the
> >> >> >>       > simulation (run_crush) needs to place a lot of objects to get
> >> >> >>       accurate
> >> >> >>       > simulated frequencies. This is true specially if the minimize
> >> >> >>       method
> >> >> >>       > attempts to approximate the gradient using finite differences
> >> >> >>       since it
> >> >> >>       > will evaluate the target function a number of times
> >> >> >>       proportional to
> >> >> >>       > the number of weights). Apart from the ones in scipy I would
> >> >> >>       try also
> >> >> >>       > optimization methods that try to perform as few evaluations as
> >> >> >>       > possible like for example HyperOpt
> >> >> >>       > (http://hyperopt.github.io/hyperopt/), which by the way takes
> >> >> >>       into
> >> >> >>       > account that the target function can be noisy.
> >> >> >>       >
> >> >> >>       > This black box approximation is simple to implement and makes
> >> >> >>       the
> >> >> >>       > computer do all the work instead of us.
> >> >> >>       > I think that this black box approximation is worthy to try
> >> >> >>       even if
> >> >> >>       > it's not the final one because if this approximation works
> >> >> >>       then we
> >> >> >>       > know that a more elaborate one that computes the gradient of
> >> >> >>       the CRUSH
> >> >> >>       > algorithm will work for sure.
> >> >> >>       >
> >> >> >>       > I can try this black box approximation this weekend not on the
> >> >> >>       real
> >> >> >>       > CRUSH algorithm but with the simple implementation I did in
> >> >> >>       python. If
> >> >> >>       > it works it's just a matter of substituting one simulation
> >> >> >>       with
> >> >> >>       > another and see what happens.
> >> >> >>       >
> >> >> >>       > 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
> >> >> >>       > > Hi Pedro,
> >> >> >>       > >
> >> >> >>       > > On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
> >> >> >>       > >> Hi Loic,
> >> >> >>       > >>
> >> >> >>       > >>>From what I see everything seems OK.
> >> >> >>       > >
> >> >> >>       > > Cool. I'll keep going in this direction then !
> >> >> >>       > >
> >> >> >>       > >> The interesting thing would be to
> >> >> >>       > >> test on some complex mapping. The reason is that
> >> >> >>       "CrushPolicyFamily"
> >> >> >>       > >> is right now modeling just a single straw bucket not the
> >> >> >>       full CRUSH
> >> >> >>       > >> algorithm.
> >> >> >>       > >
> >> >> >>       > > A number of use cases use a single straw bucket, maybe the
> >> >> >>       majority of them. Even though it does not reflect the full range
> >> >> >>       of what crush can offer, it could be useful. To be more
> >> >> >>       specific, a crush map that states "place objects so that there
> >> >> >>       is at most one replica per host" or "one replica per rack" is
> >> >> >>       common. Such a crushmap can be reduced to a single straw bucket
> >> >> >>       that contains all the hosts and by using the CrushPolicyFamily,
> >> >> >>       we can change the weights of each host to fix the probabilities.
> >> >> >>       The hosts themselves contain disks with varying weights but I
> >> >> >>       think we can ignore that because crush will only recurse to
> >> >> >>       place one object within a given host.
> >> >> >>       > >
> >> >> >>       > >> That's the work that remains to be done. The only way that
> >> >> >>       > >> would avoid reimplementing the CRUSH algorithm and
> >> >> >>       computing the
> >> >> >>       > >> gradient would be treating CRUSH as a black box and
> >> >> >>       eliminating the
> >> >> >>       > >> necessity of computing the gradient either by using a
> >> >> >>       gradient-free
> >> >> >>       > >> optimization method or making an estimation of the
> >> >> >>       gradient.
> >> >> >>       > >
> >> >> >>       > > By gradient-free optimization you mean simulated annealing
> >> >> >>       or Monte Carlo ?
> >> >> >>       > >
> >> >> >>       > > Cheers
> >> >> >>       > >
> >> >> >>       > >>
> >> >> >>       > >>
> >> >> >>       > >> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
> >> >> >>       > >>> Hi,
> >> >> >>       > >>>
> >> >> >>       > >>> I modified the crush library to accept two weights (one
> >> >> >>       for the first disk, the other for the remaining disks)[1]. This
> >> >> >>       really is a hack for experimentation purposes only ;-) I was
> >> >> >>       able to run a variation of your code[2] and got the following
> >> >> >>       results which are encouraging. Do you think what I did is
> >> >> >>       sensible ? Or is there a problem I don't see ?
> >> >> >>       > >>>
> >> >> >>       > >>> Thanks !
> >> >> >>       > >>>
> >> >> >>       > >>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8
> >> >> >>       6]
> >> >> >>       > >>>
> >> >> >>
> >> >> >> ------------------------------------------------------------------------
> >> >> >>       > >>> Before: All replicas on each hard drive
> >> >> >>       > >>> Expected vs actual use (20000 samples)
> >> >> >>       > >>>  disk 0: 1.39e-01 1.12e-01
> >> >> >>       > >>>  disk 1: 1.11e-01 1.10e-01
> >> >> >>       > >>>  disk 2: 8.33e-02 1.13e-01
> >> >> >>       > >>>  disk 3: 1.39e-01 1.11e-01
> >> >> >>       > >>>  disk 4: 1.11e-01 1.11e-01
> >> >> >>       > >>>  disk 5: 8.33e-02 1.11e-01
> >> >> >>       > >>>  disk 6: 1.39e-01 1.12e-01
> >> >> >>       > >>>  disk 7: 1.11e-01 1.12e-01
> >> >> >>       > >>>  disk 8: 8.33e-02 1.10e-01
> >> >> >>       > >>> it=    1 jac norm=1.59e-01 loss=5.27e-03
> >> >> >>       > >>> it=    2 jac norm=1.55e-01 loss=5.03e-03
> >> >> >>       > >>> ...
> >> >> >>       > >>> it=  212 jac norm=1.02e-03 loss=2.41e-07
> >> >> >>       > >>> it=  213 jac norm=1.00e-03 loss=2.31e-07
> >> >> >>       > >>> Converged to desired accuracy :)
> >> >> >>       > >>> After: All replicas on each hard drive
> >> >> >>       > >>> Expected vs actual use (20000 samples)
> >> >> >>       > >>>  disk 0: 1.39e-01 1.42e-01
> >> >> >>       > >>>  disk 1: 1.11e-01 1.09e-01
> >> >> >>       > >>>  disk 2: 8.33e-02 8.37e-02
> >> >> >>       > >>>  disk 3: 1.39e-01 1.40e-01
> >> >> >>       > >>>  disk 4: 1.11e-01 1.13e-01
> >> >> >>       > >>>  disk 5: 8.33e-02 8.08e-02
> >> >> >>       > >>>  disk 6: 1.39e-01 1.38e-01
> >> >> >>       > >>>  disk 7: 1.11e-01 1.09e-01
> >> >> >>       > >>>  disk 8: 8.33e-02 8.48e-02
> >> >> >>       > >>>
> >> >> >>       > >>>
> >> >> >>       > >>> Simulation: R=2 devices capacity [10 10 10 10  1]
> >> >> >>       > >>>
> >> >> >>
> >> >> >> ------------------------------------------------------------------------
> >> >> >>       > >>> Before: All replicas on each hard drive
> >> >> >>       > >>> Expected vs actual use (20000 samples)
> >> >> >>       > >>>  disk 0: 2.44e-01 2.36e-01
> >> >> >>       > >>>  disk 1: 2.44e-01 2.38e-01
> >> >> >>       > >>>  disk 2: 2.44e-01 2.34e-01
> >> >> >>       > >>>  disk 3: 2.44e-01 2.38e-01
> >> >> >>       > >>>  disk 4: 2.44e-02 5.37e-02
> >> >> >>       > >>> it=    1 jac norm=2.43e-01 loss=2.98e-03
> >> >> >>       > >>> it=    2 jac norm=2.28e-01 loss=2.47e-03
> >> >> >>       > >>> ...
> >> >> >>       > >>> it=   37 jac norm=1.28e-03 loss=3.48e-08
> >> >> >>       > >>> it=   38 jac norm=1.07e-03 loss=2.42e-08
> >> >> >>       > >>> Converged to desired accuracy :)
> >> >> >>       > >>> After: All replicas on each hard drive
> >> >> >>       > >>> Expected vs actual use (20000 samples)
> >> >> >>       > >>>  disk 0: 2.44e-01 2.46e-01
> >> >> >>       > >>>  disk 1: 2.44e-01 2.44e-01
> >> >> >>       > >>>  disk 2: 2.44e-01 2.41e-01
> >> >> >>       > >>>  disk 3: 2.44e-01 2.45e-01
> >> >> >>       > >>>  disk 4: 2.44e-02 2.33e-02
> >> >> >>       > >>>
> >> >> >>       > >>>
> >> >> >>       > >>> [1] crush
> >> >> >> hackhttp://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd
> >> >> >>       56fee8
> >> >> >>       > >>> [2] python-crush
> >> >> >> hackhttp://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1
> >> >> >>       bd25f8f2c4b68
> >> >> >>       > >>>
> >> >> >>       > >>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
> >> >> >>       > >>>> Hi Pedro,
> >> >> >>       > >>>>
> >> >> >>       > >>>> It looks like trying to experiment with crush won't work
> >> >> >>       as expected because crush does not distinguish the probability
> >> >> >>       of selecting the first device from the probability of selecting
> >> >> >>       the second or third device. Am I mistaken ?
> >> >> >>       > >>>>
> >> >> >>       > >>>> Cheers
> >> >> >>       > >>>>
> >> >> >>       > >>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
> >> >> >>       > >>>>> Hi Pedro,
> >> >> >>       > >>>>>
> >> >> >>       > >>>>> I'm going to experiment with what you did at
> >> >> >>       > >>>>>
> >> >> >>       > >>>>>
> >> >> >>       https://github.com/plafl/notebooks/blob/master/replication.ipynb
> >> >> >>       > >>>>>
> >> >> >>       > >>>>> and the latest python-crush published today. A
> >> >> >>       comparison function was added that will help measure the data
> >> >> >>       movement. I'm hoping we can release an offline tool based on
> >> >> >>       your solution. Please let me know if I should wait before diving
> >> >> >>       into this, in case you have unpublished drafts or new ideas.
> >> >> >>       > >>>>>
> >> >> >>       > >>>>> Cheers
> >> >> >>       > >>>>>
> >> >> >>       > >>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
> >> >> >>       > >>>>>> Great, thanks for the clarifications.
> >> >> >>       > >>>>>> I also think that the most natural way is to keep just
> >> >> >>       a set of
> >> >> >>       > >>>>>> weights in the CRUSH map and update them inside the
> >> >> >>       algorithm.
> >> >> >>       > >>>>>>
> >> >> >>       > >>>>>> I keep working on it.
> >> >> >>       > >>>>>>
> >> >> >>       > >>>>>>
> >> >> >>       > >>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil
> >> >> >>       <sage@newdream.net>:
> >> >> >>       > >>>>>>> Hi Pedro,
> >> >> >>       > >>>>>>>
> >> >> >>       > >>>>>>> Thanks for taking a look at this!  It's a frustrating
> >> >> >>       problem and we
> >> >> >>       > >>>>>>> haven't made much headway.
> >> >> >>       > >>>>>>>
> >> >> >>       > >>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
> >> >> >>       > >>>>>>>> Hi,
> >> >> >>       > >>>>>>>>
> >> >> >>       > >>>>>>>> I will have a look. BTW, I have not progressed that
> >> >> >>       much but I have
> >> >> >>       > >>>>>>>> been thinking about it. In order to adapt the
> >> >> >>       previous algorithm in
> >> >> >>       > >>>>>>>> the python notebook I need to substitute the
> >> >> >>       iteration over all
> >> >> >>       > >>>>>>>> possible devices permutations to iteration over all
> >> >> >>       the possible
> >> >> >>       > >>>>>>>> selections that crush would make. That is the main
> >> >> >>       thing I need to
> >> >> >>       > >>>>>>>> work on.
> >> >> >>       > >>>>>>>>
> >> >> >>       > >>>>>>>> The other thing is of course that weights change for
> >> >> >>       each replica.
> >> >> >>       > >>>>>>>> That is, they cannot be really fixed in the crush
> >> >> >>       map. So the
> >> >> >>       > >>>>>>>> algorithm inside libcrush, not only the weights in
> >> >> >>       the map, need to be
> >> >> >>       > >>>>>>>> changed. The weights in the crush map should reflect
> >> >> >>       then, maybe, the
> >> >> >>       > >>>>>>>> desired usage frequencies. Or maybe each replica
> >> >> >>       should have their own
> >> >> >>       > >>>>>>>> crush map, but then the information about the
> >> >> >>       previous selection
> >> >> >>       > >>>>>>>> should be passed to the next replica placement run so
> >> >> >>       it avoids
> >> >> >>       > >>>>>>>> selecting the same one again.
> >> >> >>       > >>>>>>>
> >> >> >>       > >>>>>>> My suspicion is that the best solution here (whatever
> >> >> >>       that means!)
> >> >> >>       > >>>>>>> leaves the CRUSH weights intact with the desired
> >> >> >>       distribution, and
> >> >> >>       > >>>>>>> then generates a set of derivative weights--probably
> >> >> >>       one set for each
> >> >> >>       > >>>>>>> round/replica/rank.
> >> >> >>       > >>>>>>>
> >> >> >>       > >>>>>>> One nice property of this is that once the support is
> >> >> >>       added to encode
> >> >> >>       > >>>>>>> multiple sets of weights, the algorithm used to
> >> >> >>       generate them is free to
> >> >> >>       > >>>>>>> change and evolve independently.  (In most cases any
> >> >> >>       change is
> >> >> >>       > >>>>>>> CRUSH's mapping behavior is difficult to roll out
> >> >> >>       because all
> >> >> >>       > >>>>>>> parties participating in the cluster have to support
> >> >> >>       any new behavior
> >> >> >>       > >>>>>>> before it is enabled or used.)
> >> >> >>       > >>>>>>>
> >> >> >>       > >>>>>>>> I have a question also. Is there any significant
> >> >> >>       difference between
> >> >> >>       > >>>>>>>> the device selection algorithm description in the
> >> >> >>       paper and its final
> >> >> >>       > >>>>>>>> implementation?
> >> >> >>       > >>>>>>>
> >> >> >>       > >>>>>>> The main difference is the "retry_bucket" behavior was
> >> >> >>       found to be a bad
> >> >> >>       > >>>>>>> idea; any collision or failed()/overload() case
> >> >> >>       triggers the
> >> >> >>       > >>>>>>> retry_descent.
> >> >> >>       > >>>>>>>
> >> >> >>       > >>>>>>> There are other changes, of course, but I don't think
> >> >> >>       they'll impact any
> >> >> >>       > >>>>>>> solution we come with here (or at least any solution
> >> >> >>       can be suitably
> >> >> >>       > >>>>>>> adapted)!
> >> >> >>       > >>>>>>>
> >> >> >>       > >>>>>>> sage
> >> >> >>       > >>>>>> --
> >> >> >>       > >>>>>> To unsubscribe from this list: send the line
> >> >> >>       "unsubscribe ceph-devel" in
> >> >> >>       > >>>>>> the body of a message to majordomo@vger.kernel.org
> >> >> >>       > >>>>>> More majordomo info at
> >> >> >>       http://vger.kernel.org/majordomo-info.html
> >> >> >>       > >>>>>>
> >> >> >>       > >>>>>
> >> >> >>       > >>>>
> >> >> >>       > >>>
> >> >> >>       > >>> --
> >> >> >>       > >>> Loïc Dachary, Artisan Logiciel Libre
> >> >> >>       > >> --
> >> >> >>       > >> To unsubscribe from this list: send the line "unsubscribe
> >> >> >>       ceph-devel" in
> >> >> >>       > >> the body of a message to majordomo@vger.kernel.org
> >> >> >>       > >> More majordomo info at
> >> >> >>       http://vger.kernel.org/majordomo-info.html
> >> >> >>       > >>
> >> >> >>       > >
> >> >> >>       > > --
> >> >> >>       > > Loïc Dachary, Artisan Logiciel Libre
> >> >> >>       > --
> >> >> >>       > To unsubscribe from this list: send the line "unsubscribe
> >> >> >>       ceph-devel" in
> >> >> >>       > the body of a message to majordomo@vger.kernel.org
> >> >> >>       > More majordomo info at
> >> >> >>       http://vger.kernel.org/majordomo-info.html
> >> >> >>       >
> >> >> >>       >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >>
> >> >> --
> >> >> Loïc Dachary, Artisan Logiciel Libre
> >> >
> >> >
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-23 15:32                                       ` Pedro López-Adeva
  2017-03-23 16:18                                         ` Loic Dachary
  2017-03-25 18:42                                         ` Sage Weil
@ 2017-04-11 15:22                                         ` Loic Dachary
  2017-04-22 16:51                                         ` Loic Dachary
  3 siblings, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-04-11 15:22 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: ceph-devel

Hi Pedro,

A short update to let you know the changes to crush allowing multiple weights per item is well under way[1]. It should be merged next week and will make it possible to effectively use your optimization. A new version of Ceph is going to be published in the next few weeks and will also contain these modifications.

Cheers

[1] http://libcrush.org/main/libcrush/commit/49b6043d6b85197a49e70cbbcfe411d92983f501

On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
> There are lot of gradient-free methods. I will try first to run the
> ones available using just scipy
> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> Some of them don't require the gradient and some of them can estimate
> it. The reason to go without the gradient is to run the CRUSH
> algorithm as a black box. In that case this would be the pseudo-code:
> 
> - BEGIN CODE -
> def build_target(desired_freqs):
>     def target(weights):
>         # run a simulation of CRUSH for a number of objects
>         sim_freqs = run_crush(weights)
>         # Kullback-Leibler divergence between desired frequencies and
> current ones
>         return loss(sim_freqs, desired_freqs)
>    return target
> 
> weights = scipy.optimize.minimize(build_target(desired_freqs))
> - END CODE -
> 
> The tricky thing here is that this procedure can be slow if the
> simulation (run_crush) needs to place a lot of objects to get accurate
> simulated frequencies. This is true specially if the minimize method
> attempts to approximate the gradient using finite differences since it
> will evaluate the target function a number of times proportional to
> the number of weights). Apart from the ones in scipy I would try also
> optimization methods that try to perform as few evaluations as
> possible like for example HyperOpt
> (http://hyperopt.github.io/hyperopt/), which by the way takes into
> account that the target function can be noisy.
> 
> This black box approximation is simple to implement and makes the
> computer do all the work instead of us.
> I think that this black box approximation is worthy to try even if
> it's not the final one because if this approximation works then we
> know that a more elaborate one that computes the gradient of the CRUSH
> algorithm will work for sure.
> 
> I can try this black box approximation this weekend not on the real
> CRUSH algorithm but with the simple implementation I did in python. If
> it works it's just a matter of substituting one simulation with
> another and see what happens.
> 
> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>> Hi Loic,
>>>
>>> >From what I see everything seems OK.
>>
>> Cool. I'll keep going in this direction then !
>>
>>> The interesting thing would be to
>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>> is right now modeling just a single straw bucket not the full CRUSH
>>> algorithm.
>>
>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>
>>> That's the work that remains to be done. The only way that
>>> would avoid reimplementing the CRUSH algorithm and computing the
>>> gradient would be treating CRUSH as a black box and eliminating the
>>> necessity of computing the gradient either by using a gradient-free
>>> optimization method or making an estimation of the gradient.
>>
>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>
>> Cheers
>>
>>>
>>>
>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>> Hi,
>>>>
>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>
>>>> Thanks !
>>>>
>>>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
>>>> ------------------------------------------------------------------------
>>>> Before: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>>  disk 0: 1.39e-01 1.12e-01
>>>>  disk 1: 1.11e-01 1.10e-01
>>>>  disk 2: 8.33e-02 1.13e-01
>>>>  disk 3: 1.39e-01 1.11e-01
>>>>  disk 4: 1.11e-01 1.11e-01
>>>>  disk 5: 8.33e-02 1.11e-01
>>>>  disk 6: 1.39e-01 1.12e-01
>>>>  disk 7: 1.11e-01 1.12e-01
>>>>  disk 8: 8.33e-02 1.10e-01
>>>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>>> ...
>>>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>>> Converged to desired accuracy :)
>>>> After: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>>  disk 0: 1.39e-01 1.42e-01
>>>>  disk 1: 1.11e-01 1.09e-01
>>>>  disk 2: 8.33e-02 8.37e-02
>>>>  disk 3: 1.39e-01 1.40e-01
>>>>  disk 4: 1.11e-01 1.13e-01
>>>>  disk 5: 8.33e-02 8.08e-02
>>>>  disk 6: 1.39e-01 1.38e-01
>>>>  disk 7: 1.11e-01 1.09e-01
>>>>  disk 8: 8.33e-02 8.48e-02
>>>>
>>>>
>>>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>>> ------------------------------------------------------------------------
>>>> Before: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>>  disk 0: 2.44e-01 2.36e-01
>>>>  disk 1: 2.44e-01 2.38e-01
>>>>  disk 2: 2.44e-01 2.34e-01
>>>>  disk 3: 2.44e-01 2.38e-01
>>>>  disk 4: 2.44e-02 5.37e-02
>>>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>>> ...
>>>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>>> Converged to desired accuracy :)
>>>> After: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>>  disk 0: 2.44e-01 2.46e-01
>>>>  disk 1: 2.44e-01 2.44e-01
>>>>  disk 2: 2.44e-01 2.41e-01
>>>>  disk 3: 2.44e-01 2.45e-01
>>>>  disk 4: 2.44e-02 2.33e-02
>>>>
>>>>
>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>
>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>> Hi Pedro,
>>>>>
>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> I'm going to experiment with what you did at
>>>>>>
>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>
>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>> Great, thanks for the clarifications.
>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>
>>>>>>> I keep working on it.
>>>>>>>
>>>>>>>
>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>>>>>>> haven't made much headway.
>>>>>>>>
>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>> work on.
>>>>>>>>>
>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>> selecting the same one again.
>>>>>>>>
>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>> round/replica/rank.
>>>>>>>>
>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>> change and evolve independently.  (In most cases any change is
>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>> before it is enabled or used.)
>>>>>>>>
>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>> implementation?
>>>>>>>>
>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>> retry_descent.
>>>>>>>>
>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>> adapted)!
>>>>>>>>
>>>>>>>> sage
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-03-23 15:32                                       ` Pedro López-Adeva
                                                           ` (2 preceding siblings ...)
  2017-04-11 15:22                                         ` Loic Dachary
@ 2017-04-22 16:51                                         ` Loic Dachary
  2017-04-25 15:04                                           ` Pedro López-Adeva
  3 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-04-22 16:51 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: ceph-devel

Hi Pedro,

I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.

I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...

Before optimization the situation is:

         ~expected~  ~objects~  ~delta~   ~delta%~
~name~                                            
dc1            1024       1024        0   0.000000
host0           256        294       38  14.843750
device0         128        153       25  19.531250
device1         128        141       13  10.156250
host1           256        301       45  17.578125
device2         128        157       29  22.656250
device3         128        144       16  12.500000
host2           512        429      -83 -16.210938
device4         128         96      -32 -25.000000
device5         128        117      -11  -8.593750
device6         256        216      -40 -15.625000

and after optimization we have the following:

         ~expected~  ~objects~  ~delta~  ~delta%~
~name~                                           
dc1            1024       1024        0  0.000000
host0           256        259        3  1.171875
device0         128        129        1  0.781250
device1         128        130        2  1.562500
host1           256        258        2  0.781250
device2         128        129        1  0.781250
device3         128        129        1  0.781250
host2           512        507       -5 -0.976562
device4         128        126       -2 -1.562500
device5         128        127       -1 -0.781250
device6         256        254       -2 -0.781250

Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.

Cheers

[1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
[2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
[3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
[4] https://github.com/ceph/ceph/pull/14486

On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
> There are lot of gradient-free methods. I will try first to run the
> ones available using just scipy
> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
> Some of them don't require the gradient and some of them can estimate
> it. The reason to go without the gradient is to run the CRUSH
> algorithm as a black box. In that case this would be the pseudo-code:
> 
> - BEGIN CODE -
> def build_target(desired_freqs):
>     def target(weights):
>         # run a simulation of CRUSH for a number of objects
>         sim_freqs = run_crush(weights)
>         # Kullback-Leibler divergence between desired frequencies and
> current ones
>         return loss(sim_freqs, desired_freqs)
>    return target
> 
> weights = scipy.optimize.minimize(build_target(desired_freqs))
> - END CODE -
> 
> The tricky thing here is that this procedure can be slow if the
> simulation (run_crush) needs to place a lot of objects to get accurate
> simulated frequencies. This is true specially if the minimize method
> attempts to approximate the gradient using finite differences since it
> will evaluate the target function a number of times proportional to
> the number of weights). Apart from the ones in scipy I would try also
> optimization methods that try to perform as few evaluations as
> possible like for example HyperOpt
> (http://hyperopt.github.io/hyperopt/), which by the way takes into
> account that the target function can be noisy.
> 
> This black box approximation is simple to implement and makes the
> computer do all the work instead of us.
> I think that this black box approximation is worthy to try even if
> it's not the final one because if this approximation works then we
> know that a more elaborate one that computes the gradient of the CRUSH
> algorithm will work for sure.
> 
> I can try this black box approximation this weekend not on the real
> CRUSH algorithm but with the simple implementation I did in python. If
> it works it's just a matter of substituting one simulation with
> another and see what happens.
> 
> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>> Hi Loic,
>>>
>>> >From what I see everything seems OK.
>>
>> Cool. I'll keep going in this direction then !
>>
>>> The interesting thing would be to
>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>> is right now modeling just a single straw bucket not the full CRUSH
>>> algorithm.
>>
>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>
>>> That's the work that remains to be done. The only way that
>>> would avoid reimplementing the CRUSH algorithm and computing the
>>> gradient would be treating CRUSH as a black box and eliminating the
>>> necessity of computing the gradient either by using a gradient-free
>>> optimization method or making an estimation of the gradient.
>>
>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>
>> Cheers
>>
>>>
>>>
>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>> Hi,
>>>>
>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>
>>>> Thanks !
>>>>
>>>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
>>>> ------------------------------------------------------------------------
>>>> Before: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>>  disk 0: 1.39e-01 1.12e-01
>>>>  disk 1: 1.11e-01 1.10e-01
>>>>  disk 2: 8.33e-02 1.13e-01
>>>>  disk 3: 1.39e-01 1.11e-01
>>>>  disk 4: 1.11e-01 1.11e-01
>>>>  disk 5: 8.33e-02 1.11e-01
>>>>  disk 6: 1.39e-01 1.12e-01
>>>>  disk 7: 1.11e-01 1.12e-01
>>>>  disk 8: 8.33e-02 1.10e-01
>>>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>>> ...
>>>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>>> Converged to desired accuracy :)
>>>> After: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>>  disk 0: 1.39e-01 1.42e-01
>>>>  disk 1: 1.11e-01 1.09e-01
>>>>  disk 2: 8.33e-02 8.37e-02
>>>>  disk 3: 1.39e-01 1.40e-01
>>>>  disk 4: 1.11e-01 1.13e-01
>>>>  disk 5: 8.33e-02 8.08e-02
>>>>  disk 6: 1.39e-01 1.38e-01
>>>>  disk 7: 1.11e-01 1.09e-01
>>>>  disk 8: 8.33e-02 8.48e-02
>>>>
>>>>
>>>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>>> ------------------------------------------------------------------------
>>>> Before: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>>  disk 0: 2.44e-01 2.36e-01
>>>>  disk 1: 2.44e-01 2.38e-01
>>>>  disk 2: 2.44e-01 2.34e-01
>>>>  disk 3: 2.44e-01 2.38e-01
>>>>  disk 4: 2.44e-02 5.37e-02
>>>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>>> ...
>>>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>>> Converged to desired accuracy :)
>>>> After: All replicas on each hard drive
>>>> Expected vs actual use (20000 samples)
>>>>  disk 0: 2.44e-01 2.46e-01
>>>>  disk 1: 2.44e-01 2.44e-01
>>>>  disk 2: 2.44e-01 2.41e-01
>>>>  disk 3: 2.44e-01 2.45e-01
>>>>  disk 4: 2.44e-02 2.33e-02
>>>>
>>>>
>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>
>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>> Hi Pedro,
>>>>>
>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> I'm going to experiment with what you did at
>>>>>>
>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>
>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>> Great, thanks for the clarifications.
>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>
>>>>>>> I keep working on it.
>>>>>>>
>>>>>>>
>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>>>>>>> haven't made much headway.
>>>>>>>>
>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>> work on.
>>>>>>>>>
>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>> selecting the same one again.
>>>>>>>>
>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>> round/replica/rank.
>>>>>>>>
>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>> change and evolve independently.  (In most cases any change is
>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>> before it is enabled or used.)
>>>>>>>>
>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>> implementation?
>>>>>>>>
>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>> retry_descent.
>>>>>>>>
>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>> adapted)!
>>>>>>>>
>>>>>>>> sage
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-04-22 16:51                                         ` Loic Dachary
@ 2017-04-25 15:04                                           ` Pedro López-Adeva
  2017-04-25 17:46                                             ` Loic Dachary
  2017-04-26 21:08                                             ` Loic Dachary
  0 siblings, 2 replies; 70+ messages in thread
From: Pedro López-Adeva @ 2017-04-25 15:04 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

Hi Loic,

Well, the results are better certainly! Some comments:

- I'm glad Nelder-Mead worked. It's not the one I would have chosen
because but I'm not an expert in optimization either. I wonder how it
will scale with more weights[1]. My attempt at using scipy's optimize
didn't work because you are optimizing an stochastic function and this
can make scipy's to decide that no further steps are possible. The
field that studies this kind of problems is stochastic optimization
[2]

- I used KL divergence for the loss function. My first attempt was
using as you standard deviation (more commonly known as L2 loss) with
gradient descent, but it didn't work very well.

- Sum of differences sounds like a bad idea, +100 and -100 errors will
cancel out. Worse still -100 and -100 will be better than 0 and 0.
Maybe you were talking about the absolute value of the differences?

- Well, now that CRUSH can use multiple weight the problem that
remains I think is seeing if the optimization problem is: a) reliable
and b) fast enough

Cheers,
Pedro.

[1] http://www.benfrederickson.com/numerical-optimization/
[2] https://en.wikipedia.org/wiki/Stochastic_optimization

2017-04-22 18:51 GMT+02:00 Loic Dachary <loic@dachary.org>:
> Hi Pedro,
>
> I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
>
> I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
>
> Before optimization the situation is:
>
>          ~expected~  ~objects~  ~delta~   ~delta%~
> ~name~
> dc1            1024       1024        0   0.000000
> host0           256        294       38  14.843750
> device0         128        153       25  19.531250
> device1         128        141       13  10.156250
> host1           256        301       45  17.578125
> device2         128        157       29  22.656250
> device3         128        144       16  12.500000
> host2           512        429      -83 -16.210938
> device4         128         96      -32 -25.000000
> device5         128        117      -11  -8.593750
> device6         256        216      -40 -15.625000
>
> and after optimization we have the following:
>
>          ~expected~  ~objects~  ~delta~  ~delta%~
> ~name~
> dc1            1024       1024        0  0.000000
> host0           256        259        3  1.171875
> device0         128        129        1  0.781250
> device1         128        130        2  1.562500
> host1           256        258        2  0.781250
> device2         128        129        1  0.781250
> device3         128        129        1  0.781250
> host2           512        507       -5 -0.976562
> device4         128        126       -2 -1.562500
> device5         128        127       -1 -0.781250
> device6         256        254       -2 -0.781250
>
> Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
>
> Cheers
>
> [1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
> [2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
> [3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
> [4] https://github.com/ceph/ceph/pull/14486
>
> On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
>> There are lot of gradient-free methods. I will try first to run the
>> ones available using just scipy
>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>> Some of them don't require the gradient and some of them can estimate
>> it. The reason to go without the gradient is to run the CRUSH
>> algorithm as a black box. In that case this would be the pseudo-code:
>>
>> - BEGIN CODE -
>> def build_target(desired_freqs):
>>     def target(weights):
>>         # run a simulation of CRUSH for a number of objects
>>         sim_freqs = run_crush(weights)
>>         # Kullback-Leibler divergence between desired frequencies and
>> current ones
>>         return loss(sim_freqs, desired_freqs)
>>    return target
>>
>> weights = scipy.optimize.minimize(build_target(desired_freqs))
>> - END CODE -
>>
>> The tricky thing here is that this procedure can be slow if the
>> simulation (run_crush) needs to place a lot of objects to get accurate
>> simulated frequencies. This is true specially if the minimize method
>> attempts to approximate the gradient using finite differences since it
>> will evaluate the target function a number of times proportional to
>> the number of weights). Apart from the ones in scipy I would try also
>> optimization methods that try to perform as few evaluations as
>> possible like for example HyperOpt
>> (http://hyperopt.github.io/hyperopt/), which by the way takes into
>> account that the target function can be noisy.
>>
>> This black box approximation is simple to implement and makes the
>> computer do all the work instead of us.
>> I think that this black box approximation is worthy to try even if
>> it's not the final one because if this approximation works then we
>> know that a more elaborate one that computes the gradient of the CRUSH
>> algorithm will work for sure.
>>
>> I can try this black box approximation this weekend not on the real
>> CRUSH algorithm but with the simple implementation I did in python. If
>> it works it's just a matter of substituting one simulation with
>> another and see what happens.
>>
>> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>> Hi Pedro,
>>>
>>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>> Hi Loic,
>>>>
>>>> >From what I see everything seems OK.
>>>
>>> Cool. I'll keep going in this direction then !
>>>
>>>> The interesting thing would be to
>>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>>> is right now modeling just a single straw bucket not the full CRUSH
>>>> algorithm.
>>>
>>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>>
>>>> That's the work that remains to be done. The only way that
>>>> would avoid reimplementing the CRUSH algorithm and computing the
>>>> gradient would be treating CRUSH as a black box and eliminating the
>>>> necessity of computing the gradient either by using a gradient-free
>>>> optimization method or making an estimation of the gradient.
>>>
>>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>>
>>> Cheers
>>>
>>>>
>>>>
>>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>> Hi,
>>>>>
>>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>>
>>>>> Thanks !
>>>>>
>>>>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
>>>>> ------------------------------------------------------------------------
>>>>> Before: All replicas on each hard drive
>>>>> Expected vs actual use (20000 samples)
>>>>>  disk 0: 1.39e-01 1.12e-01
>>>>>  disk 1: 1.11e-01 1.10e-01
>>>>>  disk 2: 8.33e-02 1.13e-01
>>>>>  disk 3: 1.39e-01 1.11e-01
>>>>>  disk 4: 1.11e-01 1.11e-01
>>>>>  disk 5: 8.33e-02 1.11e-01
>>>>>  disk 6: 1.39e-01 1.12e-01
>>>>>  disk 7: 1.11e-01 1.12e-01
>>>>>  disk 8: 8.33e-02 1.10e-01
>>>>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>>>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>>>> ...
>>>>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>>>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>>>> Converged to desired accuracy :)
>>>>> After: All replicas on each hard drive
>>>>> Expected vs actual use (20000 samples)
>>>>>  disk 0: 1.39e-01 1.42e-01
>>>>>  disk 1: 1.11e-01 1.09e-01
>>>>>  disk 2: 8.33e-02 8.37e-02
>>>>>  disk 3: 1.39e-01 1.40e-01
>>>>>  disk 4: 1.11e-01 1.13e-01
>>>>>  disk 5: 8.33e-02 8.08e-02
>>>>>  disk 6: 1.39e-01 1.38e-01
>>>>>  disk 7: 1.11e-01 1.09e-01
>>>>>  disk 8: 8.33e-02 8.48e-02
>>>>>
>>>>>
>>>>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>>>> ------------------------------------------------------------------------
>>>>> Before: All replicas on each hard drive
>>>>> Expected vs actual use (20000 samples)
>>>>>  disk 0: 2.44e-01 2.36e-01
>>>>>  disk 1: 2.44e-01 2.38e-01
>>>>>  disk 2: 2.44e-01 2.34e-01
>>>>>  disk 3: 2.44e-01 2.38e-01
>>>>>  disk 4: 2.44e-02 5.37e-02
>>>>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>>>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>>>> ...
>>>>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>>>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>>>> Converged to desired accuracy :)
>>>>> After: All replicas on each hard drive
>>>>> Expected vs actual use (20000 samples)
>>>>>  disk 0: 2.44e-01 2.46e-01
>>>>>  disk 1: 2.44e-01 2.44e-01
>>>>>  disk 2: 2.44e-01 2.41e-01
>>>>>  disk 3: 2.44e-01 2.45e-01
>>>>>  disk 4: 2.44e-02 2.33e-02
>>>>>
>>>>>
>>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>>
>>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>>> Hi Pedro,
>>>>>>>
>>>>>>> I'm going to experiment with what you did at
>>>>>>>
>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>
>>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>>> Great, thanks for the clarifications.
>>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>>
>>>>>>>> I keep working on it.
>>>>>>>>
>>>>>>>>
>>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>>> Hi Pedro,
>>>>>>>>>
>>>>>>>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>>>>>>>> haven't made much headway.
>>>>>>>>>
>>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>>> work on.
>>>>>>>>>>
>>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>>> selecting the same one again.
>>>>>>>>>
>>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>>> round/replica/rank.
>>>>>>>>>
>>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>>> change and evolve independently.  (In most cases any change is
>>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>>> before it is enabled or used.)
>>>>>>>>>
>>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>>> implementation?
>>>>>>>>>
>>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>>> retry_descent.
>>>>>>>>>
>>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>>> adapted)!
>>>>>>>>>
>>>>>>>>> sage
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-04-25 15:04                                           ` Pedro López-Adeva
@ 2017-04-25 17:46                                             ` Loic Dachary
  2017-04-26 21:08                                             ` Loic Dachary
  1 sibling, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-04-25 17:46 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: Ceph Development

Hi Pedro,

On 04/25/2017 05:04 PM, Pedro López-Adeva wrote:
> Hi Loic,
> 
> Well, the results are better certainly! Some comments:
> 
> - I'm glad Nelder-Mead worked. It's not the one I would have chosen
> because but I'm not an expert in optimization either. I wonder how it
> will scale with more weights[1]. My attempt at using scipy's optimize
> didn't work because you are optimizing an stochastic function and this
> can make scipy's to decide that no further steps are possible. 

Understood (I think). Do you have an opinion on which one of the following would be a better fit ? 

    minimize(method=’Powell’)
    minimize(method=’CG’)
    minimize(method=’BFGS’)
    minimize(method=’Newton-CG’)
    minimize(method=’L-BFGS-B’)
    minimize(method=’TNC’)
    minimize(method=’COBYLA’)
    minimize(method=’SLSQP’)
    minimize(method=’dogleg’)
    minimize(method=’trust-ncg’)

> The
> field that studies this kind of problems is stochastic optimization
> [2]

Unless I'm mistaken there are no tools related to that kind of problem in scipy, right ? I'll keep using scipy anyway because, as you wrote in your previous mail, it will be helpful to know if it works or not. Even if it takes so much time that it's not practical to use, it will tell us if computing the gradient of the CRUSH algorithm is a lost cause or not :-)

> - I used KL divergence for the loss function. My first attempt was
> using as you standard deviation (more commonly known as L2 loss) with
> gradient descent, but it didn't work very well.
> 
> - Sum of differences sounds like a bad idea, +100 and -100 errors will
> cancel out. Worse still -100 and -100 will be better than 0 and 0.
> Maybe you were talking about the absolute value of the differences?

I was not thinking straigth to be honest.

> - Well, now that CRUSH can use multiple weight the problem that
> remains I think is seeing if the optimization problem is: a) reliable
> and b) fast enough

Yep. I'll implement something and let you know how it goes.

Cheers

> 
> Cheers,
> Pedro.
> 
> [1] http://www.benfrederickson.com/numerical-optimization/
> [2] https://en.wikipedia.org/wiki/Stochastic_optimization
> 
> 2017-04-22 18:51 GMT+02:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
>>
>> I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
>>
>> Before optimization the situation is:
>>
>>          ~expected~  ~objects~  ~delta~   ~delta%~
>> ~name~
>> dc1            1024       1024        0   0.000000
>> host0           256        294       38  14.843750
>> device0         128        153       25  19.531250
>> device1         128        141       13  10.156250
>> host1           256        301       45  17.578125
>> device2         128        157       29  22.656250
>> device3         128        144       16  12.500000
>> host2           512        429      -83 -16.210938
>> device4         128         96      -32 -25.000000
>> device5         128        117      -11  -8.593750
>> device6         256        216      -40 -15.625000
>>
>> and after optimization we have the following:
>>
>>          ~expected~  ~objects~  ~delta~  ~delta%~
>> ~name~
>> dc1            1024       1024        0  0.000000
>> host0           256        259        3  1.171875
>> device0         128        129        1  0.781250
>> device1         128        130        2  1.562500
>> host1           256        258        2  0.781250
>> device2         128        129        1  0.781250
>> device3         128        129        1  0.781250
>> host2           512        507       -5 -0.976562
>> device4         128        126       -2 -1.562500
>> device5         128        127       -1 -0.781250
>> device6         256        254       -2 -0.781250
>>
>> Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
>>
>> Cheers
>>
>> [1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
>> [2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
>> [3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
>> [4] https://github.com/ceph/ceph/pull/14486
>>
>> On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
>>> There are lot of gradient-free methods. I will try first to run the
>>> ones available using just scipy
>>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>> Some of them don't require the gradient and some of them can estimate
>>> it. The reason to go without the gradient is to run the CRUSH
>>> algorithm as a black box. In that case this would be the pseudo-code:
>>>
>>> - BEGIN CODE -
>>> def build_target(desired_freqs):
>>>     def target(weights):
>>>         # run a simulation of CRUSH for a number of objects
>>>         sim_freqs = run_crush(weights)
>>>         # Kullback-Leibler divergence between desired frequencies and
>>> current ones
>>>         return loss(sim_freqs, desired_freqs)
>>>    return target
>>>
>>> weights = scipy.optimize.minimize(build_target(desired_freqs))
>>> - END CODE -
>>>
>>> The tricky thing here is that this procedure can be slow if the
>>> simulation (run_crush) needs to place a lot of objects to get accurate
>>> simulated frequencies. This is true specially if the minimize method
>>> attempts to approximate the gradient using finite differences since it
>>> will evaluate the target function a number of times proportional to
>>> the number of weights). Apart from the ones in scipy I would try also
>>> optimization methods that try to perform as few evaluations as
>>> possible like for example HyperOpt
>>> (http://hyperopt.github.io/hyperopt/), which by the way takes into
>>> account that the target function can be noisy.
>>>
>>> This black box approximation is simple to implement and makes the
>>> computer do all the work instead of us.
>>> I think that this black box approximation is worthy to try even if
>>> it's not the final one because if this approximation works then we
>>> know that a more elaborate one that computes the gradient of the CRUSH
>>> algorithm will work for sure.
>>>
>>> I can try this black box approximation this weekend not on the real
>>> CRUSH algorithm but with the simple implementation I did in python. If
>>> it works it's just a matter of substituting one simulation with
>>> another and see what happens.
>>>
>>> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>> Hi Pedro,
>>>>
>>>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>>> Hi Loic,
>>>>>
>>>>> >From what I see everything seems OK.
>>>>
>>>> Cool. I'll keep going in this direction then !
>>>>
>>>>> The interesting thing would be to
>>>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>>>> is right now modeling just a single straw bucket not the full CRUSH
>>>>> algorithm.
>>>>
>>>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>>>
>>>>> That's the work that remains to be done. The only way that
>>>>> would avoid reimplementing the CRUSH algorithm and computing the
>>>>> gradient would be treating CRUSH as a black box and eliminating the
>>>>> necessity of computing the gradient either by using a gradient-free
>>>>> optimization method or making an estimation of the gradient.
>>>>
>>>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>>>
>>>> Cheers
>>>>
>>>>>
>>>>>
>>>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>> Hi,
>>>>>>
>>>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>>>
>>>>>> Thanks !
>>>>>>
>>>>>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
>>>>>> ------------------------------------------------------------------------
>>>>>> Before: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>>  disk 0: 1.39e-01 1.12e-01
>>>>>>  disk 1: 1.11e-01 1.10e-01
>>>>>>  disk 2: 8.33e-02 1.13e-01
>>>>>>  disk 3: 1.39e-01 1.11e-01
>>>>>>  disk 4: 1.11e-01 1.11e-01
>>>>>>  disk 5: 8.33e-02 1.11e-01
>>>>>>  disk 6: 1.39e-01 1.12e-01
>>>>>>  disk 7: 1.11e-01 1.12e-01
>>>>>>  disk 8: 8.33e-02 1.10e-01
>>>>>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>>>>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>>>>> ...
>>>>>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>>>>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>>>>> Converged to desired accuracy :)
>>>>>> After: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>>  disk 0: 1.39e-01 1.42e-01
>>>>>>  disk 1: 1.11e-01 1.09e-01
>>>>>>  disk 2: 8.33e-02 8.37e-02
>>>>>>  disk 3: 1.39e-01 1.40e-01
>>>>>>  disk 4: 1.11e-01 1.13e-01
>>>>>>  disk 5: 8.33e-02 8.08e-02
>>>>>>  disk 6: 1.39e-01 1.38e-01
>>>>>>  disk 7: 1.11e-01 1.09e-01
>>>>>>  disk 8: 8.33e-02 8.48e-02
>>>>>>
>>>>>>
>>>>>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>>>>> ------------------------------------------------------------------------
>>>>>> Before: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>>  disk 0: 2.44e-01 2.36e-01
>>>>>>  disk 1: 2.44e-01 2.38e-01
>>>>>>  disk 2: 2.44e-01 2.34e-01
>>>>>>  disk 3: 2.44e-01 2.38e-01
>>>>>>  disk 4: 2.44e-02 5.37e-02
>>>>>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>>>>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>>>>> ...
>>>>>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>>>>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>>>>> Converged to desired accuracy :)
>>>>>> After: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>>  disk 0: 2.44e-01 2.46e-01
>>>>>>  disk 1: 2.44e-01 2.44e-01
>>>>>>  disk 2: 2.44e-01 2.41e-01
>>>>>>  disk 3: 2.44e-01 2.45e-01
>>>>>>  disk 4: 2.44e-02 2.33e-02
>>>>>>
>>>>>>
>>>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>>>
>>>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>>> Hi Pedro,
>>>>>>>
>>>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> I'm going to experiment with what you did at
>>>>>>>>
>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>
>>>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>>>> Great, thanks for the clarifications.
>>>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>>>
>>>>>>>>> I keep working on it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>>>> Hi Pedro,
>>>>>>>>>>
>>>>>>>>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>>>>>>>>> haven't made much headway.
>>>>>>>>>>
>>>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>>>> work on.
>>>>>>>>>>>
>>>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>>>> selecting the same one again.
>>>>>>>>>>
>>>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>>>> round/replica/rank.
>>>>>>>>>>
>>>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>>>> change and evolve independently.  (In most cases any change is
>>>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>>>> before it is enabled or used.)
>>>>>>>>>>
>>>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>>>> implementation?
>>>>>>>>>>
>>>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>>>> retry_descent.
>>>>>>>>>>
>>>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>>>> adapted)!
>>>>>>>>>>
>>>>>>>>>> sage
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-04-25 15:04                                           ` Pedro López-Adeva
  2017-04-25 17:46                                             ` Loic Dachary
@ 2017-04-26 21:08                                             ` Loic Dachary
  2017-04-26 22:25                                               ` Loic Dachary
  1 sibling, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-04-26 21:08 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: Ceph Development



On 04/25/2017 05:04 PM, Pedro López-Adeva wrote:
> Hi Loic,
> 
> Well, the results are better certainly! Some comments:
> 
> - I'm glad Nelder-Mead worked. It's not the one I would have chosen
> because but I'm not an expert in optimization either. I wonder how it
> will scale with more weights[1]. My attempt at using scipy's optimize
> didn't work because you are optimizing an stochastic function and this
> can make scipy's to decide that no further steps are possible. The
> field that studies this kind of problems is stochastic optimization
> [2]

You were right, it does not always work. Note that this is *not* about the conditional probability bias. This is about the uneven distribution due to the low number of values in the distribution. I think this case should be treated separately, with a different method. In Ceph clusters, large and small, the number of PGs per host is unlikely to be large enough to get enough samples. It is not an isolated problem, it's what happens most of the time.

Even in a case as simple as 12 devices starting with:

             ~expected~  ~actual~    ~delta~   ~delta%~  ~weight~
host1      2560.000000      2580  20.000000   0.781250        24
device12    106.666667       101  -5.666667  -5.312500         1
device13    213.333333       221   7.666667   3.593750         2
device14    320.000000       317  -3.000000  -0.937500         3
device15    106.666667       101  -5.666667  -5.312500         1
device16    213.333333       217   3.666667   1.718750         2
device17    320.000000       342  22.000000   6.875000         3
device18    106.666667       102  -4.666667  -4.375000         1
device19    213.333333       243  29.666667  13.906250         2
device20    320.000000       313  -7.000000  -2.187500         3
device21    106.666667        94 -12.666667 -11.875000         1
device22    213.333333       208  -5.333333  -2.500000         2
device23    320.000000       321   1.000000   0.312500         3

            res = minimize(crush, weights, method='nelder-mead',
                           options={'xtol': 1e-8, 'disp': True})

device weights [ 1.  3.  3.  2.  3.  2.  2.  1.  3.  1.  1.  2.]
device kl 0.00117274995028
...
device kl 0.00016530695476
Optimization terminated successfully.
         Current function value: 0.000165
         Iterations: 117
         Function evaluations: 470

we still get a 5% difference on device 21:

             ~expected~  ~actual~    ~delta~   ~delta%~  ~weight~
host1      2560.000000      2559 -1.000000 -0.039062  23.805183
device12    106.666667       103 -3.666667 -3.437500   1.016999
device13    213.333333       214  0.666667  0.312500   1.949328
device14    320.000000       325  5.000000  1.562500   3.008688
device15    106.666667       106 -0.666667 -0.625000   1.012565
device16    213.333333       214  0.666667  0.312500   1.976344
device17    320.000000       320  0.000000  0.000000   2.845135
device18    106.666667       102 -4.666667 -4.375000   1.039181
device19    213.333333       214  0.666667  0.312500   1.820435
device20    320.000000       324  4.000000  1.250000   3.062573
device21    106.666667       101 -5.666667 -5.312500   1.071341
device22    213.333333       212 -1.333333 -0.625000   2.039190
device23    320.000000       324  4.000000  1.250000   3.016468

 
> - I used KL divergence for the loss function. My first attempt was
> using as you standard deviation (more commonly known as L2 loss) with
> gradient descent, but it didn't work very well.
> 
> - Sum of differences sounds like a bad idea, +100 and -100 errors will
> cancel out. Worse still -100 and -100 will be better than 0 and 0.
> Maybe you were talking about the absolute value of the differences?
> 
> - Well, now that CRUSH can use multiple weight the problem that
> remains I think is seeing if the optimization problem is: a) reliable
> and b) fast enough
> 
> Cheers,
> Pedro.
> 
> [1] http://www.benfrederickson.com/numerical-optimization/
> [2] https://en.wikipedia.org/wiki/Stochastic_optimization
> 
> 2017-04-22 18:51 GMT+02:00 Loic Dachary <loic@dachary.org>:
>> Hi Pedro,
>>
>> I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
>>
>> I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
>>
>> Before optimization the situation is:
>>
>>          ~expected~  ~objects~  ~delta~   ~delta%~
>> ~name~
>> dc1            1024       1024        0   0.000000
>> host0           256        294       38  14.843750
>> device0         128        153       25  19.531250
>> device1         128        141       13  10.156250
>> host1           256        301       45  17.578125
>> device2         128        157       29  22.656250
>> device3         128        144       16  12.500000
>> host2           512        429      -83 -16.210938
>> device4         128         96      -32 -25.000000
>> device5         128        117      -11  -8.593750
>> device6         256        216      -40 -15.625000
>>
>> and after optimization we have the following:
>>
>>          ~expected~  ~objects~  ~delta~  ~delta%~
>> ~name~
>> dc1            1024       1024        0  0.000000
>> host0           256        259        3  1.171875
>> device0         128        129        1  0.781250
>> device1         128        130        2  1.562500
>> host1           256        258        2  0.781250
>> device2         128        129        1  0.781250
>> device3         128        129        1  0.781250
>> host2           512        507       -5 -0.976562
>> device4         128        126       -2 -1.562500
>> device5         128        127       -1 -0.781250
>> device6         256        254       -2 -0.781250
>>
>> Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
>>
>> Cheers
>>
>> [1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
>> [2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
>> [3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
>> [4] https://github.com/ceph/ceph/pull/14486
>>
>> On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
>>> There are lot of gradient-free methods. I will try first to run the
>>> ones available using just scipy
>>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>> Some of them don't require the gradient and some of them can estimate
>>> it. The reason to go without the gradient is to run the CRUSH
>>> algorithm as a black box. In that case this would be the pseudo-code:
>>>
>>> - BEGIN CODE -
>>> def build_target(desired_freqs):
>>>     def target(weights):
>>>         # run a simulation of CRUSH for a number of objects
>>>         sim_freqs = run_crush(weights)
>>>         # Kullback-Leibler divergence between desired frequencies and
>>> current ones
>>>         return loss(sim_freqs, desired_freqs)
>>>    return target
>>>
>>> weights = scipy.optimize.minimize(build_target(desired_freqs))
>>> - END CODE -
>>>
>>> The tricky thing here is that this procedure can be slow if the
>>> simulation (run_crush) needs to place a lot of objects to get accurate
>>> simulated frequencies. This is true specially if the minimize method
>>> attempts to approximate the gradient using finite differences since it
>>> will evaluate the target function a number of times proportional to
>>> the number of weights). Apart from the ones in scipy I would try also
>>> optimization methods that try to perform as few evaluations as
>>> possible like for example HyperOpt
>>> (http://hyperopt.github.io/hyperopt/), which by the way takes into
>>> account that the target function can be noisy.
>>>
>>> This black box approximation is simple to implement and makes the
>>> computer do all the work instead of us.
>>> I think that this black box approximation is worthy to try even if
>>> it's not the final one because if this approximation works then we
>>> know that a more elaborate one that computes the gradient of the CRUSH
>>> algorithm will work for sure.
>>>
>>> I can try this black box approximation this weekend not on the real
>>> CRUSH algorithm but with the simple implementation I did in python. If
>>> it works it's just a matter of substituting one simulation with
>>> another and see what happens.
>>>
>>> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>> Hi Pedro,
>>>>
>>>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>>> Hi Loic,
>>>>>
>>>>> >From what I see everything seems OK.
>>>>
>>>> Cool. I'll keep going in this direction then !
>>>>
>>>>> The interesting thing would be to
>>>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>>>> is right now modeling just a single straw bucket not the full CRUSH
>>>>> algorithm.
>>>>
>>>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>>>
>>>>> That's the work that remains to be done. The only way that
>>>>> would avoid reimplementing the CRUSH algorithm and computing the
>>>>> gradient would be treating CRUSH as a black box and eliminating the
>>>>> necessity of computing the gradient either by using a gradient-free
>>>>> optimization method or making an estimation of the gradient.
>>>>
>>>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>>>
>>>> Cheers
>>>>
>>>>>
>>>>>
>>>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>> Hi,
>>>>>>
>>>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>>>
>>>>>> Thanks !
>>>>>>
>>>>>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
>>>>>> ------------------------------------------------------------------------
>>>>>> Before: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>>  disk 0: 1.39e-01 1.12e-01
>>>>>>  disk 1: 1.11e-01 1.10e-01
>>>>>>  disk 2: 8.33e-02 1.13e-01
>>>>>>  disk 3: 1.39e-01 1.11e-01
>>>>>>  disk 4: 1.11e-01 1.11e-01
>>>>>>  disk 5: 8.33e-02 1.11e-01
>>>>>>  disk 6: 1.39e-01 1.12e-01
>>>>>>  disk 7: 1.11e-01 1.12e-01
>>>>>>  disk 8: 8.33e-02 1.10e-01
>>>>>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>>>>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>>>>> ...
>>>>>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>>>>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>>>>> Converged to desired accuracy :)
>>>>>> After: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>>  disk 0: 1.39e-01 1.42e-01
>>>>>>  disk 1: 1.11e-01 1.09e-01
>>>>>>  disk 2: 8.33e-02 8.37e-02
>>>>>>  disk 3: 1.39e-01 1.40e-01
>>>>>>  disk 4: 1.11e-01 1.13e-01
>>>>>>  disk 5: 8.33e-02 8.08e-02
>>>>>>  disk 6: 1.39e-01 1.38e-01
>>>>>>  disk 7: 1.11e-01 1.09e-01
>>>>>>  disk 8: 8.33e-02 8.48e-02
>>>>>>
>>>>>>
>>>>>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>>>>> ------------------------------------------------------------------------
>>>>>> Before: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>>  disk 0: 2.44e-01 2.36e-01
>>>>>>  disk 1: 2.44e-01 2.38e-01
>>>>>>  disk 2: 2.44e-01 2.34e-01
>>>>>>  disk 3: 2.44e-01 2.38e-01
>>>>>>  disk 4: 2.44e-02 5.37e-02
>>>>>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>>>>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>>>>> ...
>>>>>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>>>>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>>>>> Converged to desired accuracy :)
>>>>>> After: All replicas on each hard drive
>>>>>> Expected vs actual use (20000 samples)
>>>>>>  disk 0: 2.44e-01 2.46e-01
>>>>>>  disk 1: 2.44e-01 2.44e-01
>>>>>>  disk 2: 2.44e-01 2.41e-01
>>>>>>  disk 3: 2.44e-01 2.45e-01
>>>>>>  disk 4: 2.44e-02 2.33e-02
>>>>>>
>>>>>>
>>>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>>>
>>>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>>> Hi Pedro,
>>>>>>>
>>>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> I'm going to experiment with what you did at
>>>>>>>>
>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>
>>>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>>>> Great, thanks for the clarifications.
>>>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>>>
>>>>>>>>> I keep working on it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>>>> Hi Pedro,
>>>>>>>>>>
>>>>>>>>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>>>>>>>>> haven't made much headway.
>>>>>>>>>>
>>>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>>>> work on.
>>>>>>>>>>>
>>>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>>>> selecting the same one again.
>>>>>>>>>>
>>>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>>>> round/replica/rank.
>>>>>>>>>>
>>>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>>>> change and evolve independently.  (In most cases any change is
>>>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>>>> before it is enabled or used.)
>>>>>>>>>>
>>>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>>>> implementation?
>>>>>>>>>>
>>>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>>>> retry_descent.
>>>>>>>>>>
>>>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>>>> adapted)!
>>>>>>>>>>
>>>>>>>>>> sage
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-04-26 21:08                                             ` Loic Dachary
@ 2017-04-26 22:25                                               ` Loic Dachary
  2017-04-27  6:12                                                 ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-04-26 22:25 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: Ceph Development

It seems to work when the distribution has enough samples. I tried with 40 hosts and a distribution with 100,000 samples.

We go from kl =~ 1e-4 (with as much as 10% difference) to kl =~ 1e-7 (with no more than 0.5% difference). I will do some more experiements and try to think of patterns where this would not work.

           ~expected~  ~actual~  ~delta~   ~delta%~     ~weight~
dc1            102400    102400        0   0.000000      1008
host0            2438      2390      -48  -1.968827        24
host1            2438      2370      -68  -2.789171        24
host2            2438      2493       55   2.255947        24
host3            2438      2396      -42  -1.722724        24
host4            2438      2497       59   2.420016        24
host5            2438      2520       82   3.363413        24
host6            2438      2500       62   2.543068        24
host7            2438      2380      -58  -2.378999        24
host8            2438      2488       50   2.050861        24
host9            2438      2435       -3  -0.123052        24
host10           2438      2440        2   0.082034        24
host11           2438      2472       34   1.394586        24
host12           2438      2346      -92  -3.773585        24
host13           2438      2411      -27  -1.107465        24
host14           2438      2513       75   3.076292        24
host15           2438      2421      -17  -0.697293        24
host16           2438      2469       31   1.271534        24
host17           2438      2419      -19  -0.779327        24
host18           2438      2424      -14  -0.574241        24
host19           2438      2451       13   0.533224        24
host20           2438      2486       48   1.968827        24
host21           2438      2439        1   0.041017        24
host22           2438      2482       44   1.804758        24
host23           2438      2415      -23  -0.943396        24
host24           2438      2389      -49  -2.009844        24
host25           2438      2265     -173  -7.095980        24
host26           2438      2374      -64  -2.625103        24
host27           2438      2529       91   3.732568        24
host28           2438      2495       57   2.337982        24
host29           2438      2433       -5  -0.205086        24
host30           2438      2485       47   1.927810        24
host31           2438      2377      -61  -2.502051        24
host32           2438      2441        3   0.123052        24
host33           2438      2421      -17  -0.697293        24
host34           2438      2359      -79  -3.240361        24
host35           2438      2509       71   2.912223        24
host36           2438      2425      -13  -0.533224        24
host37           2438      2419      -19  -0.779327        24
host38           2438      2403      -35  -1.435603        24
host39           2438      2458       20   0.820345        24
host40           2438      2458       20   0.820345        24
host41           2438      2503       65   2.666120        24

           ~expected~  ~actual~  ~delta~   ~delta%~     ~weight~
dc1            102400    102400        0   0.000000         1008
host0            2438      2438        0   0.000000    24.559919
host1            2438      2438        0   0.000000    24.641221
host2            2438      2440        2   0.082034    23.486113
host3            2438      2437       -1  -0.041017    24.525875
host4            2438      2436       -2  -0.082034    23.644304
host5            2438      2440        2   0.082034    23.245287
host6            2438      2442        4   0.164069    23.617162
host7            2438      2439        1   0.041017    24.746174
host8            2438      2436       -2  -0.082034    23.584667
host9            2438      2439        1   0.041017    24.140637
host10           2438      2438        0   0.000000    24.060084
host11           2438      2441        3   0.123052    23.730349
host12           2438      2437       -1  -0.041017    24.948602
host13           2438      2437       -1  -0.041017    24.280851
host14           2438      2436       -2  -0.082034    23.402216
host15           2438      2436       -2  -0.082034    24.272037
host16           2438      2437       -1  -0.041017    23.747867
host17           2438      2436       -2  -0.082034    24.266271
host18           2438      2438        0   0.000000    24.158545
host19           2438      2440        2   0.082034    23.934788
host20           2438      2438        0   0.000000    23.630851
host21           2438      2435       -3  -0.123052    24.001950
host22           2438      2440        2   0.082034    23.623120
host23           2438      2437       -1  -0.041017    24.343138
host24           2438      2438        0   0.000000    24.595820
host25           2438      2439        1   0.041017    25.547510
host26           2438      2437       -1  -0.041017    24.753111
host27           2438      2437       -1  -0.041017    23.288606
host28           2438      2437       -1  -0.041017    23.425059
host29           2438      2438        0   0.000000    24.115941
host30           2438      2441        3   0.123052    23.560539
host31           2438      2438        0   0.000000    24.459911
host32           2438      2440        2   0.082034    24.096746
host33           2438      2437       -1  -0.041017    24.241316
host34           2438      2438        0   0.000000    24.715044
host35           2438      2436       -2  -0.082034    23.424601
host36           2438      2436       -2  -0.082034    24.123606
host37           2438      2439        1   0.041017    24.368997
host38           2438      2440        2   0.082034    24.331532
host39           2438      2439        1   0.041017    23.803561
host40           2438      2437       -1  -0.041017    23.861094
host41           2438      2442        4   0.164069    23.468473


On 04/26/2017 11:08 PM, Loic Dachary wrote:
> 
> 
> On 04/25/2017 05:04 PM, Pedro López-Adeva wrote:
>> Hi Loic,
>>
>> Well, the results are better certainly! Some comments:
>>
>> - I'm glad Nelder-Mead worked. It's not the one I would have chosen
>> because but I'm not an expert in optimization either. I wonder how it
>> will scale with more weights[1]. My attempt at using scipy's optimize
>> didn't work because you are optimizing an stochastic function and this
>> can make scipy's to decide that no further steps are possible. The
>> field that studies this kind of problems is stochastic optimization
>> [2]
> 
> You were right, it does not always work. Note that this is *not* about the conditional probability bias. This is about the uneven distribution due to the low number of values in the distribution. I think this case should be treated separately, with a different method. In Ceph clusters, large and small, the number of PGs per host is unlikely to be large enough to get enough samples. It is not an isolated problem, it's what happens most of the time.
> 
> Even in a case as simple as 12 devices starting with:
> 
>              ~expected~  ~actual~    ~delta~   ~delta%~  ~weight~
> host1      2560.000000      2580  20.000000   0.781250        24
> device12    106.666667       101  -5.666667  -5.312500         1
> device13    213.333333       221   7.666667   3.593750         2
> device14    320.000000       317  -3.000000  -0.937500         3
> device15    106.666667       101  -5.666667  -5.312500         1
> device16    213.333333       217   3.666667   1.718750         2
> device17    320.000000       342  22.000000   6.875000         3
> device18    106.666667       102  -4.666667  -4.375000         1
> device19    213.333333       243  29.666667  13.906250         2
> device20    320.000000       313  -7.000000  -2.187500         3
> device21    106.666667        94 -12.666667 -11.875000         1
> device22    213.333333       208  -5.333333  -2.500000         2
> device23    320.000000       321   1.000000   0.312500         3
> 
>             res = minimize(crush, weights, method='nelder-mead',
>                            options={'xtol': 1e-8, 'disp': True})
> 
> device weights [ 1.  3.  3.  2.  3.  2.  2.  1.  3.  1.  1.  2.]
> device kl 0.00117274995028
> ...
> device kl 0.00016530695476
> Optimization terminated successfully.
>          Current function value: 0.000165
>          Iterations: 117
>          Function evaluations: 470
> 
> we still get a 5% difference on device 21:
> 
>              ~expected~  ~actual~    ~delta~   ~delta%~  ~weight~
> host1      2560.000000      2559 -1.000000 -0.039062  23.805183
> device12    106.666667       103 -3.666667 -3.437500   1.016999
> device13    213.333333       214  0.666667  0.312500   1.949328
> device14    320.000000       325  5.000000  1.562500   3.008688
> device15    106.666667       106 -0.666667 -0.625000   1.012565
> device16    213.333333       214  0.666667  0.312500   1.976344
> device17    320.000000       320  0.000000  0.000000   2.845135
> device18    106.666667       102 -4.666667 -4.375000   1.039181
> device19    213.333333       214  0.666667  0.312500   1.820435
> device20    320.000000       324  4.000000  1.250000   3.062573
> device21    106.666667       101 -5.666667 -5.312500   1.071341
> device22    213.333333       212 -1.333333 -0.625000   2.039190
> device23    320.000000       324  4.000000  1.250000   3.016468
> 
>  
>> - I used KL divergence for the loss function. My first attempt was
>> using as you standard deviation (more commonly known as L2 loss) with
>> gradient descent, but it didn't work very well.
>>
>> - Sum of differences sounds like a bad idea, +100 and -100 errors will
>> cancel out. Worse still -100 and -100 will be better than 0 and 0.
>> Maybe you were talking about the absolute value of the differences?
>>
>> - Well, now that CRUSH can use multiple weight the problem that
>> remains I think is seeing if the optimization problem is: a) reliable
>> and b) fast enough
>>
>> Cheers,
>> Pedro.
>>
>> [1] http://www.benfrederickson.com/numerical-optimization/
>> [2] https://en.wikipedia.org/wiki/Stochastic_optimization
>>
>> 2017-04-22 18:51 GMT+02:00 Loic Dachary <loic@dachary.org>:
>>> Hi Pedro,
>>>
>>> I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
>>>
>>> I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
>>>
>>> Before optimization the situation is:
>>>
>>>          ~expected~  ~objects~  ~delta~   ~delta%~
>>> ~name~
>>> dc1            1024       1024        0   0.000000
>>> host0           256        294       38  14.843750
>>> device0         128        153       25  19.531250
>>> device1         128        141       13  10.156250
>>> host1           256        301       45  17.578125
>>> device2         128        157       29  22.656250
>>> device3         128        144       16  12.500000
>>> host2           512        429      -83 -16.210938
>>> device4         128         96      -32 -25.000000
>>> device5         128        117      -11  -8.593750
>>> device6         256        216      -40 -15.625000
>>>
>>> and after optimization we have the following:
>>>
>>>          ~expected~  ~objects~  ~delta~  ~delta%~
>>> ~name~
>>> dc1            1024       1024        0  0.000000
>>> host0           256        259        3  1.171875
>>> device0         128        129        1  0.781250
>>> device1         128        130        2  1.562500
>>> host1           256        258        2  0.781250
>>> device2         128        129        1  0.781250
>>> device3         128        129        1  0.781250
>>> host2           512        507       -5 -0.976562
>>> device4         128        126       -2 -1.562500
>>> device5         128        127       -1 -0.781250
>>> device6         256        254       -2 -0.781250
>>>
>>> Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
>>>
>>> Cheers
>>>
>>> [1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
>>> [2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
>>> [3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
>>> [4] https://github.com/ceph/ceph/pull/14486
>>>
>>> On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
>>>> There are lot of gradient-free methods. I will try first to run the
>>>> ones available using just scipy
>>>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>>> Some of them don't require the gradient and some of them can estimate
>>>> it. The reason to go without the gradient is to run the CRUSH
>>>> algorithm as a black box. In that case this would be the pseudo-code:
>>>>
>>>> - BEGIN CODE -
>>>> def build_target(desired_freqs):
>>>>     def target(weights):
>>>>         # run a simulation of CRUSH for a number of objects
>>>>         sim_freqs = run_crush(weights)
>>>>         # Kullback-Leibler divergence between desired frequencies and
>>>> current ones
>>>>         return loss(sim_freqs, desired_freqs)
>>>>    return target
>>>>
>>>> weights = scipy.optimize.minimize(build_target(desired_freqs))
>>>> - END CODE -
>>>>
>>>> The tricky thing here is that this procedure can be slow if the
>>>> simulation (run_crush) needs to place a lot of objects to get accurate
>>>> simulated frequencies. This is true specially if the minimize method
>>>> attempts to approximate the gradient using finite differences since it
>>>> will evaluate the target function a number of times proportional to
>>>> the number of weights). Apart from the ones in scipy I would try also
>>>> optimization methods that try to perform as few evaluations as
>>>> possible like for example HyperOpt
>>>> (http://hyperopt.github.io/hyperopt/), which by the way takes into
>>>> account that the target function can be noisy.
>>>>
>>>> This black box approximation is simple to implement and makes the
>>>> computer do all the work instead of us.
>>>> I think that this black box approximation is worthy to try even if
>>>> it's not the final one because if this approximation works then we
>>>> know that a more elaborate one that computes the gradient of the CRUSH
>>>> algorithm will work for sure.
>>>>
>>>> I can try this black box approximation this weekend not on the real
>>>> CRUSH algorithm but with the simple implementation I did in python. If
>>>> it works it's just a matter of substituting one simulation with
>>>> another and see what happens.
>>>>
>>>> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>> Hi Pedro,
>>>>>
>>>>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>>>> Hi Loic,
>>>>>>
>>>>>> >From what I see everything seems OK.
>>>>>
>>>>> Cool. I'll keep going in this direction then !
>>>>>
>>>>>> The interesting thing would be to
>>>>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>>>>> is right now modeling just a single straw bucket not the full CRUSH
>>>>>> algorithm.
>>>>>
>>>>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>>>>
>>>>>> That's the work that remains to be done. The only way that
>>>>>> would avoid reimplementing the CRUSH algorithm and computing the
>>>>>> gradient would be treating CRUSH as a black box and eliminating the
>>>>>> necessity of computing the gradient either by using a gradient-free
>>>>>> optimization method or making an estimation of the gradient.
>>>>>
>>>>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>>>>
>>>>> Cheers
>>>>>
>>>>>>
>>>>>>
>>>>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>>>>
>>>>>>> Thanks !
>>>>>>>
>>>>>>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
>>>>>>> ------------------------------------------------------------------------
>>>>>>> Before: All replicas on each hard drive
>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>  disk 0: 1.39e-01 1.12e-01
>>>>>>>  disk 1: 1.11e-01 1.10e-01
>>>>>>>  disk 2: 8.33e-02 1.13e-01
>>>>>>>  disk 3: 1.39e-01 1.11e-01
>>>>>>>  disk 4: 1.11e-01 1.11e-01
>>>>>>>  disk 5: 8.33e-02 1.11e-01
>>>>>>>  disk 6: 1.39e-01 1.12e-01
>>>>>>>  disk 7: 1.11e-01 1.12e-01
>>>>>>>  disk 8: 8.33e-02 1.10e-01
>>>>>>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>>>>>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>>>>>> ...
>>>>>>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>>>>>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>>>>>> Converged to desired accuracy :)
>>>>>>> After: All replicas on each hard drive
>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>  disk 0: 1.39e-01 1.42e-01
>>>>>>>  disk 1: 1.11e-01 1.09e-01
>>>>>>>  disk 2: 8.33e-02 8.37e-02
>>>>>>>  disk 3: 1.39e-01 1.40e-01
>>>>>>>  disk 4: 1.11e-01 1.13e-01
>>>>>>>  disk 5: 8.33e-02 8.08e-02
>>>>>>>  disk 6: 1.39e-01 1.38e-01
>>>>>>>  disk 7: 1.11e-01 1.09e-01
>>>>>>>  disk 8: 8.33e-02 8.48e-02
>>>>>>>
>>>>>>>
>>>>>>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>>>>>> ------------------------------------------------------------------------
>>>>>>> Before: All replicas on each hard drive
>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>  disk 0: 2.44e-01 2.36e-01
>>>>>>>  disk 1: 2.44e-01 2.38e-01
>>>>>>>  disk 2: 2.44e-01 2.34e-01
>>>>>>>  disk 3: 2.44e-01 2.38e-01
>>>>>>>  disk 4: 2.44e-02 5.37e-02
>>>>>>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>>>>>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>>>>>> ...
>>>>>>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>>>>>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>>>>>> Converged to desired accuracy :)
>>>>>>> After: All replicas on each hard drive
>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>  disk 0: 2.44e-01 2.46e-01
>>>>>>>  disk 1: 2.44e-01 2.44e-01
>>>>>>>  disk 2: 2.44e-01 2.41e-01
>>>>>>>  disk 3: 2.44e-01 2.45e-01
>>>>>>>  disk 4: 2.44e-02 2.33e-02
>>>>>>>
>>>>>>>
>>>>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>>>>
>>>>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>>>>> Hi Pedro,
>>>>>>>>>
>>>>>>>>> I'm going to experiment with what you did at
>>>>>>>>>
>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>
>>>>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>>>>> Great, thanks for the clarifications.
>>>>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>>>>
>>>>>>>>>> I keep working on it.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>>>>> Hi Pedro,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>>>>>>>>>> haven't made much headway.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>>>>> work on.
>>>>>>>>>>>>
>>>>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>>>>> selecting the same one again.
>>>>>>>>>>>
>>>>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>>>>> round/replica/rank.
>>>>>>>>>>>
>>>>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>>>>> change and evolve independently.  (In most cases any change is
>>>>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>>>>> before it is enabled or used.)
>>>>>>>>>>>
>>>>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>>>>> implementation?
>>>>>>>>>>>
>>>>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>>>>> retry_descent.
>>>>>>>>>>>
>>>>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>>>>> adapted)!
>>>>>>>>>>>
>>>>>>>>>>> sage
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-04-26 22:25                                               ` Loic Dachary
@ 2017-04-27  6:12                                                 ` Loic Dachary
  2017-04-27 16:47                                                   ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-04-27  6:12 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: Ceph Development

With 63 hosts instead of 41 we get the same results: from kl 1.9169485575e-04 to kl 3.0384231953e-07 with a maximum difference going from ~8% to ~0.5%. What's is interesting (at least to me ;-) is that the weights don't change that much, they all stay in the range ]23,25].

Note that all this optimization is done by changing a single weight per host. It is worth trying again with two different weights (which is what you did in https://github.com/plafl/notebooks/blob/master/replication.ipynb). The weight for the first draw is immutable as it is (i.e. 24) and the weight for the second draw is allowed to change.

Before optimization

host0            2400      2345      -55  -2.291667        24
host1            2400      2434       34   1.416667        24
host2            2400      2387      -13  -0.541667        24
host3            2400      2351      -49  -2.041667        24
host4            2400      2423       23   0.958333        24
host5            2400      2456       56   2.333333        24
host6            2400      2450       50   2.083333        24
host7            2400      2307      -93  -3.875000        24
host8            2400      2434       34   1.416667        24
host9            2400      2358      -42  -1.750000        24
host10           2400      2452       52   2.166667        24
host11           2400      2398       -2  -0.083333        24
host12           2400      2359      -41  -1.708333        24
host13           2400      2403        3   0.125000        24
host14           2400      2484       84   3.500000        24
host15           2400      2348      -52  -2.166667        24
host16           2400      2489       89   3.708333        24
host17           2400      2412       12   0.500000        24
host18           2400      2416       16   0.666667        24
host19           2400      2453       53   2.208333        24
host20           2400      2475       75   3.125000        24
host21           2400      2413       13   0.541667        24
host22           2400      2450       50   2.083333        24
host23           2400      2348      -52  -2.166667        24
host24           2400      2355      -45  -1.875000        24
host25           2400      2348      -52  -2.166667        24
host26           2400      2373      -27  -1.125000        24
host27           2400      2470       70   2.916667        24
host28           2400      2449       49   2.041667        24
host29           2400      2420       20   0.833333        24
host30           2400      2406        6   0.250000        24
host31           2400      2376      -24  -1.000000        24
host32           2400      2371      -29  -1.208333        24
host33           2400      2395       -5  -0.208333        24
host34           2400      2351      -49  -2.041667        24
host35           2400      2453       53   2.208333        24
host36           2400      2421       21   0.875000        24
host37           2400      2393       -7  -0.291667        24
host38           2400      2394       -6  -0.250000        24
host39           2400      2322      -78  -3.250000        24
host40           2400      2409        9   0.375000        24
host41           2400      2486       86   3.583333        24
host42           2400      2466       66   2.750000        24
host43           2400      2409        9   0.375000        24
host44           2400      2276     -124  -5.166667        24
host45           2400      2379      -21  -0.875000        24
host46           2400      2394       -6  -0.250000        24
host47           2400      2401        1   0.041667        24
host48           2400      2446       46   1.916667        24
host49           2400      2349      -51  -2.125000        24
host50           2400      2413       13   0.541667        24
host51           2400      2333      -67  -2.791667        24
host52           2400      2387      -13  -0.541667        24
host53           2400      2407        7   0.291667        24
host54           2400      2377      -23  -0.958333        24
host55           2400      2441       41   1.708333        24
host56           2400      2420       20   0.833333        24
host57           2400      2388      -12  -0.500000        24
host58           2400      2460       60   2.500000        24
host59           2400      2394       -6  -0.250000        24
host60           2400      2316      -84  -3.500000        24
host61           2400      2373      -27  -1.125000        24
host62           2400      2362      -38  -1.583333        24
host63           2400      2372      -28  -1.166667        24

After optimization

host0            2400      2403        3   0.125000    24.575153
host1            2400      2401        1   0.041667    23.723316
host2            2400      2402        2   0.083333    24.168746
host3            2400      2399       -1  -0.041667    24.520240
host4            2400      2399       -1  -0.041667    23.911445
host5            2400      2400        0   0.000000    23.606956
host6            2400      2401        1   0.041667    23.714102
host7            2400      2400        0   0.000000    25.008463
host8            2400      2399       -1  -0.041667    23.557143
host9            2400      2399       -1  -0.041667    24.431548
host10           2400      2400        0   0.000000    23.494153
host11           2400      2401        1   0.041667    23.976621
host12           2400      2400        0   0.000000    24.512622
host13           2400      2397       -3  -0.125000    24.010814
host14           2400      2398       -2  -0.083333    23.229791
host15           2400      2402        2   0.083333    24.510854
host16           2400      2401        1   0.041667    23.188161
host17           2400      2397       -3  -0.125000    23.931915
host18           2400      2400        0   0.000000    23.886135
host19           2400      2398       -2  -0.083333    23.442129
host20           2400      2401        1   0.041667    23.393092
host21           2400      2398       -2  -0.083333    23.940452
host22           2400      2401        1   0.041667    23.643843
host23           2400      2403        3   0.125000    24.592113
host24           2400      2402        2   0.083333    24.561842
host25           2400      2401        1   0.041667    24.598754
host26           2400      2398       -2  -0.083333    24.350951
host27           2400      2399       -1  -0.041667    23.336478
host28           2400      2401        1   0.041667    23.549652
host29           2400      2401        1   0.041667    23.840408
host30           2400      2400        0   0.000000    23.932423
host31           2400      2397       -3  -0.125000    24.295621
host32           2400      2402        2   0.083333    24.298228
host33           2400      2403        3   0.125000    24.068700
host34           2400      2399       -1  -0.041667    24.395416
host35           2400      2398       -2  -0.083333    23.522074
host36           2400      2395       -5  -0.208333    23.746354
host37           2400      2402        2   0.083333    24.120875
host38           2400      2401        1   0.041667    24.034644
host39           2400      2400        0   0.000000    24.665110
host40           2400      2400        0   0.000000    23.856618
host41           2400      2400        0   0.000000    23.265386
host42           2400      2398       -2  -0.083333    23.334984
host43           2400      2400        0   0.000000    23.950316
host44           2400      2404        4   0.166667    25.276133
host45           2400      2399       -1  -0.041667    24.272922
host46           2400      2399       -1  -0.041667    24.013644
host47           2400      2402        2   0.083333    24.113955
host48           2400      2404        4   0.166667    23.582616
host49           2400      2400        0   0.000000    24.531067
host50           2400      2400        0   0.000000    23.784893
host51           2400      2401        1   0.041667    24.793213
host52           2400      2400        0   0.000000    24.170809
host53           2400      2400        0   0.000000    23.783899
host54           2400      2399       -1  -0.041667    24.365295
host55           2400      2398       -2  -0.083333    23.645767
host56           2400      2401        1   0.041667    23.858433
host57           2400      2399       -1  -0.041667    24.159351
host58           2400      2396       -4  -0.166667    23.430493
host59           2400      2402        2   0.083333    24.107154
host60           2400      2403        3   0.125000    24.784382
host61           2400      2397       -3  -0.125000    24.292784
host62           2400      2399       -1  -0.041667    24.404311
host63           2400      2400        0   0.000000    24.219422


On 04/27/2017 12:25 AM, Loic Dachary wrote:
> It seems to work when the distribution has enough samples. I tried with 40 hosts and a distribution with 100,000 samples.
> 
> We go from kl =~ 1e-4 (with as much as 10% difference) to kl =~ 1e-7 (with no more than 0.5% difference). I will do some more experiements and try to think of patterns where this would not work.
> 
>            ~expected~  ~actual~  ~delta~   ~delta%~     ~weight~
> dc1            102400    102400        0   0.000000      1008
> host0            2438      2390      -48  -1.968827        24
> host1            2438      2370      -68  -2.789171        24
> host2            2438      2493       55   2.255947        24
> host3            2438      2396      -42  -1.722724        24
> host4            2438      2497       59   2.420016        24
> host5            2438      2520       82   3.363413        24
> host6            2438      2500       62   2.543068        24
> host7            2438      2380      -58  -2.378999        24
> host8            2438      2488       50   2.050861        24
> host9            2438      2435       -3  -0.123052        24
> host10           2438      2440        2   0.082034        24
> host11           2438      2472       34   1.394586        24
> host12           2438      2346      -92  -3.773585        24
> host13           2438      2411      -27  -1.107465        24
> host14           2438      2513       75   3.076292        24
> host15           2438      2421      -17  -0.697293        24
> host16           2438      2469       31   1.271534        24
> host17           2438      2419      -19  -0.779327        24
> host18           2438      2424      -14  -0.574241        24
> host19           2438      2451       13   0.533224        24
> host20           2438      2486       48   1.968827        24
> host21           2438      2439        1   0.041017        24
> host22           2438      2482       44   1.804758        24
> host23           2438      2415      -23  -0.943396        24
> host24           2438      2389      -49  -2.009844        24
> host25           2438      2265     -173  -7.095980        24
> host26           2438      2374      -64  -2.625103        24
> host27           2438      2529       91   3.732568        24
> host28           2438      2495       57   2.337982        24
> host29           2438      2433       -5  -0.205086        24
> host30           2438      2485       47   1.927810        24
> host31           2438      2377      -61  -2.502051        24
> host32           2438      2441        3   0.123052        24
> host33           2438      2421      -17  -0.697293        24
> host34           2438      2359      -79  -3.240361        24
> host35           2438      2509       71   2.912223        24
> host36           2438      2425      -13  -0.533224        24
> host37           2438      2419      -19  -0.779327        24
> host38           2438      2403      -35  -1.435603        24
> host39           2438      2458       20   0.820345        24
> host40           2438      2458       20   0.820345        24
> host41           2438      2503       65   2.666120        24
> 
>            ~expected~  ~actual~  ~delta~   ~delta%~     ~weight~
> dc1            102400    102400        0   0.000000         1008
> host0            2438      2438        0   0.000000    24.559919
> host1            2438      2438        0   0.000000    24.641221
> host2            2438      2440        2   0.082034    23.486113
> host3            2438      2437       -1  -0.041017    24.525875
> host4            2438      2436       -2  -0.082034    23.644304
> host5            2438      2440        2   0.082034    23.245287
> host6            2438      2442        4   0.164069    23.617162
> host7            2438      2439        1   0.041017    24.746174
> host8            2438      2436       -2  -0.082034    23.584667
> host9            2438      2439        1   0.041017    24.140637
> host10           2438      2438        0   0.000000    24.060084
> host11           2438      2441        3   0.123052    23.730349
> host12           2438      2437       -1  -0.041017    24.948602
> host13           2438      2437       -1  -0.041017    24.280851
> host14           2438      2436       -2  -0.082034    23.402216
> host15           2438      2436       -2  -0.082034    24.272037
> host16           2438      2437       -1  -0.041017    23.747867
> host17           2438      2436       -2  -0.082034    24.266271
> host18           2438      2438        0   0.000000    24.158545
> host19           2438      2440        2   0.082034    23.934788
> host20           2438      2438        0   0.000000    23.630851
> host21           2438      2435       -3  -0.123052    24.001950
> host22           2438      2440        2   0.082034    23.623120
> host23           2438      2437       -1  -0.041017    24.343138
> host24           2438      2438        0   0.000000    24.595820
> host25           2438      2439        1   0.041017    25.547510
> host26           2438      2437       -1  -0.041017    24.753111
> host27           2438      2437       -1  -0.041017    23.288606
> host28           2438      2437       -1  -0.041017    23.425059
> host29           2438      2438        0   0.000000    24.115941
> host30           2438      2441        3   0.123052    23.560539
> host31           2438      2438        0   0.000000    24.459911
> host32           2438      2440        2   0.082034    24.096746
> host33           2438      2437       -1  -0.041017    24.241316
> host34           2438      2438        0   0.000000    24.715044
> host35           2438      2436       -2  -0.082034    23.424601
> host36           2438      2436       -2  -0.082034    24.123606
> host37           2438      2439        1   0.041017    24.368997
> host38           2438      2440        2   0.082034    24.331532
> host39           2438      2439        1   0.041017    23.803561
> host40           2438      2437       -1  -0.041017    23.861094
> host41           2438      2442        4   0.164069    23.468473
> 
> 
> On 04/26/2017 11:08 PM, Loic Dachary wrote:
>>
>>
>> On 04/25/2017 05:04 PM, Pedro López-Adeva wrote:
>>> Hi Loic,
>>>
>>> Well, the results are better certainly! Some comments:
>>>
>>> - I'm glad Nelder-Mead worked. It's not the one I would have chosen
>>> because but I'm not an expert in optimization either. I wonder how it
>>> will scale with more weights[1]. My attempt at using scipy's optimize
>>> didn't work because you are optimizing an stochastic function and this
>>> can make scipy's to decide that no further steps are possible. The
>>> field that studies this kind of problems is stochastic optimization
>>> [2]
>>
>> You were right, it does not always work. Note that this is *not* about the conditional probability bias. This is about the uneven distribution due to the low number of values in the distribution. I think this case should be treated separately, with a different method. In Ceph clusters, large and small, the number of PGs per host is unlikely to be large enough to get enough samples. It is not an isolated problem, it's what happens most of the time.
>>
>> Even in a case as simple as 12 devices starting with:
>>
>>              ~expected~  ~actual~    ~delta~   ~delta%~  ~weight~
>> host1      2560.000000      2580  20.000000   0.781250        24
>> device12    106.666667       101  -5.666667  -5.312500         1
>> device13    213.333333       221   7.666667   3.593750         2
>> device14    320.000000       317  -3.000000  -0.937500         3
>> device15    106.666667       101  -5.666667  -5.312500         1
>> device16    213.333333       217   3.666667   1.718750         2
>> device17    320.000000       342  22.000000   6.875000         3
>> device18    106.666667       102  -4.666667  -4.375000         1
>> device19    213.333333       243  29.666667  13.906250         2
>> device20    320.000000       313  -7.000000  -2.187500         3
>> device21    106.666667        94 -12.666667 -11.875000         1
>> device22    213.333333       208  -5.333333  -2.500000         2
>> device23    320.000000       321   1.000000   0.312500         3
>>
>>             res = minimize(crush, weights, method='nelder-mead',
>>                            options={'xtol': 1e-8, 'disp': True})
>>
>> device weights [ 1.  3.  3.  2.  3.  2.  2.  1.  3.  1.  1.  2.]
>> device kl 0.00117274995028
>> ...
>> device kl 0.00016530695476
>> Optimization terminated successfully.
>>          Current function value: 0.000165
>>          Iterations: 117
>>          Function evaluations: 470
>>
>> we still get a 5% difference on device 21:
>>
>>              ~expected~  ~actual~    ~delta~   ~delta%~  ~weight~
>> host1      2560.000000      2559 -1.000000 -0.039062  23.805183
>> device12    106.666667       103 -3.666667 -3.437500   1.016999
>> device13    213.333333       214  0.666667  0.312500   1.949328
>> device14    320.000000       325  5.000000  1.562500   3.008688
>> device15    106.666667       106 -0.666667 -0.625000   1.012565
>> device16    213.333333       214  0.666667  0.312500   1.976344
>> device17    320.000000       320  0.000000  0.000000   2.845135
>> device18    106.666667       102 -4.666667 -4.375000   1.039181
>> device19    213.333333       214  0.666667  0.312500   1.820435
>> device20    320.000000       324  4.000000  1.250000   3.062573
>> device21    106.666667       101 -5.666667 -5.312500   1.071341
>> device22    213.333333       212 -1.333333 -0.625000   2.039190
>> device23    320.000000       324  4.000000  1.250000   3.016468
>>
>>  
>>> - I used KL divergence for the loss function. My first attempt was
>>> using as you standard deviation (more commonly known as L2 loss) with
>>> gradient descent, but it didn't work very well.
>>>
>>> - Sum of differences sounds like a bad idea, +100 and -100 errors will
>>> cancel out. Worse still -100 and -100 will be better than 0 and 0.
>>> Maybe you were talking about the absolute value of the differences?
>>>
>>> - Well, now that CRUSH can use multiple weight the problem that
>>> remains I think is seeing if the optimization problem is: a) reliable
>>> and b) fast enough
>>>
>>> Cheers,
>>> Pedro.
>>>
>>> [1] http://www.benfrederickson.com/numerical-optimization/
>>> [2] https://en.wikipedia.org/wiki/Stochastic_optimization
>>>
>>> 2017-04-22 18:51 GMT+02:00 Loic Dachary <loic@dachary.org>:
>>>> Hi Pedro,
>>>>
>>>> I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
>>>>
>>>> I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
>>>>
>>>> Before optimization the situation is:
>>>>
>>>>          ~expected~  ~objects~  ~delta~   ~delta%~
>>>> ~name~
>>>> dc1            1024       1024        0   0.000000
>>>> host0           256        294       38  14.843750
>>>> device0         128        153       25  19.531250
>>>> device1         128        141       13  10.156250
>>>> host1           256        301       45  17.578125
>>>> device2         128        157       29  22.656250
>>>> device3         128        144       16  12.500000
>>>> host2           512        429      -83 -16.210938
>>>> device4         128         96      -32 -25.000000
>>>> device5         128        117      -11  -8.593750
>>>> device6         256        216      -40 -15.625000
>>>>
>>>> and after optimization we have the following:
>>>>
>>>>          ~expected~  ~objects~  ~delta~  ~delta%~
>>>> ~name~
>>>> dc1            1024       1024        0  0.000000
>>>> host0           256        259        3  1.171875
>>>> device0         128        129        1  0.781250
>>>> device1         128        130        2  1.562500
>>>> host1           256        258        2  0.781250
>>>> device2         128        129        1  0.781250
>>>> device3         128        129        1  0.781250
>>>> host2           512        507       -5 -0.976562
>>>> device4         128        126       -2 -1.562500
>>>> device5         128        127       -1 -0.781250
>>>> device6         256        254       -2 -0.781250
>>>>
>>>> Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
>>>>
>>>> Cheers
>>>>
>>>> [1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
>>>> [2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
>>>> [3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
>>>> [4] https://github.com/ceph/ceph/pull/14486
>>>>
>>>> On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
>>>>> There are lot of gradient-free methods. I will try first to run the
>>>>> ones available using just scipy
>>>>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>>>> Some of them don't require the gradient and some of them can estimate
>>>>> it. The reason to go without the gradient is to run the CRUSH
>>>>> algorithm as a black box. In that case this would be the pseudo-code:
>>>>>
>>>>> - BEGIN CODE -
>>>>> def build_target(desired_freqs):
>>>>>     def target(weights):
>>>>>         # run a simulation of CRUSH for a number of objects
>>>>>         sim_freqs = run_crush(weights)
>>>>>         # Kullback-Leibler divergence between desired frequencies and
>>>>> current ones
>>>>>         return loss(sim_freqs, desired_freqs)
>>>>>    return target
>>>>>
>>>>> weights = scipy.optimize.minimize(build_target(desired_freqs))
>>>>> - END CODE -
>>>>>
>>>>> The tricky thing here is that this procedure can be slow if the
>>>>> simulation (run_crush) needs to place a lot of objects to get accurate
>>>>> simulated frequencies. This is true specially if the minimize method
>>>>> attempts to approximate the gradient using finite differences since it
>>>>> will evaluate the target function a number of times proportional to
>>>>> the number of weights). Apart from the ones in scipy I would try also
>>>>> optimization methods that try to perform as few evaluations as
>>>>> possible like for example HyperOpt
>>>>> (http://hyperopt.github.io/hyperopt/), which by the way takes into
>>>>> account that the target function can be noisy.
>>>>>
>>>>> This black box approximation is simple to implement and makes the
>>>>> computer do all the work instead of us.
>>>>> I think that this black box approximation is worthy to try even if
>>>>> it's not the final one because if this approximation works then we
>>>>> know that a more elaborate one that computes the gradient of the CRUSH
>>>>> algorithm will work for sure.
>>>>>
>>>>> I can try this black box approximation this weekend not on the real
>>>>> CRUSH algorithm but with the simple implementation I did in python. If
>>>>> it works it's just a matter of substituting one simulation with
>>>>> another and see what happens.
>>>>>
>>>>> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>>>>> Hi Loic,
>>>>>>>
>>>>>>> >From what I see everything seems OK.
>>>>>>
>>>>>> Cool. I'll keep going in this direction then !
>>>>>>
>>>>>>> The interesting thing would be to
>>>>>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>>>>>> is right now modeling just a single straw bucket not the full CRUSH
>>>>>>> algorithm.
>>>>>>
>>>>>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>>>>>
>>>>>>> That's the work that remains to be done. The only way that
>>>>>>> would avoid reimplementing the CRUSH algorithm and computing the
>>>>>>> gradient would be treating CRUSH as a black box and eliminating the
>>>>>>> necessity of computing the gradient either by using a gradient-free
>>>>>>> optimization method or making an estimation of the gradient.
>>>>>>
>>>>>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>>>>>
>>>>>>>> Thanks !
>>>>>>>>
>>>>>>>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> Before: All replicas on each hard drive
>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>  disk 0: 1.39e-01 1.12e-01
>>>>>>>>  disk 1: 1.11e-01 1.10e-01
>>>>>>>>  disk 2: 8.33e-02 1.13e-01
>>>>>>>>  disk 3: 1.39e-01 1.11e-01
>>>>>>>>  disk 4: 1.11e-01 1.11e-01
>>>>>>>>  disk 5: 8.33e-02 1.11e-01
>>>>>>>>  disk 6: 1.39e-01 1.12e-01
>>>>>>>>  disk 7: 1.11e-01 1.12e-01
>>>>>>>>  disk 8: 8.33e-02 1.10e-01
>>>>>>>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>>>>>>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>>>>>>> ...
>>>>>>>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>>>>>>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>>>>>>> Converged to desired accuracy :)
>>>>>>>> After: All replicas on each hard drive
>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>  disk 0: 1.39e-01 1.42e-01
>>>>>>>>  disk 1: 1.11e-01 1.09e-01
>>>>>>>>  disk 2: 8.33e-02 8.37e-02
>>>>>>>>  disk 3: 1.39e-01 1.40e-01
>>>>>>>>  disk 4: 1.11e-01 1.13e-01
>>>>>>>>  disk 5: 8.33e-02 8.08e-02
>>>>>>>>  disk 6: 1.39e-01 1.38e-01
>>>>>>>>  disk 7: 1.11e-01 1.09e-01
>>>>>>>>  disk 8: 8.33e-02 8.48e-02
>>>>>>>>
>>>>>>>>
>>>>>>>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>>>>>>> ------------------------------------------------------------------------
>>>>>>>> Before: All replicas on each hard drive
>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>  disk 0: 2.44e-01 2.36e-01
>>>>>>>>  disk 1: 2.44e-01 2.38e-01
>>>>>>>>  disk 2: 2.44e-01 2.34e-01
>>>>>>>>  disk 3: 2.44e-01 2.38e-01
>>>>>>>>  disk 4: 2.44e-02 5.37e-02
>>>>>>>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>>>>>>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>>>>>>> ...
>>>>>>>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>>>>>>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>>>>>>> Converged to desired accuracy :)
>>>>>>>> After: All replicas on each hard drive
>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>  disk 0: 2.44e-01 2.46e-01
>>>>>>>>  disk 1: 2.44e-01 2.44e-01
>>>>>>>>  disk 2: 2.44e-01 2.41e-01
>>>>>>>>  disk 3: 2.44e-01 2.45e-01
>>>>>>>>  disk 4: 2.44e-02 2.33e-02
>>>>>>>>
>>>>>>>>
>>>>>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>>>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>>>>>
>>>>>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>>>>> Hi Pedro,
>>>>>>>>>
>>>>>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>>>>>> Hi Pedro,
>>>>>>>>>>
>>>>>>>>>> I'm going to experiment with what you did at
>>>>>>>>>>
>>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>>
>>>>>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>>>>>> Great, thanks for the clarifications.
>>>>>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>>>>>
>>>>>>>>>>> I keep working on it.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>>>>>> Hi Pedro,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>>>>>>>>>>> haven't made much headway.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>>>>>> work on.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>>>>>> selecting the same one again.
>>>>>>>>>>>>
>>>>>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>>>>>> round/replica/rank.
>>>>>>>>>>>>
>>>>>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>>>>>> change and evolve independently.  (In most cases any change is
>>>>>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>>>>>> before it is enabled or used.)
>>>>>>>>>>>>
>>>>>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>>>>>> implementation?
>>>>>>>>>>>>
>>>>>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>>>>>> retry_descent.
>>>>>>>>>>>>
>>>>>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>>>>>> adapted)!
>>>>>>>>>>>>
>>>>>>>>>>>> sage
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-04-27  6:12                                                 ` Loic Dachary
@ 2017-04-27 16:47                                                   ` Loic Dachary
  2017-04-27 22:14                                                     ` Loic Dachary
  0 siblings, 1 reply; 70+ messages in thread
From: Loic Dachary @ 2017-04-27 16:47 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: Ceph Development

Hi Pedro,

After I suspected uniform weights could be a border case, I tried with varying weights and did not get good results. Nedler-Mead also tried (and why not) negative values for the weights which is invalid for CRUSH. And since there is no way to specify the value bounds for Nedler-Mead, that makes it a bad candidate for the job.

Next in line seems to be L-BFGS-B [1] which 

a) projects a gradient and is likely to run faster
b) allows a min value to be defined for each value so we won't have negative values

I'll go in this direction unless you tell me "Noooooo this is a baaaaad idea" ;-)

Cheers

[1] https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb

On 04/27/2017 08:12 AM, Loic Dachary wrote:
> With 63 hosts instead of 41 we get the same results: from kl 1.9169485575e-04 to kl 3.0384231953e-07 with a maximum difference going from ~8% to ~0.5%. What's is interesting (at least to me ;-) is that the weights don't change that much, they all stay in the range ]23,25].
> 
> Note that all this optimization is done by changing a single weight per host. It is worth trying again with two different weights (which is what you did in https://github.com/plafl/notebooks/blob/master/replication.ipynb). The weight for the first draw is immutable as it is (i.e. 24) and the weight for the second draw is allowed to change.
> 
> Before optimization
> 
> host0            2400      2345      -55  -2.291667        24
> host1            2400      2434       34   1.416667        24
> host2            2400      2387      -13  -0.541667        24
> host3            2400      2351      -49  -2.041667        24
> host4            2400      2423       23   0.958333        24
> host5            2400      2456       56   2.333333        24
> host6            2400      2450       50   2.083333        24
> host7            2400      2307      -93  -3.875000        24
> host8            2400      2434       34   1.416667        24
> host9            2400      2358      -42  -1.750000        24
> host10           2400      2452       52   2.166667        24
> host11           2400      2398       -2  -0.083333        24
> host12           2400      2359      -41  -1.708333        24
> host13           2400      2403        3   0.125000        24
> host14           2400      2484       84   3.500000        24
> host15           2400      2348      -52  -2.166667        24
> host16           2400      2489       89   3.708333        24
> host17           2400      2412       12   0.500000        24
> host18           2400      2416       16   0.666667        24
> host19           2400      2453       53   2.208333        24
> host20           2400      2475       75   3.125000        24
> host21           2400      2413       13   0.541667        24
> host22           2400      2450       50   2.083333        24
> host23           2400      2348      -52  -2.166667        24
> host24           2400      2355      -45  -1.875000        24
> host25           2400      2348      -52  -2.166667        24
> host26           2400      2373      -27  -1.125000        24
> host27           2400      2470       70   2.916667        24
> host28           2400      2449       49   2.041667        24
> host29           2400      2420       20   0.833333        24
> host30           2400      2406        6   0.250000        24
> host31           2400      2376      -24  -1.000000        24
> host32           2400      2371      -29  -1.208333        24
> host33           2400      2395       -5  -0.208333        24
> host34           2400      2351      -49  -2.041667        24
> host35           2400      2453       53   2.208333        24
> host36           2400      2421       21   0.875000        24
> host37           2400      2393       -7  -0.291667        24
> host38           2400      2394       -6  -0.250000        24
> host39           2400      2322      -78  -3.250000        24
> host40           2400      2409        9   0.375000        24
> host41           2400      2486       86   3.583333        24
> host42           2400      2466       66   2.750000        24
> host43           2400      2409        9   0.375000        24
> host44           2400      2276     -124  -5.166667        24
> host45           2400      2379      -21  -0.875000        24
> host46           2400      2394       -6  -0.250000        24
> host47           2400      2401        1   0.041667        24
> host48           2400      2446       46   1.916667        24
> host49           2400      2349      -51  -2.125000        24
> host50           2400      2413       13   0.541667        24
> host51           2400      2333      -67  -2.791667        24
> host52           2400      2387      -13  -0.541667        24
> host53           2400      2407        7   0.291667        24
> host54           2400      2377      -23  -0.958333        24
> host55           2400      2441       41   1.708333        24
> host56           2400      2420       20   0.833333        24
> host57           2400      2388      -12  -0.500000        24
> host58           2400      2460       60   2.500000        24
> host59           2400      2394       -6  -0.250000        24
> host60           2400      2316      -84  -3.500000        24
> host61           2400      2373      -27  -1.125000        24
> host62           2400      2362      -38  -1.583333        24
> host63           2400      2372      -28  -1.166667        24
> 
> After optimization
> 
> host0            2400      2403        3   0.125000    24.575153
> host1            2400      2401        1   0.041667    23.723316
> host2            2400      2402        2   0.083333    24.168746
> host3            2400      2399       -1  -0.041667    24.520240
> host4            2400      2399       -1  -0.041667    23.911445
> host5            2400      2400        0   0.000000    23.606956
> host6            2400      2401        1   0.041667    23.714102
> host7            2400      2400        0   0.000000    25.008463
> host8            2400      2399       -1  -0.041667    23.557143
> host9            2400      2399       -1  -0.041667    24.431548
> host10           2400      2400        0   0.000000    23.494153
> host11           2400      2401        1   0.041667    23.976621
> host12           2400      2400        0   0.000000    24.512622
> host13           2400      2397       -3  -0.125000    24.010814
> host14           2400      2398       -2  -0.083333    23.229791
> host15           2400      2402        2   0.083333    24.510854
> host16           2400      2401        1   0.041667    23.188161
> host17           2400      2397       -3  -0.125000    23.931915
> host18           2400      2400        0   0.000000    23.886135
> host19           2400      2398       -2  -0.083333    23.442129
> host20           2400      2401        1   0.041667    23.393092
> host21           2400      2398       -2  -0.083333    23.940452
> host22           2400      2401        1   0.041667    23.643843
> host23           2400      2403        3   0.125000    24.592113
> host24           2400      2402        2   0.083333    24.561842
> host25           2400      2401        1   0.041667    24.598754
> host26           2400      2398       -2  -0.083333    24.350951
> host27           2400      2399       -1  -0.041667    23.336478
> host28           2400      2401        1   0.041667    23.549652
> host29           2400      2401        1   0.041667    23.840408
> host30           2400      2400        0   0.000000    23.932423
> host31           2400      2397       -3  -0.125000    24.295621
> host32           2400      2402        2   0.083333    24.298228
> host33           2400      2403        3   0.125000    24.068700
> host34           2400      2399       -1  -0.041667    24.395416
> host35           2400      2398       -2  -0.083333    23.522074
> host36           2400      2395       -5  -0.208333    23.746354
> host37           2400      2402        2   0.083333    24.120875
> host38           2400      2401        1   0.041667    24.034644
> host39           2400      2400        0   0.000000    24.665110
> host40           2400      2400        0   0.000000    23.856618
> host41           2400      2400        0   0.000000    23.265386
> host42           2400      2398       -2  -0.083333    23.334984
> host43           2400      2400        0   0.000000    23.950316
> host44           2400      2404        4   0.166667    25.276133
> host45           2400      2399       -1  -0.041667    24.272922
> host46           2400      2399       -1  -0.041667    24.013644
> host47           2400      2402        2   0.083333    24.113955
> host48           2400      2404        4   0.166667    23.582616
> host49           2400      2400        0   0.000000    24.531067
> host50           2400      2400        0   0.000000    23.784893
> host51           2400      2401        1   0.041667    24.793213
> host52           2400      2400        0   0.000000    24.170809
> host53           2400      2400        0   0.000000    23.783899
> host54           2400      2399       -1  -0.041667    24.365295
> host55           2400      2398       -2  -0.083333    23.645767
> host56           2400      2401        1   0.041667    23.858433
> host57           2400      2399       -1  -0.041667    24.159351
> host58           2400      2396       -4  -0.166667    23.430493
> host59           2400      2402        2   0.083333    24.107154
> host60           2400      2403        3   0.125000    24.784382
> host61           2400      2397       -3  -0.125000    24.292784
> host62           2400      2399       -1  -0.041667    24.404311
> host63           2400      2400        0   0.000000    24.219422
> 
> 
> On 04/27/2017 12:25 AM, Loic Dachary wrote:
>> It seems to work when the distribution has enough samples. I tried with 40 hosts and a distribution with 100,000 samples.
>>
>> We go from kl =~ 1e-4 (with as much as 10% difference) to kl =~ 1e-7 (with no more than 0.5% difference). I will do some more experiements and try to think of patterns where this would not work.
>>
>>            ~expected~  ~actual~  ~delta~   ~delta%~     ~weight~
>> dc1            102400    102400        0   0.000000      1008
>> host0            2438      2390      -48  -1.968827        24
>> host1            2438      2370      -68  -2.789171        24
>> host2            2438      2493       55   2.255947        24
>> host3            2438      2396      -42  -1.722724        24
>> host4            2438      2497       59   2.420016        24
>> host5            2438      2520       82   3.363413        24
>> host6            2438      2500       62   2.543068        24
>> host7            2438      2380      -58  -2.378999        24
>> host8            2438      2488       50   2.050861        24
>> host9            2438      2435       -3  -0.123052        24
>> host10           2438      2440        2   0.082034        24
>> host11           2438      2472       34   1.394586        24
>> host12           2438      2346      -92  -3.773585        24
>> host13           2438      2411      -27  -1.107465        24
>> host14           2438      2513       75   3.076292        24
>> host15           2438      2421      -17  -0.697293        24
>> host16           2438      2469       31   1.271534        24
>> host17           2438      2419      -19  -0.779327        24
>> host18           2438      2424      -14  -0.574241        24
>> host19           2438      2451       13   0.533224        24
>> host20           2438      2486       48   1.968827        24
>> host21           2438      2439        1   0.041017        24
>> host22           2438      2482       44   1.804758        24
>> host23           2438      2415      -23  -0.943396        24
>> host24           2438      2389      -49  -2.009844        24
>> host25           2438      2265     -173  -7.095980        24
>> host26           2438      2374      -64  -2.625103        24
>> host27           2438      2529       91   3.732568        24
>> host28           2438      2495       57   2.337982        24
>> host29           2438      2433       -5  -0.205086        24
>> host30           2438      2485       47   1.927810        24
>> host31           2438      2377      -61  -2.502051        24
>> host32           2438      2441        3   0.123052        24
>> host33           2438      2421      -17  -0.697293        24
>> host34           2438      2359      -79  -3.240361        24
>> host35           2438      2509       71   2.912223        24
>> host36           2438      2425      -13  -0.533224        24
>> host37           2438      2419      -19  -0.779327        24
>> host38           2438      2403      -35  -1.435603        24
>> host39           2438      2458       20   0.820345        24
>> host40           2438      2458       20   0.820345        24
>> host41           2438      2503       65   2.666120        24
>>
>>            ~expected~  ~actual~  ~delta~   ~delta%~     ~weight~
>> dc1            102400    102400        0   0.000000         1008
>> host0            2438      2438        0   0.000000    24.559919
>> host1            2438      2438        0   0.000000    24.641221
>> host2            2438      2440        2   0.082034    23.486113
>> host3            2438      2437       -1  -0.041017    24.525875
>> host4            2438      2436       -2  -0.082034    23.644304
>> host5            2438      2440        2   0.082034    23.245287
>> host6            2438      2442        4   0.164069    23.617162
>> host7            2438      2439        1   0.041017    24.746174
>> host8            2438      2436       -2  -0.082034    23.584667
>> host9            2438      2439        1   0.041017    24.140637
>> host10           2438      2438        0   0.000000    24.060084
>> host11           2438      2441        3   0.123052    23.730349
>> host12           2438      2437       -1  -0.041017    24.948602
>> host13           2438      2437       -1  -0.041017    24.280851
>> host14           2438      2436       -2  -0.082034    23.402216
>> host15           2438      2436       -2  -0.082034    24.272037
>> host16           2438      2437       -1  -0.041017    23.747867
>> host17           2438      2436       -2  -0.082034    24.266271
>> host18           2438      2438        0   0.000000    24.158545
>> host19           2438      2440        2   0.082034    23.934788
>> host20           2438      2438        0   0.000000    23.630851
>> host21           2438      2435       -3  -0.123052    24.001950
>> host22           2438      2440        2   0.082034    23.623120
>> host23           2438      2437       -1  -0.041017    24.343138
>> host24           2438      2438        0   0.000000    24.595820
>> host25           2438      2439        1   0.041017    25.547510
>> host26           2438      2437       -1  -0.041017    24.753111
>> host27           2438      2437       -1  -0.041017    23.288606
>> host28           2438      2437       -1  -0.041017    23.425059
>> host29           2438      2438        0   0.000000    24.115941
>> host30           2438      2441        3   0.123052    23.560539
>> host31           2438      2438        0   0.000000    24.459911
>> host32           2438      2440        2   0.082034    24.096746
>> host33           2438      2437       -1  -0.041017    24.241316
>> host34           2438      2438        0   0.000000    24.715044
>> host35           2438      2436       -2  -0.082034    23.424601
>> host36           2438      2436       -2  -0.082034    24.123606
>> host37           2438      2439        1   0.041017    24.368997
>> host38           2438      2440        2   0.082034    24.331532
>> host39           2438      2439        1   0.041017    23.803561
>> host40           2438      2437       -1  -0.041017    23.861094
>> host41           2438      2442        4   0.164069    23.468473
>>
>>
>> On 04/26/2017 11:08 PM, Loic Dachary wrote:
>>>
>>>
>>> On 04/25/2017 05:04 PM, Pedro López-Adeva wrote:
>>>> Hi Loic,
>>>>
>>>> Well, the results are better certainly! Some comments:
>>>>
>>>> - I'm glad Nelder-Mead worked. It's not the one I would have chosen
>>>> because but I'm not an expert in optimization either. I wonder how it
>>>> will scale with more weights[1]. My attempt at using scipy's optimize
>>>> didn't work because you are optimizing an stochastic function and this
>>>> can make scipy's to decide that no further steps are possible. The
>>>> field that studies this kind of problems is stochastic optimization
>>>> [2]
>>>
>>> You were right, it does not always work. Note that this is *not* about the conditional probability bias. This is about the uneven distribution due to the low number of values in the distribution. I think this case should be treated separately, with a different method. In Ceph clusters, large and small, the number of PGs per host is unlikely to be large enough to get enough samples. It is not an isolated problem, it's what happens most of the time.
>>>
>>> Even in a case as simple as 12 devices starting with:
>>>
>>>              ~expected~  ~actual~    ~delta~   ~delta%~  ~weight~
>>> host1      2560.000000      2580  20.000000   0.781250        24
>>> device12    106.666667       101  -5.666667  -5.312500         1
>>> device13    213.333333       221   7.666667   3.593750         2
>>> device14    320.000000       317  -3.000000  -0.937500         3
>>> device15    106.666667       101  -5.666667  -5.312500         1
>>> device16    213.333333       217   3.666667   1.718750         2
>>> device17    320.000000       342  22.000000   6.875000         3
>>> device18    106.666667       102  -4.666667  -4.375000         1
>>> device19    213.333333       243  29.666667  13.906250         2
>>> device20    320.000000       313  -7.000000  -2.187500         3
>>> device21    106.666667        94 -12.666667 -11.875000         1
>>> device22    213.333333       208  -5.333333  -2.500000         2
>>> device23    320.000000       321   1.000000   0.312500         3
>>>
>>>             res = minimize(crush, weights, method='nelder-mead',
>>>                            options={'xtol': 1e-8, 'disp': True})
>>>
>>> device weights [ 1.  3.  3.  2.  3.  2.  2.  1.  3.  1.  1.  2.]
>>> device kl 0.00117274995028
>>> ...
>>> device kl 0.00016530695476
>>> Optimization terminated successfully.
>>>          Current function value: 0.000165
>>>          Iterations: 117
>>>          Function evaluations: 470
>>>
>>> we still get a 5% difference on device 21:
>>>
>>>              ~expected~  ~actual~    ~delta~   ~delta%~  ~weight~
>>> host1      2560.000000      2559 -1.000000 -0.039062  23.805183
>>> device12    106.666667       103 -3.666667 -3.437500   1.016999
>>> device13    213.333333       214  0.666667  0.312500   1.949328
>>> device14    320.000000       325  5.000000  1.562500   3.008688
>>> device15    106.666667       106 -0.666667 -0.625000   1.012565
>>> device16    213.333333       214  0.666667  0.312500   1.976344
>>> device17    320.000000       320  0.000000  0.000000   2.845135
>>> device18    106.666667       102 -4.666667 -4.375000   1.039181
>>> device19    213.333333       214  0.666667  0.312500   1.820435
>>> device20    320.000000       324  4.000000  1.250000   3.062573
>>> device21    106.666667       101 -5.666667 -5.312500   1.071341
>>> device22    213.333333       212 -1.333333 -0.625000   2.039190
>>> device23    320.000000       324  4.000000  1.250000   3.016468
>>>
>>>  
>>>> - I used KL divergence for the loss function. My first attempt was
>>>> using as you standard deviation (more commonly known as L2 loss) with
>>>> gradient descent, but it didn't work very well.
>>>>
>>>> - Sum of differences sounds like a bad idea, +100 and -100 errors will
>>>> cancel out. Worse still -100 and -100 will be better than 0 and 0.
>>>> Maybe you were talking about the absolute value of the differences?
>>>>
>>>> - Well, now that CRUSH can use multiple weight the problem that
>>>> remains I think is seeing if the optimization problem is: a) reliable
>>>> and b) fast enough
>>>>
>>>> Cheers,
>>>> Pedro.
>>>>
>>>> [1] http://www.benfrederickson.com/numerical-optimization/
>>>> [2] https://en.wikipedia.org/wiki/Stochastic_optimization
>>>>
>>>> 2017-04-22 18:51 GMT+02:00 Loic Dachary <loic@dachary.org>:
>>>>> Hi Pedro,
>>>>>
>>>>> I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
>>>>>
>>>>> I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
>>>>>
>>>>> Before optimization the situation is:
>>>>>
>>>>>          ~expected~  ~objects~  ~delta~   ~delta%~
>>>>> ~name~
>>>>> dc1            1024       1024        0   0.000000
>>>>> host0           256        294       38  14.843750
>>>>> device0         128        153       25  19.531250
>>>>> device1         128        141       13  10.156250
>>>>> host1           256        301       45  17.578125
>>>>> device2         128        157       29  22.656250
>>>>> device3         128        144       16  12.500000
>>>>> host2           512        429      -83 -16.210938
>>>>> device4         128         96      -32 -25.000000
>>>>> device5         128        117      -11  -8.593750
>>>>> device6         256        216      -40 -15.625000
>>>>>
>>>>> and after optimization we have the following:
>>>>>
>>>>>          ~expected~  ~objects~  ~delta~  ~delta%~
>>>>> ~name~
>>>>> dc1            1024       1024        0  0.000000
>>>>> host0           256        259        3  1.171875
>>>>> device0         128        129        1  0.781250
>>>>> device1         128        130        2  1.562500
>>>>> host1           256        258        2  0.781250
>>>>> device2         128        129        1  0.781250
>>>>> device3         128        129        1  0.781250
>>>>> host2           512        507       -5 -0.976562
>>>>> device4         128        126       -2 -1.562500
>>>>> device5         128        127       -1 -0.781250
>>>>> device6         256        254       -2 -0.781250
>>>>>
>>>>> Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
>>>>>
>>>>> Cheers
>>>>>
>>>>> [1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
>>>>> [2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
>>>>> [3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
>>>>> [4] https://github.com/ceph/ceph/pull/14486
>>>>>
>>>>> On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
>>>>>> There are lot of gradient-free methods. I will try first to run the
>>>>>> ones available using just scipy
>>>>>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>>>>> Some of them don't require the gradient and some of them can estimate
>>>>>> it. The reason to go without the gradient is to run the CRUSH
>>>>>> algorithm as a black box. In that case this would be the pseudo-code:
>>>>>>
>>>>>> - BEGIN CODE -
>>>>>> def build_target(desired_freqs):
>>>>>>     def target(weights):
>>>>>>         # run a simulation of CRUSH for a number of objects
>>>>>>         sim_freqs = run_crush(weights)
>>>>>>         # Kullback-Leibler divergence between desired frequencies and
>>>>>> current ones
>>>>>>         return loss(sim_freqs, desired_freqs)
>>>>>>    return target
>>>>>>
>>>>>> weights = scipy.optimize.minimize(build_target(desired_freqs))
>>>>>> - END CODE -
>>>>>>
>>>>>> The tricky thing here is that this procedure can be slow if the
>>>>>> simulation (run_crush) needs to place a lot of objects to get accurate
>>>>>> simulated frequencies. This is true specially if the minimize method
>>>>>> attempts to approximate the gradient using finite differences since it
>>>>>> will evaluate the target function a number of times proportional to
>>>>>> the number of weights). Apart from the ones in scipy I would try also
>>>>>> optimization methods that try to perform as few evaluations as
>>>>>> possible like for example HyperOpt
>>>>>> (http://hyperopt.github.io/hyperopt/), which by the way takes into
>>>>>> account that the target function can be noisy.
>>>>>>
>>>>>> This black box approximation is simple to implement and makes the
>>>>>> computer do all the work instead of us.
>>>>>> I think that this black box approximation is worthy to try even if
>>>>>> it's not the final one because if this approximation works then we
>>>>>> know that a more elaborate one that computes the gradient of the CRUSH
>>>>>> algorithm will work for sure.
>>>>>>
>>>>>> I can try this black box approximation this weekend not on the real
>>>>>> CRUSH algorithm but with the simple implementation I did in python. If
>>>>>> it works it's just a matter of substituting one simulation with
>>>>>> another and see what happens.
>>>>>>
>>>>>> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>> Hi Pedro,
>>>>>>>
>>>>>>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>>>>>> Hi Loic,
>>>>>>>>
>>>>>>>> >From what I see everything seems OK.
>>>>>>>
>>>>>>> Cool. I'll keep going in this direction then !
>>>>>>>
>>>>>>>> The interesting thing would be to
>>>>>>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>>>>>>> is right now modeling just a single straw bucket not the full CRUSH
>>>>>>>> algorithm.
>>>>>>>
>>>>>>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>>>>>>
>>>>>>>> That's the work that remains to be done. The only way that
>>>>>>>> would avoid reimplementing the CRUSH algorithm and computing the
>>>>>>>> gradient would be treating CRUSH as a black box and eliminating the
>>>>>>>> necessity of computing the gradient either by using a gradient-free
>>>>>>>> optimization method or making an estimation of the gradient.
>>>>>>>
>>>>>>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>>>>>>
>>>>>>>>> Thanks !
>>>>>>>>>
>>>>>>>>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>> Before: All replicas on each hard drive
>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>>  disk 0: 1.39e-01 1.12e-01
>>>>>>>>>  disk 1: 1.11e-01 1.10e-01
>>>>>>>>>  disk 2: 8.33e-02 1.13e-01
>>>>>>>>>  disk 3: 1.39e-01 1.11e-01
>>>>>>>>>  disk 4: 1.11e-01 1.11e-01
>>>>>>>>>  disk 5: 8.33e-02 1.11e-01
>>>>>>>>>  disk 6: 1.39e-01 1.12e-01
>>>>>>>>>  disk 7: 1.11e-01 1.12e-01
>>>>>>>>>  disk 8: 8.33e-02 1.10e-01
>>>>>>>>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>>>>>>>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>>>>>>>> ...
>>>>>>>>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>>>>>>>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>>>>>>>> Converged to desired accuracy :)
>>>>>>>>> After: All replicas on each hard drive
>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>>  disk 0: 1.39e-01 1.42e-01
>>>>>>>>>  disk 1: 1.11e-01 1.09e-01
>>>>>>>>>  disk 2: 8.33e-02 8.37e-02
>>>>>>>>>  disk 3: 1.39e-01 1.40e-01
>>>>>>>>>  disk 4: 1.11e-01 1.13e-01
>>>>>>>>>  disk 5: 8.33e-02 8.08e-02
>>>>>>>>>  disk 6: 1.39e-01 1.38e-01
>>>>>>>>>  disk 7: 1.11e-01 1.09e-01
>>>>>>>>>  disk 8: 8.33e-02 8.48e-02
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>> Before: All replicas on each hard drive
>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>>  disk 0: 2.44e-01 2.36e-01
>>>>>>>>>  disk 1: 2.44e-01 2.38e-01
>>>>>>>>>  disk 2: 2.44e-01 2.34e-01
>>>>>>>>>  disk 3: 2.44e-01 2.38e-01
>>>>>>>>>  disk 4: 2.44e-02 5.37e-02
>>>>>>>>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>>>>>>>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>>>>>>>> ...
>>>>>>>>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>>>>>>>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>>>>>>>> Converged to desired accuracy :)
>>>>>>>>> After: All replicas on each hard drive
>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>>  disk 0: 2.44e-01 2.46e-01
>>>>>>>>>  disk 1: 2.44e-01 2.44e-01
>>>>>>>>>  disk 2: 2.44e-01 2.41e-01
>>>>>>>>>  disk 3: 2.44e-01 2.45e-01
>>>>>>>>>  disk 4: 2.44e-02 2.33e-02
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>>>>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>>>>>>
>>>>>>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>>>>>> Hi Pedro,
>>>>>>>>>>
>>>>>>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>>>>>>> Hi Pedro,
>>>>>>>>>>>
>>>>>>>>>>> I'm going to experiment with what you did at
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>>>
>>>>>>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>>
>>>>>>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>>>>>>> Great, thanks for the clarifications.
>>>>>>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>>>>>>
>>>>>>>>>>>> I keep working on it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>>>>>>> Hi Pedro,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>>>>>>>>>>>> haven't made much headway.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>>>>>>> work on.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>>>>>>> selecting the same one again.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>>>>>>> round/replica/rank.
>>>>>>>>>>>>>
>>>>>>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>>>>>>> change and evolve independently.  (In most cases any change is
>>>>>>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>>>>>>> before it is enabled or used.)
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>>>>>>> implementation?
>>>>>>>>>>>>>
>>>>>>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>>>>>>> retry_descent.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>>>>>>> adapted)!
>>>>>>>>>>>>>
>>>>>>>>>>>>> sage
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: crush multipick anomaly
  2017-04-27 16:47                                                   ` Loic Dachary
@ 2017-04-27 22:14                                                     ` Loic Dachary
  0 siblings, 0 replies; 70+ messages in thread
From: Loic Dachary @ 2017-04-27 22:14 UTC (permalink / raw)
  To: Pedro López-Adeva; +Cc: Ceph Development

TL;DR: either I'm doing something wrong or optimize L-BFGS-B does not converge to anything useful.

Trying L-BFGS-B wasn't that difficult. Only eps gave me trouble but I think I chose something sensible. However ... it does not converge to anything useful. The code itself is at http://libcrush.org/dachary/python-crush/blob/b19af6d0da0ac4f8c6d9fb1c8828775539df7feb/tests/test_analyze.py#L235 and a summary of the output is shown below.

I think I'm stuck now, unfortunately. Any idea on how to move forward ?

Cheers

bounds = [(0.1, None), (0.1, None), (0.1, None), (0.1, None), (0.1, None), (0.1, None), (0.1, None), (0.1, None), (0.1, None), (0.1, None)]
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =           10     M =           10

At X0         0 variables are exactly at the bounds

host weights [ 1.  1.  1.  1.  1.  5.  1.  1.  1.  1.]
host kl 0.395525661546
...
host weights [ 7.06935073  0.59036832  0.58504545  0.57290196  0.55298047  0.1
  0.54095906  0.60123172  0.54841584  0.68277045]
host kl 0.0511888801117

At iterate   12    f=  5.12013D-02    |proj g|=  1.00098D-03

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
   10     12     62     13     1     1   1.001D-03   5.120D-02
  F =  5.12013167923604934E-002

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             

 Warning:  more than 10 function and gradient
   evaluations in the last line search.  Termination
   may possibly be caused by a bad search direction.

 Cauchy                time 0.000E+00 seconds.
 Subspace minimization time 0.000E+00 seconds.
 Line search           time 0.000E+00 seconds.

 Total User time 0.000E+00 seconds.


On 04/27/2017 06:47 PM, Loic Dachary wrote:
> Hi Pedro,
> 
> After I suspected uniform weights could be a border case, I tried with varying weights and did not get good results. Nedler-Mead also tried (and why not) negative values for the weights which is invalid for CRUSH. And since there is no way to specify the value bounds for Nedler-Mead, that makes it a bad candidate for the job.
> 
> Next in line seems to be L-BFGS-B [1] which 
> 
> a) projects a gradient and is likely to run faster
> b) allows a min value to be defined for each value so we won't have negative values
> 
> I'll go in this direction unless you tell me "Noooooo this is a baaaaad idea" ;-)
> 
> Cheers
> 
> [1] https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
> 
> On 04/27/2017 08:12 AM, Loic Dachary wrote:
>> With 63 hosts instead of 41 we get the same results: from kl 1.9169485575e-04 to kl 3.0384231953e-07 with a maximum difference going from ~8% to ~0.5%. What's is interesting (at least to me ;-) is that the weights don't change that much, they all stay in the range ]23,25].
>>
>> Note that all this optimization is done by changing a single weight per host. It is worth trying again with two different weights (which is what you did in https://github.com/plafl/notebooks/blob/master/replication.ipynb). The weight for the first draw is immutable as it is (i.e. 24) and the weight for the second draw is allowed to change.
>>
>> Before optimization
>>
>> host0            2400      2345      -55  -2.291667        24
>> host1            2400      2434       34   1.416667        24
>> host2            2400      2387      -13  -0.541667        24
>> host3            2400      2351      -49  -2.041667        24
>> host4            2400      2423       23   0.958333        24
>> host5            2400      2456       56   2.333333        24
>> host6            2400      2450       50   2.083333        24
>> host7            2400      2307      -93  -3.875000        24
>> host8            2400      2434       34   1.416667        24
>> host9            2400      2358      -42  -1.750000        24
>> host10           2400      2452       52   2.166667        24
>> host11           2400      2398       -2  -0.083333        24
>> host12           2400      2359      -41  -1.708333        24
>> host13           2400      2403        3   0.125000        24
>> host14           2400      2484       84   3.500000        24
>> host15           2400      2348      -52  -2.166667        24
>> host16           2400      2489       89   3.708333        24
>> host17           2400      2412       12   0.500000        24
>> host18           2400      2416       16   0.666667        24
>> host19           2400      2453       53   2.208333        24
>> host20           2400      2475       75   3.125000        24
>> host21           2400      2413       13   0.541667        24
>> host22           2400      2450       50   2.083333        24
>> host23           2400      2348      -52  -2.166667        24
>> host24           2400      2355      -45  -1.875000        24
>> host25           2400      2348      -52  -2.166667        24
>> host26           2400      2373      -27  -1.125000        24
>> host27           2400      2470       70   2.916667        24
>> host28           2400      2449       49   2.041667        24
>> host29           2400      2420       20   0.833333        24
>> host30           2400      2406        6   0.250000        24
>> host31           2400      2376      -24  -1.000000        24
>> host32           2400      2371      -29  -1.208333        24
>> host33           2400      2395       -5  -0.208333        24
>> host34           2400      2351      -49  -2.041667        24
>> host35           2400      2453       53   2.208333        24
>> host36           2400      2421       21   0.875000        24
>> host37           2400      2393       -7  -0.291667        24
>> host38           2400      2394       -6  -0.250000        24
>> host39           2400      2322      -78  -3.250000        24
>> host40           2400      2409        9   0.375000        24
>> host41           2400      2486       86   3.583333        24
>> host42           2400      2466       66   2.750000        24
>> host43           2400      2409        9   0.375000        24
>> host44           2400      2276     -124  -5.166667        24
>> host45           2400      2379      -21  -0.875000        24
>> host46           2400      2394       -6  -0.250000        24
>> host47           2400      2401        1   0.041667        24
>> host48           2400      2446       46   1.916667        24
>> host49           2400      2349      -51  -2.125000        24
>> host50           2400      2413       13   0.541667        24
>> host51           2400      2333      -67  -2.791667        24
>> host52           2400      2387      -13  -0.541667        24
>> host53           2400      2407        7   0.291667        24
>> host54           2400      2377      -23  -0.958333        24
>> host55           2400      2441       41   1.708333        24
>> host56           2400      2420       20   0.833333        24
>> host57           2400      2388      -12  -0.500000        24
>> host58           2400      2460       60   2.500000        24
>> host59           2400      2394       -6  -0.250000        24
>> host60           2400      2316      -84  -3.500000        24
>> host61           2400      2373      -27  -1.125000        24
>> host62           2400      2362      -38  -1.583333        24
>> host63           2400      2372      -28  -1.166667        24
>>
>> After optimization
>>
>> host0            2400      2403        3   0.125000    24.575153
>> host1            2400      2401        1   0.041667    23.723316
>> host2            2400      2402        2   0.083333    24.168746
>> host3            2400      2399       -1  -0.041667    24.520240
>> host4            2400      2399       -1  -0.041667    23.911445
>> host5            2400      2400        0   0.000000    23.606956
>> host6            2400      2401        1   0.041667    23.714102
>> host7            2400      2400        0   0.000000    25.008463
>> host8            2400      2399       -1  -0.041667    23.557143
>> host9            2400      2399       -1  -0.041667    24.431548
>> host10           2400      2400        0   0.000000    23.494153
>> host11           2400      2401        1   0.041667    23.976621
>> host12           2400      2400        0   0.000000    24.512622
>> host13           2400      2397       -3  -0.125000    24.010814
>> host14           2400      2398       -2  -0.083333    23.229791
>> host15           2400      2402        2   0.083333    24.510854
>> host16           2400      2401        1   0.041667    23.188161
>> host17           2400      2397       -3  -0.125000    23.931915
>> host18           2400      2400        0   0.000000    23.886135
>> host19           2400      2398       -2  -0.083333    23.442129
>> host20           2400      2401        1   0.041667    23.393092
>> host21           2400      2398       -2  -0.083333    23.940452
>> host22           2400      2401        1   0.041667    23.643843
>> host23           2400      2403        3   0.125000    24.592113
>> host24           2400      2402        2   0.083333    24.561842
>> host25           2400      2401        1   0.041667    24.598754
>> host26           2400      2398       -2  -0.083333    24.350951
>> host27           2400      2399       -1  -0.041667    23.336478
>> host28           2400      2401        1   0.041667    23.549652
>> host29           2400      2401        1   0.041667    23.840408
>> host30           2400      2400        0   0.000000    23.932423
>> host31           2400      2397       -3  -0.125000    24.295621
>> host32           2400      2402        2   0.083333    24.298228
>> host33           2400      2403        3   0.125000    24.068700
>> host34           2400      2399       -1  -0.041667    24.395416
>> host35           2400      2398       -2  -0.083333    23.522074
>> host36           2400      2395       -5  -0.208333    23.746354
>> host37           2400      2402        2   0.083333    24.120875
>> host38           2400      2401        1   0.041667    24.034644
>> host39           2400      2400        0   0.000000    24.665110
>> host40           2400      2400        0   0.000000    23.856618
>> host41           2400      2400        0   0.000000    23.265386
>> host42           2400      2398       -2  -0.083333    23.334984
>> host43           2400      2400        0   0.000000    23.950316
>> host44           2400      2404        4   0.166667    25.276133
>> host45           2400      2399       -1  -0.041667    24.272922
>> host46           2400      2399       -1  -0.041667    24.013644
>> host47           2400      2402        2   0.083333    24.113955
>> host48           2400      2404        4   0.166667    23.582616
>> host49           2400      2400        0   0.000000    24.531067
>> host50           2400      2400        0   0.000000    23.784893
>> host51           2400      2401        1   0.041667    24.793213
>> host52           2400      2400        0   0.000000    24.170809
>> host53           2400      2400        0   0.000000    23.783899
>> host54           2400      2399       -1  -0.041667    24.365295
>> host55           2400      2398       -2  -0.083333    23.645767
>> host56           2400      2401        1   0.041667    23.858433
>> host57           2400      2399       -1  -0.041667    24.159351
>> host58           2400      2396       -4  -0.166667    23.430493
>> host59           2400      2402        2   0.083333    24.107154
>> host60           2400      2403        3   0.125000    24.784382
>> host61           2400      2397       -3  -0.125000    24.292784
>> host62           2400      2399       -1  -0.041667    24.404311
>> host63           2400      2400        0   0.000000    24.219422
>>
>>
>> On 04/27/2017 12:25 AM, Loic Dachary wrote:
>>> It seems to work when the distribution has enough samples. I tried with 40 hosts and a distribution with 100,000 samples.
>>>
>>> We go from kl =~ 1e-4 (with as much as 10% difference) to kl =~ 1e-7 (with no more than 0.5% difference). I will do some more experiements and try to think of patterns where this would not work.
>>>
>>>            ~expected~  ~actual~  ~delta~   ~delta%~     ~weight~
>>> dc1            102400    102400        0   0.000000      1008
>>> host0            2438      2390      -48  -1.968827        24
>>> host1            2438      2370      -68  -2.789171        24
>>> host2            2438      2493       55   2.255947        24
>>> host3            2438      2396      -42  -1.722724        24
>>> host4            2438      2497       59   2.420016        24
>>> host5            2438      2520       82   3.363413        24
>>> host6            2438      2500       62   2.543068        24
>>> host7            2438      2380      -58  -2.378999        24
>>> host8            2438      2488       50   2.050861        24
>>> host9            2438      2435       -3  -0.123052        24
>>> host10           2438      2440        2   0.082034        24
>>> host11           2438      2472       34   1.394586        24
>>> host12           2438      2346      -92  -3.773585        24
>>> host13           2438      2411      -27  -1.107465        24
>>> host14           2438      2513       75   3.076292        24
>>> host15           2438      2421      -17  -0.697293        24
>>> host16           2438      2469       31   1.271534        24
>>> host17           2438      2419      -19  -0.779327        24
>>> host18           2438      2424      -14  -0.574241        24
>>> host19           2438      2451       13   0.533224        24
>>> host20           2438      2486       48   1.968827        24
>>> host21           2438      2439        1   0.041017        24
>>> host22           2438      2482       44   1.804758        24
>>> host23           2438      2415      -23  -0.943396        24
>>> host24           2438      2389      -49  -2.009844        24
>>> host25           2438      2265     -173  -7.095980        24
>>> host26           2438      2374      -64  -2.625103        24
>>> host27           2438      2529       91   3.732568        24
>>> host28           2438      2495       57   2.337982        24
>>> host29           2438      2433       -5  -0.205086        24
>>> host30           2438      2485       47   1.927810        24
>>> host31           2438      2377      -61  -2.502051        24
>>> host32           2438      2441        3   0.123052        24
>>> host33           2438      2421      -17  -0.697293        24
>>> host34           2438      2359      -79  -3.240361        24
>>> host35           2438      2509       71   2.912223        24
>>> host36           2438      2425      -13  -0.533224        24
>>> host37           2438      2419      -19  -0.779327        24
>>> host38           2438      2403      -35  -1.435603        24
>>> host39           2438      2458       20   0.820345        24
>>> host40           2438      2458       20   0.820345        24
>>> host41           2438      2503       65   2.666120        24
>>>
>>>            ~expected~  ~actual~  ~delta~   ~delta%~     ~weight~
>>> dc1            102400    102400        0   0.000000         1008
>>> host0            2438      2438        0   0.000000    24.559919
>>> host1            2438      2438        0   0.000000    24.641221
>>> host2            2438      2440        2   0.082034    23.486113
>>> host3            2438      2437       -1  -0.041017    24.525875
>>> host4            2438      2436       -2  -0.082034    23.644304
>>> host5            2438      2440        2   0.082034    23.245287
>>> host6            2438      2442        4   0.164069    23.617162
>>> host7            2438      2439        1   0.041017    24.746174
>>> host8            2438      2436       -2  -0.082034    23.584667
>>> host9            2438      2439        1   0.041017    24.140637
>>> host10           2438      2438        0   0.000000    24.060084
>>> host11           2438      2441        3   0.123052    23.730349
>>> host12           2438      2437       -1  -0.041017    24.948602
>>> host13           2438      2437       -1  -0.041017    24.280851
>>> host14           2438      2436       -2  -0.082034    23.402216
>>> host15           2438      2436       -2  -0.082034    24.272037
>>> host16           2438      2437       -1  -0.041017    23.747867
>>> host17           2438      2436       -2  -0.082034    24.266271
>>> host18           2438      2438        0   0.000000    24.158545
>>> host19           2438      2440        2   0.082034    23.934788
>>> host20           2438      2438        0   0.000000    23.630851
>>> host21           2438      2435       -3  -0.123052    24.001950
>>> host22           2438      2440        2   0.082034    23.623120
>>> host23           2438      2437       -1  -0.041017    24.343138
>>> host24           2438      2438        0   0.000000    24.595820
>>> host25           2438      2439        1   0.041017    25.547510
>>> host26           2438      2437       -1  -0.041017    24.753111
>>> host27           2438      2437       -1  -0.041017    23.288606
>>> host28           2438      2437       -1  -0.041017    23.425059
>>> host29           2438      2438        0   0.000000    24.115941
>>> host30           2438      2441        3   0.123052    23.560539
>>> host31           2438      2438        0   0.000000    24.459911
>>> host32           2438      2440        2   0.082034    24.096746
>>> host33           2438      2437       -1  -0.041017    24.241316
>>> host34           2438      2438        0   0.000000    24.715044
>>> host35           2438      2436       -2  -0.082034    23.424601
>>> host36           2438      2436       -2  -0.082034    24.123606
>>> host37           2438      2439        1   0.041017    24.368997
>>> host38           2438      2440        2   0.082034    24.331532
>>> host39           2438      2439        1   0.041017    23.803561
>>> host40           2438      2437       -1  -0.041017    23.861094
>>> host41           2438      2442        4   0.164069    23.468473
>>>
>>>
>>> On 04/26/2017 11:08 PM, Loic Dachary wrote:
>>>>
>>>>
>>>> On 04/25/2017 05:04 PM, Pedro López-Adeva wrote:
>>>>> Hi Loic,
>>>>>
>>>>> Well, the results are better certainly! Some comments:
>>>>>
>>>>> - I'm glad Nelder-Mead worked. It's not the one I would have chosen
>>>>> because but I'm not an expert in optimization either. I wonder how it
>>>>> will scale with more weights[1]. My attempt at using scipy's optimize
>>>>> didn't work because you are optimizing an stochastic function and this
>>>>> can make scipy's to decide that no further steps are possible. The
>>>>> field that studies this kind of problems is stochastic optimization
>>>>> [2]
>>>>
>>>> You were right, it does not always work. Note that this is *not* about the conditional probability bias. This is about the uneven distribution due to the low number of values in the distribution. I think this case should be treated separately, with a different method. In Ceph clusters, large and small, the number of PGs per host is unlikely to be large enough to get enough samples. It is not an isolated problem, it's what happens most of the time.
>>>>
>>>> Even in a case as simple as 12 devices starting with:
>>>>
>>>>              ~expected~  ~actual~    ~delta~   ~delta%~  ~weight~
>>>> host1      2560.000000      2580  20.000000   0.781250        24
>>>> device12    106.666667       101  -5.666667  -5.312500         1
>>>> device13    213.333333       221   7.666667   3.593750         2
>>>> device14    320.000000       317  -3.000000  -0.937500         3
>>>> device15    106.666667       101  -5.666667  -5.312500         1
>>>> device16    213.333333       217   3.666667   1.718750         2
>>>> device17    320.000000       342  22.000000   6.875000         3
>>>> device18    106.666667       102  -4.666667  -4.375000         1
>>>> device19    213.333333       243  29.666667  13.906250         2
>>>> device20    320.000000       313  -7.000000  -2.187500         3
>>>> device21    106.666667        94 -12.666667 -11.875000         1
>>>> device22    213.333333       208  -5.333333  -2.500000         2
>>>> device23    320.000000       321   1.000000   0.312500         3
>>>>
>>>>             res = minimize(crush, weights, method='nelder-mead',
>>>>                            options={'xtol': 1e-8, 'disp': True})
>>>>
>>>> device weights [ 1.  3.  3.  2.  3.  2.  2.  1.  3.  1.  1.  2.]
>>>> device kl 0.00117274995028
>>>> ...
>>>> device kl 0.00016530695476
>>>> Optimization terminated successfully.
>>>>          Current function value: 0.000165
>>>>          Iterations: 117
>>>>          Function evaluations: 470
>>>>
>>>> we still get a 5% difference on device 21:
>>>>
>>>>              ~expected~  ~actual~    ~delta~   ~delta%~  ~weight~
>>>> host1      2560.000000      2559 -1.000000 -0.039062  23.805183
>>>> device12    106.666667       103 -3.666667 -3.437500   1.016999
>>>> device13    213.333333       214  0.666667  0.312500   1.949328
>>>> device14    320.000000       325  5.000000  1.562500   3.008688
>>>> device15    106.666667       106 -0.666667 -0.625000   1.012565
>>>> device16    213.333333       214  0.666667  0.312500   1.976344
>>>> device17    320.000000       320  0.000000  0.000000   2.845135
>>>> device18    106.666667       102 -4.666667 -4.375000   1.039181
>>>> device19    213.333333       214  0.666667  0.312500   1.820435
>>>> device20    320.000000       324  4.000000  1.250000   3.062573
>>>> device21    106.666667       101 -5.666667 -5.312500   1.071341
>>>> device22    213.333333       212 -1.333333 -0.625000   2.039190
>>>> device23    320.000000       324  4.000000  1.250000   3.016468
>>>>
>>>>  
>>>>> - I used KL divergence for the loss function. My first attempt was
>>>>> using as you standard deviation (more commonly known as L2 loss) with
>>>>> gradient descent, but it didn't work very well.
>>>>>
>>>>> - Sum of differences sounds like a bad idea, +100 and -100 errors will
>>>>> cancel out. Worse still -100 and -100 will be better than 0 and 0.
>>>>> Maybe you were talking about the absolute value of the differences?
>>>>>
>>>>> - Well, now that CRUSH can use multiple weight the problem that
>>>>> remains I think is seeing if the optimization problem is: a) reliable
>>>>> and b) fast enough
>>>>>
>>>>> Cheers,
>>>>> Pedro.
>>>>>
>>>>> [1] http://www.benfrederickson.com/numerical-optimization/
>>>>> [2] https://en.wikipedia.org/wiki/Stochastic_optimization
>>>>>
>>>>> 2017-04-22 18:51 GMT+02:00 Loic Dachary <loic@dachary.org>:
>>>>>> Hi Pedro,
>>>>>>
>>>>>> I tried the optimize function you suggested and got it to work[1]! It is my first time with scipy.optimize[2] and I'm not sure this is done right. In a nutshell I chose the Nedler-Mead method[3] because it seemed simpler. The initial guess is set to the target weights and the loss function simply is the standard deviation of the difference between the expected object count per device and the actual object count returned by the simulation. I'm pretty sure this is not right but I don't know what else to do and it's not completely wrong either. The sum of the differences seems simpler and probably gives the same results.
>>>>>>
>>>>>> I ran the optimization to fix the uneven distribution we see when there are not enough samples, because the simulation runs faster than with the multipick anomaly. I suppose it could also work to fix the multipick anomaly. I assume it's ok to use the same method even though the root case of the uneven distribution is different because we're not using a gradient based optimization. But I'm not sure and maybe this is completely wrong...
>>>>>>
>>>>>> Before optimization the situation is:
>>>>>>
>>>>>>          ~expected~  ~objects~  ~delta~   ~delta%~
>>>>>> ~name~
>>>>>> dc1            1024       1024        0   0.000000
>>>>>> host0           256        294       38  14.843750
>>>>>> device0         128        153       25  19.531250
>>>>>> device1         128        141       13  10.156250
>>>>>> host1           256        301       45  17.578125
>>>>>> device2         128        157       29  22.656250
>>>>>> device3         128        144       16  12.500000
>>>>>> host2           512        429      -83 -16.210938
>>>>>> device4         128         96      -32 -25.000000
>>>>>> device5         128        117      -11  -8.593750
>>>>>> device6         256        216      -40 -15.625000
>>>>>>
>>>>>> and after optimization we have the following:
>>>>>>
>>>>>>          ~expected~  ~objects~  ~delta~  ~delta%~
>>>>>> ~name~
>>>>>> dc1            1024       1024        0  0.000000
>>>>>> host0           256        259        3  1.171875
>>>>>> device0         128        129        1  0.781250
>>>>>> device1         128        130        2  1.562500
>>>>>> host1           256        258        2  0.781250
>>>>>> device2         128        129        1  0.781250
>>>>>> device3         128        129        1  0.781250
>>>>>> host2           512        507       -5 -0.976562
>>>>>> device4         128        126       -2 -1.562500
>>>>>> device5         128        127       -1 -0.781250
>>>>>> device6         256        254       -2 -0.781250
>>>>>>
>>>>>> Do you think I should keep going in this direction ? Now that CRUSH can use multiple weights[4] we have a convenient way to use these optimized values.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> [1] http://libcrush.org/main/python-crush/merge_requests/40/diffs#614384bdef0ae975388b03cf89fc7226aa7d2566_58_180
>>>>>> [2] https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html
>>>>>> [3] https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead
>>>>>> [4] https://github.com/ceph/ceph/pull/14486
>>>>>>
>>>>>> On 03/23/2017 04:32 PM, Pedro López-Adeva wrote:
>>>>>>> There are lot of gradient-free methods. I will try first to run the
>>>>>>> ones available using just scipy
>>>>>>> (https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.html).
>>>>>>> Some of them don't require the gradient and some of them can estimate
>>>>>>> it. The reason to go without the gradient is to run the CRUSH
>>>>>>> algorithm as a black box. In that case this would be the pseudo-code:
>>>>>>>
>>>>>>> - BEGIN CODE -
>>>>>>> def build_target(desired_freqs):
>>>>>>>     def target(weights):
>>>>>>>         # run a simulation of CRUSH for a number of objects
>>>>>>>         sim_freqs = run_crush(weights)
>>>>>>>         # Kullback-Leibler divergence between desired frequencies and
>>>>>>> current ones
>>>>>>>         return loss(sim_freqs, desired_freqs)
>>>>>>>    return target
>>>>>>>
>>>>>>> weights = scipy.optimize.minimize(build_target(desired_freqs))
>>>>>>> - END CODE -
>>>>>>>
>>>>>>> The tricky thing here is that this procedure can be slow if the
>>>>>>> simulation (run_crush) needs to place a lot of objects to get accurate
>>>>>>> simulated frequencies. This is true specially if the minimize method
>>>>>>> attempts to approximate the gradient using finite differences since it
>>>>>>> will evaluate the target function a number of times proportional to
>>>>>>> the number of weights). Apart from the ones in scipy I would try also
>>>>>>> optimization methods that try to perform as few evaluations as
>>>>>>> possible like for example HyperOpt
>>>>>>> (http://hyperopt.github.io/hyperopt/), which by the way takes into
>>>>>>> account that the target function can be noisy.
>>>>>>>
>>>>>>> This black box approximation is simple to implement and makes the
>>>>>>> computer do all the work instead of us.
>>>>>>> I think that this black box approximation is worthy to try even if
>>>>>>> it's not the final one because if this approximation works then we
>>>>>>> know that a more elaborate one that computes the gradient of the CRUSH
>>>>>>> algorithm will work for sure.
>>>>>>>
>>>>>>> I can try this black box approximation this weekend not on the real
>>>>>>> CRUSH algorithm but with the simple implementation I did in python. If
>>>>>>> it works it's just a matter of substituting one simulation with
>>>>>>> another and see what happens.
>>>>>>>
>>>>>>> 2017-03-23 15:13 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>>> Hi Pedro,
>>>>>>>>
>>>>>>>> On 03/23/2017 12:49 PM, Pedro López-Adeva wrote:
>>>>>>>>> Hi Loic,
>>>>>>>>>
>>>>>>>>> >From what I see everything seems OK.
>>>>>>>>
>>>>>>>> Cool. I'll keep going in this direction then !
>>>>>>>>
>>>>>>>>> The interesting thing would be to
>>>>>>>>> test on some complex mapping. The reason is that "CrushPolicyFamily"
>>>>>>>>> is right now modeling just a single straw bucket not the full CRUSH
>>>>>>>>> algorithm.
>>>>>>>>
>>>>>>>> A number of use cases use a single straw bucket, maybe the majority of them. Even though it does not reflect the full range of what crush can offer, it could be useful. To be more specific, a crush map that states "place objects so that there is at most one replica per host" or "one replica per rack" is common. Such a crushmap can be reduced to a single straw bucket that contains all the hosts and by using the CrushPolicyFamily, we can change the weights of each host to fix the probabilities. The hosts themselves contain disks with varying weights but I think we can ignore that because crush will only recurse to place one object within a given host.
>>>>>>>>
>>>>>>>>> That's the work that remains to be done. The only way that
>>>>>>>>> would avoid reimplementing the CRUSH algorithm and computing the
>>>>>>>>> gradient would be treating CRUSH as a black box and eliminating the
>>>>>>>>> necessity of computing the gradient either by using a gradient-free
>>>>>>>>> optimization method or making an estimation of the gradient.
>>>>>>>>
>>>>>>>> By gradient-free optimization you mean simulated annealing or Monte Carlo ?
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2017-03-20 11:49 GMT+01:00 Loic Dachary <loic@dachary.org>:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I modified the crush library to accept two weights (one for the first disk, the other for the remaining disks)[1]. This really is a hack for experimentation purposes only ;-) I was able to run a variation of your code[2] and got the following results which are encouraging. Do you think what I did is sensible ? Or is there a problem I don't see ?
>>>>>>>>>>
>>>>>>>>>> Thanks !
>>>>>>>>>>
>>>>>>>>>> Simulation: R=2 devices capacity [10  8  6 10  8  6 10  8  6]
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>> Before: All replicas on each hard drive
>>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>>>  disk 0: 1.39e-01 1.12e-01
>>>>>>>>>>  disk 1: 1.11e-01 1.10e-01
>>>>>>>>>>  disk 2: 8.33e-02 1.13e-01
>>>>>>>>>>  disk 3: 1.39e-01 1.11e-01
>>>>>>>>>>  disk 4: 1.11e-01 1.11e-01
>>>>>>>>>>  disk 5: 8.33e-02 1.11e-01
>>>>>>>>>>  disk 6: 1.39e-01 1.12e-01
>>>>>>>>>>  disk 7: 1.11e-01 1.12e-01
>>>>>>>>>>  disk 8: 8.33e-02 1.10e-01
>>>>>>>>>> it=    1 jac norm=1.59e-01 loss=5.27e-03
>>>>>>>>>> it=    2 jac norm=1.55e-01 loss=5.03e-03
>>>>>>>>>> ...
>>>>>>>>>> it=  212 jac norm=1.02e-03 loss=2.41e-07
>>>>>>>>>> it=  213 jac norm=1.00e-03 loss=2.31e-07
>>>>>>>>>> Converged to desired accuracy :)
>>>>>>>>>> After: All replicas on each hard drive
>>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>>>  disk 0: 1.39e-01 1.42e-01
>>>>>>>>>>  disk 1: 1.11e-01 1.09e-01
>>>>>>>>>>  disk 2: 8.33e-02 8.37e-02
>>>>>>>>>>  disk 3: 1.39e-01 1.40e-01
>>>>>>>>>>  disk 4: 1.11e-01 1.13e-01
>>>>>>>>>>  disk 5: 8.33e-02 8.08e-02
>>>>>>>>>>  disk 6: 1.39e-01 1.38e-01
>>>>>>>>>>  disk 7: 1.11e-01 1.09e-01
>>>>>>>>>>  disk 8: 8.33e-02 8.48e-02
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Simulation: R=2 devices capacity [10 10 10 10  1]
>>>>>>>>>> ------------------------------------------------------------------------
>>>>>>>>>> Before: All replicas on each hard drive
>>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>>>  disk 0: 2.44e-01 2.36e-01
>>>>>>>>>>  disk 1: 2.44e-01 2.38e-01
>>>>>>>>>>  disk 2: 2.44e-01 2.34e-01
>>>>>>>>>>  disk 3: 2.44e-01 2.38e-01
>>>>>>>>>>  disk 4: 2.44e-02 5.37e-02
>>>>>>>>>> it=    1 jac norm=2.43e-01 loss=2.98e-03
>>>>>>>>>> it=    2 jac norm=2.28e-01 loss=2.47e-03
>>>>>>>>>> ...
>>>>>>>>>> it=   37 jac norm=1.28e-03 loss=3.48e-08
>>>>>>>>>> it=   38 jac norm=1.07e-03 loss=2.42e-08
>>>>>>>>>> Converged to desired accuracy :)
>>>>>>>>>> After: All replicas on each hard drive
>>>>>>>>>> Expected vs actual use (20000 samples)
>>>>>>>>>>  disk 0: 2.44e-01 2.46e-01
>>>>>>>>>>  disk 1: 2.44e-01 2.44e-01
>>>>>>>>>>  disk 2: 2.44e-01 2.41e-01
>>>>>>>>>>  disk 3: 2.44e-01 2.45e-01
>>>>>>>>>>  disk 4: 2.44e-02 2.33e-02
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] crush hack http://libcrush.org/main/libcrush/commit/6efda297694392d0b07845eb98464a0dcd56fee8
>>>>>>>>>> [2] python-crush hack http://libcrush.org/dachary/python-crush/commit/d9202fcd4d17cd2a82b37ec20c1bd25f8f2c4b68
>>>>>>>>>>
>>>>>>>>>> On 03/19/2017 11:31 PM, Loic Dachary wrote:
>>>>>>>>>>> Hi Pedro,
>>>>>>>>>>>
>>>>>>>>>>> It looks like trying to experiment with crush won't work as expected because crush does not distinguish the probability of selecting the first device from the probability of selecting the second or third device. Am I mistaken ?
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>>
>>>>>>>>>>> On 03/18/2017 10:21 AM, Loic Dachary wrote:
>>>>>>>>>>>> Hi Pedro,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm going to experiment with what you did at
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/plafl/notebooks/blob/master/replication.ipynb
>>>>>>>>>>>>
>>>>>>>>>>>> and the latest python-crush published today. A comparison function was added that will help measure the data movement. I'm hoping we can release an offline tool based on your solution. Please let me know if I should wait before diving into this, in case you have unpublished drafts or new ideas.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>>
>>>>>>>>>>>> On 03/09/2017 09:47 AM, Pedro López-Adeva wrote:
>>>>>>>>>>>>> Great, thanks for the clarifications.
>>>>>>>>>>>>> I also think that the most natural way is to keep just a set of
>>>>>>>>>>>>> weights in the CRUSH map and update them inside the algorithm.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I keep working on it.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2017-03-08 0:06 GMT+01:00 Sage Weil <sage@newdream.net>:
>>>>>>>>>>>>>> Hi Pedro,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for taking a look at this!  It's a frustrating problem and we
>>>>>>>>>>>>>> haven't made much headway.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, 2 Mar 2017, Pedro López-Adeva wrote:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I will have a look. BTW, I have not progressed that much but I have
>>>>>>>>>>>>>>> been thinking about it. In order to adapt the previous algorithm in
>>>>>>>>>>>>>>> the python notebook I need to substitute the iteration over all
>>>>>>>>>>>>>>> possible devices permutations to iteration over all the possible
>>>>>>>>>>>>>>> selections that crush would make. That is the main thing I need to
>>>>>>>>>>>>>>> work on.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The other thing is of course that weights change for each replica.
>>>>>>>>>>>>>>> That is, they cannot be really fixed in the crush map. So the
>>>>>>>>>>>>>>> algorithm inside libcrush, not only the weights in the map, need to be
>>>>>>>>>>>>>>> changed. The weights in the crush map should reflect then, maybe, the
>>>>>>>>>>>>>>> desired usage frequencies. Or maybe each replica should have their own
>>>>>>>>>>>>>>> crush map, but then the information about the previous selection
>>>>>>>>>>>>>>> should be passed to the next replica placement run so it avoids
>>>>>>>>>>>>>>> selecting the same one again.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My suspicion is that the best solution here (whatever that means!)
>>>>>>>>>>>>>> leaves the CRUSH weights intact with the desired distribution, and
>>>>>>>>>>>>>> then generates a set of derivative weights--probably one set for each
>>>>>>>>>>>>>> round/replica/rank.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One nice property of this is that once the support is added to encode
>>>>>>>>>>>>>> multiple sets of weights, the algorithm used to generate them is free to
>>>>>>>>>>>>>> change and evolve independently.  (In most cases any change is
>>>>>>>>>>>>>> CRUSH's mapping behavior is difficult to roll out because all
>>>>>>>>>>>>>> parties participating in the cluster have to support any new behavior
>>>>>>>>>>>>>> before it is enabled or used.)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have a question also. Is there any significant difference between
>>>>>>>>>>>>>>> the device selection algorithm description in the paper and its final
>>>>>>>>>>>>>>> implementation?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The main difference is the "retry_bucket" behavior was found to be a bad
>>>>>>>>>>>>>> idea; any collision or failed()/overload() case triggers the
>>>>>>>>>>>>>> retry_descent.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are other changes, of course, but I don't think they'll impact any
>>>>>>>>>>>>>> solution we come with here (or at least any solution can be suitably
>>>>>>>>>>>>>> adapted)!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sage
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2017-04-27 22:14 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-26  3:05 crush multipick anomaly Sage Weil
2017-01-26 11:13 ` Loic Dachary
2017-01-26 11:51   ` kefu chai
2017-02-03 14:37   ` Loic Dachary
2017-02-03 14:47     ` Sage Weil
2017-02-03 15:08       ` Loic Dachary
2017-02-03 18:54         ` Loic Dachary
2017-02-06  3:08           ` Jaze Lee
2017-02-06  8:18             ` Loic Dachary
2017-02-06 14:11               ` Jaze Lee
2017-02-06 17:07                 ` Loic Dachary
2017-02-03 15:26       ` Dan van der Ster
2017-02-03 17:37         ` Dan van der Ster
2017-02-06  8:31           ` Loic Dachary
2017-02-06  9:13             ` Dan van der Ster
2017-02-06 16:53               ` Dan van der Ster
2017-02-13 10:36 ` Loic Dachary
2017-02-13 14:21   ` Sage Weil
2017-02-13 18:50     ` Loic Dachary
2017-02-13 19:16       ` Sage Weil
2017-02-13 20:18         ` Loic Dachary
2017-02-13 21:01           ` Loic Dachary
2017-02-13 21:15             ` Sage Weil
2017-02-13 21:19               ` Gregory Farnum
2017-02-13 21:26                 ` Sage Weil
2017-02-13 21:43               ` Loic Dachary
2017-02-16 22:04     ` Pedro López-Adeva
2017-02-22  7:52       ` Loic Dachary
2017-02-22 11:26         ` Pedro López-Adeva
2017-02-22 11:38           ` Loic Dachary
2017-02-22 11:46             ` Pedro López-Adeva
2017-02-25  0:38               ` Loic Dachary
2017-02-25  8:41                 ` Pedro López-Adeva
2017-02-25  9:02                   ` Loic Dachary
2017-03-02  9:43                     ` Loic Dachary
2017-03-02  9:58                       ` Pedro López-Adeva
2017-03-02 10:31                         ` Loic Dachary
2017-03-07 23:06                         ` Sage Weil
2017-03-09  8:47                           ` Pedro López-Adeva
2017-03-18  9:21                             ` Loic Dachary
2017-03-19 22:31                               ` Loic Dachary
2017-03-20 10:49                                 ` Loic Dachary
2017-03-23 11:49                                   ` Pedro López-Adeva
2017-03-23 14:13                                     ` Loic Dachary
2017-03-23 15:32                                       ` Pedro López-Adeva
2017-03-23 16:18                                         ` Loic Dachary
2017-03-25 18:42                                         ` Sage Weil
     [not found]                                           ` <CAHMeWhHV=5u=QFggXFNMn2MzGLgQJ6nMnae+ZgK=MB5yYr1p9g@mail.gmail.com>
2017-03-27  2:33                                             ` Sage Weil
2017-03-27  6:45                                               ` Loic Dachary
     [not found]                                                 ` <CAHMeWhGuJnu2664VTxomQ-wJewBEPjRT_VGWH+g-v5k3ka6X5Q@mail.gmail.com>
2017-03-27  9:27                                                   ` Adam Kupczyk
2017-03-27 10:29                                                     ` Loic Dachary
2017-03-27 10:37                                                     ` Pedro López-Adeva
2017-03-27 13:39                                                     ` Sage Weil
2017-03-28  6:52                                                       ` Adam Kupczyk
2017-03-28  9:49                                                         ` Spandan Kumar Sahu
2017-03-28 13:35                                                         ` Sage Weil
2017-03-27 13:24                                                 ` Sage Weil
2017-04-11 15:22                                         ` Loic Dachary
2017-04-22 16:51                                         ` Loic Dachary
2017-04-25 15:04                                           ` Pedro López-Adeva
2017-04-25 17:46                                             ` Loic Dachary
2017-04-26 21:08                                             ` Loic Dachary
2017-04-26 22:25                                               ` Loic Dachary
2017-04-27  6:12                                                 ` Loic Dachary
2017-04-27 16:47                                                   ` Loic Dachary
2017-04-27 22:14                                                     ` Loic Dachary
2017-02-13 14:53   ` Gregory Farnum
2017-02-20  8:47     ` Loic Dachary
2017-02-20 17:32       ` Gregory Farnum
2017-02-20 19:31         ` Loic Dachary

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.