All of lore.kernel.org
 help / color / mirror / Atom feed
* Proposal for a CRUSH collision fallback
@ 2017-05-07 18:45 Loic Dachary
  2017-05-07 21:34 ` Loic Dachary
  2017-05-08  3:09 ` Sage Weil
  0 siblings, 2 replies; 8+ messages in thread
From: Loic Dachary @ 2017-05-07 18:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

Hi Sage,

When choosing the second replica, crush_bucket_choose picks an item and crush_choose_{indep,firstn} checks if it has already been chosen for the first replica. If so, it is discarded as a collision[1], r is modified and another attempt is made to get a different item. If no value is found after choose_tries attempt, the mapping is incomplete.

Another way to do the same would be that crush_bucket_choose / bucket_straw2_choose[2] ignores the items that have already been chosen. It could be done by looping over the list but that would mean N (number of replicas) lookups for each item.

The current implementation is optimized for the cases where collisions are rare. However, when the weights of the items are two order of magnitude appart or more, choose_tries has to be set to a very large value (more than 1000) for the mapping to succeed. In practice that is not a problem as it is unlikely that a host is 100 times bigger than another host ;-)

When fixing an uneven distribution[3], CRUSH weights are sometime set to values with that kind of difference (1 against 100) to compensate for the probability bias and/or a small number of samples. For instance when there are 5 hosts with weights 5 1 1 1 1, modifying the weight set fails. It goes as far as 8.9, 0.01, 0.01, 0.01, 0.01 with choose_tries 2000. The value of choose_tries has to be increased a lot for a gain that is smaller and smaller and the CPU usage goes up. In this situation, an alternative implementation of the CRUSH collision seems a good idea.

Instead of failing after choose_tries attempts, crush_bucket_choose could be called with the list of already chosen items and loop over them to ensure none of them are candidate. The result will always be correct but the implementation more expensive. The default choose_tries could even be set to a lower value (19 instead of 50 ?) since it is likely more expensive to handle 30 more collisions rather than looping over each item already selected.

What do you think ?

Cheers

P.S. Note that even in this border case modifying the weights to 7.1, 0.5, 0.5, 0.4, 0.4 significantly improves the distribution (twice better instead of ten times better). Only it cannot do better because it hits a limitation of the current CRUSH implementation. But it looks like it is not too difficult to fix.


[1] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L541
[2] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L332
[3] http://marc.info/?l=ceph-devel&m=149407691823750&w=2
-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Proposal for a CRUSH collision fallback
  2017-05-07 18:45 Proposal for a CRUSH collision fallback Loic Dachary
@ 2017-05-07 21:34 ` Loic Dachary
  2017-05-08  3:09 ` Sage Weil
  1 sibling, 0 replies; 8+ messages in thread
From: Loic Dachary @ 2017-05-07 21:34 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

FWIW the example provided ( 5 1 1 1 1 ) cannot lead to an even distribution with or without the proposed patch for two replica because the item with weight 5 gets 56% of the values for the first replica. As a consequence only 44 of the values for the second replica can be placed on the same item. The 9% difference remains.

A draft of the proposed implementation is below and allows the opimitzation to get very close to the optimum distribution.

On 05/07/2017 08:45 PM, Loic Dachary wrote:
> Hi Sage,
> 
> When choosing the second replica, crush_bucket_choose picks an item and crush_choose_{indep,firstn} checks if it has already been chosen for the first replica. If so, it is discarded as a collision[1], r is modified and another attempt is made to get a different item. If no value is found after choose_tries attempt, the mapping is incomplete.
> 
> Another way to do the same would be that crush_bucket_choose / bucket_straw2_choose[2] ignores the items that have already been chosen. It could be done by looping over the list but that would mean N (number of replicas) lookups for each item.
> 
> The current implementation is optimized for the cases where collisions are rare. However, when the weights of the items are two order of magnitude appart or more, choose_tries has to be set to a very large value (more than 1000) for the mapping to succeed. In practice that is not a problem as it is unlikely that a host is 100 times bigger than another host ;-)
> 
> When fixing an uneven distribution[3], CRUSH weights are sometime set to values with that kind of difference (1 against 100) to compensate for the probability bias and/or a small number of samples. For instance when there are 5 hosts with weights 5 1 1 1 1, modifying the weight set fails. It goes as far as 8.9, 0.01, 0.01, 0.01, 0.01 with choose_tries 2000. The value of choose_tries has to be increased a lot for a gain that is smaller and smaller and the CPU usage goes up. In this situation, an alternative implementation of the CRUSH collision seems a good idea.
> 
> Instead of failing after choose_tries attempts, crush_bucket_choose could be called with the list of already chosen items and loop over them to ensure none of them are candidate. The result will always be correct but the implementation more expensive. The default choose_tries could even be set to a lower value (19 instead of 50 ?) since it is likely more expensive to handle 30 more collisions rather than looping over each item already selected.
> 
> What do you think ?
> 
> Cheers
> 
> P.S. Note that even in this border case modifying the weights to 7.1, 0.5, 0.5, 0.4, 0.4 significantly improves the distribution (twice better instead of ten times better). Only it cannot do better because it hits a limitation of the current CRUSH implementation. But it looks like it is not too difficult to fix.
> 
> 
> [1] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L541
> [2] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L332
> [3] http://marc.info/?l=ceph-devel&m=149407691823750&w=2
> 

diff --git a/crush/libcrush/crush/mapper.c b/crush/libcrush/crush/mapper.c
index c71b614..7d422dc 100644
--- a/crush/libcrush/crush/mapper.c
+++ b/crush/libcrush/crush/mapper.c
@@ -322,7 +322,7 @@ static inline int *get_choose_arg_ids(const struct crush_bucket_straw2 *bucket,
 
 static int bucket_straw2_choose(const struct crush_bucket_straw2 *bucket,
 				int x, int r, const struct crush_choose_arg *arg,
-                                int position)
+                                int position, int *out)
 {
 	unsigned int i, high = 0;
 	unsigned int u;
@@ -331,7 +331,14 @@ static int bucket_straw2_choose(const struct crush_bucket_straw2 *bucket,
         int *ids = get_choose_arg_ids(bucket, arg);
 	for (i = 0; i < bucket->h.size; i++) {
                 dprintk("weight 0x%x item %d\n", weights[i], ids[i]);
-		if (weights[i]) {
+                int skip = 0;
+                for (int j = 0; j < position; j++) {
+                  if (bucket->h.items[i] == out[j]) {
+                    skip = 1;
+                    break;
+                  }
+                }
+		if (skip == 0 && weights[i]) {
 			u = crush_hash32_3(bucket->h.hash, x, ids[i], r);
 			u &= 0xffff;
 
@@ -372,7 +379,7 @@ static int crush_bucket_choose(const struct crush_bucket *in,
 			       struct crush_work_bucket *work,
 			       int x, int r,
                                const struct crush_choose_arg *arg,
-                               int position)
+                               int position, int *out)
 {
 	dprintk(" crush_bucket_choose %d x=%d r=%d\n", in->id, x, r);
 	BUG_ON(in->size == 0);
@@ -394,7 +401,7 @@ static int crush_bucket_choose(const struct crush_bucket *in,
 	case CRUSH_BUCKET_STRAW2:
 		return bucket_straw2_choose(
 			(const struct crush_bucket_straw2 *)in,
-			x, r, arg, position);
+			x, r, arg, position, out);
 	default:
 		dprintk("unknown bucket %d alg %d\n", in->id, in->alg);
 		return in->items[0];
@@ -511,7 +518,7 @@ parent_r %d stable %d\n",
 						in, work->work[-1-in->id],
 						x, r,
                                                 (choose_args ? &choose_args[-1-in->id] : 0),
-                                                outpos);
+                                                outpos, out);
 				if (item >= map->max_devices) {
 					dprintk("   bad item %d\n", item);
 					skip_rep = 1;
@@ -721,7 +728,7 @@ static void crush_choose_indep(const struct crush_map *map,
 					in, work->work[-1-in->id],
 					x, r,
                                         (choose_args ? &choose_args[-1-in->id] : 0),
-                                        outpos);
+                                        outpos, out);
 				if (item >= map->max_devices) {
 					dprintk("   bad item %d\n", item);
 					out[rep] = CRUSH_ITEM_NONE;


-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Proposal for a CRUSH collision fallback
  2017-05-07 18:45 Proposal for a CRUSH collision fallback Loic Dachary
  2017-05-07 21:34 ` Loic Dachary
@ 2017-05-08  3:09 ` Sage Weil
  2017-05-08  6:51   ` Loic Dachary
  1 sibling, 1 reply; 8+ messages in thread
From: Sage Weil @ 2017-05-08  3:09 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3778 bytes --]

On Sun, 7 May 2017, Loic Dachary wrote:
> Hi Sage,
> 
> When choosing the second replica, crush_bucket_choose picks an item and 
> crush_choose_{indep,firstn} checks if it has already been chosen for the 
> first replica. If so, it is discarded as a collision[1], r is modified 
> and another attempt is made to get a different item. If no value is 
> found after choose_tries attempt, the mapping is incomplete.
> 
> Another way to do the same would be that crush_bucket_choose / 
> bucket_straw2_choose[2] ignores the items that have already been chosen. 
> It could be done by looping over the list but that would mean N (number 
> of replicas) lookups for each item.
> 
> The current implementation is optimized for the cases where collisions 
> are rare. However, when the weights of the items are two order of 
> magnitude appart or more, choose_tries has to be set to a very large 
> value (more than 1000) for the mapping to succeed. In practice that is 
> not a problem as it is unlikely that a host is 100 times bigger than 
> another host ;-)
> 
> When fixing an uneven distribution[3], CRUSH weights are sometime set to 
> values with that kind of difference (1 against 100) to compensate for 
> the probability bias and/or a small number of samples. For instance when 
> there are 5 hosts with weights 5 1 1 1 1, modifying the weight set 
> fails. It goes as far as 8.9, 0.01, 0.01, 0.01, 0.01 with choose_tries 
> 2000. The value of choose_tries has to be increased a lot for a gain 
> that is smaller and smaller and the CPU usage goes up. In this 
> situation, an alternative implementation of the CRUSH collision seems a 
> good idea.
> 
> Instead of failing after choose_tries attempts, crush_bucket_choose 
> could be called with the list of already chosen items and loop over them 
> to ensure none of them are candidate. The result will always be correct 
> but the implementation more expensive. The default choose_tries could 
> even be set to a lower value (19 instead of 50 ?) since it is likely 
> more expensive to handle 30 more collisions rather than looping over 
> each item already selected.
> 
> What do you think ?

I think this direction is promising!  The problem is that I think it's not 
quite as simple as you suggest, since you may be choosing over multiple 
levels of a hierarchy.  If the weight tree is something like

      4
    /   \
   2     2
  / \   / \
 1   1 1   1
 a   b c   d

and you chose a, then yes, if you get back into the left branch you can 
filter it out of the straw2 selections.  And num_rep is usually small so 
that won't be expensive.  But you also need the first choice at the top 
level of the hierarchy to weight the left *tree* with 1 instead of 2.

I think this could be done by adjusting the hierarchical weights as you go 
(and I think one of Adam's early passes at the multipick problem did 
something similar), but it's a bit more complex.

It seems worth pursuing, though!

And dynamically doing this only after the first N 'normal' attempts fail 
seems like a good way to avoid slowing down the common path (no 
collisions).  I suspect the optimal N is probably closer to 5 than 19, 
though?

sage


> 
> Cheers
> 
> P.S. Note that even in this border case modifying the weights to 7.1, 
> 0.5, 0.5, 0.4, 0.4 significantly improves the distribution (twice better 
> instead of ten times better). Only it cannot do better because it hits a 
> limitation of the current CRUSH implementation. But it looks like it is 
> not too difficult to fix.
> 
> 
> [1] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L541
> [2] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L332
> [3] http://marc.info/?l=ceph-devel&m=149407691823750&w=2
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Proposal for a CRUSH collision fallback
  2017-05-08  3:09 ` Sage Weil
@ 2017-05-08  6:51   ` Loic Dachary
  2017-05-08 13:42     ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Loic Dachary @ 2017-05-08  6:51 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development



On 05/08/2017 05:09 AM, Sage Weil wrote:
> On Sun, 7 May 2017, Loic Dachary wrote:
>> Hi Sage,
>>
>> When choosing the second replica, crush_bucket_choose picks an item and 
>> crush_choose_{indep,firstn} checks if it has already been chosen for the 
>> first replica. If so, it is discarded as a collision[1], r is modified 
>> and another attempt is made to get a different item. If no value is 
>> found after choose_tries attempt, the mapping is incomplete.
>>
>> Another way to do the same would be that crush_bucket_choose / 
>> bucket_straw2_choose[2] ignores the items that have already been chosen. 
>> It could be done by looping over the list but that would mean N (number 
>> of replicas) lookups for each item.
>>
>> The current implementation is optimized for the cases where collisions 
>> are rare. However, when the weights of the items are two order of 
>> magnitude appart or more, choose_tries has to be set to a very large 
>> value (more than 1000) for the mapping to succeed. In practice that is 
>> not a problem as it is unlikely that a host is 100 times bigger than 
>> another host ;-)
>>
>> When fixing an uneven distribution[3], CRUSH weights are sometime set to 
>> values with that kind of difference (1 against 100) to compensate for 
>> the probability bias and/or a small number of samples. For instance when 
>> there are 5 hosts with weights 5 1 1 1 1, modifying the weight set 
>> fails. It goes as far as 8.9, 0.01, 0.01, 0.01, 0.01 with choose_tries 
>> 2000. The value of choose_tries has to be increased a lot for a gain 
>> that is smaller and smaller and the CPU usage goes up. In this 
>> situation, an alternative implementation of the CRUSH collision seems a 
>> good idea.
>>
>> Instead of failing after choose_tries attempts, crush_bucket_choose 
>> could be called with the list of already chosen items and loop over them 
>> to ensure none of them are candidate. The result will always be correct 
>> but the implementation more expensive. The default choose_tries could 
>> even be set to a lower value (19 instead of 50 ?) since it is likely 
>> more expensive to handle 30 more collisions rather than looping over 
>> each item already selected.
>>
>> What do you think ?
> 
> I think this direction is promising!  The problem is that I think it's not 
> quite as simple as you suggest, since you may be choosing over multiple 
> levels of a hierarchy.  If the weight tree is something like
> 
>       4
>     /   \
>    2     2
>   / \   / \
>  1   1 1   1
>  a   b c   d
> 
> and you chose a, then yes, if you get back into the left branch you can 
> filter it out of the straw2 selections.  And num_rep is usually small so 
> that won't be expensive.  But you also need the first choice at the top 
> level of the hierarchy to weight the left *tree* with 1 instead of 2.

I don't understand why this is necessary ? Here are the scenarios I have in mind:


        /         \
       /           \
  r1  4        r2   4         rack (failure domain)
    /   \         /   \
   2     2       2     2      host
  / \   / \     / \   / \
 1   1 1   1   1   1 1   1    device
 a   b c   d   e   f g   h

Say value 10 ends up in a the first time, it first went through rack
r1 which is the failure domain. If value 10 also ends up in r1 the
second time, straw2 will skip/collide it at that level because r1 is
stored in out while a is stored in out2.

There only case I can think of that requires collision to be resolved
in a higher hierarchical level is when there is no alternative.



        /         \
       /           \
  r1  4        r2   4         rack
    /             /   \
h1 2             2     2      host     (failure domain)
  /             / \   / \
 1             1   1 1   1    device
 a             e   f g   h

If 10 ends up in h1 the first time and the second time, it will
collide because there is no alternative. It will then retry_descent,
ftotal increases which goes into r and it gets another chance at landing on a host
that's not h1.

I must be missing a use case :-)


> 
> I think this could be done by adjusting the hierarchical weights as you go 
> (and I think one of Adam's early passes at the multipick problem did 
> something similar), but it's a bit more complex.
> 
> It seems worth pursuing, though!
> 
> And dynamically doing this only after the first N 'normal' attempts fail 
> seems like a good way to avoid slowing down the common path (no 
> collisions).  I suspect the optimal N is probably closer to 5 than 19, 
> though?
> 
> sage
> 
> 
>>
>> Cheers
>>
>> P.S. Note that even in this border case modifying the weights to 7.1, 
>> 0.5, 0.5, 0.4, 0.4 significantly improves the distribution (twice better 
>> instead of ten times better). Only it cannot do better because it hits a 
>> limitation of the current CRUSH implementation. But it looks like it is 
>> not too difficult to fix.
>>
>>
>> [1] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L541
>> [2] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L332
>> [3] http://marc.info/?l=ceph-devel&m=149407691823750&w=2
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>>

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Proposal for a CRUSH collision fallback
  2017-05-08  6:51   ` Loic Dachary
@ 2017-05-08 13:42     ` Sage Weil
  2017-05-08 14:45       ` Loic Dachary
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2017-05-08 13:42 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6752 bytes --]

On Mon, 8 May 2017, Loic Dachary wrote:
> On 05/08/2017 05:09 AM, Sage Weil wrote:
> > On Sun, 7 May 2017, Loic Dachary wrote:
> >> Hi Sage,
> >>
> >> When choosing the second replica, crush_bucket_choose picks an item and 
> >> crush_choose_{indep,firstn} checks if it has already been chosen for the 
> >> first replica. If so, it is discarded as a collision[1], r is modified 
> >> and another attempt is made to get a different item. If no value is 
> >> found after choose_tries attempt, the mapping is incomplete.
> >>
> >> Another way to do the same would be that crush_bucket_choose / 
> >> bucket_straw2_choose[2] ignores the items that have already been chosen. 
> >> It could be done by looping over the list but that would mean N (number 
> >> of replicas) lookups for each item.
> >>
> >> The current implementation is optimized for the cases where collisions 
> >> are rare. However, when the weights of the items are two order of 
> >> magnitude appart or more, choose_tries has to be set to a very large 
> >> value (more than 1000) for the mapping to succeed. In practice that is 
> >> not a problem as it is unlikely that a host is 100 times bigger than 
> >> another host ;-)
> >>
> >> When fixing an uneven distribution[3], CRUSH weights are sometime set to 
> >> values with that kind of difference (1 against 100) to compensate for 
> >> the probability bias and/or a small number of samples. For instance when 
> >> there are 5 hosts with weights 5 1 1 1 1, modifying the weight set 
> >> fails. It goes as far as 8.9, 0.01, 0.01, 0.01, 0.01 with choose_tries 
> >> 2000. The value of choose_tries has to be increased a lot for a gain 
> >> that is smaller and smaller and the CPU usage goes up. In this 
> >> situation, an alternative implementation of the CRUSH collision seems a 
> >> good idea.
> >>
> >> Instead of failing after choose_tries attempts, crush_bucket_choose 
> >> could be called with the list of already chosen items and loop over them 
> >> to ensure none of them are candidate. The result will always be correct 
> >> but the implementation more expensive. The default choose_tries could 
> >> even be set to a lower value (19 instead of 50 ?) since it is likely 
> >> more expensive to handle 30 more collisions rather than looping over 
> >> each item already selected.
> >>
> >> What do you think ?
> > 
> > I think this direction is promising!  The problem is that I think it's not 
> > quite as simple as you suggest, since you may be choosing over multiple 
> > levels of a hierarchy.  If the weight tree is something like
> > 
> >       4
> >     /   \
> >    2     2
> >   / \   / \
> >  1   1 1   1
> >  a   b c   d
> > 
> > and you chose a, then yes, if you get back into the left branch you can 
> > filter it out of the straw2 selections.  And num_rep is usually small so 
> > that won't be expensive.  But you also need the first choice at the top 
> > level of the hierarchy to weight the left *tree* with 1 instead of 2.
> 
> I don't understand why this is necessary ? Here are the scenarios I have in mind:
> 
> 
>         /         \
>        /           \
>   r1  4        r2   4         rack (failure domain)
>     /   \         /   \
>    2     2       2     2      host
>   / \   / \     / \   / \
>  1   1 1   1   1   1 1   1    device
>  a   b c   d   e   f g   h
> 
> Say value 10 ends up in a the first time, it first went through rack
> r1 which is the failure domain. If value 10 also ends up in r1 the
> second time, straw2 will skip/collide it at that level because r1 is
> stored in out while a is stored in out2.
> 
> There only case I can think of that requires collision to be resolved
> in a higher hierarchical level is when there is no alternative.
> 
> 
> 
>         /         \
>        /           \
>   r1  4        r2   4         rack
>     /             /   \
> h1 2             2     2      host     (failure domain)
>   /             / \   / \
>  1             1   1 1   1    device
>  a             e   f g   h
> 
> If 10 ends up in h1 the first time and the second time, it will
> collide because there is no alternative. It will then retry_descent,
> ftotal increases which goes into r and it gets another chance at landing on a host
> that's not h1.

The problem isn't when choose[leaf] is specifying the intermediate 
level (rack or host in your examples); it's when there is an intervening 
level that a single choose is crossing.  In your first tree,

          root 6
>         /         \
>        /           \
>   r1  4        r2   4         rack
>     /   \         /   \
> h1 2  h2 2    h3 2  h4 2      host (failure domain)
>   / \   / \     / \   / \
>  1   1 1   1   1   1 1   1    device
>  a   b c   d   e   f g   h

let's say the failure domain is the host instead of the rack.  If we 
choose a (h1) the first time, for subsequent descents from the root we 
still pick r1 and r2 equally (4 vs 4) even though r1 only has 1 usable 
host (h2).  This normally is fine because we reject the entire descent for 
50% of the r1 choices, so *effectively* r1's weight is only 2 (it's as if 
the other attempts never happened).  But with the smarter straw2 choose h2 
100% for r1 and you'll end up with 2x more tiems for that second position 
on h2 than you want.

Does that make sense?

sage


> 
> I must be missing a use case :-)
> 
> 
> > 
> > I think this could be done by adjusting the hierarchical weights as you go 
> > (and I think one of Adam's early passes at the multipick problem did 
> > something similar), but it's a bit more complex.
> > 
> > It seems worth pursuing, though!
> > 
> > And dynamically doing this only after the first N 'normal' attempts fail 
> > seems like a good way to avoid slowing down the common path (no 
> > collisions).  I suspect the optimal N is probably closer to 5 than 19, 
> > though?
> > 
> > sage
> > 
> > 
> >>
> >> Cheers
> >>
> >> P.S. Note that even in this border case modifying the weights to 7.1, 
> >> 0.5, 0.5, 0.4, 0.4 significantly improves the distribution (twice better 
> >> instead of ten times better). Only it cannot do better because it hits a 
> >> limitation of the current CRUSH implementation. But it looks like it is 
> >> not too difficult to fix.
> >>
> >>
> >> [1] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L541
> >> [2] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L332
> >> [3] http://marc.info/?l=ceph-devel&m=149407691823750&w=2
> >> -- 
> >> Loïc Dachary, Artisan Logiciel Libre
> >>
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Proposal for a CRUSH collision fallback
  2017-05-08 13:42     ` Sage Weil
@ 2017-05-08 14:45       ` Loic Dachary
  2017-05-18 16:48         ` Ning Yao
  0 siblings, 1 reply; 8+ messages in thread
From: Loic Dachary @ 2017-05-08 14:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development



On 05/08/2017 03:42 PM, Sage Weil wrote:
> On Mon, 8 May 2017, Loic Dachary wrote:
>> On 05/08/2017 05:09 AM, Sage Weil wrote:
>>> On Sun, 7 May 2017, Loic Dachary wrote:
>>>> Hi Sage,
>>>>
>>>> When choosing the second replica, crush_bucket_choose picks an item and 
>>>> crush_choose_{indep,firstn} checks if it has already been chosen for the 
>>>> first replica. If so, it is discarded as a collision[1], r is modified 
>>>> and another attempt is made to get a different item. If no value is 
>>>> found after choose_tries attempt, the mapping is incomplete.
>>>>
>>>> Another way to do the same would be that crush_bucket_choose / 
>>>> bucket_straw2_choose[2] ignores the items that have already been chosen. 
>>>> It could be done by looping over the list but that would mean N (number 
>>>> of replicas) lookups for each item.
>>>>
>>>> The current implementation is optimized for the cases where collisions 
>>>> are rare. However, when the weights of the items are two order of 
>>>> magnitude appart or more, choose_tries has to be set to a very large 
>>>> value (more than 1000) for the mapping to succeed. In practice that is 
>>>> not a problem as it is unlikely that a host is 100 times bigger than 
>>>> another host ;-)
>>>>
>>>> When fixing an uneven distribution[3], CRUSH weights are sometime set to 
>>>> values with that kind of difference (1 against 100) to compensate for 
>>>> the probability bias and/or a small number of samples. For instance when 
>>>> there are 5 hosts with weights 5 1 1 1 1, modifying the weight set 
>>>> fails. It goes as far as 8.9, 0.01, 0.01, 0.01, 0.01 with choose_tries 
>>>> 2000. The value of choose_tries has to be increased a lot for a gain 
>>>> that is smaller and smaller and the CPU usage goes up. In this 
>>>> situation, an alternative implementation of the CRUSH collision seems a 
>>>> good idea.
>>>>
>>>> Instead of failing after choose_tries attempts, crush_bucket_choose 
>>>> could be called with the list of already chosen items and loop over them 
>>>> to ensure none of them are candidate. The result will always be correct 
>>>> but the implementation more expensive. The default choose_tries could 
>>>> even be set to a lower value (19 instead of 50 ?) since it is likely 
>>>> more expensive to handle 30 more collisions rather than looping over 
>>>> each item already selected.
>>>>
>>>> What do you think ?
>>>
>>> I think this direction is promising!  The problem is that I think it's not 
>>> quite as simple as you suggest, since you may be choosing over multiple 
>>> levels of a hierarchy.  If the weight tree is something like
>>>
>>>       4
>>>     /   \
>>>    2     2
>>>   / \   / \
>>>  1   1 1   1
>>>  a   b c   d
>>>
>>> and you chose a, then yes, if you get back into the left branch you can 
>>> filter it out of the straw2 selections.  And num_rep is usually small so 
>>> that won't be expensive.  But you also need the first choice at the top 
>>> level of the hierarchy to weight the left *tree* with 1 instead of 2.
>>
>> I don't understand why this is necessary ? Here are the scenarios I have in mind:
>>
>>
>>         /         \
>>        /           \
>>   r1  4        r2   4         rack (failure domain)
>>     /   \         /   \
>>    2     2       2     2      host
>>   / \   / \     / \   / \
>>  1   1 1   1   1   1 1   1    device
>>  a   b c   d   e   f g   h
>>
>> Say value 10 ends up in a the first time, it first went through rack
>> r1 which is the failure domain. If value 10 also ends up in r1 the
>> second time, straw2 will skip/collide it at that level because r1 is
>> stored in out while a is stored in out2.
>>
>> There only case I can think of that requires collision to be resolved
>> in a higher hierarchical level is when there is no alternative.
>>
>>
>>
>>         /         \
>>        /           \
>>   r1  4        r2   4         rack
>>     /             /   \
>> h1 2             2     2      host     (failure domain)
>>   /             / \   / \
>>  1             1   1 1   1    device
>>  a             e   f g   h
>>
>> If 10 ends up in h1 the first time and the second time, it will
>> collide because there is no alternative. It will then retry_descent,
>> ftotal increases which goes into r and it gets another chance at landing on a host
>> that's not h1.
> 
> The problem isn't when choose[leaf] is specifying the intermediate 
> level (rack or host in your examples); it's when there is an intervening 
> level that a single choose is crossing.  In your first tree,
> 
>           root 6
>>         /         \
>>        /           \
>>   r1  4        r2   4         rack
>>     /   \         /   \
>> h1 2  h2 2    h3 2  h4 2      host (failure domain)
>>   / \   / \     / \   / \
>>  1   1 1   1   1   1 1   1    device
>>  a   b c   d   e   f g   h
> 
> let's say the failure domain is the host instead of the rack.  If we 
> choose a (h1) the first time, for subsequent descents from the root we 
> still pick r1 and r2 equally (4 vs 4) even though r1 only has 1 usable 
> host (h2).  This normally is fine because we reject the entire descent for 
> 50% of the r1 choices, so *effectively* r1's weight is only 2 (it's as if 
> the other attempts never happened).  But with the smarter straw2 choose h2 
> 100% for r1 and you'll end up with 2x more tiems for that second position 
> on h2 than you want.
> 
> Does that make sense?

Absolutely, thanks for explaining. The easier route as far as getting an even distribution is concerned seems to find a way to calculate when a combination of weights (5 1 1 1 1 with 2 replica for instance) cannot be evenly distributed.

Cheers

> 
> sage
> 
> 
>>
>> I must be missing a use case :-)
>>
>>
>>>
>>> I think this could be done by adjusting the hierarchical weights as you go 
>>> (and I think one of Adam's early passes at the multipick problem did 
>>> something similar), but it's a bit more complex.
>>>
>>> It seems worth pursuing, though!
>>>
>>> And dynamically doing this only after the first N 'normal' attempts fail 
>>> seems like a good way to avoid slowing down the common path (no 
>>> collisions).  I suspect the optimal N is probably closer to 5 than 19, 
>>> though?
>>>
>>> sage
>>>
>>>
>>>>
>>>> Cheers
>>>>
>>>> P.S. Note that even in this border case modifying the weights to 7.1, 
>>>> 0.5, 0.5, 0.4, 0.4 significantly improves the distribution (twice better 
>>>> instead of ten times better). Only it cannot do better because it hits a 
>>>> limitation of the current CRUSH implementation. But it looks like it is 
>>>> not too difficult to fix.
>>>>
>>>>
>>>> [1] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L541
>>>> [2] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L332
>>>> [3] http://marc.info/?l=ceph-devel&m=149407691823750&w=2
>>>> -- 
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Proposal for a CRUSH collision fallback
  2017-05-08 14:45       ` Loic Dachary
@ 2017-05-18 16:48         ` Ning Yao
  2017-05-18 21:21           ` Loic Dachary
  0 siblings, 1 reply; 8+ messages in thread
From: Ning Yao @ 2017-05-18 16:48 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Sage Weil, Ceph Development

Hi, Loic

we already notice this problem before and find that it is not easy to
solved.  The problem not only occurs in collision case, but also in
rejection when is_out is called:

static int is_out(const struct crush_map *map,
 const __u32 *weight, int weight_max,
 int item, int x)
{
if (item >= weight_max)
return 1;
if (weight[item] >= 0x10000)
return 0;
if (weight[item] == 0)
return 1;
if ((crush_hash32_2(CRUSH_HASH_RJENKINS1, x, item) & 0xffff)
   < weight[item])
return 0;
return 1;
}

That means if we use ceph osd out [osd.id], or osd is marked out
automatically when it is down, or we use reweight-by-utilization to
alter the reweight for an osd.
I think it may  use the same way as Sage mentioned to solve this problem.
Regards
Ning Yao


2017-05-08 22:45 GMT+08:00 Loic Dachary <loic@dachary.org>:
>
>
> On 05/08/2017 03:42 PM, Sage Weil wrote:
>> On Mon, 8 May 2017, Loic Dachary wrote:
>>> On 05/08/2017 05:09 AM, Sage Weil wrote:
>>>> On Sun, 7 May 2017, Loic Dachary wrote:
>>>>> Hi Sage,
>>>>>
>>>>> When choosing the second replica, crush_bucket_choose picks an item and
>>>>> crush_choose_{indep,firstn} checks if it has already been chosen for the
>>>>> first replica. If so, it is discarded as a collision[1], r is modified
>>>>> and another attempt is made to get a different item. If no value is
>>>>> found after choose_tries attempt, the mapping is incomplete.
>>>>>
>>>>> Another way to do the same would be that crush_bucket_choose /
>>>>> bucket_straw2_choose[2] ignores the items that have already been chosen.
>>>>> It could be done by looping over the list but that would mean N (number
>>>>> of replicas) lookups for each item.
>>>>>
>>>>> The current implementation is optimized for the cases where collisions
>>>>> are rare. However, when the weights of the items are two order of
>>>>> magnitude appart or more, choose_tries has to be set to a very large
>>>>> value (more than 1000) for the mapping to succeed. In practice that is
>>>>> not a problem as it is unlikely that a host is 100 times bigger than
>>>>> another host ;-)
>>>>>
>>>>> When fixing an uneven distribution[3], CRUSH weights are sometime set to
>>>>> values with that kind of difference (1 against 100) to compensate for
>>>>> the probability bias and/or a small number of samples. For instance when
>>>>> there are 5 hosts with weights 5 1 1 1 1, modifying the weight set
>>>>> fails. It goes as far as 8.9, 0.01, 0.01, 0.01, 0.01 with choose_tries
>>>>> 2000. The value of choose_tries has to be increased a lot for a gain
>>>>> that is smaller and smaller and the CPU usage goes up. In this
>>>>> situation, an alternative implementation of the CRUSH collision seems a
>>>>> good idea.
>>>>>
>>>>> Instead of failing after choose_tries attempts, crush_bucket_choose
>>>>> could be called with the list of already chosen items and loop over them
>>>>> to ensure none of them are candidate. The result will always be correct
>>>>> but the implementation more expensive. The default choose_tries could
>>>>> even be set to a lower value (19 instead of 50 ?) since it is likely
>>>>> more expensive to handle 30 more collisions rather than looping over
>>>>> each item already selected.
>>>>>
>>>>> What do you think ?
>>>>
>>>> I think this direction is promising!  The problem is that I think it's not
>>>> quite as simple as you suggest, since you may be choosing over multiple
>>>> levels of a hierarchy.  If the weight tree is something like
>>>>
>>>>       4
>>>>     /   \
>>>>    2     2
>>>>   / \   / \
>>>>  1   1 1   1
>>>>  a   b c   d
>>>>
>>>> and you chose a, then yes, if you get back into the left branch you can
>>>> filter it out of the straw2 selections.  And num_rep is usually small so
>>>> that won't be expensive.  But you also need the first choice at the top
>>>> level of the hierarchy to weight the left *tree* with 1 instead of 2.
>>>
>>> I don't understand why this is necessary ? Here are the scenarios I have in mind:
>>>
>>>
>>>         /         \
>>>        /           \
>>>   r1  4        r2   4         rack (failure domain)
>>>     /   \         /   \
>>>    2     2       2     2      host
>>>   / \   / \     / \   / \
>>>  1   1 1   1   1   1 1   1    device
>>>  a   b c   d   e   f g   h
>>>
>>> Say value 10 ends up in a the first time, it first went through rack
>>> r1 which is the failure domain. If value 10 also ends up in r1 the
>>> second time, straw2 will skip/collide it at that level because r1 is
>>> stored in out while a is stored in out2.
>>>
>>> There only case I can think of that requires collision to be resolved
>>> in a higher hierarchical level is when there is no alternative.
>>>
>>>
>>>
>>>         /         \
>>>        /           \
>>>   r1  4        r2   4         rack
>>>     /             /   \
>>> h1 2             2     2      host     (failure domain)
>>>   /             / \   / \
>>>  1             1   1 1   1    device
>>>  a             e   f g   h
>>>
>>> If 10 ends up in h1 the first time and the second time, it will
>>> collide because there is no alternative. It will then retry_descent,
>>> ftotal increases which goes into r and it gets another chance at landing on a host
>>> that's not h1.
>>
>> The problem isn't when choose[leaf] is specifying the intermediate
>> level (rack or host in your examples); it's when there is an intervening
>> level that a single choose is crossing.  In your first tree,
>>
>>           root 6
>>>         /         \
>>>        /           \
>>>   r1  4        r2   4         rack
>>>     /   \         /   \
>>> h1 2  h2 2    h3 2  h4 2      host (failure domain)
>>>   / \   / \     / \   / \
>>>  1   1 1   1   1   1 1   1    device
>>>  a   b c   d   e   f g   h
>>
>> let's say the failure domain is the host instead of the rack.  If we
>> choose a (h1) the first time, for subsequent descents from the root we
>> still pick r1 and r2 equally (4 vs 4) even though r1 only has 1 usable
>> host (h2).  This normally is fine because we reject the entire descent for
>> 50% of the r1 choices, so *effectively* r1's weight is only 2 (it's as if
>> the other attempts never happened).  But with the smarter straw2 choose h2
>> 100% for r1 and you'll end up with 2x more tiems for that second position
>> on h2 than you want.
>>
>> Does that make sense?
>
> Absolutely, thanks for explaining. The easier route as far as getting an even distribution is concerned seems to find a way to calculate when a combination of weights (5 1 1 1 1 with 2 replica for instance) cannot be evenly distributed.
>
> Cheers
>
>>
>> sage
>>
>>
>>>
>>> I must be missing a use case :-)
>>>
>>>
>>>>
>>>> I think this could be done by adjusting the hierarchical weights as you go
>>>> (and I think one of Adam's early passes at the multipick problem did
>>>> something similar), but it's a bit more complex.
>>>>
>>>> It seems worth pursuing, though!
>>>>
>>>> And dynamically doing this only after the first N 'normal' attempts fail
>>>> seems like a good way to avoid slowing down the common path (no
>>>> collisions).  I suspect the optimal N is probably closer to 5 than 19,
>>>> though?
>>>>
>>>> sage
>>>>
>>>>
>>>>>
>>>>> Cheers
>>>>>
>>>>> P.S. Note that even in this border case modifying the weights to 7.1,
>>>>> 0.5, 0.5, 0.4, 0.4 significantly improves the distribution (twice better
>>>>> instead of ten times better). Only it cannot do better because it hits a
>>>>> limitation of the current CRUSH implementation. But it looks like it is
>>>>> not too difficult to fix.
>>>>>
>>>>>
>>>>> [1] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L541
>>>>> [2] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L332
>>>>> [3] http://marc.info/?l=ceph-devel&m=149407691823750&w=2
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Proposal for a CRUSH collision fallback
  2017-05-18 16:48         ` Ning Yao
@ 2017-05-18 21:21           ` Loic Dachary
  0 siblings, 0 replies; 8+ messages in thread
From: Loic Dachary @ 2017-05-18 21:21 UTC (permalink / raw)
  To: Ning Yao; +Cc: Sage Weil, Ceph Development

Hi,

On 05/18/2017 07:48 PM, Ning Yao wrote:
> Hi, Loic
> 
> we already notice this problem before and find that it is not easy to
> solved.  The problem not only occurs in collision case, but also in
> rejection when is_out is called:
> 
> static int is_out(const struct crush_map *map,
>  const __u32 *weight, int weight_max,
>  int item, int x)
> {
> if (item >= weight_max)
> return 1;
> if (weight[item] >= 0x10000)
> return 0;
> if (weight[item] == 0)
> return 1;
> if ((crush_hash32_2(CRUSH_HASH_RJENKINS1, x, item) & 0xffff)
>    < weight[item])
> return 0;
> return 1;
> }
> 
> That means if we use ceph osd out [osd.id], or osd is marked out
> automatically when it is down, or we use reweight-by-utilization to
> alter the reweight for an osd.
> I think it may  use the same way as Sage mentioned to solve this problem.
> Regards
> Ning Yao

You're right, it's the same kind of problem, thanks for mentionning that.

Cheers

> 
> 
> 2017-05-08 22:45 GMT+08:00 Loic Dachary <loic@dachary.org>:
>>
>>
>> On 05/08/2017 03:42 PM, Sage Weil wrote:
>>> On Mon, 8 May 2017, Loic Dachary wrote:
>>>> On 05/08/2017 05:09 AM, Sage Weil wrote:
>>>>> On Sun, 7 May 2017, Loic Dachary wrote:
>>>>>> Hi Sage,
>>>>>>
>>>>>> When choosing the second replica, crush_bucket_choose picks an item and
>>>>>> crush_choose_{indep,firstn} checks if it has already been chosen for the
>>>>>> first replica. If so, it is discarded as a collision[1], r is modified
>>>>>> and another attempt is made to get a different item. If no value is
>>>>>> found after choose_tries attempt, the mapping is incomplete.
>>>>>>
>>>>>> Another way to do the same would be that crush_bucket_choose /
>>>>>> bucket_straw2_choose[2] ignores the items that have already been chosen.
>>>>>> It could be done by looping over the list but that would mean N (number
>>>>>> of replicas) lookups for each item.
>>>>>>
>>>>>> The current implementation is optimized for the cases where collisions
>>>>>> are rare. However, when the weights of the items are two order of
>>>>>> magnitude appart or more, choose_tries has to be set to a very large
>>>>>> value (more than 1000) for the mapping to succeed. In practice that is
>>>>>> not a problem as it is unlikely that a host is 100 times bigger than
>>>>>> another host ;-)
>>>>>>
>>>>>> When fixing an uneven distribution[3], CRUSH weights are sometime set to
>>>>>> values with that kind of difference (1 against 100) to compensate for
>>>>>> the probability bias and/or a small number of samples. For instance when
>>>>>> there are 5 hosts with weights 5 1 1 1 1, modifying the weight set
>>>>>> fails. It goes as far as 8.9, 0.01, 0.01, 0.01, 0.01 with choose_tries
>>>>>> 2000. The value of choose_tries has to be increased a lot for a gain
>>>>>> that is smaller and smaller and the CPU usage goes up. In this
>>>>>> situation, an alternative implementation of the CRUSH collision seems a
>>>>>> good idea.
>>>>>>
>>>>>> Instead of failing after choose_tries attempts, crush_bucket_choose
>>>>>> could be called with the list of already chosen items and loop over them
>>>>>> to ensure none of them are candidate. The result will always be correct
>>>>>> but the implementation more expensive. The default choose_tries could
>>>>>> even be set to a lower value (19 instead of 50 ?) since it is likely
>>>>>> more expensive to handle 30 more collisions rather than looping over
>>>>>> each item already selected.
>>>>>>
>>>>>> What do you think ?
>>>>>
>>>>> I think this direction is promising!  The problem is that I think it's not
>>>>> quite as simple as you suggest, since you may be choosing over multiple
>>>>> levels of a hierarchy.  If the weight tree is something like
>>>>>
>>>>>       4
>>>>>     /   \
>>>>>    2     2
>>>>>   / \   / \
>>>>>  1   1 1   1
>>>>>  a   b c   d
>>>>>
>>>>> and you chose a, then yes, if you get back into the left branch you can
>>>>> filter it out of the straw2 selections.  And num_rep is usually small so
>>>>> that won't be expensive.  But you also need the first choice at the top
>>>>> level of the hierarchy to weight the left *tree* with 1 instead of 2.
>>>>
>>>> I don't understand why this is necessary ? Here are the scenarios I have in mind:
>>>>
>>>>
>>>>         /         \
>>>>        /           \
>>>>   r1  4        r2   4         rack (failure domain)
>>>>     /   \         /   \
>>>>    2     2       2     2      host
>>>>   / \   / \     / \   / \
>>>>  1   1 1   1   1   1 1   1    device
>>>>  a   b c   d   e   f g   h
>>>>
>>>> Say value 10 ends up in a the first time, it first went through rack
>>>> r1 which is the failure domain. If value 10 also ends up in r1 the
>>>> second time, straw2 will skip/collide it at that level because r1 is
>>>> stored in out while a is stored in out2.
>>>>
>>>> There only case I can think of that requires collision to be resolved
>>>> in a higher hierarchical level is when there is no alternative.
>>>>
>>>>
>>>>
>>>>         /         \
>>>>        /           \
>>>>   r1  4        r2   4         rack
>>>>     /             /   \
>>>> h1 2             2     2      host     (failure domain)
>>>>   /             / \   / \
>>>>  1             1   1 1   1    device
>>>>  a             e   f g   h
>>>>
>>>> If 10 ends up in h1 the first time and the second time, it will
>>>> collide because there is no alternative. It will then retry_descent,
>>>> ftotal increases which goes into r and it gets another chance at landing on a host
>>>> that's not h1.
>>>
>>> The problem isn't when choose[leaf] is specifying the intermediate
>>> level (rack or host in your examples); it's when there is an intervening
>>> level that a single choose is crossing.  In your first tree,
>>>
>>>           root 6
>>>>         /         \
>>>>        /           \
>>>>   r1  4        r2   4         rack
>>>>     /   \         /   \
>>>> h1 2  h2 2    h3 2  h4 2      host (failure domain)
>>>>   / \   / \     / \   / \
>>>>  1   1 1   1   1   1 1   1    device
>>>>  a   b c   d   e   f g   h
>>>
>>> let's say the failure domain is the host instead of the rack.  If we
>>> choose a (h1) the first time, for subsequent descents from the root we
>>> still pick r1 and r2 equally (4 vs 4) even though r1 only has 1 usable
>>> host (h2).  This normally is fine because we reject the entire descent for
>>> 50% of the r1 choices, so *effectively* r1's weight is only 2 (it's as if
>>> the other attempts never happened).  But with the smarter straw2 choose h2
>>> 100% for r1 and you'll end up with 2x more tiems for that second position
>>> on h2 than you want.
>>>
>>> Does that make sense?
>>
>> Absolutely, thanks for explaining. The easier route as far as getting an even distribution is concerned seems to find a way to calculate when a combination of weights (5 1 1 1 1 with 2 replica for instance) cannot be evenly distributed.
>>
>> Cheers
>>
>>>
>>> sage
>>>
>>>
>>>>
>>>> I must be missing a use case :-)
>>>>
>>>>
>>>>>
>>>>> I think this could be done by adjusting the hierarchical weights as you go
>>>>> (and I think one of Adam's early passes at the multipick problem did
>>>>> something similar), but it's a bit more complex.
>>>>>
>>>>> It seems worth pursuing, though!
>>>>>
>>>>> And dynamically doing this only after the first N 'normal' attempts fail
>>>>> seems like a good way to avoid slowing down the common path (no
>>>>> collisions).  I suspect the optimal N is probably closer to 5 than 19,
>>>>> though?
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> P.S. Note that even in this border case modifying the weights to 7.1,
>>>>>> 0.5, 0.5, 0.4, 0.4 significantly improves the distribution (twice better
>>>>>> instead of ten times better). Only it cannot do better because it hits a
>>>>>> limitation of the current CRUSH implementation. But it looks like it is
>>>>>> not too difficult to fix.
>>>>>>
>>>>>>
>>>>>> [1] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L541
>>>>>> [2] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L332
>>>>>> [3] http://marc.info/?l=ceph-devel&m=149407691823750&w=2
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-05-18 21:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-07 18:45 Proposal for a CRUSH collision fallback Loic Dachary
2017-05-07 21:34 ` Loic Dachary
2017-05-08  3:09 ` Sage Weil
2017-05-08  6:51   ` Loic Dachary
2017-05-08 13:42     ` Sage Weil
2017-05-08 14:45       ` Loic Dachary
2017-05-18 16:48         ` Ning Yao
2017-05-18 21:21           ` Loic Dachary

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.