Proposal for a CRUSH collision fallback

* Proposal for a CRUSH collision fallback
@ 2017-05-07 18:45 Loic Dachary
  2017-05-07 21:34 ` Loic Dachary
  2017-05-08  3:09 ` Sage Weil
  0 siblings, 2 replies; 8+ messages in thread
From: Loic Dachary @ 2017-05-07 18:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

Hi Sage,

When choosing the second replica, crush_bucket_choose picks an item and crush_choose_{indep,firstn} checks if it has already been chosen for the first replica. If so, it is discarded as a collision[1], r is modified and another attempt is made to get a different item. If no value is found after choose_tries attempt, the mapping is incomplete.

Another way to do the same would be that crush_bucket_choose / bucket_straw2_choose[2] ignores the items that have already been chosen. It could be done by looping over the list but that would mean N (number of replicas) lookups for each item.

The current implementation is optimized for the cases where collisions are rare. However, when the weights of the items are two order of magnitude appart or more, choose_tries has to be set to a very large value (more than 1000) for the mapping to succeed. In practice that is not a problem as it is unlikely that a host is 100 times bigger than another host ;-)

When fixing an uneven distribution[3], CRUSH weights are sometime set to values with that kind of difference (1 against 100) to compensate for the probability bias and/or a small number of samples. For instance when there are 5 hosts with weights 5 1 1 1 1, modifying the weight set fails. It goes as far as 8.9, 0.01, 0.01, 0.01, 0.01 with choose_tries 2000. The value of choose_tries has to be increased a lot for a gain that is smaller and smaller and the CPU usage goes up. In this situation, an alternative implementation of the CRUSH collision seems a good idea.

Instead of failing after choose_tries attempts, crush_bucket_choose could be called with the list of already chosen items and loop over them to ensure none of them are candidate. The result will always be correct but the implementation more expensive. The default choose_tries could even be set to a lower value (19 instead of 50 ?) since it is likely more expensive to handle 30 more collisions rather than looping over each item already selected.

What do you think ?

Cheers

P.S. Note that even in this border case modifying the weights to 7.1, 0.5, 0.5, 0.4, 0.4 significantly improves the distribution (twice better instead of ten times better). Only it cannot do better because it hits a limitation of the current CRUSH implementation. But it looks like it is not too difficult to fix.

[1] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L541
[2] https://github.com/ceph/ceph/blob/master/src/crush/mapper.c#L332
[3] http://marc.info/?l=ceph-devel&m=149407691823750&w=2
-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 8+ messages in thread