All of lore.kernel.org
 help / color / mirror / Atom feed
* crush multipick anomaly
@ 2017-01-26  3:05 Sage Weil
  2017-01-26 11:13 ` Loic Dachary
  2017-02-13 10:36 ` Loic Dachary
  0 siblings, 2 replies; 70+ messages in thread
From: Sage Weil @ 2017-01-26  3:05 UTC (permalink / raw)
  To: ceph-devel

This is a longstanding bug,

	http://tracker.ceph.com/issues/15653

that causes low-weighted devices to get more data than they should. Loic's 
recent activity resurrected discussion on the original PR

	https://github.com/ceph/ceph/pull/10218

but since it's closed and almost nobody will see it I'm moving the 
discussion here.

The main news is that I have a simple adjustment for the weights that 
works (almost perfectly) for the 2nd round of placements.  The solution is 
pretty simple, although as with most probabilities it tends to make my 
brain hurt.

The idea is that, on the second round, the original weight for the small 
OSD (call it P(pick small)) isn't what we should use.  Instead, we want 
P(pick small | first pick not small).  Since P(a|b) (the probability of a 
given b) is P(a && b) / P(b),

 P(pick small | first pick not small)
 = P(pick small && first pick not small) / P(first pick not small)

The last term is easy to calculate,

 P(first pick not small) = (total_weight - small_weight) / total_weight

and the && term is the distribution we're trying to produce.  For exmaple, 
if small has 1/10 the weight, then we should see 1/10th of the PGs have 
their second replica be the small OSD.  So

 P(pick small && first pick not small) = small_weight / total_weight

Putting those together,

 P(pick small | first pick not small)
 = P(pick small && first pick not small) / P(first pick not small)
 = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
 = small_weight / (total_weight - small_weight)

This is, on the second round, we should adjust the weights by the above so 
that we get the right distribution of second choices.  It turns out it 
works to adjust *all* weights like this to get hte conditional probability 
that they weren't already chosen.

I have a branch that hacks this into straw2 and it appears to work 
properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the 
current code, you get

$ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
rule 0 (data), x = 0..40000000, numrep = 2..2
rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
  device 0:             19765965        [9899364,9866601]
  device 1:             19768033        [9899444,9868589]
  device 2:             19769938        [9901770,9868168]
  device 3:             19766918        [9898851,9868067]
  device 6:             929148  [400572,528576]

which is very close for the first replica (primary), but way off for the 
second.  With my hacky change,

rule 0 (data), x = 0..40000000, numrep = 2..2
rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
  device 0:             19797315        [9899364,9897951]
  device 1:             19799199        [9899444,9899755]
  device 2:             19801016        [9901770,9899246]
  device 3:             19797906        [9898851,9899055]
  device 6:             804566  [400572,403994]

which is quite close, but still skewing slightly high (by a big less than 
1%).

Next steps:

1- generalize this for >2 replicas
2- figure out why it skews high
3- make this work for multi-level hierarchical descent

sage


^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2017-04-27 22:14 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-26  3:05 crush multipick anomaly Sage Weil
2017-01-26 11:13 ` Loic Dachary
2017-01-26 11:51   ` kefu chai
2017-02-03 14:37   ` Loic Dachary
2017-02-03 14:47     ` Sage Weil
2017-02-03 15:08       ` Loic Dachary
2017-02-03 18:54         ` Loic Dachary
2017-02-06  3:08           ` Jaze Lee
2017-02-06  8:18             ` Loic Dachary
2017-02-06 14:11               ` Jaze Lee
2017-02-06 17:07                 ` Loic Dachary
2017-02-03 15:26       ` Dan van der Ster
2017-02-03 17:37         ` Dan van der Ster
2017-02-06  8:31           ` Loic Dachary
2017-02-06  9:13             ` Dan van der Ster
2017-02-06 16:53               ` Dan van der Ster
2017-02-13 10:36 ` Loic Dachary
2017-02-13 14:21   ` Sage Weil
2017-02-13 18:50     ` Loic Dachary
2017-02-13 19:16       ` Sage Weil
2017-02-13 20:18         ` Loic Dachary
2017-02-13 21:01           ` Loic Dachary
2017-02-13 21:15             ` Sage Weil
2017-02-13 21:19               ` Gregory Farnum
2017-02-13 21:26                 ` Sage Weil
2017-02-13 21:43               ` Loic Dachary
2017-02-16 22:04     ` Pedro López-Adeva
2017-02-22  7:52       ` Loic Dachary
2017-02-22 11:26         ` Pedro López-Adeva
2017-02-22 11:38           ` Loic Dachary
2017-02-22 11:46             ` Pedro López-Adeva
2017-02-25  0:38               ` Loic Dachary
2017-02-25  8:41                 ` Pedro López-Adeva
2017-02-25  9:02                   ` Loic Dachary
2017-03-02  9:43                     ` Loic Dachary
2017-03-02  9:58                       ` Pedro López-Adeva
2017-03-02 10:31                         ` Loic Dachary
2017-03-07 23:06                         ` Sage Weil
2017-03-09  8:47                           ` Pedro López-Adeva
2017-03-18  9:21                             ` Loic Dachary
2017-03-19 22:31                               ` Loic Dachary
2017-03-20 10:49                                 ` Loic Dachary
2017-03-23 11:49                                   ` Pedro López-Adeva
2017-03-23 14:13                                     ` Loic Dachary
2017-03-23 15:32                                       ` Pedro López-Adeva
2017-03-23 16:18                                         ` Loic Dachary
2017-03-25 18:42                                         ` Sage Weil
     [not found]                                           ` <CAHMeWhHV=5u=QFggXFNMn2MzGLgQJ6nMnae+ZgK=MB5yYr1p9g@mail.gmail.com>
2017-03-27  2:33                                             ` Sage Weil
2017-03-27  6:45                                               ` Loic Dachary
     [not found]                                                 ` <CAHMeWhGuJnu2664VTxomQ-wJewBEPjRT_VGWH+g-v5k3ka6X5Q@mail.gmail.com>
2017-03-27  9:27                                                   ` Adam Kupczyk
2017-03-27 10:29                                                     ` Loic Dachary
2017-03-27 10:37                                                     ` Pedro López-Adeva
2017-03-27 13:39                                                     ` Sage Weil
2017-03-28  6:52                                                       ` Adam Kupczyk
2017-03-28  9:49                                                         ` Spandan Kumar Sahu
2017-03-28 13:35                                                         ` Sage Weil
2017-03-27 13:24                                                 ` Sage Weil
2017-04-11 15:22                                         ` Loic Dachary
2017-04-22 16:51                                         ` Loic Dachary
2017-04-25 15:04                                           ` Pedro López-Adeva
2017-04-25 17:46                                             ` Loic Dachary
2017-04-26 21:08                                             ` Loic Dachary
2017-04-26 22:25                                               ` Loic Dachary
2017-04-27  6:12                                                 ` Loic Dachary
2017-04-27 16:47                                                   ` Loic Dachary
2017-04-27 22:14                                                     ` Loic Dachary
2017-02-13 14:53   ` Gregory Farnum
2017-02-20  8:47     ` Loic Dachary
2017-02-20 17:32       ` Gregory Farnum
2017-02-20 19:31         ` Loic Dachary

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.