All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jaze Lee <jazeltq@gmail.com>
To: Loic Dachary <loic@dachary.org>
Cc: ceph-devel@vger.kernel.org
Subject: Re: crush multipick anomaly
Date: Mon, 6 Feb 2017 22:11:37 +0800	[thread overview]
Message-ID: <CAKZcxgeo3Y=uCZ5rFmG5ULcBw1D+V2=ar1hqHYHCt_E_HcPgTA@mail.gmail.com> (raw)
In-Reply-To: <4c8bdc58-b227-2eaa-fa61-e5c26e1e5f0d@dachary.org>

2017-02-06 16:18 GMT+08:00 Loic Dachary <loic@dachary.org>:
> Hi,
>
> On 02/06/2017 04:08 AM, Jaze Lee wrote:
>> It is more complicated than i have expected.....
>> I viewed http://tracker.ceph.com/issues/15653, and know that if the
>> replica number is
>> bigger than the host we choose, we may meet the problem.
>>
>> That is
>> if we have
>> host: a b c d
>> host: e f  g h
>> host: i  j  k  l
>>
>> we only choose one from each host for replica three, and the distribution
>> is as we expected?    Right ?
>>
>>
>> The problem described in http://tracker.ceph.com/issues/15653, may happen
>> when
>> 1)
>>   host: a b c d e f g
>>
>> and we choose all three replica from this host. But this is few happen
>> in production. Right?
>>
>>
>> May be i do not understand the problem correctly ?
>
> The problem also happens with host: a b c d e f g when you try to get three replicas that are not on the same disk. You can experiment with Dan's script

Yes, I mean why we choose three from one host ? In production the host
number is ALWAYS
more than replica number.....

root
   rack-0
      host A
      host B
   rack-1
      host C
      host D
   rack -2
       host E
       host F

when choose pg 1.1 for osd, it will always choose one from rack-0, one
from rack-1, one from rack-2.
any pg will cause one be choosed from rack-0, rack-1, rack-2.

The problem is happened when we want to choose more than one osd from
a bucket for a pg, right ?



>
> https://gist.github.com/anonymous/929d799d5f80794b293783acb9108992
>
> Cheers
>
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2017-02-04 2:54 GMT+08:00 Loic Dachary <loic@dachary.org>:
>>>
>>>
>>> On 02/03/2017 04:08 PM, Loic Dachary wrote:
>>>>
>>>>
>>>> On 02/03/2017 03:47 PM, Sage Weil wrote:
>>>>> On Fri, 3 Feb 2017, Loic Dachary wrote:
>>>>>> On 01/26/2017 12:13 PM, Loic Dachary wrote:
>>>>>>> Hi Sage,
>>>>>>>
>>>>>>> Still trying to understand what you did :-) I have one question below.
>>>>>>>
>>>>>>> On 01/26/2017 04:05 AM, Sage Weil wrote:
>>>>>>>> This is a longstanding bug,
>>>>>>>>
>>>>>>>>   http://tracker.ceph.com/issues/15653
>>>>>>>>
>>>>>>>> that causes low-weighted devices to get more data than they should. Loic's
>>>>>>>> recent activity resurrected discussion on the original PR
>>>>>>>>
>>>>>>>>   https://github.com/ceph/ceph/pull/10218
>>>>>>>>
>>>>>>>> but since it's closed and almost nobody will see it I'm moving the
>>>>>>>> discussion here.
>>>>>>>>
>>>>>>>> The main news is that I have a simple adjustment for the weights that
>>>>>>>> works (almost perfectly) for the 2nd round of placements.  The solution is
>>>>>>>> pretty simple, although as with most probabilities it tends to make my
>>>>>>>> brain hurt.
>>>>>>>>
>>>>>>>> The idea is that, on the second round, the original weight for the small
>>>>>>>> OSD (call it P(pick small)) isn't what we should use.  Instead, we want
>>>>>>>> P(pick small | first pick not small).  Since P(a|b) (the probability of a
>>>>>>>> given b) is P(a && b) / P(b),
>>>>>>>
>>>>>>> >From the record this is explained at https://en.wikipedia.org/wiki/Conditional_probability#Kolmogorov_definition
>>>>>>>
>>>>>>>>
>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>
>>>>>>>> The last term is easy to calculate,
>>>>>>>>
>>>>>>>>  P(first pick not small) = (total_weight - small_weight) / total_weight
>>>>>>>>
>>>>>>>> and the && term is the distribution we're trying to produce.
>>>>>>>
>>>>>>> https://en.wikipedia.org/wiki/Conditional_probability describs A && B (using a non ascii symbol...) as the "probability of the joint of events A and B". I don't understand what that means. Is there a definition somewhere ?
>>>>>>>
>>>>>>>> For exmaple,
>>>>>>>> if small has 1/10 the weight, then we should see 1/10th of the PGs have
>>>>>>>> their second replica be the small OSD.  So
>>>>>>>>
>>>>>>>>  P(pick small && first pick not small) = small_weight / total_weight
>>>>>>>>
>>>>>>>> Putting those together,
>>>>>>>>
>>>>>>>>  P(pick small | first pick not small)
>>>>>>>>  = P(pick small && first pick not small) / P(first pick not small)
>>>>>>>>  = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight)
>>>>>>>>  = small_weight / (total_weight - small_weight)
>>>>>>>>
>>>>>>>> This is, on the second round, we should adjust the weights by the above so
>>>>>>>> that we get the right distribution of second choices.  It turns out it
>>>>>>>> works to adjust *all* weights like this to get hte conditional probability
>>>>>>>> that they weren't already chosen.
>>>>>>>>
>>>>>>>> I have a branch that hacks this into straw2 and it appears to work
>>>>>>>
>>>>>>> This is https://github.com/liewegas/ceph/commit/wip-crush-multipick
>>>>>>
>>>>>> In
>>>>>>
>>>>>> https://github.com/liewegas/ceph/commit/wip-crush-multipick#diff-0df13ad294f6585c322588cfe026d701R316
>>>>>>
>>>>>> double neww = oldw / (bucketw - oldw) * bucketw;
>>>>>>
>>>>>> I don't get why we need  "* bucketw" at the end ?
>>>>>
>>>>> It's just to keep the values within a reasonable range so that we don't
>>>>> lose precision by dropping down into small integers.
>>>>>
>>>>> I futzed around with this some more last week trying to get the third
>>>>> replica to work and ended up doubting that this piece is correct.  The
>>>>> ratio between the big and small OSDs in my [99 99 99 99 4] example varies
>>>>> slightly from what I would expect from first principles and what I get out
>>>>> of this derivation by about 1%.. which would explain the bias I as seeing.
>>>>>
>>>>> I'm hoping we can find someone with a strong stats/probability background
>>>>> and loads of free time who can tackle this...
>>>>>
>>>>
>>>> It would help to formulate the problem into a self contained puzzle to present a mathematician. I tried to do it last week but failed. I'll give it another shot and submit a draft, hoping something bad could be the start of something better ;-)
>>>
>>> Here is what I have. I realize this is not good but I'm hoping someone more knowledgeable will pity me and provide something sensible. Otherwise I'm happy to keep making a fool of myself :-) In the following a bin is the device, the ball is a replica and the color is the object id.
>>>
>>> We have D bins and each bin can hold D(B) balls. All balls have the
>>> same size. There is exactly X balls of the same color. Each ball must
>>> be placed in a bin that does not already contain a ball of the same
>>> color.
>>>
>>> What distribution guarantees that, for all X, the bins are filled in
>>> the same proportion ?
>>>
>>> Details
>>> =======
>>>
>>> * One placement: all balls are the same color and we place each of them
>>>   in a bin with a probability of:
>>>
>>>     P(BIN) = BIN(B) / SUM(BINi(B) for i in [1..D])
>>>
>>>   so that bins are equally filled regardless of their capacity.
>>>
>>> * Two placements: for each ball there is exactly one other ball of the
>>>   same color.  A ball is placed as in experience 1 and the chosen bin
>>>   is set aside. The other ball of the same color is placed as in
>>>   experience 1 with the remaining bins. The probability for a ball
>>>   to be placed in a given BIN is:
>>>
>>>     P(BIN) + P(all bins but BIN | BIN)
>>>
>>> Examples
>>> ========
>>>
>>> For instance we have 5 bins, a, b, c, d, e and they can hold:
>>>
>>> a = 10 million balls
>>> b = 10 million balls
>>> c = 10 million balls
>>> d = 10 million balls
>>> e =  1 million balls
>>>
>>> In the first experience with place each ball in
>>>
>>> a with a probability of 10 / ( 10 + 10 + 10 + 10 + 1 ) = 10 / 41
>>> same for b, c, d
>>> e with a probability of 1 / 41
>>>
>>> after 100,000 placements, the bins have
>>>
>>> a = 243456
>>> b = 243624
>>> c = 244486
>>> d = 243881
>>> e = 24553
>>>
>>> they are
>>>
>>> a = 2.43 % full
>>> b = 2.43 % full
>>> c = 2.44 % full
>>> d = 2.43 % full
>>> e = 0.24 % full
>>>
>>> In the second experience
>>>
>>>
>>>>> sage
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> properly for num_rep = 2.  With a test bucket of [99 99 99 99 4], and the
>>>>>>>> current code, you get
>>>>>>>>
>>>>>>>> $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2
>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>   device 0:             19765965        [9899364,9866601]
>>>>>>>>   device 1:             19768033        [9899444,9868589]
>>>>>>>>   device 2:             19769938        [9901770,9868168]
>>>>>>>>   device 3:             19766918        [9898851,9868067]
>>>>>>>>   device 6:             929148  [400572,528576]
>>>>>>>>
>>>>>>>> which is very close for the first replica (primary), but way off for the
>>>>>>>> second.  With my hacky change,
>>>>>>>>
>>>>>>>> rule 0 (data), x = 0..40000000, numrep = 2..2
>>>>>>>> rule 0 (data) num_rep 2 result size == 2:       40000001/40000001
>>>>>>>>   device 0:             19797315        [9899364,9897951]
>>>>>>>>   device 1:             19799199        [9899444,9899755]
>>>>>>>>   device 2:             19801016        [9901770,9899246]
>>>>>>>>   device 3:             19797906        [9898851,9899055]
>>>>>>>>   device 6:             804566  [400572,403994]
>>>>>>>>
>>>>>>>> which is quite close, but still skewing slightly high (by a big less than
>>>>>>>> 1%).
>>>>>>>>
>>>>>>>> Next steps:
>>>>>>>>
>>>>>>>> 1- generalize this for >2 replicas
>>>>>>>> 2- figure out why it skews high
>>>>>>>> 3- make this work for multi-level hierarchical descent
>>>>>>>>
>>>>>>>> sage
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre



-- 
谦谦君子

  reply	other threads:[~2017-02-06 14:11 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-26  3:05 crush multipick anomaly Sage Weil
2017-01-26 11:13 ` Loic Dachary
2017-01-26 11:51   ` kefu chai
2017-02-03 14:37   ` Loic Dachary
2017-02-03 14:47     ` Sage Weil
2017-02-03 15:08       ` Loic Dachary
2017-02-03 18:54         ` Loic Dachary
2017-02-06  3:08           ` Jaze Lee
2017-02-06  8:18             ` Loic Dachary
2017-02-06 14:11               ` Jaze Lee [this message]
2017-02-06 17:07                 ` Loic Dachary
2017-02-03 15:26       ` Dan van der Ster
2017-02-03 17:37         ` Dan van der Ster
2017-02-06  8:31           ` Loic Dachary
2017-02-06  9:13             ` Dan van der Ster
2017-02-06 16:53               ` Dan van der Ster
2017-02-13 10:36 ` Loic Dachary
2017-02-13 14:21   ` Sage Weil
2017-02-13 18:50     ` Loic Dachary
2017-02-13 19:16       ` Sage Weil
2017-02-13 20:18         ` Loic Dachary
2017-02-13 21:01           ` Loic Dachary
2017-02-13 21:15             ` Sage Weil
2017-02-13 21:19               ` Gregory Farnum
2017-02-13 21:26                 ` Sage Weil
2017-02-13 21:43               ` Loic Dachary
2017-02-16 22:04     ` Pedro López-Adeva
2017-02-22  7:52       ` Loic Dachary
2017-02-22 11:26         ` Pedro López-Adeva
2017-02-22 11:38           ` Loic Dachary
2017-02-22 11:46             ` Pedro López-Adeva
2017-02-25  0:38               ` Loic Dachary
2017-02-25  8:41                 ` Pedro López-Adeva
2017-02-25  9:02                   ` Loic Dachary
2017-03-02  9:43                     ` Loic Dachary
2017-03-02  9:58                       ` Pedro López-Adeva
2017-03-02 10:31                         ` Loic Dachary
2017-03-07 23:06                         ` Sage Weil
2017-03-09  8:47                           ` Pedro López-Adeva
2017-03-18  9:21                             ` Loic Dachary
2017-03-19 22:31                               ` Loic Dachary
2017-03-20 10:49                                 ` Loic Dachary
2017-03-23 11:49                                   ` Pedro López-Adeva
2017-03-23 14:13                                     ` Loic Dachary
2017-03-23 15:32                                       ` Pedro López-Adeva
2017-03-23 16:18                                         ` Loic Dachary
2017-03-25 18:42                                         ` Sage Weil
     [not found]                                           ` <CAHMeWhHV=5u=QFggXFNMn2MzGLgQJ6nMnae+ZgK=MB5yYr1p9g@mail.gmail.com>
2017-03-27  2:33                                             ` Sage Weil
2017-03-27  6:45                                               ` Loic Dachary
     [not found]                                                 ` <CAHMeWhGuJnu2664VTxomQ-wJewBEPjRT_VGWH+g-v5k3ka6X5Q@mail.gmail.com>
2017-03-27  9:27                                                   ` Adam Kupczyk
2017-03-27 10:29                                                     ` Loic Dachary
2017-03-27 10:37                                                     ` Pedro López-Adeva
2017-03-27 13:39                                                     ` Sage Weil
2017-03-28  6:52                                                       ` Adam Kupczyk
2017-03-28  9:49                                                         ` Spandan Kumar Sahu
2017-03-28 13:35                                                         ` Sage Weil
2017-03-27 13:24                                                 ` Sage Weil
2017-04-11 15:22                                         ` Loic Dachary
2017-04-22 16:51                                         ` Loic Dachary
2017-04-25 15:04                                           ` Pedro López-Adeva
2017-04-25 17:46                                             ` Loic Dachary
2017-04-26 21:08                                             ` Loic Dachary
2017-04-26 22:25                                               ` Loic Dachary
2017-04-27  6:12                                                 ` Loic Dachary
2017-04-27 16:47                                                   ` Loic Dachary
2017-04-27 22:14                                                     ` Loic Dachary
2017-02-13 14:53   ` Gregory Farnum
2017-02-20  8:47     ` Loic Dachary
2017-02-20 17:32       ` Gregory Farnum
2017-02-20 19:31         ` Loic Dachary

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKZcxgeo3Y=uCZ5rFmG5ULcBw1D+V2=ar1hqHYHCt_E_HcPgTA@mail.gmail.com' \
    --to=jazeltq@gmail.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=loic@dachary.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.