All of lore.kernel.org
 help / color / mirror / Atom feed
* revisiting uneven CRUSH distributions
@ 2017-04-30 14:15 Loic Dachary
  2017-05-01 17:15 ` Stefan Priebe - Profihost AG
                   ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: Loic Dachary @ 2017-04-30 14:15 UTC (permalink / raw)
  To: Ceph Development

Hi,

Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
the same proportion. If an OSD is 75% full, it is expected that all
other OSDs are also 75% full.

In reality the distribution is even only when more than 100,000 PGs
are distributed in a pool of size 1 (i.e. no replication).

In small clusters there are a few thousands PGs and it is not enough
to get an even distribution. Running the following with
python-crush[1], shows a 15% difference when distributing 1,000 PGs on
6 devices. Only with 1,000,000 PGs does the difference drop under 1%.

  for PGs in 1000 10000 100000 1000000 ; do
    crush analyze --replication-count 1 \
                  --type device \
                  --values-count $PGs \
                  --rule data \
                  --crushmap tests/sample-crushmap.json
  done

In larger clusters, even though a greater number of PGs are
distributed, there are at most a few dozens devices per host and the
problem remains. On a machine with 24 OSDs each expected to handle a
few hundred PGs, a total of a few thousands PGs are distributed which
is not enough to get an even distribution.

There is a secondary reason for the distribution to be uneven, when
there is more than one replica. The second replica must be on a
different device than the first replica. This conditional probability
is not taken into account by CRUSH and would create an uneven
distribution if more than 10,000 PGs were distributed per OSD[2]. But
a given OSD can only handle a few hundred PGs and this conditional
probability bias is dominated by the uneven distribution caused by the
low number of PGs.

The uneven CRUSH distributions are always caused by a low number of
samples, even in large clusters. Since this noise (i.e. the difference
between the desired distribution and the actual distribution) is
random, it cannot be fixed by optimizations methods.  The
Nedler-Mead[3] simplex converges to a local minimum that is far from
the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
fails to find a gradient that would allow it to converge faster. And
even if it did, the local minimum found would be as often wrong as
with Nedler-Mead, only it would go faster. A least mean squares
filter[5] is equally unable to suppress the noise created by the
uneven distribution because no coefficients can model a random noise.

With that in mind, I implemented a simple optimization algorithm[6]
which was first suggested by Thierry Delamare a few weeks ago. It goes
like this:

    - Distribute the desired number of PGs[7]
    - Subtract 1% of the weight of the OSD that is the most over used
    - Add the subtracted weight to the OSD that is the most under used
    - Repeat until the Kullback–Leibler divergence[8] is small enough

Quoting Adam Kupczyk, this works because:

  "...CRUSH is not random proces at all, it behaves in numerically
   stable way.  Specifically, if we increase weight on one node, we
   will get more PGs on this node and less on every other node:
   CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"

A nice side effect of this optimization algorithm is that it does not
change the weight of the bucket containing the items being
optimized. It is local to a bucket with no influence on the other
parts of the crushmap (modulo the conditional probability bias).

In all tests the situation improves at least by an order of
magnitude. For instance when there is a 30% difference between two
OSDs, it is down to less than 3% after optimization.

The tests for the optimization method can be run with

   git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
   tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py

If anyone think of a reason why this algorithm won't work in some
cases, please speak up :-)

Cheers

[1] python-crush http://crush.readthedocs.io/
[2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
[3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
[4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
[5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
[6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
[7] Predicting Ceph PG placement http://dachary.org/?p=4020
[8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-04-30 14:15 revisiting uneven CRUSH distributions Loic Dachary
@ 2017-05-01 17:15 ` Stefan Priebe - Profihost AG
  2017-05-01 17:47   ` Loic Dachary
       [not found] ` <CABZ+qqnqiUFbz=6CegW_o=2goOThpmoskDQ0oOUfE27jW0D17A@mail.gmail.com>
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Stefan Priebe - Profihost AG @ 2017-05-01 17:15 UTC (permalink / raw)
  To: Loic Dachary, Ceph Development

That sounds amazing! Is there any chance this will be backported to jewel?

Greets,
Stefan

Am 30.04.2017 um 16:15 schrieb Loic Dachary:
> Hi,
> 
> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
> the same proportion. If an OSD is 75% full, it is expected that all
> other OSDs are also 75% full.
> 
> In reality the distribution is even only when more than 100,000 PGs
> are distributed in a pool of size 1 (i.e. no replication).
> 
> In small clusters there are a few thousands PGs and it is not enough
> to get an even distribution. Running the following with
> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
> 
>   for PGs in 1000 10000 100000 1000000 ; do
>     crush analyze --replication-count 1 \
>                   --type device \
>                   --values-count $PGs \
>                   --rule data \
>                   --crushmap tests/sample-crushmap.json
>   done
> 
> In larger clusters, even though a greater number of PGs are
> distributed, there are at most a few dozens devices per host and the
> problem remains. On a machine with 24 OSDs each expected to handle a
> few hundred PGs, a total of a few thousands PGs are distributed which
> is not enough to get an even distribution.
> 
> There is a secondary reason for the distribution to be uneven, when
> there is more than one replica. The second replica must be on a
> different device than the first replica. This conditional probability
> is not taken into account by CRUSH and would create an uneven
> distribution if more than 10,000 PGs were distributed per OSD[2]. But
> a given OSD can only handle a few hundred PGs and this conditional
> probability bias is dominated by the uneven distribution caused by the
> low number of PGs.
> 
> The uneven CRUSH distributions are always caused by a low number of
> samples, even in large clusters. Since this noise (i.e. the difference
> between the desired distribution and the actual distribution) is
> random, it cannot be fixed by optimizations methods.  The
> Nedler-Mead[3] simplex converges to a local minimum that is far from
> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
> fails to find a gradient that would allow it to converge faster. And
> even if it did, the local minimum found would be as often wrong as
> with Nedler-Mead, only it would go faster. A least mean squares
> filter[5] is equally unable to suppress the noise created by the
> uneven distribution because no coefficients can model a random noise.
> 
> With that in mind, I implemented a simple optimization algorithm[6]
> which was first suggested by Thierry Delamare a few weeks ago. It goes
> like this:
> 
>     - Distribute the desired number of PGs[7]
>     - Subtract 1% of the weight of the OSD that is the most over used
>     - Add the subtracted weight to the OSD that is the most under used
>     - Repeat until the Kullback–Leibler divergence[8] is small enough
> 
> Quoting Adam Kupczyk, this works because:
> 
>   "...CRUSH is not random proces at all, it behaves in numerically
>    stable way.  Specifically, if we increase weight on one node, we
>    will get more PGs on this node and less on every other node:
>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
> 
> A nice side effect of this optimization algorithm is that it does not
> change the weight of the bucket containing the items being
> optimized. It is local to a bucket with no influence on the other
> parts of the crushmap (modulo the conditional probability bias).
> 
> In all tests the situation improves at least by an order of
> magnitude. For instance when there is a 30% difference between two
> OSDs, it is down to less than 3% after optimization.
> 
> The tests for the optimization method can be run with
> 
>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
> 
> If anyone think of a reason why this algorithm won't work in some
> cases, please speak up :-)
> 
> Cheers
> 
> [1] python-crush http://crush.readthedocs.io/
> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-01 17:15 ` Stefan Priebe - Profihost AG
@ 2017-05-01 17:47   ` Loic Dachary
  2017-05-01 18:06     ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 37+ messages in thread
From: Loic Dachary @ 2017-05-01 17:47 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Ceph Development

Hi Stefan,

On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
> That sounds amazing! Is there any chance this will be backported to jewel?

There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.

Cheers

> 
> Greets,
> Stefan
> 
> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>> Hi,
>>
>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>> the same proportion. If an OSD is 75% full, it is expected that all
>> other OSDs are also 75% full.
>>
>> In reality the distribution is even only when more than 100,000 PGs
>> are distributed in a pool of size 1 (i.e. no replication).
>>
>> In small clusters there are a few thousands PGs and it is not enough
>> to get an even distribution. Running the following with
>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>
>>   for PGs in 1000 10000 100000 1000000 ; do
>>     crush analyze --replication-count 1 \
>>                   --type device \
>>                   --values-count $PGs \
>>                   --rule data \
>>                   --crushmap tests/sample-crushmap.json
>>   done
>>
>> In larger clusters, even though a greater number of PGs are
>> distributed, there are at most a few dozens devices per host and the
>> problem remains. On a machine with 24 OSDs each expected to handle a
>> few hundred PGs, a total of a few thousands PGs are distributed which
>> is not enough to get an even distribution.
>>
>> There is a secondary reason for the distribution to be uneven, when
>> there is more than one replica. The second replica must be on a
>> different device than the first replica. This conditional probability
>> is not taken into account by CRUSH and would create an uneven
>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>> a given OSD can only handle a few hundred PGs and this conditional
>> probability bias is dominated by the uneven distribution caused by the
>> low number of PGs.
>>
>> The uneven CRUSH distributions are always caused by a low number of
>> samples, even in large clusters. Since this noise (i.e. the difference
>> between the desired distribution and the actual distribution) is
>> random, it cannot be fixed by optimizations methods.  The
>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>> fails to find a gradient that would allow it to converge faster. And
>> even if it did, the local minimum found would be as often wrong as
>> with Nedler-Mead, only it would go faster. A least mean squares
>> filter[5] is equally unable to suppress the noise created by the
>> uneven distribution because no coefficients can model a random noise.
>>
>> With that in mind, I implemented a simple optimization algorithm[6]
>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>> like this:
>>
>>     - Distribute the desired number of PGs[7]
>>     - Subtract 1% of the weight of the OSD that is the most over used
>>     - Add the subtracted weight to the OSD that is the most under used
>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>
>> Quoting Adam Kupczyk, this works because:
>>
>>   "...CRUSH is not random proces at all, it behaves in numerically
>>    stable way.  Specifically, if we increase weight on one node, we
>>    will get more PGs on this node and less on every other node:
>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>
>> A nice side effect of this optimization algorithm is that it does not
>> change the weight of the bucket containing the items being
>> optimized. It is local to a bucket with no influence on the other
>> parts of the crushmap (modulo the conditional probability bias).
>>
>> In all tests the situation improves at least by an order of
>> magnitude. For instance when there is a 30% difference between two
>> OSDs, it is down to less than 3% after optimization.
>>
>> The tests for the optimization method can be run with
>>
>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>
>> If anyone think of a reason why this algorithm won't work in some
>> cases, please speak up :-)
>>
>> Cheers
>>
>> [1] python-crush http://crush.readthedocs.io/
>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-01 17:47   ` Loic Dachary
@ 2017-05-01 18:06     ` Stefan Priebe - Profihost AG
  2017-05-01 23:12       ` Loic Dachary
  0 siblings, 1 reply; 37+ messages in thread
From: Stefan Priebe - Profihost AG @ 2017-05-01 18:06 UTC (permalink / raw)
  To: Loic Dachary, Ceph Development

Am 01.05.2017 um 19:47 schrieb Loic Dachary:
> Hi Stefan,
> 
> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>> That sounds amazing! Is there any chance this will be backported to jewel?
> 
> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.

I've lots of them ;-)

Will sent you one via private e-mail in some minutes.

Greets,
Stefan

> Cheers
> 
>>
>> Greets,
>> Stefan
>>
>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>> Hi,
>>>
>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>> the same proportion. If an OSD is 75% full, it is expected that all
>>> other OSDs are also 75% full.
>>>
>>> In reality the distribution is even only when more than 100,000 PGs
>>> are distributed in a pool of size 1 (i.e. no replication).
>>>
>>> In small clusters there are a few thousands PGs and it is not enough
>>> to get an even distribution. Running the following with
>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>
>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>     crush analyze --replication-count 1 \
>>>                   --type device \
>>>                   --values-count $PGs \
>>>                   --rule data \
>>>                   --crushmap tests/sample-crushmap.json
>>>   done
>>>
>>> In larger clusters, even though a greater number of PGs are
>>> distributed, there are at most a few dozens devices per host and the
>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>> is not enough to get an even distribution.
>>>
>>> There is a secondary reason for the distribution to be uneven, when
>>> there is more than one replica. The second replica must be on a
>>> different device than the first replica. This conditional probability
>>> is not taken into account by CRUSH and would create an uneven
>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>> a given OSD can only handle a few hundred PGs and this conditional
>>> probability bias is dominated by the uneven distribution caused by the
>>> low number of PGs.
>>>
>>> The uneven CRUSH distributions are always caused by a low number of
>>> samples, even in large clusters. Since this noise (i.e. the difference
>>> between the desired distribution and the actual distribution) is
>>> random, it cannot be fixed by optimizations methods.  The
>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>> fails to find a gradient that would allow it to converge faster. And
>>> even if it did, the local minimum found would be as often wrong as
>>> with Nedler-Mead, only it would go faster. A least mean squares
>>> filter[5] is equally unable to suppress the noise created by the
>>> uneven distribution because no coefficients can model a random noise.
>>>
>>> With that in mind, I implemented a simple optimization algorithm[6]
>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>> like this:
>>>
>>>     - Distribute the desired number of PGs[7]
>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>     - Add the subtracted weight to the OSD that is the most under used
>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>
>>> Quoting Adam Kupczyk, this works because:
>>>
>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>    stable way.  Specifically, if we increase weight on one node, we
>>>    will get more PGs on this node and less on every other node:
>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>
>>> A nice side effect of this optimization algorithm is that it does not
>>> change the weight of the bucket containing the items being
>>> optimized. It is local to a bucket with no influence on the other
>>> parts of the crushmap (modulo the conditional probability bias).
>>>
>>> In all tests the situation improves at least by an order of
>>> magnitude. For instance when there is a 30% difference between two
>>> OSDs, it is down to less than 3% after optimization.
>>>
>>> The tests for the optimization method can be run with
>>>
>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>
>>> If anyone think of a reason why this algorithm won't work in some
>>> cases, please speak up :-)
>>>
>>> Cheers
>>>
>>> [1] python-crush http://crush.readthedocs.io/
>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-01 18:06     ` Stefan Priebe - Profihost AG
@ 2017-05-01 23:12       ` Loic Dachary
  2017-05-02  5:43         ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 37+ messages in thread
From: Loic Dachary @ 2017-05-01 23:12 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Ceph Development

It is working, with straw2 (your cluster still is using straw).

For instance for one host it goes from:

        ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
~name~
osd.24         149        159                 6.65     10.0      6.71
osd.29         149        159                 6.65     10.0      6.71
osd.0           69         77                11.04      8.0     11.59
osd.2           69         69                -0.50      0.0      0.00
osd.42         149        148                -0.73     -1.0     -0.67
osd.1           69         62               -10.59     -7.0    -10.14
osd.23          69         62               -10.59     -7.0    -10.14
osd.36         149        132               -11.46    -17.0    -11.41

to

        ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
~name~
osd.0           69         69                -0.50      0.0      0.00
osd.23          69         69                -0.50      0.0      0.00
osd.24         149        149                -0.06      0.0      0.00
osd.29         149        149                -0.06      0.0      0.00
osd.36         149        149                -0.06      0.0      0.00
osd.1           69         68                -1.94     -1.0     -1.45
osd.2           69         68                -1.94     -1.0     -1.45
osd.42         149        147                -1.40     -2.0     -1.34

By changing the weights to

[0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616]

And you could set these weights on the crushmap, there would be no need for backporting.


On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote:
> Am 01.05.2017 um 19:47 schrieb Loic Dachary:
>> Hi Stefan,
>>
>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>>> That sounds amazing! Is there any chance this will be backported to jewel?
>>
>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.
> 
> I've lots of them ;-)
> 
> Will sent you one via private e-mail in some minutes.
> 
> Greets,
> Stefan
> 
>> Cheers
>>
>>>
>>> Greets,
>>> Stefan
>>>
>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>>> Hi,
>>>>
>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>> the same proportion. If an OSD is 75% full, it is expected that all
>>>> other OSDs are also 75% full.
>>>>
>>>> In reality the distribution is even only when more than 100,000 PGs
>>>> are distributed in a pool of size 1 (i.e. no replication).
>>>>
>>>> In small clusters there are a few thousands PGs and it is not enough
>>>> to get an even distribution. Running the following with
>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>
>>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>>     crush analyze --replication-count 1 \
>>>>                   --type device \
>>>>                   --values-count $PGs \
>>>>                   --rule data \
>>>>                   --crushmap tests/sample-crushmap.json
>>>>   done
>>>>
>>>> In larger clusters, even though a greater number of PGs are
>>>> distributed, there are at most a few dozens devices per host and the
>>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>>> is not enough to get an even distribution.
>>>>
>>>> There is a secondary reason for the distribution to be uneven, when
>>>> there is more than one replica. The second replica must be on a
>>>> different device than the first replica. This conditional probability
>>>> is not taken into account by CRUSH and would create an uneven
>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>> a given OSD can only handle a few hundred PGs and this conditional
>>>> probability bias is dominated by the uneven distribution caused by the
>>>> low number of PGs.
>>>>
>>>> The uneven CRUSH distributions are always caused by a low number of
>>>> samples, even in large clusters. Since this noise (i.e. the difference
>>>> between the desired distribution and the actual distribution) is
>>>> random, it cannot be fixed by optimizations methods.  The
>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>> fails to find a gradient that would allow it to converge faster. And
>>>> even if it did, the local minimum found would be as often wrong as
>>>> with Nedler-Mead, only it would go faster. A least mean squares
>>>> filter[5] is equally unable to suppress the noise created by the
>>>> uneven distribution because no coefficients can model a random noise.
>>>>
>>>> With that in mind, I implemented a simple optimization algorithm[6]
>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>> like this:
>>>>
>>>>     - Distribute the desired number of PGs[7]
>>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>>     - Add the subtracted weight to the OSD that is the most under used
>>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>
>>>> Quoting Adam Kupczyk, this works because:
>>>>
>>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>>    stable way.  Specifically, if we increase weight on one node, we
>>>>    will get more PGs on this node and less on every other node:
>>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>
>>>> A nice side effect of this optimization algorithm is that it does not
>>>> change the weight of the bucket containing the items being
>>>> optimized. It is local to a bucket with no influence on the other
>>>> parts of the crushmap (modulo the conditional probability bias).
>>>>
>>>> In all tests the situation improves at least by an order of
>>>> magnitude. For instance when there is a 30% difference between two
>>>> OSDs, it is down to less than 3% after optimization.
>>>>
>>>> The tests for the optimization method can be run with
>>>>
>>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>
>>>> If anyone think of a reason why this algorithm won't work in some
>>>> cases, please speak up :-)
>>>>
>>>> Cheers
>>>>
>>>> [1] python-crush http://crush.readthedocs.io/
>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-01 23:12       ` Loic Dachary
@ 2017-05-02  5:43         ` Stefan Priebe - Profihost AG
  2017-05-02  5:48           ` Stefan Priebe - Profihost AG
  2017-05-02  7:32           ` Loic Dachary
  0 siblings, 2 replies; 37+ messages in thread
From: Stefan Priebe - Profihost AG @ 2017-05-02  5:43 UTC (permalink / raw)
  To: Loic Dachary, Ceph Development

Hi Loic,

yes i didn't changed them to straw2 as i didn't saw any difference. I
switched to straw2 now but it didn't change anything at all.

If i use those weights manuall i've to adjust them on every crush change
on the cluster? That's something i don't really like to do.

Greets,
Stefan

Am 02.05.2017 um 01:12 schrieb Loic Dachary:
> It is working, with straw2 (your cluster still is using straw).
> 
> For instance for one host it goes from:
> 
>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
> ~name~
> osd.24         149        159                 6.65     10.0      6.71
> osd.29         149        159                 6.65     10.0      6.71
> osd.0           69         77                11.04      8.0     11.59
> osd.2           69         69                -0.50      0.0      0.00
> osd.42         149        148                -0.73     -1.0     -0.67
> osd.1           69         62               -10.59     -7.0    -10.14
> osd.23          69         62               -10.59     -7.0    -10.14
> osd.36         149        132               -11.46    -17.0    -11.41
> 
> to
> 
>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
> ~name~
> osd.0           69         69                -0.50      0.0      0.00
> osd.23          69         69                -0.50      0.0      0.00
> osd.24         149        149                -0.06      0.0      0.00
> osd.29         149        149                -0.06      0.0      0.00
> osd.36         149        149                -0.06      0.0      0.00
> osd.1           69         68                -1.94     -1.0     -1.45
> osd.2           69         68                -1.94     -1.0     -1.45
> osd.42         149        147                -1.40     -2.0     -1.34
> 
> By changing the weights to
> 
> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616]
> 
> And you could set these weights on the crushmap, there would be no need for backporting.
> 
> 
> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote:
>> Am 01.05.2017 um 19:47 schrieb Loic Dachary:
>>> Hi Stefan,
>>>
>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>>>> That sounds amazing! Is there any chance this will be backported to jewel?
>>>
>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.
>>
>> I've lots of them ;-)
>>
>> Will sent you one via private e-mail in some minutes.
>>
>> Greets,
>> Stefan
>>
>>> Cheers
>>>
>>>>
>>>> Greets,
>>>> Stefan
>>>>
>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>>>> Hi,
>>>>>
>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>> the same proportion. If an OSD is 75% full, it is expected that all
>>>>> other OSDs are also 75% full.
>>>>>
>>>>> In reality the distribution is even only when more than 100,000 PGs
>>>>> are distributed in a pool of size 1 (i.e. no replication).
>>>>>
>>>>> In small clusters there are a few thousands PGs and it is not enough
>>>>> to get an even distribution. Running the following with
>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>
>>>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>>>     crush analyze --replication-count 1 \
>>>>>                   --type device \
>>>>>                   --values-count $PGs \
>>>>>                   --rule data \
>>>>>                   --crushmap tests/sample-crushmap.json
>>>>>   done
>>>>>
>>>>> In larger clusters, even though a greater number of PGs are
>>>>> distributed, there are at most a few dozens devices per host and the
>>>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>>>> is not enough to get an even distribution.
>>>>>
>>>>> There is a secondary reason for the distribution to be uneven, when
>>>>> there is more than one replica. The second replica must be on a
>>>>> different device than the first replica. This conditional probability
>>>>> is not taken into account by CRUSH and would create an uneven
>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>> a given OSD can only handle a few hundred PGs and this conditional
>>>>> probability bias is dominated by the uneven distribution caused by the
>>>>> low number of PGs.
>>>>>
>>>>> The uneven CRUSH distributions are always caused by a low number of
>>>>> samples, even in large clusters. Since this noise (i.e. the difference
>>>>> between the desired distribution and the actual distribution) is
>>>>> random, it cannot be fixed by optimizations methods.  The
>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>> fails to find a gradient that would allow it to converge faster. And
>>>>> even if it did, the local minimum found would be as often wrong as
>>>>> with Nedler-Mead, only it would go faster. A least mean squares
>>>>> filter[5] is equally unable to suppress the noise created by the
>>>>> uneven distribution because no coefficients can model a random noise.
>>>>>
>>>>> With that in mind, I implemented a simple optimization algorithm[6]
>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>> like this:
>>>>>
>>>>>     - Distribute the desired number of PGs[7]
>>>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>>>     - Add the subtracted weight to the OSD that is the most under used
>>>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>
>>>>> Quoting Adam Kupczyk, this works because:
>>>>>
>>>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>>>    stable way.  Specifically, if we increase weight on one node, we
>>>>>    will get more PGs on this node and less on every other node:
>>>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>
>>>>> A nice side effect of this optimization algorithm is that it does not
>>>>> change the weight of the bucket containing the items being
>>>>> optimized. It is local to a bucket with no influence on the other
>>>>> parts of the crushmap (modulo the conditional probability bias).
>>>>>
>>>>> In all tests the situation improves at least by an order of
>>>>> magnitude. For instance when there is a 30% difference between two
>>>>> OSDs, it is down to less than 3% after optimization.
>>>>>
>>>>> The tests for the optimization method can be run with
>>>>>
>>>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>
>>>>> If anyone think of a reason why this algorithm won't work in some
>>>>> cases, please speak up :-)
>>>>>
>>>>> Cheers
>>>>>
>>>>> [1] python-crush http://crush.readthedocs.io/
>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-02  5:43         ` Stefan Priebe - Profihost AG
@ 2017-05-02  5:48           ` Stefan Priebe - Profihost AG
  2017-05-02  6:29             ` Alexandre DERUMIER
  2017-05-02  7:32           ` Loic Dachary
  1 sibling, 1 reply; 37+ messages in thread
From: Stefan Priebe - Profihost AG @ 2017-05-02  5:48 UTC (permalink / raw)
  To: Loic Dachary, Ceph Development

I created a new cluster under jewel but straw1 still seems to be the
default?

Greets,
Stefan

Am 02.05.2017 um 07:43 schrieb Stefan Priebe - Profihost AG:
> Hi Loic,
> 
> yes i didn't changed them to straw2 as i didn't saw any difference. I
> switched to straw2 now but it didn't change anything at all.
> 
> If i use those weights manuall i've to adjust them on every crush change
> on the cluster? That's something i don't really like to do.
> 
> Greets,
> Stefan
> 
> Am 02.05.2017 um 01:12 schrieb Loic Dachary:
>> It is working, with straw2 (your cluster still is using straw).
>>
>> For instance for one host it goes from:
>>
>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>> ~name~
>> osd.24         149        159                 6.65     10.0      6.71
>> osd.29         149        159                 6.65     10.0      6.71
>> osd.0           69         77                11.04      8.0     11.59
>> osd.2           69         69                -0.50      0.0      0.00
>> osd.42         149        148                -0.73     -1.0     -0.67
>> osd.1           69         62               -10.59     -7.0    -10.14
>> osd.23          69         62               -10.59     -7.0    -10.14
>> osd.36         149        132               -11.46    -17.0    -11.41
>>
>> to
>>
>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>> ~name~
>> osd.0           69         69                -0.50      0.0      0.00
>> osd.23          69         69                -0.50      0.0      0.00
>> osd.24         149        149                -0.06      0.0      0.00
>> osd.29         149        149                -0.06      0.0      0.00
>> osd.36         149        149                -0.06      0.0      0.00
>> osd.1           69         68                -1.94     -1.0     -1.45
>> osd.2           69         68                -1.94     -1.0     -1.45
>> osd.42         149        147                -1.40     -2.0     -1.34
>>
>> By changing the weights to
>>
>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616]
>>
>> And you could set these weights on the crushmap, there would be no need for backporting.
>>
>>
>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote:
>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary:
>>>> Hi Stefan,
>>>>
>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>>>>> That sounds amazing! Is there any chance this will be backported to jewel?
>>>>
>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.
>>>
>>> I've lots of them ;-)
>>>
>>> Will sent you one via private e-mail in some minutes.
>>>
>>> Greets,
>>> Stefan
>>>
>>>> Cheers
>>>>
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>>
>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>>>>> Hi,
>>>>>>
>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>> the same proportion. If an OSD is 75% full, it is expected that all
>>>>>> other OSDs are also 75% full.
>>>>>>
>>>>>> In reality the distribution is even only when more than 100,000 PGs
>>>>>> are distributed in a pool of size 1 (i.e. no replication).
>>>>>>
>>>>>> In small clusters there are a few thousands PGs and it is not enough
>>>>>> to get an even distribution. Running the following with
>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>>
>>>>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>>>>     crush analyze --replication-count 1 \
>>>>>>                   --type device \
>>>>>>                   --values-count $PGs \
>>>>>>                   --rule data \
>>>>>>                   --crushmap tests/sample-crushmap.json
>>>>>>   done
>>>>>>
>>>>>> In larger clusters, even though a greater number of PGs are
>>>>>> distributed, there are at most a few dozens devices per host and the
>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>> is not enough to get an even distribution.
>>>>>>
>>>>>> There is a secondary reason for the distribution to be uneven, when
>>>>>> there is more than one replica. The second replica must be on a
>>>>>> different device than the first replica. This conditional probability
>>>>>> is not taken into account by CRUSH and would create an uneven
>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>> a given OSD can only handle a few hundred PGs and this conditional
>>>>>> probability bias is dominated by the uneven distribution caused by the
>>>>>> low number of PGs.
>>>>>>
>>>>>> The uneven CRUSH distributions are always caused by a low number of
>>>>>> samples, even in large clusters. Since this noise (i.e. the difference
>>>>>> between the desired distribution and the actual distribution) is
>>>>>> random, it cannot be fixed by optimizations methods.  The
>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>> fails to find a gradient that would allow it to converge faster. And
>>>>>> even if it did, the local minimum found would be as often wrong as
>>>>>> with Nedler-Mead, only it would go faster. A least mean squares
>>>>>> filter[5] is equally unable to suppress the noise created by the
>>>>>> uneven distribution because no coefficients can model a random noise.
>>>>>>
>>>>>> With that in mind, I implemented a simple optimization algorithm[6]
>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>> like this:
>>>>>>
>>>>>>     - Distribute the desired number of PGs[7]
>>>>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>>>>     - Add the subtracted weight to the OSD that is the most under used
>>>>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>>
>>>>>> Quoting Adam Kupczyk, this works because:
>>>>>>
>>>>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>>>>    stable way.  Specifically, if we increase weight on one node, we
>>>>>>    will get more PGs on this node and less on every other node:
>>>>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>>
>>>>>> A nice side effect of this optimization algorithm is that it does not
>>>>>> change the weight of the bucket containing the items being
>>>>>> optimized. It is local to a bucket with no influence on the other
>>>>>> parts of the crushmap (modulo the conditional probability bias).
>>>>>>
>>>>>> In all tests the situation improves at least by an order of
>>>>>> magnitude. For instance when there is a 30% difference between two
>>>>>> OSDs, it is down to less than 3% after optimization.
>>>>>>
>>>>>> The tests for the optimization method can be run with
>>>>>>
>>>>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>>
>>>>>> If anyone think of a reason why this algorithm won't work in some
>>>>>> cases, please speak up :-)
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> [1] python-crush http://crush.readthedocs.io/
>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-02  5:48           ` Stefan Priebe - Profihost AG
@ 2017-05-02  6:29             ` Alexandre DERUMIER
  2017-05-02  6:31               ` Stefan Priebe - Profihost AG
  2017-05-02  6:43               ` Stefan Priebe - Profihost AG
  0 siblings, 2 replies; 37+ messages in thread
From: Alexandre DERUMIER @ 2017-05-02  6:29 UTC (permalink / raw)
  To: Stefan Priebe, Profihost AG; +Cc: Loic Dachary, ceph-devel

>>I created a new cluster under jewel but straw1 still seems to be the
>>default?

Hi Stefan, 

you need to upgrade ceph tunables

http://docs.ceph.com/docs/master/rados/operations/crush-map/


I think straw2 is since hammer tunables (CRUSH_V4 tunables)


----- Mail original -----
De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
À: "Loic Dachary" <loic@dachary.org>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mardi 2 Mai 2017 07:48:26
Objet: Re: revisiting uneven CRUSH distributions

I created a new cluster under jewel but straw1 still seems to be the 
default? 

Greets, 
Stefan 

Am 02.05.2017 um 07:43 schrieb Stefan Priebe - Profihost AG: 
> Hi Loic, 
> 
> yes i didn't changed them to straw2 as i didn't saw any difference. I 
> switched to straw2 now but it didn't change anything at all. 
> 
> If i use those weights manuall i've to adjust them on every crush change 
> on the cluster? That's something i don't really like to do. 
> 
> Greets, 
> Stefan 
> 
> Am 02.05.2017 um 01:12 schrieb Loic Dachary: 
>> It is working, with straw2 (your cluster still is using straw). 
>> 
>> For instance for one host it goes from: 
>> 
>> ~expected~ ~objects~ ~over/under used %~ ~delta~ ~delta%~ 
>> ~name~ 
>> osd.24 149 159 6.65 10.0 6.71 
>> osd.29 149 159 6.65 10.0 6.71 
>> osd.0 69 77 11.04 8.0 11.59 
>> osd.2 69 69 -0.50 0.0 0.00 
>> osd.42 149 148 -0.73 -1.0 -0.67 
>> osd.1 69 62 -10.59 -7.0 -10.14 
>> osd.23 69 62 -10.59 -7.0 -10.14 
>> osd.36 149 132 -11.46 -17.0 -11.41 
>> 
>> to 
>> 
>> ~expected~ ~objects~ ~over/under used %~ ~delta~ ~delta%~ 
>> ~name~ 
>> osd.0 69 69 -0.50 0.0 0.00 
>> osd.23 69 69 -0.50 0.0 0.00 
>> osd.24 149 149 -0.06 0.0 0.00 
>> osd.29 149 149 -0.06 0.0 0.00 
>> osd.36 149 149 -0.06 0.0 0.00 
>> osd.1 69 68 -1.94 -1.0 -1.45 
>> osd.2 69 68 -1.94 -1.0 -1.45 
>> osd.42 149 147 -1.40 -2.0 -1.34 
>> 
>> By changing the weights to 
>> 
>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616] 
>> 
>> And you could set these weights on the crushmap, there would be no need for backporting. 
>> 
>> 
>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote: 
>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary: 
>>>> Hi Stefan, 
>>>> 
>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote: 
>>>>> That sounds amazing! Is there any chance this will be backported to jewel? 
>>>> 
>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that. 
>>> 
>>> I've lots of them ;-) 
>>> 
>>> Will sent you one via private e-mail in some minutes. 
>>> 
>>> Greets, 
>>> Stefan 
>>> 
>>>> Cheers 
>>>> 
>>>>> 
>>>>> Greets, 
>>>>> Stefan 
>>>>> 
>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary: 
>>>>>> Hi, 
>>>>>> 
>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in 
>>>>>> the same proportion. If an OSD is 75% full, it is expected that all 
>>>>>> other OSDs are also 75% full. 
>>>>>> 
>>>>>> In reality the distribution is even only when more than 100,000 PGs 
>>>>>> are distributed in a pool of size 1 (i.e. no replication). 
>>>>>> 
>>>>>> In small clusters there are a few thousands PGs and it is not enough 
>>>>>> to get an even distribution. Running the following with 
>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on 
>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%. 
>>>>>> 
>>>>>> for PGs in 1000 10000 100000 1000000 ; do 
>>>>>> crush analyze --replication-count 1 \ 
>>>>>> --type device \ 
>>>>>> --values-count $PGs \ 
>>>>>> --rule data \ 
>>>>>> --crushmap tests/sample-crushmap.json 
>>>>>> done 
>>>>>> 
>>>>>> In larger clusters, even though a greater number of PGs are 
>>>>>> distributed, there are at most a few dozens devices per host and the 
>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a 
>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which 
>>>>>> is not enough to get an even distribution. 
>>>>>> 
>>>>>> There is a secondary reason for the distribution to be uneven, when 
>>>>>> there is more than one replica. The second replica must be on a 
>>>>>> different device than the first replica. This conditional probability 
>>>>>> is not taken into account by CRUSH and would create an uneven 
>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But 
>>>>>> a given OSD can only handle a few hundred PGs and this conditional 
>>>>>> probability bias is dominated by the uneven distribution caused by the 
>>>>>> low number of PGs. 
>>>>>> 
>>>>>> The uneven CRUSH distributions are always caused by a low number of 
>>>>>> samples, even in large clusters. Since this noise (i.e. the difference 
>>>>>> between the desired distribution and the actual distribution) is 
>>>>>> random, it cannot be fixed by optimizations methods. The 
>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from 
>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4] 
>>>>>> fails to find a gradient that would allow it to converge faster. And 
>>>>>> even if it did, the local minimum found would be as often wrong as 
>>>>>> with Nedler-Mead, only it would go faster. A least mean squares 
>>>>>> filter[5] is equally unable to suppress the noise created by the 
>>>>>> uneven distribution because no coefficients can model a random noise. 
>>>>>> 
>>>>>> With that in mind, I implemented a simple optimization algorithm[6] 
>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes 
>>>>>> like this: 
>>>>>> 
>>>>>> - Distribute the desired number of PGs[7] 
>>>>>> - Subtract 1% of the weight of the OSD that is the most over used 
>>>>>> - Add the subtracted weight to the OSD that is the most under used 
>>>>>> - Repeat until the Kullback–Leibler divergence[8] is small enough 
>>>>>> 
>>>>>> Quoting Adam Kupczyk, this works because: 
>>>>>> 
>>>>>> "...CRUSH is not random proces at all, it behaves in numerically 
>>>>>> stable way. Specifically, if we increase weight on one node, we 
>>>>>> will get more PGs on this node and less on every other node: 
>>>>>> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]" 
>>>>>> 
>>>>>> A nice side effect of this optimization algorithm is that it does not 
>>>>>> change the weight of the bucket containing the items being 
>>>>>> optimized. It is local to a bucket with no influence on the other 
>>>>>> parts of the crushmap (modulo the conditional probability bias). 
>>>>>> 
>>>>>> In all tests the situation improves at least by an order of 
>>>>>> magnitude. For instance when there is a 30% difference between two 
>>>>>> OSDs, it is down to less than 3% after optimization. 
>>>>>> 
>>>>>> The tests for the optimization method can be run with 
>>>>>> 
>>>>>> git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git 
>>>>>> tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py 
>>>>>> 
>>>>>> If anyone think of a reason why this algorithm won't work in some 
>>>>>> cases, please speak up :-) 
>>>>>> 
>>>>>> Cheers 
>>>>>> 
>>>>>> [1] python-crush http://crush.readthedocs.io/ 
>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2 
>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method 
>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb 
>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter 
>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39 
>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020 
>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence 
>>>>>> 
>>>>> -- 
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>>>>> the body of a message to majordomo@vger.kernel.org 
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>>>>> 
>>>> 
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>>> the body of a message to majordomo@vger.kernel.org 
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>>> 
>> 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-02  6:29             ` Alexandre DERUMIER
@ 2017-05-02  6:31               ` Stefan Priebe - Profihost AG
  2017-05-02  6:43               ` Stefan Priebe - Profihost AG
  1 sibling, 0 replies; 37+ messages in thread
From: Stefan Priebe - Profihost AG @ 2017-05-02  6:31 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Loic Dachary, ceph-devel

Hi Alexandre,

i'm talking about a newly created cluster under jewel. Should straw2 bee
the default? I'm always using ceph osd crush tunables optimal.

Greets,
Stefan

Am 02.05.2017 um 08:29 schrieb Alexandre DERUMIER:
>>> I created a new cluster under jewel but straw1 still seems to be the
>>> default?
> 
> Hi Stefan, 
> 
> you need to upgrade ceph tunables
> 
> http://docs.ceph.com/docs/master/rados/operations/crush-map/
> 
> 
> I think straw2 is since hammer tunables (CRUSH_V4 tunables)
> 
> 
> ----- Mail original -----
> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
> À: "Loic Dachary" <loic@dachary.org>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Mardi 2 Mai 2017 07:48:26
> Objet: Re: revisiting uneven CRUSH distributions
> 
> I created a new cluster under jewel but straw1 still seems to be the 
> default? 
> 
> Greets, 
> Stefan 
> 
> Am 02.05.2017 um 07:43 schrieb Stefan Priebe - Profihost AG: 
>> Hi Loic, 
>>
>> yes i didn't changed them to straw2 as i didn't saw any difference. I 
>> switched to straw2 now but it didn't change anything at all. 
>>
>> If i use those weights manuall i've to adjust them on every crush change 
>> on the cluster? That's something i don't really like to do. 
>>
>> Greets, 
>> Stefan 
>>
>> Am 02.05.2017 um 01:12 schrieb Loic Dachary: 
>>> It is working, with straw2 (your cluster still is using straw). 
>>>
>>> For instance for one host it goes from: 
>>>
>>> ~expected~ ~objects~ ~over/under used %~ ~delta~ ~delta%~ 
>>> ~name~ 
>>> osd.24 149 159 6.65 10.0 6.71 
>>> osd.29 149 159 6.65 10.0 6.71 
>>> osd.0 69 77 11.04 8.0 11.59 
>>> osd.2 69 69 -0.50 0.0 0.00 
>>> osd.42 149 148 -0.73 -1.0 -0.67 
>>> osd.1 69 62 -10.59 -7.0 -10.14 
>>> osd.23 69 62 -10.59 -7.0 -10.14 
>>> osd.36 149 132 -11.46 -17.0 -11.41 
>>>
>>> to 
>>>
>>> ~expected~ ~objects~ ~over/under used %~ ~delta~ ~delta%~ 
>>> ~name~ 
>>> osd.0 69 69 -0.50 0.0 0.00 
>>> osd.23 69 69 -0.50 0.0 0.00 
>>> osd.24 149 149 -0.06 0.0 0.00 
>>> osd.29 149 149 -0.06 0.0 0.00 
>>> osd.36 149 149 -0.06 0.0 0.00 
>>> osd.1 69 68 -1.94 -1.0 -1.45 
>>> osd.2 69 68 -1.94 -1.0 -1.45 
>>> osd.42 149 147 -1.40 -2.0 -1.34 
>>>
>>> By changing the weights to 
>>>
>>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616] 
>>>
>>> And you could set these weights on the crushmap, there would be no need for backporting. 
>>>
>>>
>>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote: 
>>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary: 
>>>>> Hi Stefan, 
>>>>>
>>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote: 
>>>>>> That sounds amazing! Is there any chance this will be backported to jewel? 
>>>>>
>>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that. 
>>>>
>>>> I've lots of them ;-) 
>>>>
>>>> Will sent you one via private e-mail in some minutes. 
>>>>
>>>> Greets, 
>>>> Stefan 
>>>>
>>>>> Cheers 
>>>>>
>>>>>>
>>>>>> Greets, 
>>>>>> Stefan 
>>>>>>
>>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary: 
>>>>>>> Hi, 
>>>>>>>
>>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in 
>>>>>>> the same proportion. If an OSD is 75% full, it is expected that all 
>>>>>>> other OSDs are also 75% full. 
>>>>>>>
>>>>>>> In reality the distribution is even only when more than 100,000 PGs 
>>>>>>> are distributed in a pool of size 1 (i.e. no replication). 
>>>>>>>
>>>>>>> In small clusters there are a few thousands PGs and it is not enough 
>>>>>>> to get an even distribution. Running the following with 
>>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on 
>>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%. 
>>>>>>>
>>>>>>> for PGs in 1000 10000 100000 1000000 ; do 
>>>>>>> crush analyze --replication-count 1 \ 
>>>>>>> --type device \ 
>>>>>>> --values-count $PGs \ 
>>>>>>> --rule data \ 
>>>>>>> --crushmap tests/sample-crushmap.json 
>>>>>>> done 
>>>>>>>
>>>>>>> In larger clusters, even though a greater number of PGs are 
>>>>>>> distributed, there are at most a few dozens devices per host and the 
>>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a 
>>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which 
>>>>>>> is not enough to get an even distribution. 
>>>>>>>
>>>>>>> There is a secondary reason for the distribution to be uneven, when 
>>>>>>> there is more than one replica. The second replica must be on a 
>>>>>>> different device than the first replica. This conditional probability 
>>>>>>> is not taken into account by CRUSH and would create an uneven 
>>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But 
>>>>>>> a given OSD can only handle a few hundred PGs and this conditional 
>>>>>>> probability bias is dominated by the uneven distribution caused by the 
>>>>>>> low number of PGs. 
>>>>>>>
>>>>>>> The uneven CRUSH distributions are always caused by a low number of 
>>>>>>> samples, even in large clusters. Since this noise (i.e. the difference 
>>>>>>> between the desired distribution and the actual distribution) is 
>>>>>>> random, it cannot be fixed by optimizations methods. The 
>>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from 
>>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4] 
>>>>>>> fails to find a gradient that would allow it to converge faster. And 
>>>>>>> even if it did, the local minimum found would be as often wrong as 
>>>>>>> with Nedler-Mead, only it would go faster. A least mean squares 
>>>>>>> filter[5] is equally unable to suppress the noise created by the 
>>>>>>> uneven distribution because no coefficients can model a random noise. 
>>>>>>>
>>>>>>> With that in mind, I implemented a simple optimization algorithm[6] 
>>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes 
>>>>>>> like this: 
>>>>>>>
>>>>>>> - Distribute the desired number of PGs[7] 
>>>>>>> - Subtract 1% of the weight of the OSD that is the most over used 
>>>>>>> - Add the subtracted weight to the OSD that is the most under used 
>>>>>>> - Repeat until the Kullback–Leibler divergence[8] is small enough 
>>>>>>>
>>>>>>> Quoting Adam Kupczyk, this works because: 
>>>>>>>
>>>>>>> "...CRUSH is not random proces at all, it behaves in numerically 
>>>>>>> stable way. Specifically, if we increase weight on one node, we 
>>>>>>> will get more PGs on this node and less on every other node: 
>>>>>>> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]" 
>>>>>>>
>>>>>>> A nice side effect of this optimization algorithm is that it does not 
>>>>>>> change the weight of the bucket containing the items being 
>>>>>>> optimized. It is local to a bucket with no influence on the other 
>>>>>>> parts of the crushmap (modulo the conditional probability bias). 
>>>>>>>
>>>>>>> In all tests the situation improves at least by an order of 
>>>>>>> magnitude. For instance when there is a 30% difference between two 
>>>>>>> OSDs, it is down to less than 3% after optimization. 
>>>>>>>
>>>>>>> The tests for the optimization method can be run with 
>>>>>>>
>>>>>>> git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git 
>>>>>>> tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py 
>>>>>>>
>>>>>>> If anyone think of a reason why this algorithm won't work in some 
>>>>>>> cases, please speak up :-) 
>>>>>>>
>>>>>>> Cheers 
>>>>>>>
>>>>>>> [1] python-crush http://crush.readthedocs.io/ 
>>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2 
>>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method 
>>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb 
>>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter 
>>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39 
>>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020 
>>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence 
>>>>>>>
>>>>>> -- 
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>>>>>> the body of a message to majordomo@vger.kernel.org 
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>>>>>>
>>>>>
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>>>> the body of a message to majordomo@vger.kernel.org 
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>>>>
>>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-02  6:29             ` Alexandre DERUMIER
  2017-05-02  6:31               ` Stefan Priebe - Profihost AG
@ 2017-05-02  6:43               ` Stefan Priebe - Profihost AG
  2017-05-02  7:52                 ` Alexandre DERUMIER
  1 sibling, 1 reply; 37+ messages in thread
From: Stefan Priebe - Profihost AG @ 2017-05-02  6:43 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Loic Dachary, ceph-devel

Hi Alexandre,

i still miss it. I'm talking about a NEWLY created cluster. Which did
never run under hammer. It was always using jewel.

Stefan

Am 02.05.2017 um 08:29 schrieb Alexandre DERUMIER:
>>> I created a new cluster under jewel but straw1 still seems to be the
>>> default?
> 
> Hi Stefan, 
> 
> you need to upgrade ceph tunables
> 
> http://docs.ceph.com/docs/master/rados/operations/crush-map/
> 
> 
> I think straw2 is since hammer tunables (CRUSH_V4 tunables)
> 
> 
> ----- Mail original -----
> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
> À: "Loic Dachary" <loic@dachary.org>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Mardi 2 Mai 2017 07:48:26
> Objet: Re: revisiting uneven CRUSH distributions
> 
> I created a new cluster under jewel but straw1 still seems to be the 
> default? 
> 
> Greets, 
> Stefan 
> 
> Am 02.05.2017 um 07:43 schrieb Stefan Priebe - Profihost AG: 
>> Hi Loic, 
>>
>> yes i didn't changed them to straw2 as i didn't saw any difference. I 
>> switched to straw2 now but it didn't change anything at all. 
>>
>> If i use those weights manuall i've to adjust them on every crush change 
>> on the cluster? That's something i don't really like to do. 
>>
>> Greets, 
>> Stefan 
>>
>> Am 02.05.2017 um 01:12 schrieb Loic Dachary: 
>>> It is working, with straw2 (your cluster still is using straw). 
>>>
>>> For instance for one host it goes from: 
>>>
>>> ~expected~ ~objects~ ~over/under used %~ ~delta~ ~delta%~ 
>>> ~name~ 
>>> osd.24 149 159 6.65 10.0 6.71 
>>> osd.29 149 159 6.65 10.0 6.71 
>>> osd.0 69 77 11.04 8.0 11.59 
>>> osd.2 69 69 -0.50 0.0 0.00 
>>> osd.42 149 148 -0.73 -1.0 -0.67 
>>> osd.1 69 62 -10.59 -7.0 -10.14 
>>> osd.23 69 62 -10.59 -7.0 -10.14 
>>> osd.36 149 132 -11.46 -17.0 -11.41 
>>>
>>> to 
>>>
>>> ~expected~ ~objects~ ~over/under used %~ ~delta~ ~delta%~ 
>>> ~name~ 
>>> osd.0 69 69 -0.50 0.0 0.00 
>>> osd.23 69 69 -0.50 0.0 0.00 
>>> osd.24 149 149 -0.06 0.0 0.00 
>>> osd.29 149 149 -0.06 0.0 0.00 
>>> osd.36 149 149 -0.06 0.0 0.00 
>>> osd.1 69 68 -1.94 -1.0 -1.45 
>>> osd.2 69 68 -1.94 -1.0 -1.45 
>>> osd.42 149 147 -1.40 -2.0 -1.34 
>>>
>>> By changing the weights to 
>>>
>>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616] 
>>>
>>> And you could set these weights on the crushmap, there would be no need for backporting. 
>>>
>>>
>>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote: 
>>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary: 
>>>>> Hi Stefan, 
>>>>>
>>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote: 
>>>>>> That sounds amazing! Is there any chance this will be backported to jewel? 
>>>>>
>>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that. 
>>>>
>>>> I've lots of them ;-) 
>>>>
>>>> Will sent you one via private e-mail in some minutes. 
>>>>
>>>> Greets, 
>>>> Stefan 
>>>>
>>>>> Cheers 
>>>>>
>>>>>>
>>>>>> Greets, 
>>>>>> Stefan 
>>>>>>
>>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary: 
>>>>>>> Hi, 
>>>>>>>
>>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in 
>>>>>>> the same proportion. If an OSD is 75% full, it is expected that all 
>>>>>>> other OSDs are also 75% full. 
>>>>>>>
>>>>>>> In reality the distribution is even only when more than 100,000 PGs 
>>>>>>> are distributed in a pool of size 1 (i.e. no replication). 
>>>>>>>
>>>>>>> In small clusters there are a few thousands PGs and it is not enough 
>>>>>>> to get an even distribution. Running the following with 
>>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on 
>>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%. 
>>>>>>>
>>>>>>> for PGs in 1000 10000 100000 1000000 ; do 
>>>>>>> crush analyze --replication-count 1 \ 
>>>>>>> --type device \ 
>>>>>>> --values-count $PGs \ 
>>>>>>> --rule data \ 
>>>>>>> --crushmap tests/sample-crushmap.json 
>>>>>>> done 
>>>>>>>
>>>>>>> In larger clusters, even though a greater number of PGs are 
>>>>>>> distributed, there are at most a few dozens devices per host and the 
>>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a 
>>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which 
>>>>>>> is not enough to get an even distribution. 
>>>>>>>
>>>>>>> There is a secondary reason for the distribution to be uneven, when 
>>>>>>> there is more than one replica. The second replica must be on a 
>>>>>>> different device than the first replica. This conditional probability 
>>>>>>> is not taken into account by CRUSH and would create an uneven 
>>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But 
>>>>>>> a given OSD can only handle a few hundred PGs and this conditional 
>>>>>>> probability bias is dominated by the uneven distribution caused by the 
>>>>>>> low number of PGs. 
>>>>>>>
>>>>>>> The uneven CRUSH distributions are always caused by a low number of 
>>>>>>> samples, even in large clusters. Since this noise (i.e. the difference 
>>>>>>> between the desired distribution and the actual distribution) is 
>>>>>>> random, it cannot be fixed by optimizations methods. The 
>>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from 
>>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4] 
>>>>>>> fails to find a gradient that would allow it to converge faster. And 
>>>>>>> even if it did, the local minimum found would be as often wrong as 
>>>>>>> with Nedler-Mead, only it would go faster. A least mean squares 
>>>>>>> filter[5] is equally unable to suppress the noise created by the 
>>>>>>> uneven distribution because no coefficients can model a random noise. 
>>>>>>>
>>>>>>> With that in mind, I implemented a simple optimization algorithm[6] 
>>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes 
>>>>>>> like this: 
>>>>>>>
>>>>>>> - Distribute the desired number of PGs[7] 
>>>>>>> - Subtract 1% of the weight of the OSD that is the most over used 
>>>>>>> - Add the subtracted weight to the OSD that is the most under used 
>>>>>>> - Repeat until the Kullback–Leibler divergence[8] is small enough 
>>>>>>>
>>>>>>> Quoting Adam Kupczyk, this works because: 
>>>>>>>
>>>>>>> "...CRUSH is not random proces at all, it behaves in numerically 
>>>>>>> stable way. Specifically, if we increase weight on one node, we 
>>>>>>> will get more PGs on this node and less on every other node: 
>>>>>>> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]" 
>>>>>>>
>>>>>>> A nice side effect of this optimization algorithm is that it does not 
>>>>>>> change the weight of the bucket containing the items being 
>>>>>>> optimized. It is local to a bucket with no influence on the other 
>>>>>>> parts of the crushmap (modulo the conditional probability bias). 
>>>>>>>
>>>>>>> In all tests the situation improves at least by an order of 
>>>>>>> magnitude. For instance when there is a 30% difference between two 
>>>>>>> OSDs, it is down to less than 3% after optimization. 
>>>>>>>
>>>>>>> The tests for the optimization method can be run with 
>>>>>>>
>>>>>>> git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git 
>>>>>>> tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py 
>>>>>>>
>>>>>>> If anyone think of a reason why this algorithm won't work in some 
>>>>>>> cases, please speak up :-) 
>>>>>>>
>>>>>>> Cheers 
>>>>>>>
>>>>>>> [1] python-crush http://crush.readthedocs.io/ 
>>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2 
>>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method 
>>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb 
>>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter 
>>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39 
>>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020 
>>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence 
>>>>>>>
>>>>>> -- 
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>>>>>> the body of a message to majordomo@vger.kernel.org 
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>>>>>>
>>>>>
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>>>> the body of a message to majordomo@vger.kernel.org 
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>>>>
>>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-02  5:43         ` Stefan Priebe - Profihost AG
  2017-05-02  5:48           ` Stefan Priebe - Profihost AG
@ 2017-05-02  7:32           ` Loic Dachary
  2017-05-14 17:46             ` Loic Dachary
  1 sibling, 1 reply; 37+ messages in thread
From: Loic Dachary @ 2017-05-02  7:32 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Ceph Development



On 05/02/2017 07:43 AM, Stefan Priebe - Profihost AG wrote:
> Hi Loic,
> 
> yes i didn't changed them to straw2 as i didn't saw any difference. I
> switched to straw2 now but it didn't change anything at all.

straw vs straw2 is not responsible for the uneven distribution you're seeing. I meant to say the optimization only works on straw2 buckets, it is not implemented for straw buckets.

> If i use those weights manuall i've to adjust them on every crush change
> on the cluster? That's something i don't really like to do.

This is not practical indeed :-) I'm hoping python-crush can automate that.

Cheers

> Greets,
> Stefan
> 
> Am 02.05.2017 um 01:12 schrieb Loic Dachary:
>> It is working, with straw2 (your cluster still is using straw).
>>
>> For instance for one host it goes from:
>>
>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>> ~name~
>> osd.24         149        159                 6.65     10.0      6.71
>> osd.29         149        159                 6.65     10.0      6.71
>> osd.0           69         77                11.04      8.0     11.59
>> osd.2           69         69                -0.50      0.0      0.00
>> osd.42         149        148                -0.73     -1.0     -0.67
>> osd.1           69         62               -10.59     -7.0    -10.14
>> osd.23          69         62               -10.59     -7.0    -10.14
>> osd.36         149        132               -11.46    -17.0    -11.41
>>
>> to
>>
>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>> ~name~
>> osd.0           69         69                -0.50      0.0      0.00
>> osd.23          69         69                -0.50      0.0      0.00
>> osd.24         149        149                -0.06      0.0      0.00
>> osd.29         149        149                -0.06      0.0      0.00
>> osd.36         149        149                -0.06      0.0      0.00
>> osd.1           69         68                -1.94     -1.0     -1.45
>> osd.2           69         68                -1.94     -1.0     -1.45
>> osd.42         149        147                -1.40     -2.0     -1.34
>>
>> By changing the weights to
>>
>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616]
>>
>> And you could set these weights on the crushmap, there would be no need for backporting.
>>
>>
>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote:
>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary:
>>>> Hi Stefan,
>>>>
>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>>>>> That sounds amazing! Is there any chance this will be backported to jewel?
>>>>
>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.
>>>
>>> I've lots of them ;-)
>>>
>>> Will sent you one via private e-mail in some minutes.
>>>
>>> Greets,
>>> Stefan
>>>
>>>> Cheers
>>>>
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>>
>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>>>>> Hi,
>>>>>>
>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>> the same proportion. If an OSD is 75% full, it is expected that all
>>>>>> other OSDs are also 75% full.
>>>>>>
>>>>>> In reality the distribution is even only when more than 100,000 PGs
>>>>>> are distributed in a pool of size 1 (i.e. no replication).
>>>>>>
>>>>>> In small clusters there are a few thousands PGs and it is not enough
>>>>>> to get an even distribution. Running the following with
>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>>
>>>>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>>>>     crush analyze --replication-count 1 \
>>>>>>                   --type device \
>>>>>>                   --values-count $PGs \
>>>>>>                   --rule data \
>>>>>>                   --crushmap tests/sample-crushmap.json
>>>>>>   done
>>>>>>
>>>>>> In larger clusters, even though a greater number of PGs are
>>>>>> distributed, there are at most a few dozens devices per host and the
>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>> is not enough to get an even distribution.
>>>>>>
>>>>>> There is a secondary reason for the distribution to be uneven, when
>>>>>> there is more than one replica. The second replica must be on a
>>>>>> different device than the first replica. This conditional probability
>>>>>> is not taken into account by CRUSH and would create an uneven
>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>> a given OSD can only handle a few hundred PGs and this conditional
>>>>>> probability bias is dominated by the uneven distribution caused by the
>>>>>> low number of PGs.
>>>>>>
>>>>>> The uneven CRUSH distributions are always caused by a low number of
>>>>>> samples, even in large clusters. Since this noise (i.e. the difference
>>>>>> between the desired distribution and the actual distribution) is
>>>>>> random, it cannot be fixed by optimizations methods.  The
>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>> fails to find a gradient that would allow it to converge faster. And
>>>>>> even if it did, the local minimum found would be as often wrong as
>>>>>> with Nedler-Mead, only it would go faster. A least mean squares
>>>>>> filter[5] is equally unable to suppress the noise created by the
>>>>>> uneven distribution because no coefficients can model a random noise.
>>>>>>
>>>>>> With that in mind, I implemented a simple optimization algorithm[6]
>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>> like this:
>>>>>>
>>>>>>     - Distribute the desired number of PGs[7]
>>>>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>>>>     - Add the subtracted weight to the OSD that is the most under used
>>>>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>>
>>>>>> Quoting Adam Kupczyk, this works because:
>>>>>>
>>>>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>>>>    stable way.  Specifically, if we increase weight on one node, we
>>>>>>    will get more PGs on this node and less on every other node:
>>>>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>>
>>>>>> A nice side effect of this optimization algorithm is that it does not
>>>>>> change the weight of the bucket containing the items being
>>>>>> optimized. It is local to a bucket with no influence on the other
>>>>>> parts of the crushmap (modulo the conditional probability bias).
>>>>>>
>>>>>> In all tests the situation improves at least by an order of
>>>>>> magnitude. For instance when there is a 30% difference between two
>>>>>> OSDs, it is down to less than 3% after optimization.
>>>>>>
>>>>>> The tests for the optimization method can be run with
>>>>>>
>>>>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>>
>>>>>> If anyone think of a reason why this algorithm won't work in some
>>>>>> cases, please speak up :-)
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> [1] python-crush http://crush.readthedocs.io/
>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-02  6:43               ` Stefan Priebe - Profihost AG
@ 2017-05-02  7:52                 ` Alexandre DERUMIER
  0 siblings, 0 replies; 37+ messages in thread
From: Alexandre DERUMIER @ 2017-05-02  7:52 UTC (permalink / raw)
  To: Stefan Priebe, Profihost AG; +Cc: Loic Dachary, ceph-devel

>>i still miss it. I'm talking about a NEWLY created cluster. Which did
>>never run under hammer. It was always using jewel.

yes, new cluster install don't have last tunables enabled by default.

I think default tunables are still firefly, even on luminous.

(I think this is for old client compatibility)


----- Mail original -----
De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag>
À: "Alexandre Derumier" <aderumier@odiso.com>
Cc: "Loic Dachary" <loic@dachary.org>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mardi 2 Mai 2017 08:43:53
Objet: Re: revisiting uneven CRUSH distributions

Hi Alexandre, 

i still miss it. I'm talking about a NEWLY created cluster. Which did 
never run under hammer. It was always using jewel. 

Stefan 

Am 02.05.2017 um 08:29 schrieb Alexandre DERUMIER: 
>>> I created a new cluster under jewel but straw1 still seems to be the 
>>> default? 
> 
> Hi Stefan, 
> 
> you need to upgrade ceph tunables 
> 
> http://docs.ceph.com/docs/master/rados/operations/crush-map/ 
> 
> 
> I think straw2 is since hammer tunables (CRUSH_V4 tunables) 
> 
> 
> ----- Mail original ----- 
> De: "Stefan Priebe, Profihost AG" <s.priebe@profihost.ag> 
> À: "Loic Dachary" <loic@dachary.org>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Mardi 2 Mai 2017 07:48:26 
> Objet: Re: revisiting uneven CRUSH distributions 
> 
> I created a new cluster under jewel but straw1 still seems to be the 
> default? 
> 
> Greets, 
> Stefan 
> 
> Am 02.05.2017 um 07:43 schrieb Stefan Priebe - Profihost AG: 
>> Hi Loic, 
>> 
>> yes i didn't changed them to straw2 as i didn't saw any difference. I 
>> switched to straw2 now but it didn't change anything at all. 
>> 
>> If i use those weights manuall i've to adjust them on every crush change 
>> on the cluster? That's something i don't really like to do. 
>> 
>> Greets, 
>> Stefan 
>> 
>> Am 02.05.2017 um 01:12 schrieb Loic Dachary: 
>>> It is working, with straw2 (your cluster still is using straw). 
>>> 
>>> For instance for one host it goes from: 
>>> 
>>> ~expected~ ~objects~ ~over/under used %~ ~delta~ ~delta%~ 
>>> ~name~ 
>>> osd.24 149 159 6.65 10.0 6.71 
>>> osd.29 149 159 6.65 10.0 6.71 
>>> osd.0 69 77 11.04 8.0 11.59 
>>> osd.2 69 69 -0.50 0.0 0.00 
>>> osd.42 149 148 -0.73 -1.0 -0.67 
>>> osd.1 69 62 -10.59 -7.0 -10.14 
>>> osd.23 69 62 -10.59 -7.0 -10.14 
>>> osd.36 149 132 -11.46 -17.0 -11.41 
>>> 
>>> to 
>>> 
>>> ~expected~ ~objects~ ~over/under used %~ ~delta~ ~delta%~ 
>>> ~name~ 
>>> osd.0 69 69 -0.50 0.0 0.00 
>>> osd.23 69 69 -0.50 0.0 0.00 
>>> osd.24 149 149 -0.06 0.0 0.00 
>>> osd.29 149 149 -0.06 0.0 0.00 
>>> osd.36 149 149 -0.06 0.0 0.00 
>>> osd.1 69 68 -1.94 -1.0 -1.45 
>>> osd.2 69 68 -1.94 -1.0 -1.45 
>>> osd.42 149 147 -1.40 -2.0 -1.34 
>>> 
>>> By changing the weights to 
>>> 
>>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616] 
>>> 
>>> And you could set these weights on the crushmap, there would be no need for backporting. 
>>> 
>>> 
>>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote: 
>>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary: 
>>>>> Hi Stefan, 
>>>>> 
>>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote: 
>>>>>> That sounds amazing! Is there any chance this will be backported to jewel? 
>>>>> 
>>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that. 
>>>> 
>>>> I've lots of them ;-) 
>>>> 
>>>> Will sent you one via private e-mail in some minutes. 
>>>> 
>>>> Greets, 
>>>> Stefan 
>>>> 
>>>>> Cheers 
>>>>> 
>>>>>> 
>>>>>> Greets, 
>>>>>> Stefan 
>>>>>> 
>>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary: 
>>>>>>> Hi, 
>>>>>>> 
>>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in 
>>>>>>> the same proportion. If an OSD is 75% full, it is expected that all 
>>>>>>> other OSDs are also 75% full. 
>>>>>>> 
>>>>>>> In reality the distribution is even only when more than 100,000 PGs 
>>>>>>> are distributed in a pool of size 1 (i.e. no replication). 
>>>>>>> 
>>>>>>> In small clusters there are a few thousands PGs and it is not enough 
>>>>>>> to get an even distribution. Running the following with 
>>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on 
>>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%. 
>>>>>>> 
>>>>>>> for PGs in 1000 10000 100000 1000000 ; do 
>>>>>>> crush analyze --replication-count 1 \ 
>>>>>>> --type device \ 
>>>>>>> --values-count $PGs \ 
>>>>>>> --rule data \ 
>>>>>>> --crushmap tests/sample-crushmap.json 
>>>>>>> done 
>>>>>>> 
>>>>>>> In larger clusters, even though a greater number of PGs are 
>>>>>>> distributed, there are at most a few dozens devices per host and the 
>>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a 
>>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which 
>>>>>>> is not enough to get an even distribution. 
>>>>>>> 
>>>>>>> There is a secondary reason for the distribution to be uneven, when 
>>>>>>> there is more than one replica. The second replica must be on a 
>>>>>>> different device than the first replica. This conditional probability 
>>>>>>> is not taken into account by CRUSH and would create an uneven 
>>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But 
>>>>>>> a given OSD can only handle a few hundred PGs and this conditional 
>>>>>>> probability bias is dominated by the uneven distribution caused by the 
>>>>>>> low number of PGs. 
>>>>>>> 
>>>>>>> The uneven CRUSH distributions are always caused by a low number of 
>>>>>>> samples, even in large clusters. Since this noise (i.e. the difference 
>>>>>>> between the desired distribution and the actual distribution) is 
>>>>>>> random, it cannot be fixed by optimizations methods. The 
>>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from 
>>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4] 
>>>>>>> fails to find a gradient that would allow it to converge faster. And 
>>>>>>> even if it did, the local minimum found would be as often wrong as 
>>>>>>> with Nedler-Mead, only it would go faster. A least mean squares 
>>>>>>> filter[5] is equally unable to suppress the noise created by the 
>>>>>>> uneven distribution because no coefficients can model a random noise. 
>>>>>>> 
>>>>>>> With that in mind, I implemented a simple optimization algorithm[6] 
>>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes 
>>>>>>> like this: 
>>>>>>> 
>>>>>>> - Distribute the desired number of PGs[7] 
>>>>>>> - Subtract 1% of the weight of the OSD that is the most over used 
>>>>>>> - Add the subtracted weight to the OSD that is the most under used 
>>>>>>> - Repeat until the Kullback–Leibler divergence[8] is small enough 
>>>>>>> 
>>>>>>> Quoting Adam Kupczyk, this works because: 
>>>>>>> 
>>>>>>> "...CRUSH is not random proces at all, it behaves in numerically 
>>>>>>> stable way. Specifically, if we increase weight on one node, we 
>>>>>>> will get more PGs on this node and less on every other node: 
>>>>>>> CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]" 
>>>>>>> 
>>>>>>> A nice side effect of this optimization algorithm is that it does not 
>>>>>>> change the weight of the bucket containing the items being 
>>>>>>> optimized. It is local to a bucket with no influence on the other 
>>>>>>> parts of the crushmap (modulo the conditional probability bias). 
>>>>>>> 
>>>>>>> In all tests the situation improves at least by an order of 
>>>>>>> magnitude. For instance when there is a 30% difference between two 
>>>>>>> OSDs, it is down to less than 3% after optimization. 
>>>>>>> 
>>>>>>> The tests for the optimization method can be run with 
>>>>>>> 
>>>>>>> git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git 
>>>>>>> tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py 
>>>>>>> 
>>>>>>> If anyone think of a reason why this algorithm won't work in some 
>>>>>>> cases, please speak up :-) 
>>>>>>> 
>>>>>>> Cheers 
>>>>>>> 
>>>>>>> [1] python-crush http://crush.readthedocs.io/ 
>>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2 
>>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method 
>>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb 
>>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter 
>>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39 
>>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020 
>>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence 
>>>>>>> 
>>>>>> -- 
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>>>>>> the body of a message to majordomo@vger.kernel.org 
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>>>>>> 
>>>>> 
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>>>> the body of a message to majordomo@vger.kernel.org 
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>>>> 
>>> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
       [not found] ` <CABZ+qqnqiUFbz=6CegW_o=2goOThpmoskDQ0oOUfE27jW0D17A@mail.gmail.com>
@ 2017-05-02 10:21   ` Loic Dachary
  2017-05-02 10:39     ` Dan van der Ster
  0 siblings, 1 reply; 37+ messages in thread
From: Loic Dachary @ 2017-05-02 10:21 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: Ceph Development



On 05/02/2017 11:35 AM, Dan van der Ster wrote:
> Hi Loic,
> 
> I'm not managing to compile this on my CentOS 7 dev box. 

What error do you get ? With pip 8.1 + you should not need to compile, there are binary wheels available.

> Do you want to try a "complicated" crush map? Here is ours: https://www.dropbox.com/s/ihg7cwz7wug50pb/cern.crush?dl=1

Could you also tell me the pool numbers, pg_num and size and the rule they use ?

> The important rules are "data" and "critical", and note that there are two rooms which are expected to fill at different rates. So we'd like to optimize separately for buckets 0513-R-0050 and 0513-R-0060.

Thanks, I will :-)

> Cheers, Dan
> 
> 
> On Sun, Apr 30, 2017 at 4:15 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
> 
>     Hi,
> 
>     Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>     the same proportion. If an OSD is 75% full, it is expected that all
>     other OSDs are also 75% full.
> 
>     In reality the distribution is even only when more than 100,000 PGs
>     are distributed in a pool of size 1 (i.e. no replication).
> 
>     In small clusters there are a few thousands PGs and it is not enough
>     to get an even distribution. Running the following with
>     python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>     6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
> 
>       for PGs in 1000 10000 100000 1000000 ; do
>         crush analyze --replication-count 1 \
>                       --type device \
>                       --values-count $PGs \
>                       --rule data \
>                       --crushmap tests/sample-crushmap.json
>       done
> 
>     In larger clusters, even though a greater number of PGs are
>     distributed, there are at most a few dozens devices per host and the
>     problem remains. On a machine with 24 OSDs each expected to handle a
>     few hundred PGs, a total of a few thousands PGs are distributed which
>     is not enough to get an even distribution.
> 
>     There is a secondary reason for the distribution to be uneven, when
>     there is more than one replica. The second replica must be on a
>     different device than the first replica. This conditional probability
>     is not taken into account by CRUSH and would create an uneven
>     distribution if more than 10,000 PGs were distributed per OSD[2]. But
>     a given OSD can only handle a few hundred PGs and this conditional
>     probability bias is dominated by the uneven distribution caused by the
>     low number of PGs.
> 
>     The uneven CRUSH distributions are always caused by a low number of
>     samples, even in large clusters. Since this noise (i.e. the difference
>     between the desired distribution and the actual distribution) is
>     random, it cannot be fixed by optimizations methods.  The
>     Nedler-Mead[3] simplex converges to a local minimum that is far from
>     the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>     fails to find a gradient that would allow it to converge faster. And
>     even if it did, the local minimum found would be as often wrong as
>     with Nedler-Mead, only it would go faster. A least mean squares
>     filter[5] is equally unable to suppress the noise created by the
>     uneven distribution because no coefficients can model a random noise.
> 
>     With that in mind, I implemented a simple optimization algorithm[6]
>     which was first suggested by Thierry Delamare a few weeks ago. It goes
>     like this:
> 
>         - Distribute the desired number of PGs[7]
>         - Subtract 1% of the weight of the OSD that is the most over used
>         - Add the subtracted weight to the OSD that is the most under used
>         - Repeat until the Kullback–Leibler divergence[8] is small enough
> 
>     Quoting Adam Kupczyk, this works because:
> 
>       "...CRUSH is not random proces at all, it behaves in numerically
>        stable way.  Specifically, if we increase weight on one node, we
>        will get more PGs on this node and less on every other node:
>        CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
> 
>     A nice side effect of this optimization algorithm is that it does not
>     change the weight of the bucket containing the items being
>     optimized. It is local to a bucket with no influence on the other
>     parts of the crushmap (modulo the conditional probability bias).
> 
>     In all tests the situation improves at least by an order of
>     magnitude. For instance when there is a 30% difference between two
>     OSDs, it is down to less than 3% after optimization.
> 
>     The tests for the optimization method can be run with
> 
>        git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git <http://libcrush.org/dachary/python-crush.git>
>        tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
> 
>     If anyone think of a reason why this algorithm won't work in some
>     cases, please speak up :-)
> 
>     Cheers
> 
>     [1] python-crush http://crush.readthedocs.io/
>     [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2 <http://marc.info/?l=ceph-devel&m=148539995928656&w=2>
>     [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method <https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method>
>     [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb <https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb>
>     [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter <https://en.wikipedia.org/wiki/Least_mean_squares_filter>
>     [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39 <http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39>
>     [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>     [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>
> 
>     --
>     Loïc Dachary, Artisan Logiciel Libre
>     --
>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>     the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>
>     More majordomo info at  http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html>
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-02 10:21   ` Loic Dachary
@ 2017-05-02 10:39     ` Dan van der Ster
  2017-05-06 13:21       ` Loic Dachary
  0 siblings, 1 reply; 37+ messages in thread
From: Dan van der Ster @ 2017-05-02 10:39 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Tue, May 2, 2017 at 12:21 PM, Loic Dachary <loic@dachary.org> wrote:
> On 05/02/2017 11:35 AM, Dan van der Ster wrote:
> > Hi Loic,
> >
> > I'm not managing to compile this on my CentOS 7 dev box.
>
> What error do you get ? With pip 8.1 + you should not need to compile, there are binary wheels available.
>

Double requirement given: appdirs==1.4.3 (from -r
/root/git/python-crush/requirements-dev.txt (line 8)) (already in
appdirs==1.4.3 (from -r /root/git/python-crush/requirements.txt (line
10)), name='appdirs')

[root@dvanders-work python-crush]# grep appdirs *.txt
requirements-dev.txt:appdirs==1.4.3
requirements.txt:appdirs==1.4.3



> > Do you want to try a "complicated" crush map? Here is ours: https://www.dropbox.com/s/ihg7cwz7wug50pb/cern.crush?dl=1
>
> Could you also tell me the pool numbers, pg_num and size and the rule they use ?

pool 4 'volumes' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 628176 flags
nodelete,nopgchange,nosizechange min_read_recency_for_promote 1
stripe_width 0
pool 5 'images' replicated size 3 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 628178 flags
hashpspool,nodelete,nopgchange,nosizechange
min_read_recency_for_promote 1 stripe_width 0
pool 75 'cinder-critical' replicated size 3 min_size 2 crush_ruleset 4
object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 587162 flags
hashpspool,nodelete,nopgchange,nosizechange
min_read_recency_for_promote 1 stripe_width 0


>
> > The important rules are "data" and "critical", and note that there are two rooms which are expected to fill at different rates. So we'd like to optimize separately for buckets 0513-R-0050 and 0513-R-0060.
>
> Thanks, I will :-)
>

Cool, thanks!

-- Dan

> > Cheers, Dan
> >
> >
> > On Sun, Apr 30, 2017 at 4:15 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
> >
> >     Hi,
> >
> >     Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
> >     the same proportion. If an OSD is 75% full, it is expected that all
> >     other OSDs are also 75% full.
> >
> >     In reality the distribution is even only when more than 100,000 PGs
> >     are distributed in a pool of size 1 (i.e. no replication).
> >
> >     In small clusters there are a few thousands PGs and it is not enough
> >     to get an even distribution. Running the following with
> >     python-crush[1], shows a 15% difference when distributing 1,000 PGs on
> >     6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
> >
> >       for PGs in 1000 10000 100000 1000000 ; do
> >         crush analyze --replication-count 1 \
> >                       --type device \
> >                       --values-count $PGs \
> >                       --rule data \
> >                       --crushmap tests/sample-crushmap.json
> >       done
> >
> >     In larger clusters, even though a greater number of PGs are
> >     distributed, there are at most a few dozens devices per host and the
> >     problem remains. On a machine with 24 OSDs each expected to handle a
> >     few hundred PGs, a total of a few thousands PGs are distributed which
> >     is not enough to get an even distribution.
> >
> >     There is a secondary reason for the distribution to be uneven, when
> >     there is more than one replica. The second replica must be on a
> >     different device than the first replica. This conditional probability
> >     is not taken into account by CRUSH and would create an uneven
> >     distribution if more than 10,000 PGs were distributed per OSD[2]. But
> >     a given OSD can only handle a few hundred PGs and this conditional
> >     probability bias is dominated by the uneven distribution caused by the
> >     low number of PGs.
> >
> >     The uneven CRUSH distributions are always caused by a low number of
> >     samples, even in large clusters. Since this noise (i.e. the difference
> >     between the desired distribution and the actual distribution) is
> >     random, it cannot be fixed by optimizations methods.  The
> >     Nedler-Mead[3] simplex converges to a local minimum that is far from
> >     the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
> >     fails to find a gradient that would allow it to converge faster. And
> >     even if it did, the local minimum found would be as often wrong as
> >     with Nedler-Mead, only it would go faster. A least mean squares
> >     filter[5] is equally unable to suppress the noise created by the
> >     uneven distribution because no coefficients can model a random noise.
> >
> >     With that in mind, I implemented a simple optimization algorithm[6]
> >     which was first suggested by Thierry Delamare a few weeks ago. It goes
> >     like this:
> >
> >         - Distribute the desired number of PGs[7]
> >         - Subtract 1% of the weight of the OSD that is the most over used
> >         - Add the subtracted weight to the OSD that is the most under used
> >         - Repeat until the Kullback–Leibler divergence[8] is small enough
> >
> >     Quoting Adam Kupczyk, this works because:
> >
> >       "...CRUSH is not random proces at all, it behaves in numerically
> >        stable way.  Specifically, if we increase weight on one node, we
> >        will get more PGs on this node and less on every other node:
> >        CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
> >
> >     A nice side effect of this optimization algorithm is that it does not
> >     change the weight of the bucket containing the items being
> >     optimized. It is local to a bucket with no influence on the other
> >     parts of the crushmap (modulo the conditional probability bias).
> >
> >     In all tests the situation improves at least by an order of
> >     magnitude. For instance when there is a 30% difference between two
> >     OSDs, it is down to less than 3% after optimization.
> >
> >     The tests for the optimization method can be run with
> >
> >        git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git <http://libcrush.org/dachary/python-crush.git>
> >        tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
> >
> >     If anyone think of a reason why this algorithm won't work in some
> >     cases, please speak up :-)
> >
> >     Cheers
> >
> >     [1] python-crush http://crush.readthedocs.io/
> >     [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2 <http://marc.info/?l=ceph-devel&m=148539995928656&w=2>
> >     [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method <https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method>
> >     [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb <https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb>
> >     [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter <https://en.wikipedia.org/wiki/Least_mean_squares_filter>
> >     [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39 <http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39>
> >     [7] Predicting Ceph PG placement http://dachary.org/?p=4020
> >     [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>
> >
> >     --
> >     Loïc Dachary, Artisan Logiciel Libre
> >     --
> >     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >     the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>
> >     More majordomo info at  http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html>
> >
> >
>
> --
> Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-04-30 14:15 revisiting uneven CRUSH distributions Loic Dachary
  2017-05-01 17:15 ` Stefan Priebe - Profihost AG
       [not found] ` <CABZ+qqnqiUFbz=6CegW_o=2goOThpmoskDQ0oOUfE27jW0D17A@mail.gmail.com>
@ 2017-05-02 16:16 ` Loic Dachary
  2017-05-03  9:35   ` Dan van der Ster
  2017-05-05 14:49 ` Loic Dachary
  3 siblings, 1 reply; 37+ messages in thread
From: Loic Dachary @ 2017-05-02 16:16 UTC (permalink / raw)
  To: Ceph Development

Greg raised the following problem today: what if, as a consequence of changing the weights, the failure of a host/rack (whatever the failure domain is) makes the cluster full ? For instance if you have racks 1, 2, 3 with "effective" weights .8, 1.1, 1 and you lose half of rack 3 then rack 2 is going to get a lot more of the data than rack 1 is.

In other words, getting an even distribution must not be done at the expense of the ability of the cluster to sustain the failure of at least one bucket in the failure domain. It is necessary to evaluate that before and after the optimization.

On 04/30/2017 04:15 PM, Loic Dachary wrote:
> Hi,
> 
> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
> the same proportion. If an OSD is 75% full, it is expected that all
> other OSDs are also 75% full.
> 
> In reality the distribution is even only when more than 100,000 PGs
> are distributed in a pool of size 1 (i.e. no replication).
> 
> In small clusters there are a few thousands PGs and it is not enough
> to get an even distribution. Running the following with
> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
> 
>   for PGs in 1000 10000 100000 1000000 ; do
>     crush analyze --replication-count 1 \
>                   --type device \
>                   --values-count $PGs \
>                   --rule data \
>                   --crushmap tests/sample-crushmap.json
>   done
> 
> In larger clusters, even though a greater number of PGs are
> distributed, there are at most a few dozens devices per host and the
> problem remains. On a machine with 24 OSDs each expected to handle a
> few hundred PGs, a total of a few thousands PGs are distributed which
> is not enough to get an even distribution.
> 
> There is a secondary reason for the distribution to be uneven, when
> there is more than one replica. The second replica must be on a
> different device than the first replica. This conditional probability
> is not taken into account by CRUSH and would create an uneven
> distribution if more than 10,000 PGs were distributed per OSD[2]. But
> a given OSD can only handle a few hundred PGs and this conditional
> probability bias is dominated by the uneven distribution caused by the
> low number of PGs.
> 
> The uneven CRUSH distributions are always caused by a low number of
> samples, even in large clusters. Since this noise (i.e. the difference
> between the desired distribution and the actual distribution) is
> random, it cannot be fixed by optimizations methods.  The
> Nedler-Mead[3] simplex converges to a local minimum that is far from
> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
> fails to find a gradient that would allow it to converge faster. And
> even if it did, the local minimum found would be as often wrong as
> with Nedler-Mead, only it would go faster. A least mean squares
> filter[5] is equally unable to suppress the noise created by the
> uneven distribution because no coefficients can model a random noise.
> 
> With that in mind, I implemented a simple optimization algorithm[6]
> which was first suggested by Thierry Delamare a few weeks ago. It goes
> like this:
> 
>     - Distribute the desired number of PGs[7]
>     - Subtract 1% of the weight of the OSD that is the most over used
>     - Add the subtracted weight to the OSD that is the most under used
>     - Repeat until the Kullback–Leibler divergence[8] is small enough
> 
> Quoting Adam Kupczyk, this works because:
> 
>   "...CRUSH is not random proces at all, it behaves in numerically
>    stable way.  Specifically, if we increase weight on one node, we
>    will get more PGs on this node and less on every other node:
>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
> 
> A nice side effect of this optimization algorithm is that it does not
> change the weight of the bucket containing the items being
> optimized. It is local to a bucket with no influence on the other
> parts of the crushmap (modulo the conditional probability bias).
> 
> In all tests the situation improves at least by an order of
> magnitude. For instance when there is a 30% difference between two
> OSDs, it is down to less than 3% after optimization.
> 
> The tests for the optimization method can be run with
> 
>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
> 
> If anyone think of a reason why this algorithm won't work in some
> cases, please speak up :-)
> 
> Cheers
> 
> [1] python-crush http://crush.readthedocs.io/
> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-02 16:16 ` Loic Dachary
@ 2017-05-03  9:35   ` Dan van der Ster
  2017-05-03 16:50     ` Loic Dachary
  2017-05-04  1:14     ` Gregory Farnum
  0 siblings, 2 replies; 37+ messages in thread
From: Dan van der Ster @ 2017-05-03  9:35 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Tue, May 2, 2017 at 6:16 PM, Loic Dachary <loic@dachary.org> wrote:
> Greg raised the following problem today: what if, as a consequence of changing the weights, the failure of a host/rack (whatever the failure domain is) makes the cluster full ? For instance if you have racks 1, 2, 3 with "effective" weights .8, 1.1, 1 and you lose half of rack 3 then rack 2 is going to get a lot more of the data than rack 1 is.
>

Is this really a problem? In your example, the rack weights are
tweaked to correct the "rate" at which CRUSH is assigning PGs to each
rack. If you fail half of rack 3, then your effective weights will
continue to ensure that the moved PGs get equally assigned to racks 1
and 2.


On the other hand, one problem I see with your new approach is that it
does not address the secondary multi-pick problem, which is that the
ratio of 1st, 2nd, 3rd, etc... replicas/stripes is not equal for the
lower weighted OSDs.

-- Dan


> In other words, getting an even distribution must not be done at the expense of the ability of the cluster to sustain the failure of at least one bucket in the failure domain. It is necessary to evaluate that before and after the optimization.
>
> On 04/30/2017 04:15 PM, Loic Dachary wrote:
>> Hi,
>>
>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>> the same proportion. If an OSD is 75% full, it is expected that all
>> other OSDs are also 75% full.
>>
>> In reality the distribution is even only when more than 100,000 PGs
>> are distributed in a pool of size 1 (i.e. no replication).
>>
>> In small clusters there are a few thousands PGs and it is not enough
>> to get an even distribution. Running the following with
>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>
>>   for PGs in 1000 10000 100000 1000000 ; do
>>     crush analyze --replication-count 1 \
>>                   --type device \
>>                   --values-count $PGs \
>>                   --rule data \
>>                   --crushmap tests/sample-crushmap.json
>>   done
>>
>> In larger clusters, even though a greater number of PGs are
>> distributed, there are at most a few dozens devices per host and the
>> problem remains. On a machine with 24 OSDs each expected to handle a
>> few hundred PGs, a total of a few thousands PGs are distributed which
>> is not enough to get an even distribution.
>>
>> There is a secondary reason for the distribution to be uneven, when
>> there is more than one replica. The second replica must be on a
>> different device than the first replica. This conditional probability
>> is not taken into account by CRUSH and would create an uneven
>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>> a given OSD can only handle a few hundred PGs and this conditional
>> probability bias is dominated by the uneven distribution caused by the
>> low number of PGs.
>>
>> The uneven CRUSH distributions are always caused by a low number of
>> samples, even in large clusters. Since this noise (i.e. the difference
>> between the desired distribution and the actual distribution) is
>> random, it cannot be fixed by optimizations methods.  The
>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>> fails to find a gradient that would allow it to converge faster. And
>> even if it did, the local minimum found would be as often wrong as
>> with Nedler-Mead, only it would go faster. A least mean squares
>> filter[5] is equally unable to suppress the noise created by the
>> uneven distribution because no coefficients can model a random noise.
>>
>> With that in mind, I implemented a simple optimization algorithm[6]
>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>> like this:
>>
>>     - Distribute the desired number of PGs[7]
>>     - Subtract 1% of the weight of the OSD that is the most over used
>>     - Add the subtracted weight to the OSD that is the most under used
>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>
>> Quoting Adam Kupczyk, this works because:
>>
>>   "...CRUSH is not random proces at all, it behaves in numerically
>>    stable way.  Specifically, if we increase weight on one node, we
>>    will get more PGs on this node and less on every other node:
>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>
>> A nice side effect of this optimization algorithm is that it does not
>> change the weight of the bucket containing the items being
>> optimized. It is local to a bucket with no influence on the other
>> parts of the crushmap (modulo the conditional probability bias).
>>
>> In all tests the situation improves at least by an order of
>> magnitude. For instance when there is a 30% difference between two
>> OSDs, it is down to less than 3% after optimization.
>>
>> The tests for the optimization method can be run with
>>
>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>
>> If anyone think of a reason why this algorithm won't work in some
>> cases, please speak up :-)
>>
>> Cheers
>>
>> [1] python-crush http://crush.readthedocs.io/
>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-03  9:35   ` Dan van der Ster
@ 2017-05-03 16:50     ` Loic Dachary
  2017-05-03 17:59       ` Dan van der Ster
  2017-05-04  1:14     ` Gregory Farnum
  1 sibling, 1 reply; 37+ messages in thread
From: Loic Dachary @ 2017-05-03 16:50 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: Ceph Development



On 05/03/2017 11:35 AM, Dan van der Ster wrote:
> On Tue, May 2, 2017 at 6:16 PM, Loic Dachary <loic@dachary.org> wrote:
>> Greg raised the following problem today: what if, as a consequence of changing the weights, the failure of a host/rack (whatever the failure domain is) makes the cluster full ? For instance if you have racks 1, 2, 3 with "effective" weights .8, 1.1, 1 and you lose half of rack 3 then rack 2 is going to get a lot more of the data than rack 1 is.
>>
> 
> Is this really a problem? In your example, the rack weights are
> tweaked to correct the "rate" at which CRUSH is assigning PGs to each
> rack. If you fail half of rack 3, then your effective weights will
> continue to ensure that the moved PGs get equally assigned to racks 1
> and 2.

It should be possible to verify if the opimitization makes things worse in case of a failure, just by running a simulation with every failure scenario. If the worst scenario (i.e. the one with the highest overfull OSD) before optimization is better than the worst scenario after optimization, the opimization can be discarded.

> On the other hand, one problem I see with your new approach is that it
> does not address the secondary multi-pick problem, which is that the
> ratio of 1st, 2nd, 3rd, etc... replicas/stripes is not equal for the
> lower weighted OSDs.

Note that in pools with less than 10,000 PGs the multi-pick problem does not happen: there are too few samples and the uneven distribution is dominated by that problem.

However, I think the proposed algorithm could also work by tweaking the weights of each replica (but I only thought about it right now so...):

  first pick uses the target weights, say 1 1 1 1 10, always
  second pick uses the target weights the first time
  run a simulation and lower the weight of the item that is the most over full and increase the weight of the item that is the most under full
  repeat until the distribution is even
  do the same for the third pick etc.

If we do that we have the desired property of a distribution that is stable when we change the size of the pool. The key difference with the previous approaches is that the weights are adjusted based on repeated simulations instead of maths. For every pool know the exact value of each PG placed by Ceph using CRUSH.

Does that make sense or am I missing something ?

> 
> -- Dan
> 
> 
>> In other words, getting an even distribution must not be done at the expense of the ability of the cluster to sustain the failure of at least one bucket in the failure domain. It is necessary to evaluate that before and after the optimization.
>>
>> On 04/30/2017 04:15 PM, Loic Dachary wrote:
>>> Hi,
>>>
>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>> the same proportion. If an OSD is 75% full, it is expected that all
>>> other OSDs are also 75% full.
>>>
>>> In reality the distribution is even only when more than 100,000 PGs
>>> are distributed in a pool of size 1 (i.e. no replication).
>>>
>>> In small clusters there are a few thousands PGs and it is not enough
>>> to get an even distribution. Running the following with
>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>
>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>     crush analyze --replication-count 1 \
>>>                   --type device \
>>>                   --values-count $PGs \
>>>                   --rule data \
>>>                   --crushmap tests/sample-crushmap.json
>>>   done
>>>
>>> In larger clusters, even though a greater number of PGs are
>>> distributed, there are at most a few dozens devices per host and the
>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>> is not enough to get an even distribution.
>>>
>>> There is a secondary reason for the distribution to be uneven, when
>>> there is more than one replica. The second replica must be on a
>>> different device than the first replica. This conditional probability
>>> is not taken into account by CRUSH and would create an uneven
>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>> a given OSD can only handle a few hundred PGs and this conditional
>>> probability bias is dominated by the uneven distribution caused by the
>>> low number of PGs.
>>>
>>> The uneven CRUSH distributions are always caused by a low number of
>>> samples, even in large clusters. Since this noise (i.e. the difference
>>> between the desired distribution and the actual distribution) is
>>> random, it cannot be fixed by optimizations methods.  The
>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>> fails to find a gradient that would allow it to converge faster. And
>>> even if it did, the local minimum found would be as often wrong as
>>> with Nedler-Mead, only it would go faster. A least mean squares
>>> filter[5] is equally unable to suppress the noise created by the
>>> uneven distribution because no coefficients can model a random noise.
>>>
>>> With that in mind, I implemented a simple optimization algorithm[6]
>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>> like this:
>>>
>>>     - Distribute the desired number of PGs[7]
>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>     - Add the subtracted weight to the OSD that is the most under used
>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>
>>> Quoting Adam Kupczyk, this works because:
>>>
>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>    stable way.  Specifically, if we increase weight on one node, we
>>>    will get more PGs on this node and less on every other node:
>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>
>>> A nice side effect of this optimization algorithm is that it does not
>>> change the weight of the bucket containing the items being
>>> optimized. It is local to a bucket with no influence on the other
>>> parts of the crushmap (modulo the conditional probability bias).
>>>
>>> In all tests the situation improves at least by an order of
>>> magnitude. For instance when there is a 30% difference between two
>>> OSDs, it is down to less than 3% after optimization.
>>>
>>> The tests for the optimization method can be run with
>>>
>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>
>>> If anyone think of a reason why this algorithm won't work in some
>>> cases, please speak up :-)
>>>
>>> Cheers
>>>
>>> [1] python-crush http://crush.readthedocs.io/
>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-03 16:50     ` Loic Dachary
@ 2017-05-03 17:59       ` Dan van der Ster
  2017-05-03 18:41         ` Loic Dachary
  0 siblings, 1 reply; 37+ messages in thread
From: Dan van der Ster @ 2017-05-03 17:59 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Wed, May 3, 2017 at 6:50 PM, Loic Dachary <loic@dachary.org> wrote:
>
>
> On 05/03/2017 11:35 AM, Dan van der Ster wrote:
>> On Tue, May 2, 2017 at 6:16 PM, Loic Dachary <loic@dachary.org> wrote:
>>> Greg raised the following problem today: what if, as a consequence of changing the weights, the failure of a host/rack (whatever the failure domain is) makes the cluster full ? For instance if you have racks 1, 2, 3 with "effective" weights .8, 1.1, 1 and you lose half of rack 3 then rack 2 is going to get a lot more of the data than rack 1 is.
>>>
>>
>> Is this really a problem? In your example, the rack weights are
>> tweaked to correct the "rate" at which CRUSH is assigning PGs to each
>> rack. If you fail half of rack 3, then your effective weights will
>> continue to ensure that the moved PGs get equally assigned to racks 1
>> and 2.
>
> It should be possible to verify if the opimitization makes things worse in case of a failure, just by running a simulation with every failure scenario. If the worst scenario (i.e. the one with the highest overfull OSD) before optimization is better than the worst scenario after optimization, the opimization can be discarded.
>

OK, worth verifying like you said.

>> On the other hand, one problem I see with your new approach is that it
>> does not address the secondary multi-pick problem, which is that the
>> ratio of 1st, 2nd, 3rd, etc... replicas/stripes is not equal for the
>> lower weighted OSDs.
>
> Note that in pools with less than 10,000 PGs the multi-pick problem does not happen: there are too few samples and the uneven distribution is dominated by that problem.

OK perhaps you're right. The scenario I try to consider is for very
wide erasure coding -- say 8+4 or wider -- which has the same effect
as a size=12 replication pool.
I should use your simulator to provide real examples, I know.

>
> However, I think the proposed algorithm could also work by tweaking the weights of each replica (but I only thought about it right now so...):
>
>   first pick uses the target weights, say 1 1 1 1 10, always
>   second pick uses the target weights the first time
>   run a simulation and lower the weight of the item that is the most over full and increase the weight of the item that is the most under full
>   repeat until the distribution is even
>   do the same for the third pick etc.
>
> If we do that we have the desired property of a distribution that is stable when we change the size of the pool. The key difference with the previous approaches is that the weights are adjusted based on repeated simulations instead of maths. For every pool know the exact value of each PG placed by Ceph using CRUSH.
>
> Does that make sense or am I missing something ?

Sounds worth a try.

Thinking (on my toes) about this a bit more, assuming that this
iterative algorithm will bear fruit, I could imagine an interface
like:

ceph osd crush reweight-by-pg <bucket> <num iterations>

Each iteration does what you described: subtract a small amount of
(crush) weight from the fullest OSD, add that (crush) weight back to
the emptiest. [1]
On a production cluster with lots of data, the operator could minimize
data movement by invoking just a small number of iterations at once.
New clusters could run, say, a million iterations to quickly find the
optimal weights.

ceph-mgr could play a role by periodically invoking "ceph osd crush
reweight-by-pg" -- a cron of sorts -- or it could invoke that based on
some conditions related to cluster IO activity.

Other random question [2].

Cheers, Dan

[1] implementation question: do you plan to continue storing and
displaying the original crush weight (based on disk size), or would
this optimization algorithm overwrite that with the tuned value? IOW,
will we store/display "crush weight" and "effective crush weight"
separately?

[2] implementation question: could we store and use a unique
"effective crush weight" set for each pool, rather than just once for
the whole cluster? This way, newly created pools could be balanced
perfectly using this algorithm (involving zero data movement), and
legacy pools could be left imbalanced (to be slowly optimized over
several days/weeks).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-03 17:59       ` Dan van der Ster
@ 2017-05-03 18:41         ` Loic Dachary
  0 siblings, 0 replies; 37+ messages in thread
From: Loic Dachary @ 2017-05-03 18:41 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: Ceph Development



On 05/03/2017 07:59 PM, Dan van der Ster wrote:
> On Wed, May 3, 2017 at 6:50 PM, Loic Dachary <loic@dachary.org> wrote:
>>
>>
>> On 05/03/2017 11:35 AM, Dan van der Ster wrote:
>>> On Tue, May 2, 2017 at 6:16 PM, Loic Dachary <loic@dachary.org> wrote:
>>>> Greg raised the following problem today: what if, as a consequence of changing the weights, the failure of a host/rack (whatever the failure domain is) makes the cluster full ? For instance if you have racks 1, 2, 3 with "effective" weights .8, 1.1, 1 and you lose half of rack 3 then rack 2 is going to get a lot more of the data than rack 1 is.
>>>>
>>>
>>> Is this really a problem? In your example, the rack weights are
>>> tweaked to correct the "rate" at which CRUSH is assigning PGs to each
>>> rack. If you fail half of rack 3, then your effective weights will
>>> continue to ensure that the moved PGs get equally assigned to racks 1
>>> and 2.
>>
>> It should be possible to verify if the opimitization makes things worse in case of a failure, just by running a simulation with every failure scenario. If the worst scenario (i.e. the one with the highest overfull OSD) before optimization is better than the worst scenario after optimization, the opimization can be discarded.
>>
> 
> OK, worth verifying like you said.
> 
>>> On the other hand, one problem I see with your new approach is that it
>>> does not address the secondary multi-pick problem, which is that the
>>> ratio of 1st, 2nd, 3rd, etc... replicas/stripes is not equal for the
>>> lower weighted OSDs.
>>
>> Note that in pools with less than 10,000 PGs the multi-pick problem does not happen: there are too few samples and the uneven distribution is dominated by that problem.
> 
> OK perhaps you're right. The scenario I try to consider is for very
> wide erasure coding -- say 8+4 or wider -- which has the same effect
> as a size=12 replication pool.
> I should use your simulator to provide real examples, I know.
> 
>>
>> However, I think the proposed algorithm could also work by tweaking the weights of each replica (but I only thought about it right now so...):
>>
>>   first pick uses the target weights, say 1 1 1 1 10, always
>>   second pick uses the target weights the first time
>>   run a simulation and lower the weight of the item that is the most over full and increase the weight of the item that is the most under full
>>   repeat until the distribution is even
>>   do the same for the third pick etc.
>>
>> If we do that we have the desired property of a distribution that is stable when we change the size of the pool. The key difference with the previous approaches is that the weights are adjusted based on repeated simulations instead of maths. For every pool know the exact value of each PG placed by Ceph using CRUSH.
>>
>> Does that make sense or am I missing something ?
> 
> Sounds worth a try.

Ok. I'll work on that since there does not seem to be any stupid / obvious blocker. 

> 
> Thinking (on my toes) about this a bit more, assuming that this
> iterative algorithm will bear fruit, I could imagine an interface
> like:
> 
> ceph osd crush reweight-by-pg <bucket> <num iterations>
> 
> Each iteration does what you described: subtract a small amount of
> (crush) weight from the fullest OSD, add that (crush) weight back to
> the emptiest. [1]
> On a production cluster with lots of data, the operator could minimize
> data movement by invoking just a small number of iterations at once.
> New clusters could run, say, a million iterations to quickly find the
> optimal weights.
> 
> ceph-mgr could play a role by periodically invoking "ceph osd crush
> reweight-by-pg" -- a cron of sorts -- or it could invoke that based on
> some conditions related to cluster IO activity.
> 
> Other random question [2].
> 
> Cheers, Dan
> 
> [1] implementation question: do you plan to continue storing and
> displaying the original crush weight (based on disk size), or would
> this optimization algorithm overwrite that with the tuned value? IOW,
> will we store/display "crush weight" and "effective crush weight"
> separately?

The original weights stay as they are, yes. The optimization algorithm will modify weights that are hidden and can only be seen in the crushmap, when decompiled. 

> 
> [2] implementation question: could we store and use a unique
> "effective crush weight" set for each pool, rather than just once for
> the whole cluster? This way, newly created pools could be balanced
> perfectly using this algorithm (involving zero data movement), and
> legacy pools could be left imbalanced (to be slowly optimized over
> several days/weeks).

Yes, there is a unique set of effective crush weight per pool (we say "weight set" instead of "effective crush weight"). This already is in Luminous. See https://github.com/ceph/ceph/pull/14486/files#diff-0057f181edd3554c94feabcb1586cbd7R98 for an example of weight set applied to pool 6.

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-03  9:35   ` Dan van der Ster
  2017-05-03 16:50     ` Loic Dachary
@ 2017-05-04  1:14     ` Gregory Farnum
  1 sibling, 0 replies; 37+ messages in thread
From: Gregory Farnum @ 2017-05-04  1:14 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: Loic Dachary, Ceph Development

On Wed, May 3, 2017 at 2:35 AM, Dan van der Ster <dan@vanderster.com> wrote:
> On Tue, May 2, 2017 at 6:16 PM, Loic Dachary <loic@dachary.org> wrote:
>> Greg raised the following problem today: what if, as a consequence of changing the weights, the failure of a host/rack (whatever the failure domain is) makes the cluster full ? For instance if you have racks 1, 2, 3 with "effective" weights .8, 1.1, 1 and you lose half of rack 3 then rack 2 is going to get a lot more of the data than rack 1 is.
>>
>
> Is this really a problem? In your example, the rack weights are
> tweaked to correct the "rate" at which CRUSH is assigning PGs to each
> rack. If you fail half of rack 3, then your effective weights will
> continue to ensure that the moved PGs get equally assigned to racks 1
> and 2.

This intuition is *probably* wrong. In general, we expect data to be
redistributed according to the CRUSH weights in use. In specific
instances that turns out not to be the case — and because it's
pseudo-random there's a statistical variance from perfect balance
anyway — but there is no reason to expect that the "ideal" weights
under CRUSH map {X} will lead to anything like an ideal balance under
map {Y}.

So the problem isn't just that if you have set effective rack weights
of .8, 1.1, and 1 and then reduce the third rack to .5, you might fill
up a particular OSD. It's that the statistical guarantees CRUSH
provides about how much data moves to where are broken (due to us
changing the weights) — we expect rack 2 to get (1.1/.8=)37.5% more
data than rack 1, despite them really being the same size.

As I said to Loïc, there may be space to play around with this in a
practical sense. But we would want to somehow avoid moving data from
1->2 while re-replicating data into the correct locations, and without
putting data into the wrong place to start. I don't know if there's a
way to resolve that.


Do those issues make sense?
-Greg

> On the other hand, one problem I see with your new approach is that it
> does not address the secondary multi-pick problem, which is that the
> ratio of 1st, 2nd, 3rd, etc... replicas/stripes is not equal for the
> lower weighted OSDs.
>
> -- Dan
>
>
>> In other words, getting an even distribution must not be done at the expense of the ability of the cluster to sustain the failure of at least one bucket in the failure domain. It is necessary to evaluate that before and after the optimization.
>>
>> On 04/30/2017 04:15 PM, Loic Dachary wrote:
>>> Hi,
>>>
>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>> the same proportion. If an OSD is 75% full, it is expected that all
>>> other OSDs are also 75% full.
>>>
>>> In reality the distribution is even only when more than 100,000 PGs
>>> are distributed in a pool of size 1 (i.e. no replication).
>>>
>>> In small clusters there are a few thousands PGs and it is not enough
>>> to get an even distribution. Running the following with
>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>
>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>     crush analyze --replication-count 1 \
>>>                   --type device \
>>>                   --values-count $PGs \
>>>                   --rule data \
>>>                   --crushmap tests/sample-crushmap.json
>>>   done
>>>
>>> In larger clusters, even though a greater number of PGs are
>>> distributed, there are at most a few dozens devices per host and the
>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>> is not enough to get an even distribution.
>>>
>>> There is a secondary reason for the distribution to be uneven, when
>>> there is more than one replica. The second replica must be on a
>>> different device than the first replica. This conditional probability
>>> is not taken into account by CRUSH and would create an uneven
>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>> a given OSD can only handle a few hundred PGs and this conditional
>>> probability bias is dominated by the uneven distribution caused by the
>>> low number of PGs.
>>>
>>> The uneven CRUSH distributions are always caused by a low number of
>>> samples, even in large clusters. Since this noise (i.e. the difference
>>> between the desired distribution and the actual distribution) is
>>> random, it cannot be fixed by optimizations methods.  The
>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>> fails to find a gradient that would allow it to converge faster. And
>>> even if it did, the local minimum found would be as often wrong as
>>> with Nedler-Mead, only it would go faster. A least mean squares
>>> filter[5] is equally unable to suppress the noise created by the
>>> uneven distribution because no coefficients can model a random noise.
>>>
>>> With that in mind, I implemented a simple optimization algorithm[6]
>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>> like this:
>>>
>>>     - Distribute the desired number of PGs[7]
>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>     - Add the subtracted weight to the OSD that is the most under used
>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>
>>> Quoting Adam Kupczyk, this works because:
>>>
>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>    stable way.  Specifically, if we increase weight on one node, we
>>>    will get more PGs on this node and less on every other node:
>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>
>>> A nice side effect of this optimization algorithm is that it does not
>>> change the weight of the bucket containing the items being
>>> optimized. It is local to a bucket with no influence on the other
>>> parts of the crushmap (modulo the conditional probability bias).
>>>
>>> In all tests the situation improves at least by an order of
>>> magnitude. For instance when there is a 30% difference between two
>>> OSDs, it is down to less than 3% after optimization.
>>>
>>> The tests for the optimization method can be run with
>>>
>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>
>>> If anyone think of a reason why this algorithm won't work in some
>>> cases, please speak up :-)
>>>
>>> Cheers
>>>
>>> [1] python-crush http://crush.readthedocs.io/
>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-04-30 14:15 revisiting uneven CRUSH distributions Loic Dachary
                   ` (2 preceding siblings ...)
  2017-05-02 16:16 ` Loic Dachary
@ 2017-05-05 14:49 ` Loic Dachary
  3 siblings, 0 replies; 37+ messages in thread
From: Loic Dachary @ 2017-05-05 14:49 UTC (permalink / raw)
  To: Ceph Development

Hi,

The proposed algorithm fixes the probability bias when we have 7 rooms defined as the failure domain and one of them is bigger than the others (six with weight 10, one with weight 20), each hosting about 50,000 PGs. Before optimization we have:

             ~id~     ~weight~  ~objects~  ~over/under used %~
~name~                                                        
cloud6-1434    -7  1000.239929      39529             5.395913
cloud6-1463    -8  1000.239929      39466             5.227936
cloud6-1432    -5  1000.079895      39378             5.010104
cloud6-1430    -3  1000.079895      39232             4.620763
cloud6-1431    -4  1000.079895      39207             4.554095
cloud6-1433    -6  1000.079895      39112             4.300756
cloud6-1429    -2  2000.000000      64076           -14.556796

Worst case when a failure happens:

        ~over used %~
~type~               
device      10.120912
room         7.507467
root         0.000000

after optimization we have:

             ~id~  ~weight~  ~objects~  ~over/under used %~
~name~                                                     
cloud6-1463    -8   1000.24      37648                 0.38
cloud6-1434    -7   1000.24      37560                 0.15
cloud6-1430    -3   1000.08      37531                 0.08
cloud6-1429    -2   2000.00      75028                 0.05
cloud6-1432    -5   1000.08      37506                 0.02
cloud6-1433    -6   1000.08      37463                -0.10
cloud6-1431    -4   1000.08      37264                -0.63

Worst case when a failure happens:

        ~over used %~
~type~               
device           4.31
room             2.66
root             0.00

This is encouraging and I'll add more test cases including the crush maps I have. If anyone has an atypical crushmap to share, it would be great to add it to the list ;-)

Cheers

P.S. The corresponding code to run these tests is at http://libcrush.org/dachary/python-crush/commits/wip-fix-2

On 04/30/2017 04:15 PM, Loic Dachary wrote:
> Hi,
> 
> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
> the same proportion. If an OSD is 75% full, it is expected that all
> other OSDs are also 75% full.
> 
> In reality the distribution is even only when more than 100,000 PGs
> are distributed in a pool of size 1 (i.e. no replication).
> 
> In small clusters there are a few thousands PGs and it is not enough
> to get an even distribution. Running the following with
> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
> 
>   for PGs in 1000 10000 100000 1000000 ; do
>     crush analyze --replication-count 1 \
>                   --type device \
>                   --values-count $PGs \
>                   --rule data \
>                   --crushmap tests/sample-crushmap.json
>   done
> 
> In larger clusters, even though a greater number of PGs are
> distributed, there are at most a few dozens devices per host and the
> problem remains. On a machine with 24 OSDs each expected to handle a
> few hundred PGs, a total of a few thousands PGs are distributed which
> is not enough to get an even distribution.
> 
> There is a secondary reason for the distribution to be uneven, when
> there is more than one replica. The second replica must be on a
> different device than the first replica. This conditional probability
> is not taken into account by CRUSH and would create an uneven
> distribution if more than 10,000 PGs were distributed per OSD[2]. But
> a given OSD can only handle a few hundred PGs and this conditional
> probability bias is dominated by the uneven distribution caused by the
> low number of PGs.
> 
> The uneven CRUSH distributions are always caused by a low number of
> samples, even in large clusters. Since this noise (i.e. the difference
> between the desired distribution and the actual distribution) is
> random, it cannot be fixed by optimizations methods.  The
> Nedler-Mead[3] simplex converges to a local minimum that is far from
> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
> fails to find a gradient that would allow it to converge faster. And
> even if it did, the local minimum found would be as often wrong as
> with Nedler-Mead, only it would go faster. A least mean squares
> filter[5] is equally unable to suppress the noise created by the
> uneven distribution because no coefficients can model a random noise.
> 
> With that in mind, I implemented a simple optimization algorithm[6]
> which was first suggested by Thierry Delamare a few weeks ago. It goes
> like this:
> 
>     - Distribute the desired number of PGs[7]
>     - Subtract 1% of the weight of the OSD that is the most over used
>     - Add the subtracted weight to the OSD that is the most under used
>     - Repeat until the Kullback–Leibler divergence[8] is small enough
> 
> Quoting Adam Kupczyk, this works because:
> 
>   "...CRUSH is not random proces at all, it behaves in numerically
>    stable way.  Specifically, if we increase weight on one node, we
>    will get more PGs on this node and less on every other node:
>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
> 
> A nice side effect of this optimization algorithm is that it does not
> change the weight of the bucket containing the items being
> optimized. It is local to a bucket with no influence on the other
> parts of the crushmap (modulo the conditional probability bias).
> 
> In all tests the situation improves at least by an order of
> magnitude. For instance when there is a 30% difference between two
> OSDs, it is down to less than 3% after optimization.
> 
> The tests for the optimization method can be run with
> 
>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
> 
> If anyone think of a reason why this algorithm won't work in some
> cases, please speak up :-)
> 
> Cheers
> 
> [1] python-crush http://crush.readthedocs.io/
> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-02 10:39     ` Dan van der Ster
@ 2017-05-06 13:21       ` Loic Dachary
       [not found]         ` <CAAXqJ+oTkwT4AP6U5BUBVLbkTPwcwo8rnK1ng-p3UroEHBDV2A@mail.gmail.com>
                           ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Loic Dachary @ 2017-05-06 13:21 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: Ceph Development

Hi Dan,

The optimization works for pool 5 which is using the "data" rule. It's an extreme case because there are very few PG per OSD (about 6) and, as expected, very uneven and some of them even have no PG at all (the list is abreviated but you can find it in full at https://paste2.org/3j1hbd3d):

          ~id~  ~weight~      ~PGs~  ~over/under used %~
~name~                                                  
osd.104    104  5.459991         17           124.674479
osd.704    704  5.459991         16           111.458333
osd.75      75  5.459991         16           111.458333
...
osd.25      25  5.459991          2           -73.567708
osd.336    336  5.459991          1           -86.783854
osd.673    673  5.459991          1           -86.783854
osd.496    496  5.459991          0          -100.000000
osd.646    646  5.459991          0          -100.000000

The failure domain (rack) only has ~5% over / under full racks. But, this has an impact on the uneven distribution within each rack.

        ~id~    ~weight~      ~PGs~  ~over/under used %~
~name~                                                  
RA13      -9  786.238770       1150             5.545609
RA01     -72  911.818573       1270             0.506019
RA09      -6  917.278564       1274             0.222439
RA17     -14  900.898590       1238            -0.838857
RA05      -4  917.278564       1212            -4.654948

After optimization the distribution of the OSDs is still uneven, even though it improved significantly:

          ~id~  ~weight~      ~PGs~  ~over/under used %~
~name~                                                  
osd.252    252  5.459991         10            32.161458
osd.330    330  5.459991         10            32.161458
osd.571    571  5.459991          9            18.945312
...
osd.261    261  5.459991          5           -33.919271
osd.210    210  5.459991          5           -33.919271
osd.1269  1269  5.459991          4           -47.135417

and the racks uneven distribution dropped under 0.5% which is better:

        ~id~    ~weight~      ~PGs~  ~over/under used %~
~name~                                                  
RA17     -14  900.898590       1252             0.282513
RA01     -72  911.818573       1267             0.268603
RA13      -9  786.238770       1089            -0.052897
RA05      -4  917.278564       1269            -0.170898
RA09      -6  917.278564       1267            -0.328234

I'll keep working on optimizing the two other pools. Don't hesistate to tell me if I'm going in the wrong direction. 

Cheers


On 05/02/2017 12:39 PM, Dan van der Ster wrote:
> On Tue, May 2, 2017 at 12:21 PM, Loic Dachary <loic@dachary.org> wrote:
>> On 05/02/2017 11:35 AM, Dan van der Ster wrote:
>>> Hi Loic,
>>>
>>> I'm not managing to compile this on my CentOS 7 dev box.
>>
>> What error do you get ? With pip 8.1 + you should not need to compile, there are binary wheels available.
>>
> 
> Double requirement given: appdirs==1.4.3 (from -r
> /root/git/python-crush/requirements-dev.txt (line 8)) (already in
> appdirs==1.4.3 (from -r /root/git/python-crush/requirements.txt (line
> 10)), name='appdirs')
> 
> [root@dvanders-work python-crush]# grep appdirs *.txt
> requirements-dev.txt:appdirs==1.4.3
> requirements.txt:appdirs==1.4.3
> 
> 
> 
>>> Do you want to try a "complicated" crush map? Here is ours: https://www.dropbox.com/s/ihg7cwz7wug50pb/cern.crush?dl=1
>>
>> Could you also tell me the pool numbers, pg_num and size and the rule they use ?
> 
> pool 4 'volumes' replicated size 3 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 628176 flags
> nodelete,nopgchange,nosizechange min_read_recency_for_promote 1
> stripe_width 0
> pool 5 'images' replicated size 3 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 628178 flags
> hashpspool,nodelete,nopgchange,nosizechange
> min_read_recency_for_promote 1 stripe_width 0
> pool 75 'cinder-critical' replicated size 3 min_size 2 crush_ruleset 4
> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 587162 flags
> hashpspool,nodelete,nopgchange,nosizechange
> min_read_recency_for_promote 1 stripe_width 0
> 
> 
>>
>>> The important rules are "data" and "critical", and note that there are two rooms which are expected to fill at different rates. So we'd like to optimize separately for buckets 0513-R-0050 and 0513-R-0060.
>>
>> Thanks, I will :-)
>>
> 
> Cool, thanks!
> 
> -- Dan
> 
>>> Cheers, Dan
>>>
>>>
>>> On Sun, Apr 30, 2017 at 4:15 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>>
>>>     Hi,
>>>
>>>     Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>     the same proportion. If an OSD is 75% full, it is expected that all
>>>     other OSDs are also 75% full.
>>>
>>>     In reality the distribution is even only when more than 100,000 PGs
>>>     are distributed in a pool of size 1 (i.e. no replication).
>>>
>>>     In small clusters there are a few thousands PGs and it is not enough
>>>     to get an even distribution. Running the following with
>>>     python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>     6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>
>>>       for PGs in 1000 10000 100000 1000000 ; do
>>>         crush analyze --replication-count 1 \
>>>                       --type device \
>>>                       --values-count $PGs \
>>>                       --rule data \
>>>                       --crushmap tests/sample-crushmap.json
>>>       done
>>>
>>>     In larger clusters, even though a greater number of PGs are
>>>     distributed, there are at most a few dozens devices per host and the
>>>     problem remains. On a machine with 24 OSDs each expected to handle a
>>>     few hundred PGs, a total of a few thousands PGs are distributed which
>>>     is not enough to get an even distribution.
>>>
>>>     There is a secondary reason for the distribution to be uneven, when
>>>     there is more than one replica. The second replica must be on a
>>>     different device than the first replica. This conditional probability
>>>     is not taken into account by CRUSH and would create an uneven
>>>     distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>     a given OSD can only handle a few hundred PGs and this conditional
>>>     probability bias is dominated by the uneven distribution caused by the
>>>     low number of PGs.
>>>
>>>     The uneven CRUSH distributions are always caused by a low number of
>>>     samples, even in large clusters. Since this noise (i.e. the difference
>>>     between the desired distribution and the actual distribution) is
>>>     random, it cannot be fixed by optimizations methods.  The
>>>     Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>     the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>     fails to find a gradient that would allow it to converge faster. And
>>>     even if it did, the local minimum found would be as often wrong as
>>>     with Nedler-Mead, only it would go faster. A least mean squares
>>>     filter[5] is equally unable to suppress the noise created by the
>>>     uneven distribution because no coefficients can model a random noise.
>>>
>>>     With that in mind, I implemented a simple optimization algorithm[6]
>>>     which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>     like this:
>>>
>>>         - Distribute the desired number of PGs[7]
>>>         - Subtract 1% of the weight of the OSD that is the most over used
>>>         - Add the subtracted weight to the OSD that is the most under used
>>>         - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>
>>>     Quoting Adam Kupczyk, this works because:
>>>
>>>       "...CRUSH is not random proces at all, it behaves in numerically
>>>        stable way.  Specifically, if we increase weight on one node, we
>>>        will get more PGs on this node and less on every other node:
>>>        CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>
>>>     A nice side effect of this optimization algorithm is that it does not
>>>     change the weight of the bucket containing the items being
>>>     optimized. It is local to a bucket with no influence on the other
>>>     parts of the crushmap (modulo the conditional probability bias).
>>>
>>>     In all tests the situation improves at least by an order of
>>>     magnitude. For instance when there is a 30% difference between two
>>>     OSDs, it is down to less than 3% after optimization.
>>>
>>>     The tests for the optimization method can be run with
>>>
>>>        git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git <http://libcrush.org/dachary/python-crush.git>
>>>        tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>
>>>     If anyone think of a reason why this algorithm won't work in some
>>>     cases, please speak up :-)
>>>
>>>     Cheers
>>>
>>>     [1] python-crush http://crush.readthedocs.io/
>>>     [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2 <http://marc.info/?l=ceph-devel&m=148539995928656&w=2>
>>>     [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method <https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method>
>>>     [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb <https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb>
>>>     [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter <https://en.wikipedia.org/wiki/Least_mean_squares_filter>
>>>     [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39 <http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39>
>>>     [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>     [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>
>>>
>>>     --
>>>     Loïc Dachary, Artisan Logiciel Libre
>>>     --
>>>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>     the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>
>>>     More majordomo info at  http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html>
>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: SMARTER REWEIGHT-BY-UTILIZATION
       [not found]         ` <CAAXqJ+oTkwT4AP6U5BUBVLbkTPwcwo8rnK1ng-p3UroEHBDV2A@mail.gmail.com>
@ 2017-05-07 19:31           ` Loic Dachary
  0 siblings, 0 replies; 37+ messages in thread
From: Loic Dachary @ 2017-05-07 19:31 UTC (permalink / raw)
  To: Spandan Kumar Sahu; +Cc: Ceph Development

Hi,

On 05/07/2017 08:45 PM, Spandan Kumar Sahu wrote:
> Hello
> 
> I have been selected under Google Summer of Code program to work under the "Smarter Reweight by Utilisation" project, and I believe this is very similar to what Loic is working on.

Congrats and welcome :-) This is an interesting project.

> I would really appreciate if anyone can go through the proposed solution, and give feedback. In short, it is somewhat similar, to Loic's initial proposal. In more general terms, instead of simply subtracting and adding 1% from the most and least used OSDs, it tries to distribute the difference between the set value and the actual value, among all the OSDs, in proportion to their weights. It uses some other tricks, which I have explained over here [1] <https://docs.google.com/document/d/1RFvHEJiSXtTTjEX0MDWfaRkWwYjnQfV18EuXGsw2hm0/edit?usp=sharing> and just the algorithm part over here[2] <https://github.com/SpandanKumarSahu/Ceph_Proposal>.

It looks like you've already done a lot of work and covered significant ground, this is good. You may want to look into the work of Xavier Villaneau at http://libcrush.org/main/python-crush/issues/14 which is described in detail at http://libcrush.org/xvillaneau/crush-docs/blob/master/converted/Ceph%20pool%20capacity%20analysis.pdf

Assuming the distribution of PGs within a pool is perfect (either via upmap or by optimizing the weights), the OSDs could still be over full because:

a) the object size has a significant variance and the size of the PGs vary
b) multiple pools overlap on some OSDs but not on all of them

In the simplest case (a single pool with objects of various sizes), the problem can be fixed by reweighting the OSDs. The current implementation does that but it fails to take into account that modifying the weight of a single OSD has an impact on the distribution of PGs on all other OSDs (see https://github.com/plafl/notebooks/blob/master/converted/replication.pdf for a good explanation about that). In all other cases modifying the weight of the OSD is not going to work in general. The implementation will likely need to:

a) have a separate CRUSH hierarchy for each pool so that weights can be adjusted independently, because we cannot assume the pool have exactly the same workload
b) figure out which target weights make sense for the OSDs that are shared by multiple pools
c) figure out which weights should be adjusted if some PGs in a pool are larger than others

I think a good first step would be to address the simplest case and modify the implementation so that it modifies the OSD weight in the crushmap rather than in the OSD map. The result would be the same but it would be a step forward to implement a more sophisticated algorithm.

What do you think ?

Cheers

> In [1] <https://docs.google.com/document/d/1RFvHEJiSXtTTjEX0MDWfaRkWwYjnQfV18EuXGsw2hm0/edit?usp=sharing> I have also given justification as to why this will work more efficiently.
> 
> [1] : https://docs.google.com/document/d/1RFvHEJiSXtTTjEX0MDWfaRkWwYjnQfV18EuXGsw2hm0/edit?usp=sharing
> [2] : https://github.com/SpandanKumarSahu/Ceph_Proposal
> 
> 
> On Sat, May 6, 2017 at 6:51 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
> 
>     Hi Dan,
> 
>     The optimization works for pool 5 which is using the "data" rule. It's an extreme case because there are very few PG per OSD (about 6) and, as expected, very uneven and some of them even have no PG at all (the list is abreviated but you can find it in full at https://paste2.org/3j1hbd3d):
> 
>               ~id~  ~weight~      ~PGs~  ~over/under used %~
>     ~name~
>     osd.104    104  5.459991         17           124.674479
>     osd.704    704  5.459991         16           111.458333
>     osd.75      75  5.459991         16           111.458333
>     ...
>     osd.25      25  5.459991          2           -73.567708
>     osd.336    336  5.459991          1           -86.783854
>     osd.673    673  5.459991          1           -86.783854
>     osd.496    496  5.459991          0          -100.000000
>     osd.646    646  5.459991          0          -100.000000
> 
>     The failure domain (rack) only has ~5% over / under full racks. But, this has an impact on the uneven distribution within each rack.
> 
>             ~id~    ~weight~      ~PGs~  ~over/under used %~
>     ~name~
>     RA13      -9  786.238770       1150             5.545609
>     RA01     -72  911.818573       1270             0.506019
>     RA09      -6  917.278564       1274             0.222439
>     RA17     -14  900.898590       1238            -0.838857
>     RA05      -4  917.278564       1212            -4.654948
> 
>     After optimization the distribution of the OSDs is still uneven, even though it improved significantly:
> 
>               ~id~  ~weight~      ~PGs~  ~over/under used %~
>     ~name~
>     osd.252    252  5.459991         10            32.161458
>     osd.330    330  5.459991         10            32.161458
>     osd.571    571  5.459991          9            18.945312
>     ...
>     osd.261    261  5.459991          5           -33.919271
>     osd.210    210  5.459991          5           -33.919271
>     osd.1269  1269  5.459991          4           -47.135417
> 
>     and the racks uneven distribution dropped under 0.5% which is better:
> 
>             ~id~    ~weight~      ~PGs~  ~over/under used %~
>     ~name~
>     RA17     -14  900.898590       1252             0.282513
>     RA01     -72  911.818573       1267             0.268603
>     RA13      -9  786.238770       1089            -0.052897
>     RA05      -4  917.278564       1269            -0.170898
>     RA09      -6  917.278564       1267            -0.328234
> 
>     I'll keep working on optimizing the two other pools. Don't hesistate to tell me if I'm going in the wrong direction.
> 
>     Cheers
> 
> 
>     On 05/02/2017 12:39 PM, Dan van der Ster wrote:
>     > On Tue, May 2, 2017 at 12:21 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>     >> On 05/02/2017 11:35 AM, Dan van der Ster wrote:
>     >>> Hi Loic,
>     >>>
>     >>> I'm not managing to compile this on my CentOS 7 dev box.
>     >>
>     >> What error do you get ? With pip 8.1 + you should not need to compile, there are binary wheels available.
>     >>
>     >
>     > Double requirement given: appdirs==1.4.3 (from -r
>     > /root/git/python-crush/requirements-dev.txt (line 8)) (already in
>     > appdirs==1.4.3 (from -r /root/git/python-crush/requirements.txt (line
>     > 10)), name='appdirs')
>     >
>     > [root@dvanders-work python-crush]# grep appdirs *.txt
>     > requirements-dev.txt:appdirs==1.4.3
>     > requirements.txt:appdirs==1.4.3
>     >
>     >
>     >
>     >>> Do you want to try a "complicated" crush map? Here is ours: https://www.dropbox.com/s/ihg7cwz7wug50pb/cern.crush?dl=1 <https://www.dropbox.com/s/ihg7cwz7wug50pb/cern.crush?dl=1>
>     >>
>     >> Could you also tell me the pool numbers, pg_num and size and the rule they use ?
>     >
>     > pool 4 'volumes' replicated size 3 min_size 2 crush_ruleset 0
>     > object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 628176 flags
>     > nodelete,nopgchange,nosizechange min_read_recency_for_promote 1
>     > stripe_width 0
>     > pool 5 'images' replicated size 3 min_size 2 crush_ruleset 0
>     > object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 628178 flags
>     > hashpspool,nodelete,nopgchange,nosizechange
>     > min_read_recency_for_promote 1 stripe_width 0
>     > pool 75 'cinder-critical' replicated size 3 min_size 2 crush_ruleset 4
>     > object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 587162 flags
>     > hashpspool,nodelete,nopgchange,nosizechange
>     > min_read_recency_for_promote 1 stripe_width 0
>     >
>     >
>     >>
>     >>> The important rules are "data" and "critical", and note that there are two rooms which are expected to fill at different rates. So we'd like to optimize separately for buckets 0513-R-0050 and 0513-R-0060.
>     >>
>     >> Thanks, I will :-)
>     >>
>     >
>     > Cool, thanks!
>     >
>     > -- Dan
>     >
>     >>> Cheers, Dan
>     >>>
>     >>>
>     >>> On Sun, Apr 30, 2017 at 4:15 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org> <mailto:loic@dachary.org <mailto:loic@dachary.org>>> wrote:
>     >>>
>     >>>     Hi,
>     >>>
>     >>>     Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>     >>>     the same proportion. If an OSD is 75% full, it is expected that all
>     >>>     other OSDs are also 75% full.
>     >>>
>     >>>     In reality the distribution is even only when more than 100,000 PGs
>     >>>     are distributed in a pool of size 1 (i.e. no replication).
>     >>>
>     >>>     In small clusters there are a few thousands PGs and it is not enough
>     >>>     to get an even distribution. Running the following with
>     >>>     python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>     >>>     6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>     >>>
>     >>>       for PGs in 1000 10000 100000 1000000 ; do
>     >>>         crush analyze --replication-count 1 \
>     >>>                       --type device \
>     >>>                       --values-count $PGs \
>     >>>                       --rule data \
>     >>>                       --crushmap tests/sample-crushmap.json
>     >>>       done
>     >>>
>     >>>     In larger clusters, even though a greater number of PGs are
>     >>>     distributed, there are at most a few dozens devices per host and the
>     >>>     problem remains. On a machine with 24 OSDs each expected to handle a
>     >>>     few hundred PGs, a total of a few thousands PGs are distributed which
>     >>>     is not enough to get an even distribution.
>     >>>
>     >>>     There is a secondary reason for the distribution to be uneven, when
>     >>>     there is more than one replica. The second replica must be on a
>     >>>     different device than the first replica. This conditional probability
>     >>>     is not taken into account by CRUSH and would create an uneven
>     >>>     distribution if more than 10,000 PGs were distributed per OSD[2]. But
>     >>>     a given OSD can only handle a few hundred PGs and this conditional
>     >>>     probability bias is dominated by the uneven distribution caused by the
>     >>>     low number of PGs.
>     >>>
>     >>>     The uneven CRUSH distributions are always caused by a low number of
>     >>>     samples, even in large clusters. Since this noise (i.e. the difference
>     >>>     between the desired distribution and the actual distribution) is
>     >>>     random, it cannot be fixed by optimizations methods.  The
>     >>>     Nedler-Mead[3] simplex converges to a local minimum that is far from
>     >>>     the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>     >>>     fails to find a gradient that would allow it to converge faster. And
>     >>>     even if it did, the local minimum found would be as often wrong as
>     >>>     with Nedler-Mead, only it would go faster. A least mean squares
>     >>>     filter[5] is equally unable to suppress the noise created by the
>     >>>     uneven distribution because no coefficients can model a random noise.
>     >>>
>     >>>     With that in mind, I implemented a simple optimization algorithm[6]
>     >>>     which was first suggested by Thierry Delamare a few weeks ago. It goes
>     >>>     like this:
>     >>>
>     >>>         - Distribute the desired number of PGs[7]
>     >>>         - Subtract 1% of the weight of the OSD that is the most over used
>     >>>         - Add the subtracted weight to the OSD that is the most under used
>     >>>         - Repeat until the Kullback–Leibler divergence[8] is small enough
>     >>>
>     >>>     Quoting Adam Kupczyk, this works because:
>     >>>
>     >>>       "...CRUSH is not random proces at all, it behaves in numerically
>     >>>        stable way.  Specifically, if we increase weight on one node, we
>     >>>        will get more PGs on this node and less on every other node:
>     >>>        CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>     >>>
>     >>>     A nice side effect of this optimization algorithm is that it does not
>     >>>     change the weight of the bucket containing the items being
>     >>>     optimized. It is local to a bucket with no influence on the other
>     >>>     parts of the crushmap (modulo the conditional probability bias).
>     >>>
>     >>>     In all tests the situation improves at least by an order of
>     >>>     magnitude. For instance when there is a 30% difference between two
>     >>>     OSDs, it is down to less than 3% after optimization.
>     >>>
>     >>>     The tests for the optimization method can be run with
>     >>>
>     >>>        git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git <http://libcrush.org/dachary/python-crush.git> <http://libcrush.org/dachary/python-crush.git <http://libcrush.org/dachary/python-crush.git>>
>     >>>        tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>     >>>
>     >>>     If anyone think of a reason why this algorithm won't work in some
>     >>>     cases, please speak up :-)
>     >>>
>     >>>     Cheers
>     >>>
>     >>>     [1] python-crush http://crush.readthedocs.io/
>     >>>     [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2 <http://marc.info/?l=ceph-devel&m=148539995928656&w=2> <http://marc.info/?l=ceph-devel&m=148539995928656&w=2 <http://marc.info/?l=ceph-devel&m=148539995928656&w=2>>
>     >>>     [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method <https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method> <https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method <https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method>>
>     >>>     [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb <https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb> <https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb <https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb>>
>     >>>     [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter <https://en.wikipedia.org/wiki/Least_mean_squares_filter> <https://en.wikipedia.org/wiki/Least_mean_squares_filter <https://en.wikipedia.org/wiki/Least_mean_squares_filter>>
>     >>>     [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39 <http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39> <http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39 <http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39>>
>     >>>     [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>     >>>     [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence> <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>>
>     >>>
>     >>>     --
>     >>>     Loïc Dachary, Artisan Logiciel Libre
>     >>>     --
>     >>>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>     >>>     the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org> <mailto:majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>>
>     >>>     More majordomo info at  http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html> <http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html>>
>     >>>
>     >>>
>     >>
>     >> --
>     >> Loïc Dachary, Artisan Logiciel Libre
>     >
> 
>     --
>     Loïc Dachary, Artisan Logiciel Libre
>     --
>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>     the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>
>     More majordomo info at  http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html>
> 
> 
> 
> 
> -- 
> Spandan Kumar Sahu
> IIT Kharagpur

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-06 13:21       ` Loic Dachary
       [not found]         ` <CAAXqJ+oTkwT4AP6U5BUBVLbkTPwcwo8rnK1ng-p3UroEHBDV2A@mail.gmail.com>
@ 2017-05-08  3:34         ` Spandan Kumar Sahu
  2017-05-08  9:59           ` Spandan Kumar Sahu
  2017-05-08 11:36         ` Dan van der Ster
  2 siblings, 1 reply; 37+ messages in thread
From: Spandan Kumar Sahu @ 2017-05-08  3:34 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Dan van der Ster, Ceph Development

Hello

I have been selected under Google Summer of Code program to work under
the "Smarter Reweight by Utilisation" project, and I believe this is
very similar to what Loic is working on.

I would really appreciate if anyone can go through the proposed
solution, and give feedback. In short, it is somewhat similar, to
Loic's initial proposal. In more general terms, instead of simply
subtracting and adding 1% from the most and least used OSDs, it tries
to distribute the difference between the set value and the actual
value, among all the OSDs, in proportion to their weights. It uses
some other tricks, which I have explained over here [1] and just the
algorithm part over here [2].

In [1] I have also given justification as to why this will work more
efficiently.

[1] : https://docs.google.com/document/d/1RFvHEJiSXtTTjEX0MDWfaRkWwYjnQfV18EuXGsw2hm0/edit?usp=sharing
[2] : https://github.com/SpandanKumarSahu/Ceph_Proposal

-- 
Spandan Kumar Sahu
IIT Kharagpur

On Sat, May 6, 2017 at 6:51 PM, Loic Dachary <loic@dachary.org> wrote:
> Hi Dan,
>
> The optimization works for pool 5 which is using the "data" rule. It's an extreme case because there are very few PG per OSD (about 6) and, as expected, very uneven and some of them even have no PG at all (the list is abreviated but you can find it in full at https://paste2.org/3j1hbd3d):
>
>           ~id~  ~weight~      ~PGs~  ~over/under used %~
> ~name~
> osd.104    104  5.459991         17           124.674479
> osd.704    704  5.459991         16           111.458333
> osd.75      75  5.459991         16           111.458333
> ...
> osd.25      25  5.459991          2           -73.567708
> osd.336    336  5.459991          1           -86.783854
> osd.673    673  5.459991          1           -86.783854
> osd.496    496  5.459991          0          -100.000000
> osd.646    646  5.459991          0          -100.000000
>
> The failure domain (rack) only has ~5% over / under full racks. But, this has an impact on the uneven distribution within each rack.
>
>         ~id~    ~weight~      ~PGs~  ~over/under used %~
> ~name~
> RA13      -9  786.238770       1150             5.545609
> RA01     -72  911.818573       1270             0.506019
> RA09      -6  917.278564       1274             0.222439
> RA17     -14  900.898590       1238            -0.838857
> RA05      -4  917.278564       1212            -4.654948
>
> After optimization the distribution of the OSDs is still uneven, even though it improved significantly:
>
>           ~id~  ~weight~      ~PGs~  ~over/under used %~
> ~name~
> osd.252    252  5.459991         10            32.161458
> osd.330    330  5.459991         10            32.161458
> osd.571    571  5.459991          9            18.945312
> ...
> osd.261    261  5.459991          5           -33.919271
> osd.210    210  5.459991          5           -33.919271
> osd.1269  1269  5.459991          4           -47.135417
>
> and the racks uneven distribution dropped under 0.5% which is better:
>
>         ~id~    ~weight~      ~PGs~  ~over/under used %~
> ~name~
> RA17     -14  900.898590       1252             0.282513
> RA01     -72  911.818573       1267             0.268603
> RA13      -9  786.238770       1089            -0.052897
> RA05      -4  917.278564       1269            -0.170898
> RA09      -6  917.278564       1267            -0.328234
>
> I'll keep working on optimizing the two other pools. Don't hesistate to tell me if I'm going in the wrong direction.
>
> Cheers
>
>
> On 05/02/2017 12:39 PM, Dan van der Ster wrote:
>> On Tue, May 2, 2017 at 12:21 PM, Loic Dachary <loic@dachary.org> wrote:
>>> On 05/02/2017 11:35 AM, Dan van der Ster wrote:
>>>> Hi Loic,
>>>>
>>>> I'm not managing to compile this on my CentOS 7 dev box.
>>>
>>> What error do you get ? With pip 8.1 + you should not need to compile, there are binary wheels available.
>>>
>>
>> Double requirement given: appdirs==1.4.3 (from -r
>> /root/git/python-crush/requirements-dev.txt (line 8)) (already in
>> appdirs==1.4.3 (from -r /root/git/python-crush/requirements.txt (line
>> 10)), name='appdirs')
>>
>> [root@dvanders-work python-crush]# grep appdirs *.txt
>> requirements-dev.txt:appdirs==1.4.3
>> requirements.txt:appdirs==1.4.3
>>
>>
>>
>>>> Do you want to try a "complicated" crush map? Here is ours: https://www.dropbox.com/s/ihg7cwz7wug50pb/cern.crush?dl=1
>>>
>>> Could you also tell me the pool numbers, pg_num and size and the rule they use ?
>>
>> pool 4 'volumes' replicated size 3 min_size 2 crush_ruleset 0
>> object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 628176 flags
>> nodelete,nopgchange,nosizechange min_read_recency_for_promote 1
>> stripe_width 0
>> pool 5 'images' replicated size 3 min_size 2 crush_ruleset 0
>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 628178 flags
>> hashpspool,nodelete,nopgchange,nosizechange
>> min_read_recency_for_promote 1 stripe_width 0
>> pool 75 'cinder-critical' replicated size 3 min_size 2 crush_ruleset 4
>> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 587162 flags
>> hashpspool,nodelete,nopgchange,nosizechange
>> min_read_recency_for_promote 1 stripe_width 0
>>
>>
>>>
>>>> The important rules are "data" and "critical", and note that there are two rooms which are expected to fill at different rates. So we'd like to optimize separately for buckets 0513-R-0050 and 0513-R-0060.
>>>
>>> Thanks, I will :-)
>>>
>>
>> Cool, thanks!
>>
>> -- Dan
>>
>>>> Cheers, Dan
>>>>
>>>>
>>>> On Sun, Apr 30, 2017 at 4:15 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>>>
>>>>     Hi,
>>>>
>>>>     Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>     the same proportion. If an OSD is 75% full, it is expected that all
>>>>     other OSDs are also 75% full.
>>>>
>>>>     In reality the distribution is even only when more than 100,000 PGs
>>>>     are distributed in a pool of size 1 (i.e. no replication).
>>>>
>>>>     In small clusters there are a few thousands PGs and it is not enough
>>>>     to get an even distribution. Running the following with
>>>>     python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>     6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>
>>>>       for PGs in 1000 10000 100000 1000000 ; do
>>>>         crush analyze --replication-count 1 \
>>>>                       --type device \
>>>>                       --values-count $PGs \
>>>>                       --rule data \
>>>>                       --crushmap tests/sample-crushmap.json
>>>>       done
>>>>
>>>>     In larger clusters, even though a greater number of PGs are
>>>>     distributed, there are at most a few dozens devices per host and the
>>>>     problem remains. On a machine with 24 OSDs each expected to handle a
>>>>     few hundred PGs, a total of a few thousands PGs are distributed which
>>>>     is not enough to get an even distribution.
>>>>
>>>>     There is a secondary reason for the distribution to be uneven, when
>>>>     there is more than one replica. The second replica must be on a
>>>>     different device than the first replica. This conditional probability
>>>>     is not taken into account by CRUSH and would create an uneven
>>>>     distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>     a given OSD can only handle a few hundred PGs and this conditional
>>>>     probability bias is dominated by the uneven distribution caused by the
>>>>     low number of PGs.
>>>>
>>>>     The uneven CRUSH distributions are always caused by a low number of
>>>>     samples, even in large clusters. Since this noise (i.e. the difference
>>>>     between the desired distribution and the actual distribution) is
>>>>     random, it cannot be fixed by optimizations methods.  The
>>>>     Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>     the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>     fails to find a gradient that would allow it to converge faster. And
>>>>     even if it did, the local minimum found would be as often wrong as
>>>>     with Nedler-Mead, only it would go faster. A least mean squares
>>>>     filter[5] is equally unable to suppress the noise created by the
>>>>     uneven distribution because no coefficients can model a random noise.
>>>>
>>>>     With that in mind, I implemented a simple optimization algorithm[6]
>>>>     which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>     like this:
>>>>
>>>>         - Distribute the desired number of PGs[7]
>>>>         - Subtract 1% of the weight of the OSD that is the most over used
>>>>         - Add the subtracted weight to the OSD that is the most under used
>>>>         - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>
>>>>     Quoting Adam Kupczyk, this works because:
>>>>
>>>>       "...CRUSH is not random proces at all, it behaves in numerically
>>>>        stable way.  Specifically, if we increase weight on one node, we
>>>>        will get more PGs on this node and less on every other node:
>>>>        CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>
>>>>     A nice side effect of this optimization algorithm is that it does not
>>>>     change the weight of the bucket containing the items being
>>>>     optimized. It is local to a bucket with no influence on the other
>>>>     parts of the crushmap (modulo the conditional probability bias).
>>>>
>>>>     In all tests the situation improves at least by an order of
>>>>     magnitude. For instance when there is a 30% difference between two
>>>>     OSDs, it is down to less than 3% after optimization.
>>>>
>>>>     The tests for the optimization method can be run with
>>>>
>>>>        git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git <http://libcrush.org/dachary/python-crush.git>
>>>>        tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>
>>>>     If anyone think of a reason why this algorithm won't work in some
>>>>     cases, please speak up :-)
>>>>
>>>>     Cheers
>>>>
>>>>     [1] python-crush http://crush.readthedocs.io/
>>>>     [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2 <http://marc.info/?l=ceph-devel&m=148539995928656&w=2>
>>>>     [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method <https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method>
>>>>     [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb <https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb>
>>>>     [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter <https://en.wikipedia.org/wiki/Least_mean_squares_filter>
>>>>     [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39 <http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39>
>>>>     [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>     [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>
>>>>
>>>>     --
>>>>     Loïc Dachary, Artisan Logiciel Libre
>>>>     --
>>>>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>     the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>
>>>>     More majordomo info at  http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html>
>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Spandan Kumar Sahu
IIT Kharagpur

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-08  3:34         ` revisiting uneven CRUSH distributions Spandan Kumar Sahu
@ 2017-05-08  9:59           ` Spandan Kumar Sahu
  2017-05-08 10:27             ` Loic Dachary
  0 siblings, 1 reply; 37+ messages in thread
From: Spandan Kumar Sahu @ 2017-05-08  9:59 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Dan van der Ster, Ceph Development

I am sorry for hijacking the thread. I was unaware of it, until my
mentor pointed this out.

So I will write in, my observations regarding the current algorithm.

* Subtracting 1% from the most over-used and adding it to the
least-used one doesn't change the weights of the other OSDs. So, the
time taken to achieve the required distribution will be more. Instead,
I suggest to, for each OSD, calculate the difference between the
desired percentage used and its current percentage used, multiply with
a constant and that to the existing weights. In this way, at each
iteration, the weight of each OSD changes towards the desired weight
distribution. Hence it will be faster.

* Also, we need to consider the weights of the OSDs while making a
change. For example, a 1% over-use in a highly weighted device, might
mean, more data is unevenly distributed ( since, the OSDs weights are
in general a measure of their storing capacity). So, all the changes
that we make to the weights of the OSDs, we need to scale or factor
according to the weight of the OSD.

* The case when there are multiple OSDs are most over-used and
multiple OSDs are most-underused, has not been handled. If I make
reasonable assumptions, then, the weight of the most-underused
increase by the same factor, thereby, introducing a possibility of
uneven distribution.

* The noises, as explained, are random. So, in order to cancel them,
we don't just need to add or subtract the current difference between
the desired and current weight percentage, but also take care of
noises in the past. This can be done by introducing another, factor
called "integral-factor" and "derivative-factor".
There are algorithms to handle noises. One of such algorithm is
"PID-based" feedback mechanism. The current implementation is a weak
form of PID-based feedback. I have explained a modified one with an
example here[1].

Thanks

Spandan Kumar Sahu

[1] : https://github.com/SpandanKumarSahu/Ceph_Proposal/blob/master/Readme

On Mon, May 8, 2017 at 9:04 AM, Spandan Kumar Sahu
<spandankumarsahu@gmail.com> wrote:
> Hello
>
> I have been selected under Google Summer of Code program to work under
> the "Smarter Reweight by Utilisation" project, and I believe this is
> very similar to what Loic is working on.
>
> I would really appreciate if anyone can go through the proposed
> solution, and give feedback. In short, it is somewhat similar, to
> Loic's initial proposal. In more general terms, instead of simply
> subtracting and adding 1% from the most and least used OSDs, it tries
> to distribute the difference between the set value and the actual
> value, among all the OSDs, in proportion to their weights. It uses
> some other tricks, which I have explained over here [1] and just the
> algorithm part over here [2].
>
> In [1] I have also given justification as to why this will work more
> efficiently.
>
> [1] : https://docs.google.com/document/d/1RFvHEJiSXtTTjEX0MDWfaRkWwYjnQfV18EuXGsw2hm0/edit?usp=sharing
> [2] : https://github.com/SpandanKumarSahu/Ceph_Proposal
>
> --
> Spandan Kumar Sahu
> IIT Kharagpur
>
> On Sat, May 6, 2017 at 6:51 PM, Loic Dachary <loic@dachary.org> wrote:
>> Hi Dan,
>>
>> The optimization works for pool 5 which is using the "data" rule. It's an extreme case because there are very few PG per OSD (about 6) and, as expected, very uneven and some of them even have no PG at all (the list is abreviated but you can find it in full at https://paste2.org/3j1hbd3d):
>>
>>           ~id~  ~weight~      ~PGs~  ~over/under used %~
>> ~name~
>> osd.104    104  5.459991         17           124.674479
>> osd.704    704  5.459991         16           111.458333
>> osd.75      75  5.459991         16           111.458333
>> ...
>> osd.25      25  5.459991          2           -73.567708
>> osd.336    336  5.459991          1           -86.783854
>> osd.673    673  5.459991          1           -86.783854
>> osd.496    496  5.459991          0          -100.000000
>> osd.646    646  5.459991          0          -100.000000
>>
>> The failure domain (rack) only has ~5% over / under full racks. But, this has an impact on the uneven distribution within each rack.
>>
>>         ~id~    ~weight~      ~PGs~  ~over/under used %~
>> ~name~
>> RA13      -9  786.238770       1150             5.545609
>> RA01     -72  911.818573       1270             0.506019
>> RA09      -6  917.278564       1274             0.222439
>> RA17     -14  900.898590       1238            -0.838857
>> RA05      -4  917.278564       1212            -4.654948
>>
>> After optimization the distribution of the OSDs is still uneven, even though it improved significantly:
>>
>>           ~id~  ~weight~      ~PGs~  ~over/under used %~
>> ~name~
>> osd.252    252  5.459991         10            32.161458
>> osd.330    330  5.459991         10            32.161458
>> osd.571    571  5.459991          9            18.945312
>> ...
>> osd.261    261  5.459991          5           -33.919271
>> osd.210    210  5.459991          5           -33.919271
>> osd.1269  1269  5.459991          4           -47.135417
>>
>> and the racks uneven distribution dropped under 0.5% which is better:
>>
>>         ~id~    ~weight~      ~PGs~  ~over/under used %~
>> ~name~
>> RA17     -14  900.898590       1252             0.282513
>> RA01     -72  911.818573       1267             0.268603
>> RA13      -9  786.238770       1089            -0.052897
>> RA05      -4  917.278564       1269            -0.170898
>> RA09      -6  917.278564       1267            -0.328234
>>
>> I'll keep working on optimizing the two other pools. Don't hesistate to tell me if I'm going in the wrong direction.
>>
>> Cheers
>>
>>
>> On 05/02/2017 12:39 PM, Dan van der Ster wrote:
>>> On Tue, May 2, 2017 at 12:21 PM, Loic Dachary <loic@dachary.org> wrote:
>>>> On 05/02/2017 11:35 AM, Dan van der Ster wrote:
>>>>> Hi Loic,
>>>>>
>>>>> I'm not managing to compile this on my CentOS 7 dev box.
>>>>
>>>> What error do you get ? With pip 8.1 + you should not need to compile, there are binary wheels available.
>>>>
>>>
>>> Double requirement given: appdirs==1.4.3 (from -r
>>> /root/git/python-crush/requirements-dev.txt (line 8)) (already in
>>> appdirs==1.4.3 (from -r /root/git/python-crush/requirements.txt (line
>>> 10)), name='appdirs')
>>>
>>> [root@dvanders-work python-crush]# grep appdirs *.txt
>>> requirements-dev.txt:appdirs==1.4.3
>>> requirements.txt:appdirs==1.4.3
>>>
>>>
>>>
>>>>> Do you want to try a "complicated" crush map? Here is ours: https://www.dropbox.com/s/ihg7cwz7wug50pb/cern.crush?dl=1
>>>>
>>>> Could you also tell me the pool numbers, pg_num and size and the rule they use ?
>>>
>>> pool 4 'volumes' replicated size 3 min_size 2 crush_ruleset 0
>>> object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 628176 flags
>>> nodelete,nopgchange,nosizechange min_read_recency_for_promote 1
>>> stripe_width 0
>>> pool 5 'images' replicated size 3 min_size 2 crush_ruleset 0
>>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 628178 flags
>>> hashpspool,nodelete,nopgchange,nosizechange
>>> min_read_recency_for_promote 1 stripe_width 0
>>> pool 75 'cinder-critical' replicated size 3 min_size 2 crush_ruleset 4
>>> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 587162 flags
>>> hashpspool,nodelete,nopgchange,nosizechange
>>> min_read_recency_for_promote 1 stripe_width 0
>>>
>>>
>>>>
>>>>> The important rules are "data" and "critical", and note that there are two rooms which are expected to fill at different rates. So we'd like to optimize separately for buckets 0513-R-0050 and 0513-R-0060.
>>>>
>>>> Thanks, I will :-)
>>>>
>>>
>>> Cool, thanks!
>>>
>>> -- Dan
>>>
>>>>> Cheers, Dan
>>>>>
>>>>>
>>>>> On Sun, Apr 30, 2017 at 4:15 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>>>>
>>>>>     Hi,
>>>>>
>>>>>     Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>     the same proportion. If an OSD is 75% full, it is expected that all
>>>>>     other OSDs are also 75% full.
>>>>>
>>>>>     In reality the distribution is even only when more than 100,000 PGs
>>>>>     are distributed in a pool of size 1 (i.e. no replication).
>>>>>
>>>>>     In small clusters there are a few thousands PGs and it is not enough
>>>>>     to get an even distribution. Running the following with
>>>>>     python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>     6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>
>>>>>       for PGs in 1000 10000 100000 1000000 ; do
>>>>>         crush analyze --replication-count 1 \
>>>>>                       --type device \
>>>>>                       --values-count $PGs \
>>>>>                       --rule data \
>>>>>                       --crushmap tests/sample-crushmap.json
>>>>>       done
>>>>>
>>>>>     In larger clusters, even though a greater number of PGs are
>>>>>     distributed, there are at most a few dozens devices per host and the
>>>>>     problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>     few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>     is not enough to get an even distribution.
>>>>>
>>>>>     There is a secondary reason for the distribution to be uneven, when
>>>>>     there is more than one replica. The second replica must be on a
>>>>>     different device than the first replica. This conditional probability
>>>>>     is not taken into account by CRUSH and would create an uneven
>>>>>     distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>     a given OSD can only handle a few hundred PGs and this conditional
>>>>>     probability bias is dominated by the uneven distribution caused by the
>>>>>     low number of PGs.
>>>>>
>>>>>     The uneven CRUSH distributions are always caused by a low number of
>>>>>     samples, even in large clusters. Since this noise (i.e. the difference
>>>>>     between the desired distribution and the actual distribution) is
>>>>>     random, it cannot be fixed by optimizations methods.  The
>>>>>     Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>     the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>     fails to find a gradient that would allow it to converge faster. And
>>>>>     even if it did, the local minimum found would be as often wrong as
>>>>>     with Nedler-Mead, only it would go faster. A least mean squares
>>>>>     filter[5] is equally unable to suppress the noise created by the
>>>>>     uneven distribution because no coefficients can model a random noise.
>>>>>
>>>>>     With that in mind, I implemented a simple optimization algorithm[6]
>>>>>     which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>     like this:
>>>>>
>>>>>         - Distribute the desired number of PGs[7]
>>>>>         - Subtract 1% of the weight of the OSD that is the most over used
>>>>>         - Add the subtracted weight to the OSD that is the most under used
>>>>>         - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>
>>>>>     Quoting Adam Kupczyk, this works because:
>>>>>
>>>>>       "...CRUSH is not random proces at all, it behaves in numerically
>>>>>        stable way.  Specifically, if we increase weight on one node, we
>>>>>        will get more PGs on this node and less on every other node:
>>>>>        CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>
>>>>>     A nice side effect of this optimization algorithm is that it does not
>>>>>     change the weight of the bucket containing the items being
>>>>>     optimized. It is local to a bucket with no influence on the other
>>>>>     parts of the crushmap (modulo the conditional probability bias).
>>>>>
>>>>>     In all tests the situation improves at least by an order of
>>>>>     magnitude. For instance when there is a 30% difference between two
>>>>>     OSDs, it is down to less than 3% after optimization.
>>>>>
>>>>>     The tests for the optimization method can be run with
>>>>>
>>>>>        git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git <http://libcrush.org/dachary/python-crush.git>
>>>>>        tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>
>>>>>     If anyone think of a reason why this algorithm won't work in some
>>>>>     cases, please speak up :-)
>>>>>
>>>>>     Cheers
>>>>>
>>>>>     [1] python-crush http://crush.readthedocs.io/
>>>>>     [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2 <http://marc.info/?l=ceph-devel&m=148539995928656&w=2>
>>>>>     [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method <https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method>
>>>>>     [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb <https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb>
>>>>>     [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter <https://en.wikipedia.org/wiki/Least_mean_squares_filter>
>>>>>     [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39 <http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39>
>>>>>     [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>     [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>
>>>>>
>>>>>     --
>>>>>     Loïc Dachary, Artisan Logiciel Libre
>>>>>     --
>>>>>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>     the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>
>>>>>     More majordomo info at  http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Spandan Kumar Sahu
> IIT Kharagpur



-- 
Spandan Kumar Sahu
IIT Kharagpur

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-08  9:59           ` Spandan Kumar Sahu
@ 2017-05-08 10:27             ` Loic Dachary
  0 siblings, 0 replies; 37+ messages in thread
From: Loic Dachary @ 2017-05-08 10:27 UTC (permalink / raw)
  To: Spandan Kumar Sahu; +Cc: Ceph Development

Hi Spandan,

On 05/08/2017 11:59 AM, Spandan Kumar Sahu wrote:
> I am sorry for hijacking the thread. I was unaware of it, until my
> mentor pointed this out.
> 
> So I will write in, my observations regarding the current algorithm.
> 
> * Subtracting 1% from the most over-used and adding it to the
> least-used one doesn't change the weights of the other OSDs. So, the
> time taken to achieve the required distribution will be more. Instead,
> I suggest to, for each OSD, calculate the difference between the
> desired percentage used and its current percentage used, multiply with
> a constant and that to the existing weights. In this way, at each
> iteration, the weight of each OSD changes towards the desired weight
> distribution. Hence it will be faster.
> 
> * Also, we need to consider the weights of the OSDs while making a
> change. For example, a 1% over-use in a highly weighted device, might
> mean, more data is unevenly distributed ( since, the OSDs weights are
> in general a measure of their storing capacity). So, all the changes
> that we make to the weights of the OSDs, we need to scale or factor
> according to the weight of the OSD.

These two comments make a lot of sense. While testing the algorithm at

http://libcrush.org/dachary/python-crush/blob/79042a415590157c6cf93fa844abcea9158bc327/tests/test_analyze.py#L81

they did not cause problems. But you're right, it would go faster and get closer to the optimum if it was implemented with the suggested improvements.

> * The case when there are multiple OSDs are most over-used and
> multiple OSDs are most-underused, has not been handled. If I make
> reasonable assumptions, then, the weight of the most-underused
> increase by the same factor, thereby, introducing a possibility of
> uneven distribution.

The loop stops when things get worse instead of getting better. But again, you're right, handling this better would get us closer to the expected distribution. In all the tests I conducted things improve by an order of magnitude at least and I did not spend time refining the details further.

I initialy thought it would be a good idea to have a loss function calculating the Kullback-Leibler divergence between the expected and the actual distribution. It turns out that the sum of the absolute values of the variance for each item is as good.

> * The noises, as explained, are random. So, in order to cancel them,
> we don't just need to add or subtract the current difference between
> the desired and current weight percentage, but also take care of
> noises in the past. This can be done by introducing another, factor
> called "integral-factor" and "derivative-factor".
> There are algorithms to handle noises. One of such algorithm is
> "PID-based" feedback mechanism. The current implementation is a weak
> form of PID-based feedback. I have explained a modified one with an
> example here[1].

I think CRUSH is a stochastic function that cannot be optimized as a linear function.

Cheers

> 
> Thanks
> 
> Spandan Kumar Sahu
> 
> [1] : https://github.com/SpandanKumarSahu/Ceph_Proposal/blob/master/Readme
> 
> On Mon, May 8, 2017 at 9:04 AM, Spandan Kumar Sahu
> <spandankumarsahu@gmail.com> wrote:
>> Hello
>>
>> I have been selected under Google Summer of Code program to work under
>> the "Smarter Reweight by Utilisation" project, and I believe this is
>> very similar to what Loic is working on.
>>
>> I would really appreciate if anyone can go through the proposed
>> solution, and give feedback. In short, it is somewhat similar, to
>> Loic's initial proposal. In more general terms, instead of simply
>> subtracting and adding 1% from the most and least used OSDs, it tries
>> to distribute the difference between the set value and the actual
>> value, among all the OSDs, in proportion to their weights. It uses
>> some other tricks, which I have explained over here [1] and just the
>> algorithm part over here [2].
>>
>> In [1] I have also given justification as to why this will work more
>> efficiently.
>>
>> [1] : https://docs.google.com/document/d/1RFvHEJiSXtTTjEX0MDWfaRkWwYjnQfV18EuXGsw2hm0/edit?usp=sharing
>> [2] : https://github.com/SpandanKumarSahu/Ceph_Proposal
>>
>> --
>> Spandan Kumar Sahu
>> IIT Kharagpur
>>
>> On Sat, May 6, 2017 at 6:51 PM, Loic Dachary <loic@dachary.org> wrote:
>>> Hi Dan,
>>>
>>> The optimization works for pool 5 which is using the "data" rule. It's an extreme case because there are very few PG per OSD (about 6) and, as expected, very uneven and some of them even have no PG at all (the list is abreviated but you can find it in full at https://paste2.org/3j1hbd3d):
>>>
>>>           ~id~  ~weight~      ~PGs~  ~over/under used %~
>>> ~name~
>>> osd.104    104  5.459991         17           124.674479
>>> osd.704    704  5.459991         16           111.458333
>>> osd.75      75  5.459991         16           111.458333
>>> ...
>>> osd.25      25  5.459991          2           -73.567708
>>> osd.336    336  5.459991          1           -86.783854
>>> osd.673    673  5.459991          1           -86.783854
>>> osd.496    496  5.459991          0          -100.000000
>>> osd.646    646  5.459991          0          -100.000000
>>>
>>> The failure domain (rack) only has ~5% over / under full racks. But, this has an impact on the uneven distribution within each rack.
>>>
>>>         ~id~    ~weight~      ~PGs~  ~over/under used %~
>>> ~name~
>>> RA13      -9  786.238770       1150             5.545609
>>> RA01     -72  911.818573       1270             0.506019
>>> RA09      -6  917.278564       1274             0.222439
>>> RA17     -14  900.898590       1238            -0.838857
>>> RA05      -4  917.278564       1212            -4.654948
>>>
>>> After optimization the distribution of the OSDs is still uneven, even though it improved significantly:
>>>
>>>           ~id~  ~weight~      ~PGs~  ~over/under used %~
>>> ~name~
>>> osd.252    252  5.459991         10            32.161458
>>> osd.330    330  5.459991         10            32.161458
>>> osd.571    571  5.459991          9            18.945312
>>> ...
>>> osd.261    261  5.459991          5           -33.919271
>>> osd.210    210  5.459991          5           -33.919271
>>> osd.1269  1269  5.459991          4           -47.135417
>>>
>>> and the racks uneven distribution dropped under 0.5% which is better:
>>>
>>>         ~id~    ~weight~      ~PGs~  ~over/under used %~
>>> ~name~
>>> RA17     -14  900.898590       1252             0.282513
>>> RA01     -72  911.818573       1267             0.268603
>>> RA13      -9  786.238770       1089            -0.052897
>>> RA05      -4  917.278564       1269            -0.170898
>>> RA09      -6  917.278564       1267            -0.328234
>>>
>>> I'll keep working on optimizing the two other pools. Don't hesistate to tell me if I'm going in the wrong direction.
>>>
>>> Cheers
>>>
>>>
>>> On 05/02/2017 12:39 PM, Dan van der Ster wrote:
>>>> On Tue, May 2, 2017 at 12:21 PM, Loic Dachary <loic@dachary.org> wrote:
>>>>> On 05/02/2017 11:35 AM, Dan van der Ster wrote:
>>>>>> Hi Loic,
>>>>>>
>>>>>> I'm not managing to compile this on my CentOS 7 dev box.
>>>>>
>>>>> What error do you get ? With pip 8.1 + you should not need to compile, there are binary wheels available.
>>>>>
>>>>
>>>> Double requirement given: appdirs==1.4.3 (from -r
>>>> /root/git/python-crush/requirements-dev.txt (line 8)) (already in
>>>> appdirs==1.4.3 (from -r /root/git/python-crush/requirements.txt (line
>>>> 10)), name='appdirs')
>>>>
>>>> [root@dvanders-work python-crush]# grep appdirs *.txt
>>>> requirements-dev.txt:appdirs==1.4.3
>>>> requirements.txt:appdirs==1.4.3
>>>>
>>>>
>>>>
>>>>>> Do you want to try a "complicated" crush map? Here is ours: https://www.dropbox.com/s/ihg7cwz7wug50pb/cern.crush?dl=1
>>>>>
>>>>> Could you also tell me the pool numbers, pg_num and size and the rule they use ?
>>>>
>>>> pool 4 'volumes' replicated size 3 min_size 2 crush_ruleset 0
>>>> object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 628176 flags
>>>> nodelete,nopgchange,nosizechange min_read_recency_for_promote 1
>>>> stripe_width 0
>>>> pool 5 'images' replicated size 3 min_size 2 crush_ruleset 0
>>>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 628178 flags
>>>> hashpspool,nodelete,nopgchange,nosizechange
>>>> min_read_recency_for_promote 1 stripe_width 0
>>>> pool 75 'cinder-critical' replicated size 3 min_size 2 crush_ruleset 4
>>>> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 587162 flags
>>>> hashpspool,nodelete,nopgchange,nosizechange
>>>> min_read_recency_for_promote 1 stripe_width 0
>>>>
>>>>
>>>>>
>>>>>> The important rules are "data" and "critical", and note that there are two rooms which are expected to fill at different rates. So we'd like to optimize separately for buckets 0513-R-0050 and 0513-R-0060.
>>>>>
>>>>> Thanks, I will :-)
>>>>>
>>>>
>>>> Cool, thanks!
>>>>
>>>> -- Dan
>>>>
>>>>>> Cheers, Dan
>>>>>>
>>>>>>
>>>>>> On Sun, Apr 30, 2017 at 4:15 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>>>>>
>>>>>>     Hi,
>>>>>>
>>>>>>     Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>>     the same proportion. If an OSD is 75% full, it is expected that all
>>>>>>     other OSDs are also 75% full.
>>>>>>
>>>>>>     In reality the distribution is even only when more than 100,000 PGs
>>>>>>     are distributed in a pool of size 1 (i.e. no replication).
>>>>>>
>>>>>>     In small clusters there are a few thousands PGs and it is not enough
>>>>>>     to get an even distribution. Running the following with
>>>>>>     python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>>     6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>>
>>>>>>       for PGs in 1000 10000 100000 1000000 ; do
>>>>>>         crush analyze --replication-count 1 \
>>>>>>                       --type device \
>>>>>>                       --values-count $PGs \
>>>>>>                       --rule data \
>>>>>>                       --crushmap tests/sample-crushmap.json
>>>>>>       done
>>>>>>
>>>>>>     In larger clusters, even though a greater number of PGs are
>>>>>>     distributed, there are at most a few dozens devices per host and the
>>>>>>     problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>>     few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>>     is not enough to get an even distribution.
>>>>>>
>>>>>>     There is a secondary reason for the distribution to be uneven, when
>>>>>>     there is more than one replica. The second replica must be on a
>>>>>>     different device than the first replica. This conditional probability
>>>>>>     is not taken into account by CRUSH and would create an uneven
>>>>>>     distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>>     a given OSD can only handle a few hundred PGs and this conditional
>>>>>>     probability bias is dominated by the uneven distribution caused by the
>>>>>>     low number of PGs.
>>>>>>
>>>>>>     The uneven CRUSH distributions are always caused by a low number of
>>>>>>     samples, even in large clusters. Since this noise (i.e. the difference
>>>>>>     between the desired distribution and the actual distribution) is
>>>>>>     random, it cannot be fixed by optimizations methods.  The
>>>>>>     Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>>     the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>>     fails to find a gradient that would allow it to converge faster. And
>>>>>>     even if it did, the local minimum found would be as often wrong as
>>>>>>     with Nedler-Mead, only it would go faster. A least mean squares
>>>>>>     filter[5] is equally unable to suppress the noise created by the
>>>>>>     uneven distribution because no coefficients can model a random noise.
>>>>>>
>>>>>>     With that in mind, I implemented a simple optimization algorithm[6]
>>>>>>     which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>>     like this:
>>>>>>
>>>>>>         - Distribute the desired number of PGs[7]
>>>>>>         - Subtract 1% of the weight of the OSD that is the most over used
>>>>>>         - Add the subtracted weight to the OSD that is the most under used
>>>>>>         - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>>
>>>>>>     Quoting Adam Kupczyk, this works because:
>>>>>>
>>>>>>       "...CRUSH is not random proces at all, it behaves in numerically
>>>>>>        stable way.  Specifically, if we increase weight on one node, we
>>>>>>        will get more PGs on this node and less on every other node:
>>>>>>        CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>>
>>>>>>     A nice side effect of this optimization algorithm is that it does not
>>>>>>     change the weight of the bucket containing the items being
>>>>>>     optimized. It is local to a bucket with no influence on the other
>>>>>>     parts of the crushmap (modulo the conditional probability bias).
>>>>>>
>>>>>>     In all tests the situation improves at least by an order of
>>>>>>     magnitude. For instance when there is a 30% difference between two
>>>>>>     OSDs, it is down to less than 3% after optimization.
>>>>>>
>>>>>>     The tests for the optimization method can be run with
>>>>>>
>>>>>>        git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git <http://libcrush.org/dachary/python-crush.git>
>>>>>>        tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>>
>>>>>>     If anyone think of a reason why this algorithm won't work in some
>>>>>>     cases, please speak up :-)
>>>>>>
>>>>>>     Cheers
>>>>>>
>>>>>>     [1] python-crush http://crush.readthedocs.io/
>>>>>>     [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2 <http://marc.info/?l=ceph-devel&m=148539995928656&w=2>
>>>>>>     [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method <https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method>
>>>>>>     [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb <https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb>
>>>>>>     [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter <https://en.wikipedia.org/wiki/Least_mean_squares_filter>
>>>>>>     [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39 <http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39>
>>>>>>     [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>>     [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>
>>>>>>
>>>>>>     --
>>>>>>     Loïc Dachary, Artisan Logiciel Libre
>>>>>>     --
>>>>>>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>     the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>
>>>>>>     More majordomo info at  http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Spandan Kumar Sahu
>> IIT Kharagpur
> 
> 
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-06 13:21       ` Loic Dachary
       [not found]         ` <CAAXqJ+oTkwT4AP6U5BUBVLbkTPwcwo8rnK1ng-p3UroEHBDV2A@mail.gmail.com>
  2017-05-08  3:34         ` revisiting uneven CRUSH distributions Spandan Kumar Sahu
@ 2017-05-08 11:36         ` Dan van der Ster
  2017-05-08 12:14           ` Loic Dachary
  2 siblings, 1 reply; 37+ messages in thread
From: Dan van der Ster @ 2017-05-08 11:36 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Sat, May 6, 2017 at 3:21 PM, Loic Dachary <loic@dachary.org> wrote:
> Hi Dan,
>
> The optimization works for pool 5 which is using the "data" rule. It's an extreme case because there are very few PG per OSD (about 6) and, as expected, very uneven and some of them even have no PG at all (the list is abreviated but you can find it in full at https://paste2.org/3j1hbd3d):
>
>           ~id~  ~weight~      ~PGs~  ~over/under used %~
> ~name~
> osd.104    104  5.459991         17           124.674479
> osd.704    704  5.459991         16           111.458333
> osd.75      75  5.459991         16           111.458333
> ...
> osd.25      25  5.459991          2           -73.567708
> osd.336    336  5.459991          1           -86.783854
> osd.673    673  5.459991          1           -86.783854
> osd.496    496  5.459991          0          -100.000000
> osd.646    646  5.459991          0          -100.000000
>
> The failure domain (rack) only has ~5% over / under full racks. But, this has an impact on the uneven distribution within each rack.
>
>         ~id~    ~weight~      ~PGs~  ~over/under used %~
> ~name~
> RA13      -9  786.238770       1150             5.545609
> RA01     -72  911.818573       1270             0.506019
> RA09      -6  917.278564       1274             0.222439
> RA17     -14  900.898590       1238            -0.838857
> RA05      -4  917.278564       1212            -4.654948
>
> After optimization the distribution of the OSDs is still uneven, even though it improved significantly:
>
>           ~id~  ~weight~      ~PGs~  ~over/under used %~
> ~name~
> osd.252    252  5.459991         10            32.161458
> osd.330    330  5.459991         10            32.161458
> osd.571    571  5.459991          9            18.945312
> ...
> osd.261    261  5.459991          5           -33.919271
> osd.210    210  5.459991          5           -33.919271
> osd.1269  1269  5.459991          4           -47.135417
>
> and the racks uneven distribution dropped under 0.5% which is better:
>
>         ~id~    ~weight~      ~PGs~  ~over/under used %~
> ~name~
> RA17     -14  900.898590       1252             0.282513
> RA01     -72  911.818573       1267             0.268603
> RA13      -9  786.238770       1089            -0.052897
> RA05      -4  917.278564       1269            -0.170898
> RA09      -6  917.278564       1267            -0.328234
>
> I'll keep working on optimizing the two other pools. Don't hesistate to tell me if I'm going in the wrong direction.

Thanks. This seems to be going in the right direction.

But maybe we need to think about how best to handle these pools with
small numbers of PGs. (There will always be several pools with
relatively few PGs, e.g. the .rgw config pools).

Perhaps an overall heuristic would be to optimise the pools in
descending order of pg_num. Once the under/over usage is balanced
below some threshold, we stop optimizing -- meaning that the small
pools might not be optimised at all.

Cheers, Dan




>
> Cheers
>
>
> On 05/02/2017 12:39 PM, Dan van der Ster wrote:
>> On Tue, May 2, 2017 at 12:21 PM, Loic Dachary <loic@dachary.org> wrote:
>>> On 05/02/2017 11:35 AM, Dan van der Ster wrote:
>>>> Hi Loic,
>>>>
>>>> I'm not managing to compile this on my CentOS 7 dev box.
>>>
>>> What error do you get ? With pip 8.1 + you should not need to compile, there are binary wheels available.
>>>
>>
>> Double requirement given: appdirs==1.4.3 (from -r
>> /root/git/python-crush/requirements-dev.txt (line 8)) (already in
>> appdirs==1.4.3 (from -r /root/git/python-crush/requirements.txt (line
>> 10)), name='appdirs')
>>
>> [root@dvanders-work python-crush]# grep appdirs *.txt
>> requirements-dev.txt:appdirs==1.4.3
>> requirements.txt:appdirs==1.4.3
>>
>>
>>
>>>> Do you want to try a "complicated" crush map? Here is ours: https://www.dropbox.com/s/ihg7cwz7wug50pb/cern.crush?dl=1
>>>
>>> Could you also tell me the pool numbers, pg_num and size and the rule they use ?
>>
>> pool 4 'volumes' replicated size 3 min_size 2 crush_ruleset 0
>> object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 628176 flags
>> nodelete,nopgchange,nosizechange min_read_recency_for_promote 1
>> stripe_width 0
>> pool 5 'images' replicated size 3 min_size 2 crush_ruleset 0
>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 628178 flags
>> hashpspool,nodelete,nopgchange,nosizechange
>> min_read_recency_for_promote 1 stripe_width 0
>> pool 75 'cinder-critical' replicated size 3 min_size 2 crush_ruleset 4
>> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 587162 flags
>> hashpspool,nodelete,nopgchange,nosizechange
>> min_read_recency_for_promote 1 stripe_width 0
>>
>>
>>>
>>>> The important rules are "data" and "critical", and note that there are two rooms which are expected to fill at different rates. So we'd like to optimize separately for buckets 0513-R-0050 and 0513-R-0060.
>>>
>>> Thanks, I will :-)
>>>
>>
>> Cool, thanks!
>>
>> -- Dan
>>
>>>> Cheers, Dan
>>>>
>>>>
>>>> On Sun, Apr 30, 2017 at 4:15 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>>>
>>>>     Hi,
>>>>
>>>>     Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>     the same proportion. If an OSD is 75% full, it is expected that all
>>>>     other OSDs are also 75% full.
>>>>
>>>>     In reality the distribution is even only when more than 100,000 PGs
>>>>     are distributed in a pool of size 1 (i.e. no replication).
>>>>
>>>>     In small clusters there are a few thousands PGs and it is not enough
>>>>     to get an even distribution. Running the following with
>>>>     python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>     6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>
>>>>       for PGs in 1000 10000 100000 1000000 ; do
>>>>         crush analyze --replication-count 1 \
>>>>                       --type device \
>>>>                       --values-count $PGs \
>>>>                       --rule data \
>>>>                       --crushmap tests/sample-crushmap.json
>>>>       done
>>>>
>>>>     In larger clusters, even though a greater number of PGs are
>>>>     distributed, there are at most a few dozens devices per host and the
>>>>     problem remains. On a machine with 24 OSDs each expected to handle a
>>>>     few hundred PGs, a total of a few thousands PGs are distributed which
>>>>     is not enough to get an even distribution.
>>>>
>>>>     There is a secondary reason for the distribution to be uneven, when
>>>>     there is more than one replica. The second replica must be on a
>>>>     different device than the first replica. This conditional probability
>>>>     is not taken into account by CRUSH and would create an uneven
>>>>     distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>     a given OSD can only handle a few hundred PGs and this conditional
>>>>     probability bias is dominated by the uneven distribution caused by the
>>>>     low number of PGs.
>>>>
>>>>     The uneven CRUSH distributions are always caused by a low number of
>>>>     samples, even in large clusters. Since this noise (i.e. the difference
>>>>     between the desired distribution and the actual distribution) is
>>>>     random, it cannot be fixed by optimizations methods.  The
>>>>     Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>     the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>     fails to find a gradient that would allow it to converge faster. And
>>>>     even if it did, the local minimum found would be as often wrong as
>>>>     with Nedler-Mead, only it would go faster. A least mean squares
>>>>     filter[5] is equally unable to suppress the noise created by the
>>>>     uneven distribution because no coefficients can model a random noise.
>>>>
>>>>     With that in mind, I implemented a simple optimization algorithm[6]
>>>>     which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>     like this:
>>>>
>>>>         - Distribute the desired number of PGs[7]
>>>>         - Subtract 1% of the weight of the OSD that is the most over used
>>>>         - Add the subtracted weight to the OSD that is the most under used
>>>>         - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>
>>>>     Quoting Adam Kupczyk, this works because:
>>>>
>>>>       "...CRUSH is not random proces at all, it behaves in numerically
>>>>        stable way.  Specifically, if we increase weight on one node, we
>>>>        will get more PGs on this node and less on every other node:
>>>>        CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>
>>>>     A nice side effect of this optimization algorithm is that it does not
>>>>     change the weight of the bucket containing the items being
>>>>     optimized. It is local to a bucket with no influence on the other
>>>>     parts of the crushmap (modulo the conditional probability bias).
>>>>
>>>>     In all tests the situation improves at least by an order of
>>>>     magnitude. For instance when there is a 30% difference between two
>>>>     OSDs, it is down to less than 3% after optimization.
>>>>
>>>>     The tests for the optimization method can be run with
>>>>
>>>>        git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git <http://libcrush.org/dachary/python-crush.git>
>>>>        tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>
>>>>     If anyone think of a reason why this algorithm won't work in some
>>>>     cases, please speak up :-)
>>>>
>>>>     Cheers
>>>>
>>>>     [1] python-crush http://crush.readthedocs.io/
>>>>     [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2 <http://marc.info/?l=ceph-devel&m=148539995928656&w=2>
>>>>     [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method <https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method>
>>>>     [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb <https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb>
>>>>     [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter <https://en.wikipedia.org/wiki/Least_mean_squares_filter>
>>>>     [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39 <http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39>
>>>>     [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>     [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>
>>>>
>>>>     --
>>>>     Loïc Dachary, Artisan Logiciel Libre
>>>>     --
>>>>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>     the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>
>>>>     More majordomo info at  http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html>
>>>>
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-08 11:36         ` Dan van der Ster
@ 2017-05-08 12:14           ` Loic Dachary
  0 siblings, 0 replies; 37+ messages in thread
From: Loic Dachary @ 2017-05-08 12:14 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: Ceph Development



On 05/08/2017 01:36 PM, Dan van der Ster wrote:
> On Sat, May 6, 2017 at 3:21 PM, Loic Dachary <loic@dachary.org> wrote:
>> Hi Dan,
>>
>> The optimization works for pool 5 which is using the "data" rule. It's an extreme case because there are very few PG per OSD (about 6) and, as expected, very uneven and some of them even have no PG at all (the list is abreviated but you can find it in full at https://paste2.org/3j1hbd3d):
>>
>>           ~id~  ~weight~      ~PGs~  ~over/under used %~
>> ~name~
>> osd.104    104  5.459991         17           124.674479
>> osd.704    704  5.459991         16           111.458333
>> osd.75      75  5.459991         16           111.458333
>> ...
>> osd.25      25  5.459991          2           -73.567708
>> osd.336    336  5.459991          1           -86.783854
>> osd.673    673  5.459991          1           -86.783854
>> osd.496    496  5.459991          0          -100.000000
>> osd.646    646  5.459991          0          -100.000000
>>
>> The failure domain (rack) only has ~5% over / under full racks. But, this has an impact on the uneven distribution within each rack.
>>
>>         ~id~    ~weight~      ~PGs~  ~over/under used %~
>> ~name~
>> RA13      -9  786.238770       1150             5.545609
>> RA01     -72  911.818573       1270             0.506019
>> RA09      -6  917.278564       1274             0.222439
>> RA17     -14  900.898590       1238            -0.838857
>> RA05      -4  917.278564       1212            -4.654948
>>
>> After optimization the distribution of the OSDs is still uneven, even though it improved significantly:
>>
>>           ~id~  ~weight~      ~PGs~  ~over/under used %~
>> ~name~
>> osd.252    252  5.459991         10            32.161458
>> osd.330    330  5.459991         10            32.161458
>> osd.571    571  5.459991          9            18.945312
>> ...
>> osd.261    261  5.459991          5           -33.919271
>> osd.210    210  5.459991          5           -33.919271
>> osd.1269  1269  5.459991          4           -47.135417
>>
>> and the racks uneven distribution dropped under 0.5% which is better:
>>
>>         ~id~    ~weight~      ~PGs~  ~over/under used %~
>> ~name~
>> RA17     -14  900.898590       1252             0.282513
>> RA01     -72  911.818573       1267             0.268603
>> RA13      -9  786.238770       1089            -0.052897
>> RA05      -4  917.278564       1269            -0.170898
>> RA09      -6  917.278564       1267            -0.328234
>>
>> I'll keep working on optimizing the two other pools. Don't hesistate to tell me if I'm going in the wrong direction.
> 
> Thanks. This seems to be going in the right direction.

Cool :-)

> 
> But maybe we need to think about how best to handle these pools with
> small numbers of PGs. (There will always be several pools with
> relatively few PGs, e.g. the .rgw config pools).
> 
> Perhaps an overall heuristic would be to optimise the pools in
> descending order of pg_num. Once the under/over usage is balanced
> below some threshold, we stop optimizing -- meaning that the small
> pools might not be optimised at all.

Oh definitely. Optimizing with OSDs hosting PGs for multiple pools is the next challenge. Hopefully we now have the right tools for the job. Xavier Villaneau started to work on that (see the draft at http://libcrush.org/xvillaneau/crush-docs/raw/master/converted/Ceph%20pool%20capacity%20analysis.pdf ).

Cheers

> 
> Cheers, Dan
> 
> 
> 
> 
>>
>> Cheers
>>
>>
>> On 05/02/2017 12:39 PM, Dan van der Ster wrote:
>>> On Tue, May 2, 2017 at 12:21 PM, Loic Dachary <loic@dachary.org> wrote:
>>>> On 05/02/2017 11:35 AM, Dan van der Ster wrote:
>>>>> Hi Loic,
>>>>>
>>>>> I'm not managing to compile this on my CentOS 7 dev box.
>>>>
>>>> What error do you get ? With pip 8.1 + you should not need to compile, there are binary wheels available.
>>>>
>>>
>>> Double requirement given: appdirs==1.4.3 (from -r
>>> /root/git/python-crush/requirements-dev.txt (line 8)) (already in
>>> appdirs==1.4.3 (from -r /root/git/python-crush/requirements.txt (line
>>> 10)), name='appdirs')
>>>
>>> [root@dvanders-work python-crush]# grep appdirs *.txt
>>> requirements-dev.txt:appdirs==1.4.3
>>> requirements.txt:appdirs==1.4.3
>>>
>>>
>>>
>>>>> Do you want to try a "complicated" crush map? Here is ours: https://www.dropbox.com/s/ihg7cwz7wug50pb/cern.crush?dl=1
>>>>
>>>> Could you also tell me the pool numbers, pg_num and size and the rule they use ?
>>>
>>> pool 4 'volumes' replicated size 3 min_size 2 crush_ruleset 0
>>> object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 628176 flags
>>> nodelete,nopgchange,nosizechange min_read_recency_for_promote 1
>>> stripe_width 0
>>> pool 5 'images' replicated size 3 min_size 2 crush_ruleset 0
>>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 628178 flags
>>> hashpspool,nodelete,nopgchange,nosizechange
>>> min_read_recency_for_promote 1 stripe_width 0
>>> pool 75 'cinder-critical' replicated size 3 min_size 2 crush_ruleset 4
>>> object_hash rjenkins pg_num 8192 pgp_num 8192 last_change 587162 flags
>>> hashpspool,nodelete,nopgchange,nosizechange
>>> min_read_recency_for_promote 1 stripe_width 0
>>>
>>>
>>>>
>>>>> The important rules are "data" and "critical", and note that there are two rooms which are expected to fill at different rates. So we'd like to optimize separately for buckets 0513-R-0050 and 0513-R-0060.
>>>>
>>>> Thanks, I will :-)
>>>>
>>>
>>> Cool, thanks!
>>>
>>> -- Dan
>>>
>>>>> Cheers, Dan
>>>>>
>>>>>
>>>>> On Sun, Apr 30, 2017 at 4:15 PM, Loic Dachary <loic@dachary.org <mailto:loic@dachary.org>> wrote:
>>>>>
>>>>>     Hi,
>>>>>
>>>>>     Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>     the same proportion. If an OSD is 75% full, it is expected that all
>>>>>     other OSDs are also 75% full.
>>>>>
>>>>>     In reality the distribution is even only when more than 100,000 PGs
>>>>>     are distributed in a pool of size 1 (i.e. no replication).
>>>>>
>>>>>     In small clusters there are a few thousands PGs and it is not enough
>>>>>     to get an even distribution. Running the following with
>>>>>     python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>     6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>
>>>>>       for PGs in 1000 10000 100000 1000000 ; do
>>>>>         crush analyze --replication-count 1 \
>>>>>                       --type device \
>>>>>                       --values-count $PGs \
>>>>>                       --rule data \
>>>>>                       --crushmap tests/sample-crushmap.json
>>>>>       done
>>>>>
>>>>>     In larger clusters, even though a greater number of PGs are
>>>>>     distributed, there are at most a few dozens devices per host and the
>>>>>     problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>     few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>     is not enough to get an even distribution.
>>>>>
>>>>>     There is a secondary reason for the distribution to be uneven, when
>>>>>     there is more than one replica. The second replica must be on a
>>>>>     different device than the first replica. This conditional probability
>>>>>     is not taken into account by CRUSH and would create an uneven
>>>>>     distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>     a given OSD can only handle a few hundred PGs and this conditional
>>>>>     probability bias is dominated by the uneven distribution caused by the
>>>>>     low number of PGs.
>>>>>
>>>>>     The uneven CRUSH distributions are always caused by a low number of
>>>>>     samples, even in large clusters. Since this noise (i.e. the difference
>>>>>     between the desired distribution and the actual distribution) is
>>>>>     random, it cannot be fixed by optimizations methods.  The
>>>>>     Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>     the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>     fails to find a gradient that would allow it to converge faster. And
>>>>>     even if it did, the local minimum found would be as often wrong as
>>>>>     with Nedler-Mead, only it would go faster. A least mean squares
>>>>>     filter[5] is equally unable to suppress the noise created by the
>>>>>     uneven distribution because no coefficients can model a random noise.
>>>>>
>>>>>     With that in mind, I implemented a simple optimization algorithm[6]
>>>>>     which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>     like this:
>>>>>
>>>>>         - Distribute the desired number of PGs[7]
>>>>>         - Subtract 1% of the weight of the OSD that is the most over used
>>>>>         - Add the subtracted weight to the OSD that is the most under used
>>>>>         - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>
>>>>>     Quoting Adam Kupczyk, this works because:
>>>>>
>>>>>       "...CRUSH is not random proces at all, it behaves in numerically
>>>>>        stable way.  Specifically, if we increase weight on one node, we
>>>>>        will get more PGs on this node and less on every other node:
>>>>>        CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>
>>>>>     A nice side effect of this optimization algorithm is that it does not
>>>>>     change the weight of the bucket containing the items being
>>>>>     optimized. It is local to a bucket with no influence on the other
>>>>>     parts of the crushmap (modulo the conditional probability bias).
>>>>>
>>>>>     In all tests the situation improves at least by an order of
>>>>>     magnitude. For instance when there is a 30% difference between two
>>>>>     OSDs, it is down to less than 3% after optimization.
>>>>>
>>>>>     The tests for the optimization method can be run with
>>>>>
>>>>>        git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git <http://libcrush.org/dachary/python-crush.git>
>>>>>        tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>
>>>>>     If anyone think of a reason why this algorithm won't work in some
>>>>>     cases, please speak up :-)
>>>>>
>>>>>     Cheers
>>>>>
>>>>>     [1] python-crush http://crush.readthedocs.io/
>>>>>     [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2 <http://marc.info/?l=ceph-devel&m=148539995928656&w=2>
>>>>>     [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method <https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method>
>>>>>     [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb <https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb>
>>>>>     [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter <https://en.wikipedia.org/wiki/Least_mean_squares_filter>
>>>>>     [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39 <http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39>
>>>>>     [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>     [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>
>>>>>
>>>>>     --
>>>>>     Loïc Dachary, Artisan Logiciel Libre
>>>>>     --
>>>>>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>     the body of a message to majordomo@vger.kernel.org <mailto:majordomo@vger.kernel.org>
>>>>>     More majordomo info at  http://vger.kernel.org/majordomo-info.html <http://vger.kernel.org/majordomo-info.html>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-02  7:32           ` Loic Dachary
@ 2017-05-14 17:46             ` Loic Dachary
  2017-05-15 19:08               ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 37+ messages in thread
From: Loic Dachary @ 2017-05-14 17:46 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Ceph Development

Hi Stefan,

A new python-crush[1] subcommand will be available next week that you could use to rebalance your clusters. You give it a crushmap and it optimizes the weights to fix the uneven distribution. It can produce a series of crushmaps, each with a small modification so that you can gradually improve the situation and better control how many PGs are moving.

Would that be useful for the clusters you have ?

Cheers

[1] http://crush.readthedocs.io/

On 05/02/2017 09:32 AM, Loic Dachary wrote:
> 
> 
> On 05/02/2017 07:43 AM, Stefan Priebe - Profihost AG wrote:
>> Hi Loic,
>>
>> yes i didn't changed them to straw2 as i didn't saw any difference. I
>> switched to straw2 now but it didn't change anything at all.
> 
> straw vs straw2 is not responsible for the uneven distribution you're seeing. I meant to say the optimization only works on straw2 buckets, it is not implemented for straw buckets.
> 
>> If i use those weights manuall i've to adjust them on every crush change
>> on the cluster? That's something i don't really like to do.
> 
> This is not practical indeed :-) I'm hoping python-crush can automate that.
> 
> Cheers
> 
>> Greets,
>> Stefan
>>
>> Am 02.05.2017 um 01:12 schrieb Loic Dachary:
>>> It is working, with straw2 (your cluster still is using straw).
>>>
>>> For instance for one host it goes from:
>>>
>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>> ~name~
>>> osd.24         149        159                 6.65     10.0      6.71
>>> osd.29         149        159                 6.65     10.0      6.71
>>> osd.0           69         77                11.04      8.0     11.59
>>> osd.2           69         69                -0.50      0.0      0.00
>>> osd.42         149        148                -0.73     -1.0     -0.67
>>> osd.1           69         62               -10.59     -7.0    -10.14
>>> osd.23          69         62               -10.59     -7.0    -10.14
>>> osd.36         149        132               -11.46    -17.0    -11.41
>>>
>>> to
>>>
>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>> ~name~
>>> osd.0           69         69                -0.50      0.0      0.00
>>> osd.23          69         69                -0.50      0.0      0.00
>>> osd.24         149        149                -0.06      0.0      0.00
>>> osd.29         149        149                -0.06      0.0      0.00
>>> osd.36         149        149                -0.06      0.0      0.00
>>> osd.1           69         68                -1.94     -1.0     -1.45
>>> osd.2           69         68                -1.94     -1.0     -1.45
>>> osd.42         149        147                -1.40     -2.0     -1.34
>>>
>>> By changing the weights to
>>>
>>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616]
>>>
>>> And you could set these weights on the crushmap, there would be no need for backporting.
>>>
>>>
>>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote:
>>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary:
>>>>> Hi Stefan,
>>>>>
>>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>>>>>> That sounds amazing! Is there any chance this will be backported to jewel?
>>>>>
>>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.
>>>>
>>>> I've lots of them ;-)
>>>>
>>>> Will sent you one via private e-mail in some minutes.
>>>>
>>>> Greets,
>>>> Stefan
>>>>
>>>>> Cheers
>>>>>
>>>>>>
>>>>>> Greets,
>>>>>> Stefan
>>>>>>
>>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>>> the same proportion. If an OSD is 75% full, it is expected that all
>>>>>>> other OSDs are also 75% full.
>>>>>>>
>>>>>>> In reality the distribution is even only when more than 100,000 PGs
>>>>>>> are distributed in a pool of size 1 (i.e. no replication).
>>>>>>>
>>>>>>> In small clusters there are a few thousands PGs and it is not enough
>>>>>>> to get an even distribution. Running the following with
>>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>>>
>>>>>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>>>>>     crush analyze --replication-count 1 \
>>>>>>>                   --type device \
>>>>>>>                   --values-count $PGs \
>>>>>>>                   --rule data \
>>>>>>>                   --crushmap tests/sample-crushmap.json
>>>>>>>   done
>>>>>>>
>>>>>>> In larger clusters, even though a greater number of PGs are
>>>>>>> distributed, there are at most a few dozens devices per host and the
>>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>>> is not enough to get an even distribution.
>>>>>>>
>>>>>>> There is a secondary reason for the distribution to be uneven, when
>>>>>>> there is more than one replica. The second replica must be on a
>>>>>>> different device than the first replica. This conditional probability
>>>>>>> is not taken into account by CRUSH and would create an uneven
>>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>>> a given OSD can only handle a few hundred PGs and this conditional
>>>>>>> probability bias is dominated by the uneven distribution caused by the
>>>>>>> low number of PGs.
>>>>>>>
>>>>>>> The uneven CRUSH distributions are always caused by a low number of
>>>>>>> samples, even in large clusters. Since this noise (i.e. the difference
>>>>>>> between the desired distribution and the actual distribution) is
>>>>>>> random, it cannot be fixed by optimizations methods.  The
>>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>>> fails to find a gradient that would allow it to converge faster. And
>>>>>>> even if it did, the local minimum found would be as often wrong as
>>>>>>> with Nedler-Mead, only it would go faster. A least mean squares
>>>>>>> filter[5] is equally unable to suppress the noise created by the
>>>>>>> uneven distribution because no coefficients can model a random noise.
>>>>>>>
>>>>>>> With that in mind, I implemented a simple optimization algorithm[6]
>>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>>> like this:
>>>>>>>
>>>>>>>     - Distribute the desired number of PGs[7]
>>>>>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>>>>>     - Add the subtracted weight to the OSD that is the most under used
>>>>>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>>>
>>>>>>> Quoting Adam Kupczyk, this works because:
>>>>>>>
>>>>>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>>>>>    stable way.  Specifically, if we increase weight on one node, we
>>>>>>>    will get more PGs on this node and less on every other node:
>>>>>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>>>
>>>>>>> A nice side effect of this optimization algorithm is that it does not
>>>>>>> change the weight of the bucket containing the items being
>>>>>>> optimized. It is local to a bucket with no influence on the other
>>>>>>> parts of the crushmap (modulo the conditional probability bias).
>>>>>>>
>>>>>>> In all tests the situation improves at least by an order of
>>>>>>> magnitude. For instance when there is a 30% difference between two
>>>>>>> OSDs, it is down to less than 3% after optimization.
>>>>>>>
>>>>>>> The tests for the optimization method can be run with
>>>>>>>
>>>>>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>>>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>>>
>>>>>>> If anyone think of a reason why this algorithm won't work in some
>>>>>>> cases, please speak up :-)
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> [1] python-crush http://crush.readthedocs.io/
>>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-14 17:46             ` Loic Dachary
@ 2017-05-15 19:08               ` Stefan Priebe - Profihost AG
  2017-05-15 20:35                 ` Loic Dachary
  2017-05-22 18:44                 ` Stefan Priebe - Profihost AG
  0 siblings, 2 replies; 37+ messages in thread
From: Stefan Priebe - Profihost AG @ 2017-05-15 19:08 UTC (permalink / raw)
  To: Loic Dachary, Ceph Development

Hello Loic,

sounds good but my initial question was if this shouldn't be integrated
in ceph-deploy - so when you add OSDs it also does the correct reweight?

Greets,
Stefan

Am 14.05.2017 um 19:46 schrieb Loic Dachary:
> Hi Stefan,
> 
> A new python-crush[1] subcommand will be available next week that you could use to rebalance your clusters. You give it a crushmap and it optimizes the weights to fix the uneven distribution. It can produce a series of crushmaps, each with a small modification so that you can gradually improve the situation and better control how many PGs are moving.
> 
> Would that be useful for the clusters you have ?
> 
> Cheers
> 
> [1] http://crush.readthedocs.io/
> 
> On 05/02/2017 09:32 AM, Loic Dachary wrote:
>>
>>
>> On 05/02/2017 07:43 AM, Stefan Priebe - Profihost AG wrote:
>>> Hi Loic,
>>>
>>> yes i didn't changed them to straw2 as i didn't saw any difference. I
>>> switched to straw2 now but it didn't change anything at all.
>>
>> straw vs straw2 is not responsible for the uneven distribution you're seeing. I meant to say the optimization only works on straw2 buckets, it is not implemented for straw buckets.
>>
>>> If i use those weights manuall i've to adjust them on every crush change
>>> on the cluster? That's something i don't really like to do.
>>
>> This is not practical indeed :-) I'm hoping python-crush can automate that.
>>
>> Cheers
>>
>>> Greets,
>>> Stefan
>>>
>>> Am 02.05.2017 um 01:12 schrieb Loic Dachary:
>>>> It is working, with straw2 (your cluster still is using straw).
>>>>
>>>> For instance for one host it goes from:
>>>>
>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>> ~name~
>>>> osd.24         149        159                 6.65     10.0      6.71
>>>> osd.29         149        159                 6.65     10.0      6.71
>>>> osd.0           69         77                11.04      8.0     11.59
>>>> osd.2           69         69                -0.50      0.0      0.00
>>>> osd.42         149        148                -0.73     -1.0     -0.67
>>>> osd.1           69         62               -10.59     -7.0    -10.14
>>>> osd.23          69         62               -10.59     -7.0    -10.14
>>>> osd.36         149        132               -11.46    -17.0    -11.41
>>>>
>>>> to
>>>>
>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>> ~name~
>>>> osd.0           69         69                -0.50      0.0      0.00
>>>> osd.23          69         69                -0.50      0.0      0.00
>>>> osd.24         149        149                -0.06      0.0      0.00
>>>> osd.29         149        149                -0.06      0.0      0.00
>>>> osd.36         149        149                -0.06      0.0      0.00
>>>> osd.1           69         68                -1.94     -1.0     -1.45
>>>> osd.2           69         68                -1.94     -1.0     -1.45
>>>> osd.42         149        147                -1.40     -2.0     -1.34
>>>>
>>>> By changing the weights to
>>>>
>>>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616]
>>>>
>>>> And you could set these weights on the crushmap, there would be no need for backporting.
>>>>
>>>>
>>>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote:
>>>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary:
>>>>>> Hi Stefan,
>>>>>>
>>>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>> That sounds amazing! Is there any chance this will be backported to jewel?
>>>>>>
>>>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.
>>>>>
>>>>> I've lots of them ;-)
>>>>>
>>>>> Will sent you one via private e-mail in some minutes.
>>>>>
>>>>> Greets,
>>>>> Stefan
>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>>
>>>>>>> Greets,
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>>>> the same proportion. If an OSD is 75% full, it is expected that all
>>>>>>>> other OSDs are also 75% full.
>>>>>>>>
>>>>>>>> In reality the distribution is even only when more than 100,000 PGs
>>>>>>>> are distributed in a pool of size 1 (i.e. no replication).
>>>>>>>>
>>>>>>>> In small clusters there are a few thousands PGs and it is not enough
>>>>>>>> to get an even distribution. Running the following with
>>>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>>>>
>>>>>>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>>>>>>     crush analyze --replication-count 1 \
>>>>>>>>                   --type device \
>>>>>>>>                   --values-count $PGs \
>>>>>>>>                   --rule data \
>>>>>>>>                   --crushmap tests/sample-crushmap.json
>>>>>>>>   done
>>>>>>>>
>>>>>>>> In larger clusters, even though a greater number of PGs are
>>>>>>>> distributed, there are at most a few dozens devices per host and the
>>>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>>>> is not enough to get an even distribution.
>>>>>>>>
>>>>>>>> There is a secondary reason for the distribution to be uneven, when
>>>>>>>> there is more than one replica. The second replica must be on a
>>>>>>>> different device than the first replica. This conditional probability
>>>>>>>> is not taken into account by CRUSH and would create an uneven
>>>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>>>> a given OSD can only handle a few hundred PGs and this conditional
>>>>>>>> probability bias is dominated by the uneven distribution caused by the
>>>>>>>> low number of PGs.
>>>>>>>>
>>>>>>>> The uneven CRUSH distributions are always caused by a low number of
>>>>>>>> samples, even in large clusters. Since this noise (i.e. the difference
>>>>>>>> between the desired distribution and the actual distribution) is
>>>>>>>> random, it cannot be fixed by optimizations methods.  The
>>>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>>>> fails to find a gradient that would allow it to converge faster. And
>>>>>>>> even if it did, the local minimum found would be as often wrong as
>>>>>>>> with Nedler-Mead, only it would go faster. A least mean squares
>>>>>>>> filter[5] is equally unable to suppress the noise created by the
>>>>>>>> uneven distribution because no coefficients can model a random noise.
>>>>>>>>
>>>>>>>> With that in mind, I implemented a simple optimization algorithm[6]
>>>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>>>> like this:
>>>>>>>>
>>>>>>>>     - Distribute the desired number of PGs[7]
>>>>>>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>>>>>>     - Add the subtracted weight to the OSD that is the most under used
>>>>>>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>>>>
>>>>>>>> Quoting Adam Kupczyk, this works because:
>>>>>>>>
>>>>>>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>>>>>>    stable way.  Specifically, if we increase weight on one node, we
>>>>>>>>    will get more PGs on this node and less on every other node:
>>>>>>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>>>>
>>>>>>>> A nice side effect of this optimization algorithm is that it does not
>>>>>>>> change the weight of the bucket containing the items being
>>>>>>>> optimized. It is local to a bucket with no influence on the other
>>>>>>>> parts of the crushmap (modulo the conditional probability bias).
>>>>>>>>
>>>>>>>> In all tests the situation improves at least by an order of
>>>>>>>> magnitude. For instance when there is a 30% difference between two
>>>>>>>> OSDs, it is down to less than 3% after optimization.
>>>>>>>>
>>>>>>>> The tests for the optimization method can be run with
>>>>>>>>
>>>>>>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>>>>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>>>>
>>>>>>>> If anyone think of a reason why this algorithm won't work in some
>>>>>>>> cases, please speak up :-)
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> [1] python-crush http://crush.readthedocs.io/
>>>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>
>>
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-15 19:08               ` Stefan Priebe - Profihost AG
@ 2017-05-15 20:35                 ` Loic Dachary
  2017-05-16  6:15                   ` Stefan Priebe - Profihost AG
  2017-05-22 18:44                 ` Stefan Priebe - Profihost AG
  1 sibling, 1 reply; 37+ messages in thread
From: Loic Dachary @ 2017-05-15 20:35 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Ceph Development



On 05/15/2017 09:08 PM, Stefan Priebe - Profihost AG wrote:
> Hello Loic,
> 
> sounds good but my initial question was if this shouldn't be integrated
> in ceph-deploy - so when you add OSDs it also does the correct reweight?

Ideally it should be fully transparent and we can forget the problem ever existed. I think we'll get there, maybe with a ceph-mgr task running on a regular basis to gradually optimize when it can't be done in real time. It won't be ready for Luminous but it could be for M*.

Cheers

> Greets,
> Stefan
> 
> Am 14.05.2017 um 19:46 schrieb Loic Dachary:
>> Hi Stefan,
>>
>> A new python-crush[1] subcommand will be available next week that you could use to rebalance your clusters. You give it a crushmap and it optimizes the weights to fix the uneven distribution. It can produce a series of crushmaps, each with a small modification so that you can gradually improve the situation and better control how many PGs are moving.
>>
>> Would that be useful for the clusters you have ?
>>
>> Cheers
>>
>> [1] http://crush.readthedocs.io/
>>
>> On 05/02/2017 09:32 AM, Loic Dachary wrote:
>>>
>>>
>>> On 05/02/2017 07:43 AM, Stefan Priebe - Profihost AG wrote:
>>>> Hi Loic,
>>>>
>>>> yes i didn't changed them to straw2 as i didn't saw any difference. I
>>>> switched to straw2 now but it didn't change anything at all.
>>>
>>> straw vs straw2 is not responsible for the uneven distribution you're seeing. I meant to say the optimization only works on straw2 buckets, it is not implemented for straw buckets.
>>>
>>>> If i use those weights manuall i've to adjust them on every crush change
>>>> on the cluster? That's something i don't really like to do.
>>>
>>> This is not practical indeed :-) I'm hoping python-crush can automate that.
>>>
>>> Cheers
>>>
>>>> Greets,
>>>> Stefan
>>>>
>>>> Am 02.05.2017 um 01:12 schrieb Loic Dachary:
>>>>> It is working, with straw2 (your cluster still is using straw).
>>>>>
>>>>> For instance for one host it goes from:
>>>>>
>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>> ~name~
>>>>> osd.24         149        159                 6.65     10.0      6.71
>>>>> osd.29         149        159                 6.65     10.0      6.71
>>>>> osd.0           69         77                11.04      8.0     11.59
>>>>> osd.2           69         69                -0.50      0.0      0.00
>>>>> osd.42         149        148                -0.73     -1.0     -0.67
>>>>> osd.1           69         62               -10.59     -7.0    -10.14
>>>>> osd.23          69         62               -10.59     -7.0    -10.14
>>>>> osd.36         149        132               -11.46    -17.0    -11.41
>>>>>
>>>>> to
>>>>>
>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>> ~name~
>>>>> osd.0           69         69                -0.50      0.0      0.00
>>>>> osd.23          69         69                -0.50      0.0      0.00
>>>>> osd.24         149        149                -0.06      0.0      0.00
>>>>> osd.29         149        149                -0.06      0.0      0.00
>>>>> osd.36         149        149                -0.06      0.0      0.00
>>>>> osd.1           69         68                -1.94     -1.0     -1.45
>>>>> osd.2           69         68                -1.94     -1.0     -1.45
>>>>> osd.42         149        147                -1.40     -2.0     -1.34
>>>>>
>>>>> By changing the weights to
>>>>>
>>>>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616]
>>>>>
>>>>> And you could set these weights on the crushmap, there would be no need for backporting.
>>>>>
>>>>>
>>>>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote:
>>>>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary:
>>>>>>> Hi Stefan,
>>>>>>>
>>>>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>> That sounds amazing! Is there any chance this will be backported to jewel?
>>>>>>>
>>>>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.
>>>>>>
>>>>>> I've lots of them ;-)
>>>>>>
>>>>>> Will sent you one via private e-mail in some minutes.
>>>>>>
>>>>>> Greets,
>>>>>> Stefan
>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>>
>>>>>>>> Greets,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>>>>> the same proportion. If an OSD is 75% full, it is expected that all
>>>>>>>>> other OSDs are also 75% full.
>>>>>>>>>
>>>>>>>>> In reality the distribution is even only when more than 100,000 PGs
>>>>>>>>> are distributed in a pool of size 1 (i.e. no replication).
>>>>>>>>>
>>>>>>>>> In small clusters there are a few thousands PGs and it is not enough
>>>>>>>>> to get an even distribution. Running the following with
>>>>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>>>>>
>>>>>>>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>>>>>>>     crush analyze --replication-count 1 \
>>>>>>>>>                   --type device \
>>>>>>>>>                   --values-count $PGs \
>>>>>>>>>                   --rule data \
>>>>>>>>>                   --crushmap tests/sample-crushmap.json
>>>>>>>>>   done
>>>>>>>>>
>>>>>>>>> In larger clusters, even though a greater number of PGs are
>>>>>>>>> distributed, there are at most a few dozens devices per host and the
>>>>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>>>>> is not enough to get an even distribution.
>>>>>>>>>
>>>>>>>>> There is a secondary reason for the distribution to be uneven, when
>>>>>>>>> there is more than one replica. The second replica must be on a
>>>>>>>>> different device than the first replica. This conditional probability
>>>>>>>>> is not taken into account by CRUSH and would create an uneven
>>>>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>>>>> a given OSD can only handle a few hundred PGs and this conditional
>>>>>>>>> probability bias is dominated by the uneven distribution caused by the
>>>>>>>>> low number of PGs.
>>>>>>>>>
>>>>>>>>> The uneven CRUSH distributions are always caused by a low number of
>>>>>>>>> samples, even in large clusters. Since this noise (i.e. the difference
>>>>>>>>> between the desired distribution and the actual distribution) is
>>>>>>>>> random, it cannot be fixed by optimizations methods.  The
>>>>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>>>>> fails to find a gradient that would allow it to converge faster. And
>>>>>>>>> even if it did, the local minimum found would be as often wrong as
>>>>>>>>> with Nedler-Mead, only it would go faster. A least mean squares
>>>>>>>>> filter[5] is equally unable to suppress the noise created by the
>>>>>>>>> uneven distribution because no coefficients can model a random noise.
>>>>>>>>>
>>>>>>>>> With that in mind, I implemented a simple optimization algorithm[6]
>>>>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>>>>> like this:
>>>>>>>>>
>>>>>>>>>     - Distribute the desired number of PGs[7]
>>>>>>>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>>>>>>>     - Add the subtracted weight to the OSD that is the most under used
>>>>>>>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>>>>>
>>>>>>>>> Quoting Adam Kupczyk, this works because:
>>>>>>>>>
>>>>>>>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>>>>>>>    stable way.  Specifically, if we increase weight on one node, we
>>>>>>>>>    will get more PGs on this node and less on every other node:
>>>>>>>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>>>>>
>>>>>>>>> A nice side effect of this optimization algorithm is that it does not
>>>>>>>>> change the weight of the bucket containing the items being
>>>>>>>>> optimized. It is local to a bucket with no influence on the other
>>>>>>>>> parts of the crushmap (modulo the conditional probability bias).
>>>>>>>>>
>>>>>>>>> In all tests the situation improves at least by an order of
>>>>>>>>> magnitude. For instance when there is a 30% difference between two
>>>>>>>>> OSDs, it is down to less than 3% after optimization.
>>>>>>>>>
>>>>>>>>> The tests for the optimization method can be run with
>>>>>>>>>
>>>>>>>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>>>>>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>>>>>
>>>>>>>>> If anyone think of a reason why this algorithm won't work in some
>>>>>>>>> cases, please speak up :-)
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> [1] python-crush http://crush.readthedocs.io/
>>>>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>>>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>>>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>>>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>>>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>>>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-15 20:35                 ` Loic Dachary
@ 2017-05-16  6:15                   ` Stefan Priebe - Profihost AG
  2017-05-16  8:14                     ` Loic Dachary
  0 siblings, 1 reply; 37+ messages in thread
From: Stefan Priebe - Profihost AG @ 2017-05-16  6:15 UTC (permalink / raw)
  To: Loic Dachary, Ceph Development

Hello Loic,

thanks for clarification. Sounds good so far. It it planned to get
packages out of the repo so we do not need to have pip and a compiler
installed on the systems?

Greets,
Stefan

Am 15.05.2017 um 22:35 schrieb Loic Dachary:
> 
> 
> On 05/15/2017 09:08 PM, Stefan Priebe - Profihost AG wrote:
>> Hello Loic,
>>
>> sounds good but my initial question was if this shouldn't be integrated
>> in ceph-deploy - so when you add OSDs it also does the correct reweight?
> 
> Ideally it should be fully transparent and we can forget the problem ever existed. I think we'll get there, maybe with a ceph-mgr task running on a regular basis to gradually optimize when it can't be done in real time. It won't be ready for Luminous but it could be for M*.
> 
> Cheers
> 
>> Greets,
>> Stefan
>>
>> Am 14.05.2017 um 19:46 schrieb Loic Dachary:
>>> Hi Stefan,
>>>
>>> A new python-crush[1] subcommand will be available next week that you could use to rebalance your clusters. You give it a crushmap and it optimizes the weights to fix the uneven distribution. It can produce a series of crushmaps, each with a small modification so that you can gradually improve the situation and better control how many PGs are moving.
>>>
>>> Would that be useful for the clusters you have ?
>>>
>>> Cheers
>>>
>>> [1] http://crush.readthedocs.io/
>>>
>>> On 05/02/2017 09:32 AM, Loic Dachary wrote:
>>>>
>>>>
>>>> On 05/02/2017 07:43 AM, Stefan Priebe - Profihost AG wrote:
>>>>> Hi Loic,
>>>>>
>>>>> yes i didn't changed them to straw2 as i didn't saw any difference. I
>>>>> switched to straw2 now but it didn't change anything at all.
>>>>
>>>> straw vs straw2 is not responsible for the uneven distribution you're seeing. I meant to say the optimization only works on straw2 buckets, it is not implemented for straw buckets.
>>>>
>>>>> If i use those weights manuall i've to adjust them on every crush change
>>>>> on the cluster? That's something i don't really like to do.
>>>>
>>>> This is not practical indeed :-) I'm hoping python-crush can automate that.
>>>>
>>>> Cheers
>>>>
>>>>> Greets,
>>>>> Stefan
>>>>>
>>>>> Am 02.05.2017 um 01:12 schrieb Loic Dachary:
>>>>>> It is working, with straw2 (your cluster still is using straw).
>>>>>>
>>>>>> For instance for one host it goes from:
>>>>>>
>>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>>> ~name~
>>>>>> osd.24         149        159                 6.65     10.0      6.71
>>>>>> osd.29         149        159                 6.65     10.0      6.71
>>>>>> osd.0           69         77                11.04      8.0     11.59
>>>>>> osd.2           69         69                -0.50      0.0      0.00
>>>>>> osd.42         149        148                -0.73     -1.0     -0.67
>>>>>> osd.1           69         62               -10.59     -7.0    -10.14
>>>>>> osd.23          69         62               -10.59     -7.0    -10.14
>>>>>> osd.36         149        132               -11.46    -17.0    -11.41
>>>>>>
>>>>>> to
>>>>>>
>>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>>> ~name~
>>>>>> osd.0           69         69                -0.50      0.0      0.00
>>>>>> osd.23          69         69                -0.50      0.0      0.00
>>>>>> osd.24         149        149                -0.06      0.0      0.00
>>>>>> osd.29         149        149                -0.06      0.0      0.00
>>>>>> osd.36         149        149                -0.06      0.0      0.00
>>>>>> osd.1           69         68                -1.94     -1.0     -1.45
>>>>>> osd.2           69         68                -1.94     -1.0     -1.45
>>>>>> osd.42         149        147                -1.40     -2.0     -1.34
>>>>>>
>>>>>> By changing the weights to
>>>>>>
>>>>>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616]
>>>>>>
>>>>>> And you could set these weights on the crushmap, there would be no need for backporting.
>>>>>>
>>>>>>
>>>>>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary:
>>>>>>>> Hi Stefan,
>>>>>>>>
>>>>>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>>> That sounds amazing! Is there any chance this will be backported to jewel?
>>>>>>>>
>>>>>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.
>>>>>>>
>>>>>>> I've lots of them ;-)
>>>>>>>
>>>>>>> Will sent you one via private e-mail in some minutes.
>>>>>>>
>>>>>>> Greets,
>>>>>>> Stefan
>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Greets,
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>>>>>> the same proportion. If an OSD is 75% full, it is expected that all
>>>>>>>>>> other OSDs are also 75% full.
>>>>>>>>>>
>>>>>>>>>> In reality the distribution is even only when more than 100,000 PGs
>>>>>>>>>> are distributed in a pool of size 1 (i.e. no replication).
>>>>>>>>>>
>>>>>>>>>> In small clusters there are a few thousands PGs and it is not enough
>>>>>>>>>> to get an even distribution. Running the following with
>>>>>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>>>>>>
>>>>>>>>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>>>>>>>>     crush analyze --replication-count 1 \
>>>>>>>>>>                   --type device \
>>>>>>>>>>                   --values-count $PGs \
>>>>>>>>>>                   --rule data \
>>>>>>>>>>                   --crushmap tests/sample-crushmap.json
>>>>>>>>>>   done
>>>>>>>>>>
>>>>>>>>>> In larger clusters, even though a greater number of PGs are
>>>>>>>>>> distributed, there are at most a few dozens devices per host and the
>>>>>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>>>>>> is not enough to get an even distribution.
>>>>>>>>>>
>>>>>>>>>> There is a secondary reason for the distribution to be uneven, when
>>>>>>>>>> there is more than one replica. The second replica must be on a
>>>>>>>>>> different device than the first replica. This conditional probability
>>>>>>>>>> is not taken into account by CRUSH and would create an uneven
>>>>>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>>>>>> a given OSD can only handle a few hundred PGs and this conditional
>>>>>>>>>> probability bias is dominated by the uneven distribution caused by the
>>>>>>>>>> low number of PGs.
>>>>>>>>>>
>>>>>>>>>> The uneven CRUSH distributions are always caused by a low number of
>>>>>>>>>> samples, even in large clusters. Since this noise (i.e. the difference
>>>>>>>>>> between the desired distribution and the actual distribution) is
>>>>>>>>>> random, it cannot be fixed by optimizations methods.  The
>>>>>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>>>>>> fails to find a gradient that would allow it to converge faster. And
>>>>>>>>>> even if it did, the local minimum found would be as often wrong as
>>>>>>>>>> with Nedler-Mead, only it would go faster. A least mean squares
>>>>>>>>>> filter[5] is equally unable to suppress the noise created by the
>>>>>>>>>> uneven distribution because no coefficients can model a random noise.
>>>>>>>>>>
>>>>>>>>>> With that in mind, I implemented a simple optimization algorithm[6]
>>>>>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>>>>>> like this:
>>>>>>>>>>
>>>>>>>>>>     - Distribute the desired number of PGs[7]
>>>>>>>>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>>>>>>>>     - Add the subtracted weight to the OSD that is the most under used
>>>>>>>>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>>>>>>
>>>>>>>>>> Quoting Adam Kupczyk, this works because:
>>>>>>>>>>
>>>>>>>>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>>>>>>>>    stable way.  Specifically, if we increase weight on one node, we
>>>>>>>>>>    will get more PGs on this node and less on every other node:
>>>>>>>>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>>>>>>
>>>>>>>>>> A nice side effect of this optimization algorithm is that it does not
>>>>>>>>>> change the weight of the bucket containing the items being
>>>>>>>>>> optimized. It is local to a bucket with no influence on the other
>>>>>>>>>> parts of the crushmap (modulo the conditional probability bias).
>>>>>>>>>>
>>>>>>>>>> In all tests the situation improves at least by an order of
>>>>>>>>>> magnitude. For instance when there is a 30% difference between two
>>>>>>>>>> OSDs, it is down to less than 3% after optimization.
>>>>>>>>>>
>>>>>>>>>> The tests for the optimization method can be run with
>>>>>>>>>>
>>>>>>>>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>>>>>>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>>>>>>
>>>>>>>>>> If anyone think of a reason why this algorithm won't work in some
>>>>>>>>>> cases, please speak up :-)
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>> [1] python-crush http://crush.readthedocs.io/
>>>>>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>>>>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>>>>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>>>>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>>>>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>>>>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-16  6:15                   ` Stefan Priebe - Profihost AG
@ 2017-05-16  8:14                     ` Loic Dachary
  0 siblings, 0 replies; 37+ messages in thread
From: Loic Dachary @ 2017-05-16  8:14 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Ceph Development

Hi Stefan,

On 05/16/2017 08:15 AM, Stefan Priebe - Profihost AG wrote:
> Hello Loic,
> 
> thanks for clarification. Sounds good so far. It it planned to get
> packages out of the repo so we do not need to have pip and a compiler
> installed on the systems?

With pip 8.1+ you can get binary wheels for python-crush and its dependencies, there is no need for a compiler. I'm not sure how exactly it will be packaged though.

Cheers

> 
> Greets,
> Stefan
> 
> Am 15.05.2017 um 22:35 schrieb Loic Dachary:
>>
>>
>> On 05/15/2017 09:08 PM, Stefan Priebe - Profihost AG wrote:
>>> Hello Loic,
>>>
>>> sounds good but my initial question was if this shouldn't be integrated
>>> in ceph-deploy - so when you add OSDs it also does the correct reweight?
>>
>> Ideally it should be fully transparent and we can forget the problem ever existed. I think we'll get there, maybe with a ceph-mgr task running on a regular basis to gradually optimize when it can't be done in real time. It won't be ready for Luminous but it could be for M*.
>>
>> Cheers
>>
>>> Greets,
>>> Stefan
>>>
>>> Am 14.05.2017 um 19:46 schrieb Loic Dachary:
>>>> Hi Stefan,
>>>>
>>>> A new python-crush[1] subcommand will be available next week that you could use to rebalance your clusters. You give it a crushmap and it optimizes the weights to fix the uneven distribution. It can produce a series of crushmaps, each with a small modification so that you can gradually improve the situation and better control how many PGs are moving.
>>>>
>>>> Would that be useful for the clusters you have ?
>>>>
>>>> Cheers
>>>>
>>>> [1] http://crush.readthedocs.io/
>>>>
>>>> On 05/02/2017 09:32 AM, Loic Dachary wrote:
>>>>>
>>>>>
>>>>> On 05/02/2017 07:43 AM, Stefan Priebe - Profihost AG wrote:
>>>>>> Hi Loic,
>>>>>>
>>>>>> yes i didn't changed them to straw2 as i didn't saw any difference. I
>>>>>> switched to straw2 now but it didn't change anything at all.
>>>>>
>>>>> straw vs straw2 is not responsible for the uneven distribution you're seeing. I meant to say the optimization only works on straw2 buckets, it is not implemented for straw buckets.
>>>>>
>>>>>> If i use those weights manuall i've to adjust them on every crush change
>>>>>> on the cluster? That's something i don't really like to do.
>>>>>
>>>>> This is not practical indeed :-) I'm hoping python-crush can automate that.
>>>>>
>>>>> Cheers
>>>>>
>>>>>> Greets,
>>>>>> Stefan
>>>>>>
>>>>>> Am 02.05.2017 um 01:12 schrieb Loic Dachary:
>>>>>>> It is working, with straw2 (your cluster still is using straw).
>>>>>>>
>>>>>>> For instance for one host it goes from:
>>>>>>>
>>>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>>>> ~name~
>>>>>>> osd.24         149        159                 6.65     10.0      6.71
>>>>>>> osd.29         149        159                 6.65     10.0      6.71
>>>>>>> osd.0           69         77                11.04      8.0     11.59
>>>>>>> osd.2           69         69                -0.50      0.0      0.00
>>>>>>> osd.42         149        148                -0.73     -1.0     -0.67
>>>>>>> osd.1           69         62               -10.59     -7.0    -10.14
>>>>>>> osd.23          69         62               -10.59     -7.0    -10.14
>>>>>>> osd.36         149        132               -11.46    -17.0    -11.41
>>>>>>>
>>>>>>> to
>>>>>>>
>>>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>>>> ~name~
>>>>>>> osd.0           69         69                -0.50      0.0      0.00
>>>>>>> osd.23          69         69                -0.50      0.0      0.00
>>>>>>> osd.24         149        149                -0.06      0.0      0.00
>>>>>>> osd.29         149        149                -0.06      0.0      0.00
>>>>>>> osd.36         149        149                -0.06      0.0      0.00
>>>>>>> osd.1           69         68                -1.94     -1.0     -1.45
>>>>>>> osd.2           69         68                -1.94     -1.0     -1.45
>>>>>>> osd.42         149        147                -1.40     -2.0     -1.34
>>>>>>>
>>>>>>> By changing the weights to
>>>>>>>
>>>>>>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616]
>>>>>>>
>>>>>>> And you could set these weights on the crushmap, there would be no need for backporting.
>>>>>>>
>>>>>>>
>>>>>>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary:
>>>>>>>>> Hi Stefan,
>>>>>>>>>
>>>>>>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>>>> That sounds amazing! Is there any chance this will be backported to jewel?
>>>>>>>>>
>>>>>>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.
>>>>>>>>
>>>>>>>> I've lots of them ;-)
>>>>>>>>
>>>>>>>> Will sent you one via private e-mail in some minutes.
>>>>>>>>
>>>>>>>> Greets,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Greets,
>>>>>>>>>> Stefan
>>>>>>>>>>
>>>>>>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>>>>>>> the same proportion. If an OSD is 75% full, it is expected that all
>>>>>>>>>>> other OSDs are also 75% full.
>>>>>>>>>>>
>>>>>>>>>>> In reality the distribution is even only when more than 100,000 PGs
>>>>>>>>>>> are distributed in a pool of size 1 (i.e. no replication).
>>>>>>>>>>>
>>>>>>>>>>> In small clusters there are a few thousands PGs and it is not enough
>>>>>>>>>>> to get an even distribution. Running the following with
>>>>>>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>>>>>>>
>>>>>>>>>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>>>>>>>>>     crush analyze --replication-count 1 \
>>>>>>>>>>>                   --type device \
>>>>>>>>>>>                   --values-count $PGs \
>>>>>>>>>>>                   --rule data \
>>>>>>>>>>>                   --crushmap tests/sample-crushmap.json
>>>>>>>>>>>   done
>>>>>>>>>>>
>>>>>>>>>>> In larger clusters, even though a greater number of PGs are
>>>>>>>>>>> distributed, there are at most a few dozens devices per host and the
>>>>>>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>>>>>>> is not enough to get an even distribution.
>>>>>>>>>>>
>>>>>>>>>>> There is a secondary reason for the distribution to be uneven, when
>>>>>>>>>>> there is more than one replica. The second replica must be on a
>>>>>>>>>>> different device than the first replica. This conditional probability
>>>>>>>>>>> is not taken into account by CRUSH and would create an uneven
>>>>>>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>>>>>>> a given OSD can only handle a few hundred PGs and this conditional
>>>>>>>>>>> probability bias is dominated by the uneven distribution caused by the
>>>>>>>>>>> low number of PGs.
>>>>>>>>>>>
>>>>>>>>>>> The uneven CRUSH distributions are always caused by a low number of
>>>>>>>>>>> samples, even in large clusters. Since this noise (i.e. the difference
>>>>>>>>>>> between the desired distribution and the actual distribution) is
>>>>>>>>>>> random, it cannot be fixed by optimizations methods.  The
>>>>>>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>>>>>>> fails to find a gradient that would allow it to converge faster. And
>>>>>>>>>>> even if it did, the local minimum found would be as often wrong as
>>>>>>>>>>> with Nedler-Mead, only it would go faster. A least mean squares
>>>>>>>>>>> filter[5] is equally unable to suppress the noise created by the
>>>>>>>>>>> uneven distribution because no coefficients can model a random noise.
>>>>>>>>>>>
>>>>>>>>>>> With that in mind, I implemented a simple optimization algorithm[6]
>>>>>>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>>>>>>> like this:
>>>>>>>>>>>
>>>>>>>>>>>     - Distribute the desired number of PGs[7]
>>>>>>>>>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>>>>>>>>>     - Add the subtracted weight to the OSD that is the most under used
>>>>>>>>>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>>>>>>>
>>>>>>>>>>> Quoting Adam Kupczyk, this works because:
>>>>>>>>>>>
>>>>>>>>>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>>>>>>>>>    stable way.  Specifically, if we increase weight on one node, we
>>>>>>>>>>>    will get more PGs on this node and less on every other node:
>>>>>>>>>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>>>>>>>
>>>>>>>>>>> A nice side effect of this optimization algorithm is that it does not
>>>>>>>>>>> change the weight of the bucket containing the items being
>>>>>>>>>>> optimized. It is local to a bucket with no influence on the other
>>>>>>>>>>> parts of the crushmap (modulo the conditional probability bias).
>>>>>>>>>>>
>>>>>>>>>>> In all tests the situation improves at least by an order of
>>>>>>>>>>> magnitude. For instance when there is a 30% difference between two
>>>>>>>>>>> OSDs, it is down to less than 3% after optimization.
>>>>>>>>>>>
>>>>>>>>>>> The tests for the optimization method can be run with
>>>>>>>>>>>
>>>>>>>>>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>>>>>>>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>>>>>>>
>>>>>>>>>>> If anyone think of a reason why this algorithm won't work in some
>>>>>>>>>>> cases, please speak up :-)
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>>
>>>>>>>>>>> [1] python-crush http://crush.readthedocs.io/
>>>>>>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>>>>>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>>>>>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>>>>>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>>>>>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>>>>>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-15 19:08               ` Stefan Priebe - Profihost AG
  2017-05-15 20:35                 ` Loic Dachary
@ 2017-05-22 18:44                 ` Stefan Priebe - Profihost AG
  2017-05-22 19:00                   ` Loic Dachary
  1 sibling, 1 reply; 37+ messages in thread
From: Stefan Priebe - Profihost AG @ 2017-05-22 18:44 UTC (permalink / raw)
  To: Loic Dachary, Ceph Development

Hello Loic,

i want to optimize a crush map. What are the exact steps to archieve this?

http://crush.readthedocs.io/en/latest/
doesn't tell me about an optimization command.

Stefan

Am 15.05.2017 um 21:08 schrieb Stefan Priebe - Profihost AG:
> Hello Loic,
> 
> sounds good but my initial question was if this shouldn't be integrated
> in ceph-deploy - so when you add OSDs it also does the correct reweight?
> 
> Greets,
> Stefan
> 
> Am 14.05.2017 um 19:46 schrieb Loic Dachary:
>> Hi Stefan,
>>
>> A new python-crush[1] subcommand will be available next week that you could use to rebalance your clusters. You give it a crushmap and it optimizes the weights to fix the uneven distribution. It can produce a series of crushmaps, each with a small modification so that you can gradually improve the situation and better control how many PGs are moving.
>>
>> Would that be useful for the clusters you have ?
>>
>> Cheers
>>
>> [1] http://crush.readthedocs.io/
>>
>> On 05/02/2017 09:32 AM, Loic Dachary wrote:
>>>
>>>
>>> On 05/02/2017 07:43 AM, Stefan Priebe - Profihost AG wrote:
>>>> Hi Loic,
>>>>
>>>> yes i didn't changed them to straw2 as i didn't saw any difference. I
>>>> switched to straw2 now but it didn't change anything at all.
>>>
>>> straw vs straw2 is not responsible for the uneven distribution you're seeing. I meant to say the optimization only works on straw2 buckets, it is not implemented for straw buckets.
>>>
>>>> If i use those weights manuall i've to adjust them on every crush change
>>>> on the cluster? That's something i don't really like to do.
>>>
>>> This is not practical indeed :-) I'm hoping python-crush can automate that.
>>>
>>> Cheers
>>>
>>>> Greets,
>>>> Stefan
>>>>
>>>> Am 02.05.2017 um 01:12 schrieb Loic Dachary:
>>>>> It is working, with straw2 (your cluster still is using straw).
>>>>>
>>>>> For instance for one host it goes from:
>>>>>
>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>> ~name~
>>>>> osd.24         149        159                 6.65     10.0      6.71
>>>>> osd.29         149        159                 6.65     10.0      6.71
>>>>> osd.0           69         77                11.04      8.0     11.59
>>>>> osd.2           69         69                -0.50      0.0      0.00
>>>>> osd.42         149        148                -0.73     -1.0     -0.67
>>>>> osd.1           69         62               -10.59     -7.0    -10.14
>>>>> osd.23          69         62               -10.59     -7.0    -10.14
>>>>> osd.36         149        132               -11.46    -17.0    -11.41
>>>>>
>>>>> to
>>>>>
>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>> ~name~
>>>>> osd.0           69         69                -0.50      0.0      0.00
>>>>> osd.23          69         69                -0.50      0.0      0.00
>>>>> osd.24         149        149                -0.06      0.0      0.00
>>>>> osd.29         149        149                -0.06      0.0      0.00
>>>>> osd.36         149        149                -0.06      0.0      0.00
>>>>> osd.1           69         68                -1.94     -1.0     -1.45
>>>>> osd.2           69         68                -1.94     -1.0     -1.45
>>>>> osd.42         149        147                -1.40     -2.0     -1.34
>>>>>
>>>>> By changing the weights to
>>>>>
>>>>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616]
>>>>>
>>>>> And you could set these weights on the crushmap, there would be no need for backporting.
>>>>>
>>>>>
>>>>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote:
>>>>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary:
>>>>>>> Hi Stefan,
>>>>>>>
>>>>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>> That sounds amazing! Is there any chance this will be backported to jewel?
>>>>>>>
>>>>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.
>>>>>>
>>>>>> I've lots of them ;-)
>>>>>>
>>>>>> Will sent you one via private e-mail in some minutes.
>>>>>>
>>>>>> Greets,
>>>>>> Stefan
>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>>
>>>>>>>> Greets,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>>>>> the same proportion. If an OSD is 75% full, it is expected that all
>>>>>>>>> other OSDs are also 75% full.
>>>>>>>>>
>>>>>>>>> In reality the distribution is even only when more than 100,000 PGs
>>>>>>>>> are distributed in a pool of size 1 (i.e. no replication).
>>>>>>>>>
>>>>>>>>> In small clusters there are a few thousands PGs and it is not enough
>>>>>>>>> to get an even distribution. Running the following with
>>>>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>>>>>
>>>>>>>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>>>>>>>     crush analyze --replication-count 1 \
>>>>>>>>>                   --type device \
>>>>>>>>>                   --values-count $PGs \
>>>>>>>>>                   --rule data \
>>>>>>>>>                   --crushmap tests/sample-crushmap.json
>>>>>>>>>   done
>>>>>>>>>
>>>>>>>>> In larger clusters, even though a greater number of PGs are
>>>>>>>>> distributed, there are at most a few dozens devices per host and the
>>>>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>>>>> is not enough to get an even distribution.
>>>>>>>>>
>>>>>>>>> There is a secondary reason for the distribution to be uneven, when
>>>>>>>>> there is more than one replica. The second replica must be on a
>>>>>>>>> different device than the first replica. This conditional probability
>>>>>>>>> is not taken into account by CRUSH and would create an uneven
>>>>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>>>>> a given OSD can only handle a few hundred PGs and this conditional
>>>>>>>>> probability bias is dominated by the uneven distribution caused by the
>>>>>>>>> low number of PGs.
>>>>>>>>>
>>>>>>>>> The uneven CRUSH distributions are always caused by a low number of
>>>>>>>>> samples, even in large clusters. Since this noise (i.e. the difference
>>>>>>>>> between the desired distribution and the actual distribution) is
>>>>>>>>> random, it cannot be fixed by optimizations methods.  The
>>>>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>>>>> fails to find a gradient that would allow it to converge faster. And
>>>>>>>>> even if it did, the local minimum found would be as often wrong as
>>>>>>>>> with Nedler-Mead, only it would go faster. A least mean squares
>>>>>>>>> filter[5] is equally unable to suppress the noise created by the
>>>>>>>>> uneven distribution because no coefficients can model a random noise.
>>>>>>>>>
>>>>>>>>> With that in mind, I implemented a simple optimization algorithm[6]
>>>>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>>>>> like this:
>>>>>>>>>
>>>>>>>>>     - Distribute the desired number of PGs[7]
>>>>>>>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>>>>>>>     - Add the subtracted weight to the OSD that is the most under used
>>>>>>>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>>>>>
>>>>>>>>> Quoting Adam Kupczyk, this works because:
>>>>>>>>>
>>>>>>>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>>>>>>>    stable way.  Specifically, if we increase weight on one node, we
>>>>>>>>>    will get more PGs on this node and less on every other node:
>>>>>>>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>>>>>
>>>>>>>>> A nice side effect of this optimization algorithm is that it does not
>>>>>>>>> change the weight of the bucket containing the items being
>>>>>>>>> optimized. It is local to a bucket with no influence on the other
>>>>>>>>> parts of the crushmap (modulo the conditional probability bias).
>>>>>>>>>
>>>>>>>>> In all tests the situation improves at least by an order of
>>>>>>>>> magnitude. For instance when there is a 30% difference between two
>>>>>>>>> OSDs, it is down to less than 3% after optimization.
>>>>>>>>>
>>>>>>>>> The tests for the optimization method can be run with
>>>>>>>>>
>>>>>>>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>>>>>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>>>>>
>>>>>>>>> If anyone think of a reason why this algorithm won't work in some
>>>>>>>>> cases, please speak up :-)
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> [1] python-crush http://crush.readthedocs.io/
>>>>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>>>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>>>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>>>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>>>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>>>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>
>>>
>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-22 18:44                 ` Stefan Priebe - Profihost AG
@ 2017-05-22 19:00                   ` Loic Dachary
  2017-05-27 10:03                     ` 攀刘
  0 siblings, 1 reply; 37+ messages in thread
From: Loic Dachary @ 2017-05-22 19:00 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Ceph Development

Hi Stefan,

On 05/22/2017 09:44 PM, Stefan Priebe - Profihost AG wrote:
> Hello Loic,
> 
> i want to optimize a crush map. What are the exact steps to archieve this?
> 
> http://crush.readthedocs.io/en/latest/
> doesn't tell me about an optimization command.

It's not published yet. I was hoping to finish it last week but ... I did something really stupid (early optimization :-). Fortunately I realized my mistake tonight while discussing the problem with a friend over a beer. Long story short: I'm optimistic about publishing something sensible in the next few days.

If you send me the ceph report of the cluster you'd like to optimize, I'll make sure it works as expected. I've been using the ceph report you sent me last week as well, it has been very helpful.

Cheers

> 
> Stefan
> 
> Am 15.05.2017 um 21:08 schrieb Stefan Priebe - Profihost AG:
>> Hello Loic,
>>
>> sounds good but my initial question was if this shouldn't be integrated
>> in ceph-deploy - so when you add OSDs it also does the correct reweight?
>>
>> Greets,
>> Stefan
>>
>> Am 14.05.2017 um 19:46 schrieb Loic Dachary:
>>> Hi Stefan,
>>>
>>> A new python-crush[1] subcommand will be available next week that you could use to rebalance your clusters. You give it a crushmap and it optimizes the weights to fix the uneven distribution. It can produce a series of crushmaps, each with a small modification so that you can gradually improve the situation and better control how many PGs are moving.
>>>
>>> Would that be useful for the clusters you have ?
>>>
>>> Cheers
>>>
>>> [1] http://crush.readthedocs.io/
>>>
>>> On 05/02/2017 09:32 AM, Loic Dachary wrote:
>>>>
>>>>
>>>> On 05/02/2017 07:43 AM, Stefan Priebe - Profihost AG wrote:
>>>>> Hi Loic,
>>>>>
>>>>> yes i didn't changed them to straw2 as i didn't saw any difference. I
>>>>> switched to straw2 now but it didn't change anything at all.
>>>>
>>>> straw vs straw2 is not responsible for the uneven distribution you're seeing. I meant to say the optimization only works on straw2 buckets, it is not implemented for straw buckets.
>>>>
>>>>> If i use those weights manuall i've to adjust them on every crush change
>>>>> on the cluster? That's something i don't really like to do.
>>>>
>>>> This is not practical indeed :-) I'm hoping python-crush can automate that.
>>>>
>>>> Cheers
>>>>
>>>>> Greets,
>>>>> Stefan
>>>>>
>>>>> Am 02.05.2017 um 01:12 schrieb Loic Dachary:
>>>>>> It is working, with straw2 (your cluster still is using straw).
>>>>>>
>>>>>> For instance for one host it goes from:
>>>>>>
>>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>>> ~name~
>>>>>> osd.24         149        159                 6.65     10.0      6.71
>>>>>> osd.29         149        159                 6.65     10.0      6.71
>>>>>> osd.0           69         77                11.04      8.0     11.59
>>>>>> osd.2           69         69                -0.50      0.0      0.00
>>>>>> osd.42         149        148                -0.73     -1.0     -0.67
>>>>>> osd.1           69         62               -10.59     -7.0    -10.14
>>>>>> osd.23          69         62               -10.59     -7.0    -10.14
>>>>>> osd.36         149        132               -11.46    -17.0    -11.41
>>>>>>
>>>>>> to
>>>>>>
>>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>>> ~name~
>>>>>> osd.0           69         69                -0.50      0.0      0.00
>>>>>> osd.23          69         69                -0.50      0.0      0.00
>>>>>> osd.24         149        149                -0.06      0.0      0.00
>>>>>> osd.29         149        149                -0.06      0.0      0.00
>>>>>> osd.36         149        149                -0.06      0.0      0.00
>>>>>> osd.1           69         68                -1.94     -1.0     -1.45
>>>>>> osd.2           69         68                -1.94     -1.0     -1.45
>>>>>> osd.42         149        147                -1.40     -2.0     -1.34
>>>>>>
>>>>>> By changing the weights to
>>>>>>
>>>>>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616]
>>>>>>
>>>>>> And you could set these weights on the crushmap, there would be no need for backporting.
>>>>>>
>>>>>>
>>>>>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary:
>>>>>>>> Hi Stefan,
>>>>>>>>
>>>>>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>>> That sounds amazing! Is there any chance this will be backported to jewel?
>>>>>>>>
>>>>>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.
>>>>>>>
>>>>>>> I've lots of them ;-)
>>>>>>>
>>>>>>> Will sent you one via private e-mail in some minutes.
>>>>>>>
>>>>>>> Greets,
>>>>>>> Stefan
>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Greets,
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>>>>>> the same proportion. If an OSD is 75% full, it is expected that all
>>>>>>>>>> other OSDs are also 75% full.
>>>>>>>>>>
>>>>>>>>>> In reality the distribution is even only when more than 100,000 PGs
>>>>>>>>>> are distributed in a pool of size 1 (i.e. no replication).
>>>>>>>>>>
>>>>>>>>>> In small clusters there are a few thousands PGs and it is not enough
>>>>>>>>>> to get an even distribution. Running the following with
>>>>>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>>>>>>
>>>>>>>>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>>>>>>>>     crush analyze --replication-count 1 \
>>>>>>>>>>                   --type device \
>>>>>>>>>>                   --values-count $PGs \
>>>>>>>>>>                   --rule data \
>>>>>>>>>>                   --crushmap tests/sample-crushmap.json
>>>>>>>>>>   done
>>>>>>>>>>
>>>>>>>>>> In larger clusters, even though a greater number of PGs are
>>>>>>>>>> distributed, there are at most a few dozens devices per host and the
>>>>>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>>>>>> is not enough to get an even distribution.
>>>>>>>>>>
>>>>>>>>>> There is a secondary reason for the distribution to be uneven, when
>>>>>>>>>> there is more than one replica. The second replica must be on a
>>>>>>>>>> different device than the first replica. This conditional probability
>>>>>>>>>> is not taken into account by CRUSH and would create an uneven
>>>>>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>>>>>> a given OSD can only handle a few hundred PGs and this conditional
>>>>>>>>>> probability bias is dominated by the uneven distribution caused by the
>>>>>>>>>> low number of PGs.
>>>>>>>>>>
>>>>>>>>>> The uneven CRUSH distributions are always caused by a low number of
>>>>>>>>>> samples, even in large clusters. Since this noise (i.e. the difference
>>>>>>>>>> between the desired distribution and the actual distribution) is
>>>>>>>>>> random, it cannot be fixed by optimizations methods.  The
>>>>>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>>>>>> fails to find a gradient that would allow it to converge faster. And
>>>>>>>>>> even if it did, the local minimum found would be as often wrong as
>>>>>>>>>> with Nedler-Mead, only it would go faster. A least mean squares
>>>>>>>>>> filter[5] is equally unable to suppress the noise created by the
>>>>>>>>>> uneven distribution because no coefficients can model a random noise.
>>>>>>>>>>
>>>>>>>>>> With that in mind, I implemented a simple optimization algorithm[6]
>>>>>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>>>>>> like this:
>>>>>>>>>>
>>>>>>>>>>     - Distribute the desired number of PGs[7]
>>>>>>>>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>>>>>>>>     - Add the subtracted weight to the OSD that is the most under used
>>>>>>>>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>>>>>>
>>>>>>>>>> Quoting Adam Kupczyk, this works because:
>>>>>>>>>>
>>>>>>>>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>>>>>>>>    stable way.  Specifically, if we increase weight on one node, we
>>>>>>>>>>    will get more PGs on this node and less on every other node:
>>>>>>>>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>>>>>>
>>>>>>>>>> A nice side effect of this optimization algorithm is that it does not
>>>>>>>>>> change the weight of the bucket containing the items being
>>>>>>>>>> optimized. It is local to a bucket with no influence on the other
>>>>>>>>>> parts of the crushmap (modulo the conditional probability bias).
>>>>>>>>>>
>>>>>>>>>> In all tests the situation improves at least by an order of
>>>>>>>>>> magnitude. For instance when there is a 30% difference between two
>>>>>>>>>> OSDs, it is down to less than 3% after optimization.
>>>>>>>>>>
>>>>>>>>>> The tests for the optimization method can be run with
>>>>>>>>>>
>>>>>>>>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>>>>>>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>>>>>>
>>>>>>>>>> If anyone think of a reason why this algorithm won't work in some
>>>>>>>>>> cases, please speak up :-)
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>> [1] python-crush http://crush.readthedocs.io/
>>>>>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>>>>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>>>>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>>>>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>>>>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>>>>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-22 19:00                   ` Loic Dachary
@ 2017-05-27 10:03                     ` 攀刘
  2017-05-27 10:13                       ` Loic Dachary
  0 siblings, 1 reply; 37+ messages in thread
From: 攀刘 @ 2017-05-27 10:03 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Stefan Priebe - Profihost AG, Ceph Development

Hi Loic,

I am also looking forward for this tool! Please let me know asap when
it is finished. At this time, for our on-line product environment, I
have to use reweight multiple times for this pg uneven distribution.
But I have to say, I still have some concern about affinity:

I used one script to dump primary pgs of each OSD:
for a in `ceph osd ls`;do echo -n $a:;ceph pg ls-by-primary  $a|grep
-v pg_stat|wc -l;done

I test it in a cluster with 228 OSDS and 8192 pgs, the max pg of one
osd is:50, min is 21.

Will your tool also handle it?

Thanks
Pa


2017-05-23 3:00 GMT+08:00 Loic Dachary <loic@dachary.org>:
> Hi Stefan,
>
> On 05/22/2017 09:44 PM, Stefan Priebe - Profihost AG wrote:
>> Hello Loic,
>>
>> i want to optimize a crush map. What are the exact steps to archieve this?
>>
>> http://crush.readthedocs.io/en/latest/
>> doesn't tell me about an optimization command.
>
> It's not published yet. I was hoping to finish it last week but ... I did something really stupid (early optimization :-). Fortunately I realized my mistake tonight while discussing the problem with a friend over a beer. Long story short: I'm optimistic about publishing something sensible in the next few days.
>
> If you send me the ceph report of the cluster you'd like to optimize, I'll make sure it works as expected. I've been using the ceph report you sent me last week as well, it has been very helpful.
>
> Cheers
>
>>
>> Stefan
>>
>> Am 15.05.2017 um 21:08 schrieb Stefan Priebe - Profihost AG:
>>> Hello Loic,
>>>
>>> sounds good but my initial question was if this shouldn't be integrated
>>> in ceph-deploy - so when you add OSDs it also does the correct reweight?
>>>
>>> Greets,
>>> Stefan
>>>
>>> Am 14.05.2017 um 19:46 schrieb Loic Dachary:
>>>> Hi Stefan,
>>>>
>>>> A new python-crush[1] subcommand will be available next week that you could use to rebalance your clusters. You give it a crushmap and it optimizes the weights to fix the uneven distribution. It can produce a series of crushmaps, each with a small modification so that you can gradually improve the situation and better control how many PGs are moving.
>>>>
>>>> Would that be useful for the clusters you have ?
>>>>
>>>> Cheers
>>>>
>>>> [1] http://crush.readthedocs.io/
>>>>
>>>> On 05/02/2017 09:32 AM, Loic Dachary wrote:
>>>>>
>>>>>
>>>>> On 05/02/2017 07:43 AM, Stefan Priebe - Profihost AG wrote:
>>>>>> Hi Loic,
>>>>>>
>>>>>> yes i didn't changed them to straw2 as i didn't saw any difference. I
>>>>>> switched to straw2 now but it didn't change anything at all.
>>>>>
>>>>> straw vs straw2 is not responsible for the uneven distribution you're seeing. I meant to say the optimization only works on straw2 buckets, it is not implemented for straw buckets.
>>>>>
>>>>>> If i use those weights manuall i've to adjust them on every crush change
>>>>>> on the cluster? That's something i don't really like to do.
>>>>>
>>>>> This is not practical indeed :-) I'm hoping python-crush can automate that.
>>>>>
>>>>> Cheers
>>>>>
>>>>>> Greets,
>>>>>> Stefan
>>>>>>
>>>>>> Am 02.05.2017 um 01:12 schrieb Loic Dachary:
>>>>>>> It is working, with straw2 (your cluster still is using straw).
>>>>>>>
>>>>>>> For instance for one host it goes from:
>>>>>>>
>>>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>>>> ~name~
>>>>>>> osd.24         149        159                 6.65     10.0      6.71
>>>>>>> osd.29         149        159                 6.65     10.0      6.71
>>>>>>> osd.0           69         77                11.04      8.0     11.59
>>>>>>> osd.2           69         69                -0.50      0.0      0.00
>>>>>>> osd.42         149        148                -0.73     -1.0     -0.67
>>>>>>> osd.1           69         62               -10.59     -7.0    -10.14
>>>>>>> osd.23          69         62               -10.59     -7.0    -10.14
>>>>>>> osd.36         149        132               -11.46    -17.0    -11.41
>>>>>>>
>>>>>>> to
>>>>>>>
>>>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>>>> ~name~
>>>>>>> osd.0           69         69                -0.50      0.0      0.00
>>>>>>> osd.23          69         69                -0.50      0.0      0.00
>>>>>>> osd.24         149        149                -0.06      0.0      0.00
>>>>>>> osd.29         149        149                -0.06      0.0      0.00
>>>>>>> osd.36         149        149                -0.06      0.0      0.00
>>>>>>> osd.1           69         68                -1.94     -1.0     -1.45
>>>>>>> osd.2           69         68                -1.94     -1.0     -1.45
>>>>>>> osd.42         149        147                -1.40     -2.0     -1.34
>>>>>>>
>>>>>>> By changing the weights to
>>>>>>>
>>>>>>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616]
>>>>>>>
>>>>>>> And you could set these weights on the crushmap, there would be no need for backporting.
>>>>>>>
>>>>>>>
>>>>>>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary:
>>>>>>>>> Hi Stefan,
>>>>>>>>>
>>>>>>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>>>> That sounds amazing! Is there any chance this will be backported to jewel?
>>>>>>>>>
>>>>>>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.
>>>>>>>>
>>>>>>>> I've lots of them ;-)
>>>>>>>>
>>>>>>>> Will sent you one via private e-mail in some minutes.
>>>>>>>>
>>>>>>>> Greets,
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Greets,
>>>>>>>>>> Stefan
>>>>>>>>>>
>>>>>>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>>>>>>> the same proportion. If an OSD is 75% full, it is expected that all
>>>>>>>>>>> other OSDs are also 75% full.
>>>>>>>>>>>
>>>>>>>>>>> In reality the distribution is even only when more than 100,000 PGs
>>>>>>>>>>> are distributed in a pool of size 1 (i.e. no replication).
>>>>>>>>>>>
>>>>>>>>>>> In small clusters there are a few thousands PGs and it is not enough
>>>>>>>>>>> to get an even distribution. Running the following with
>>>>>>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>>>>>>>
>>>>>>>>>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>>>>>>>>>     crush analyze --replication-count 1 \
>>>>>>>>>>>                   --type device \
>>>>>>>>>>>                   --values-count $PGs \
>>>>>>>>>>>                   --rule data \
>>>>>>>>>>>                   --crushmap tests/sample-crushmap.json
>>>>>>>>>>>   done
>>>>>>>>>>>
>>>>>>>>>>> In larger clusters, even though a greater number of PGs are
>>>>>>>>>>> distributed, there are at most a few dozens devices per host and the
>>>>>>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>>>>>>> is not enough to get an even distribution.
>>>>>>>>>>>
>>>>>>>>>>> There is a secondary reason for the distribution to be uneven, when
>>>>>>>>>>> there is more than one replica. The second replica must be on a
>>>>>>>>>>> different device than the first replica. This conditional probability
>>>>>>>>>>> is not taken into account by CRUSH and would create an uneven
>>>>>>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>>>>>>> a given OSD can only handle a few hundred PGs and this conditional
>>>>>>>>>>> probability bias is dominated by the uneven distribution caused by the
>>>>>>>>>>> low number of PGs.
>>>>>>>>>>>
>>>>>>>>>>> The uneven CRUSH distributions are always caused by a low number of
>>>>>>>>>>> samples, even in large clusters. Since this noise (i.e. the difference
>>>>>>>>>>> between the desired distribution and the actual distribution) is
>>>>>>>>>>> random, it cannot be fixed by optimizations methods.  The
>>>>>>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>>>>>>> fails to find a gradient that would allow it to converge faster. And
>>>>>>>>>>> even if it did, the local minimum found would be as often wrong as
>>>>>>>>>>> with Nedler-Mead, only it would go faster. A least mean squares
>>>>>>>>>>> filter[5] is equally unable to suppress the noise created by the
>>>>>>>>>>> uneven distribution because no coefficients can model a random noise.
>>>>>>>>>>>
>>>>>>>>>>> With that in mind, I implemented a simple optimization algorithm[6]
>>>>>>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>>>>>>> like this:
>>>>>>>>>>>
>>>>>>>>>>>     - Distribute the desired number of PGs[7]
>>>>>>>>>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>>>>>>>>>     - Add the subtracted weight to the OSD that is the most under used
>>>>>>>>>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>>>>>>>
>>>>>>>>>>> Quoting Adam Kupczyk, this works because:
>>>>>>>>>>>
>>>>>>>>>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>>>>>>>>>    stable way.  Specifically, if we increase weight on one node, we
>>>>>>>>>>>    will get more PGs on this node and less on every other node:
>>>>>>>>>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>>>>>>>
>>>>>>>>>>> A nice side effect of this optimization algorithm is that it does not
>>>>>>>>>>> change the weight of the bucket containing the items being
>>>>>>>>>>> optimized. It is local to a bucket with no influence on the other
>>>>>>>>>>> parts of the crushmap (modulo the conditional probability bias).
>>>>>>>>>>>
>>>>>>>>>>> In all tests the situation improves at least by an order of
>>>>>>>>>>> magnitude. For instance when there is a 30% difference between two
>>>>>>>>>>> OSDs, it is down to less than 3% after optimization.
>>>>>>>>>>>
>>>>>>>>>>> The tests for the optimization method can be run with
>>>>>>>>>>>
>>>>>>>>>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>>>>>>>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>>>>>>>
>>>>>>>>>>> If anyone think of a reason why this algorithm won't work in some
>>>>>>>>>>> cases, please speak up :-)
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>>
>>>>>>>>>>> [1] python-crush http://crush.readthedocs.io/
>>>>>>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>>>>>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>>>>>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>>>>>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>>>>>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>>>>>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: revisiting uneven CRUSH distributions
  2017-05-27 10:03                     ` 攀刘
@ 2017-05-27 10:13                       ` Loic Dachary
  0 siblings, 0 replies; 37+ messages in thread
From: Loic Dachary @ 2017-05-27 10:13 UTC (permalink / raw)
  To: 攀刘; +Cc: Ceph Development

Hi,

On 05/27/2017 01:03 PM, 攀刘 wrote:
> Hi Loic,
> 
> I am also looking forward for this tool! Please let me know asap when
> it is finished. At this time, for our on-line product environment, I
> have to use reweight multiple times for this pg uneven distribution.
> But I have to say, I still have some concern about affinity:
> 
> I used one script to dump primary pgs of each OSD:
> for a in `ceph osd ls`;do echo -n $a:;ceph pg ls-by-primary  $a|grep
> -v pg_stat|wc -l;done
> 
> I test it in a cluster with 228 OSDS and 8192 pgs, the max pg of one
> osd is:50, min is 21.
> 
> Will your tool also handle it?

Could you send me (privately) the output of "ceph report" for your cluster ? I'll let you know if it is a good match.

Cheers

> 
> Thanks
> Pa
> 
> 
> 2017-05-23 3:00 GMT+08:00 Loic Dachary <loic@dachary.org>:
>> Hi Stefan,
>>
>> On 05/22/2017 09:44 PM, Stefan Priebe - Profihost AG wrote:
>>> Hello Loic,
>>>
>>> i want to optimize a crush map. What are the exact steps to archieve this?
>>>
>>> http://crush.readthedocs.io/en/latest/
>>> doesn't tell me about an optimization command.
>>
>> It's not published yet. I was hoping to finish it last week but ... I did something really stupid (early optimization :-). Fortunately I realized my mistake tonight while discussing the problem with a friend over a beer. Long story short: I'm optimistic about publishing something sensible in the next few days.
>>
>> If you send me the ceph report of the cluster you'd like to optimize, I'll make sure it works as expected. I've been using the ceph report you sent me last week as well, it has been very helpful.
>>
>> Cheers
>>
>>>
>>> Stefan
>>>
>>> Am 15.05.2017 um 21:08 schrieb Stefan Priebe - Profihost AG:
>>>> Hello Loic,
>>>>
>>>> sounds good but my initial question was if this shouldn't be integrated
>>>> in ceph-deploy - so when you add OSDs it also does the correct reweight?
>>>>
>>>> Greets,
>>>> Stefan
>>>>
>>>> Am 14.05.2017 um 19:46 schrieb Loic Dachary:
>>>>> Hi Stefan,
>>>>>
>>>>> A new python-crush[1] subcommand will be available next week that you could use to rebalance your clusters. You give it a crushmap and it optimizes the weights to fix the uneven distribution. It can produce a series of crushmaps, each with a small modification so that you can gradually improve the situation and better control how many PGs are moving.
>>>>>
>>>>> Would that be useful for the clusters you have ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> [1] http://crush.readthedocs.io/
>>>>>
>>>>> On 05/02/2017 09:32 AM, Loic Dachary wrote:
>>>>>>
>>>>>>
>>>>>> On 05/02/2017 07:43 AM, Stefan Priebe - Profihost AG wrote:
>>>>>>> Hi Loic,
>>>>>>>
>>>>>>> yes i didn't changed them to straw2 as i didn't saw any difference. I
>>>>>>> switched to straw2 now but it didn't change anything at all.
>>>>>>
>>>>>> straw vs straw2 is not responsible for the uneven distribution you're seeing. I meant to say the optimization only works on straw2 buckets, it is not implemented for straw buckets.
>>>>>>
>>>>>>> If i use those weights manuall i've to adjust them on every crush change
>>>>>>> on the cluster? That's something i don't really like to do.
>>>>>>
>>>>>> This is not practical indeed :-) I'm hoping python-crush can automate that.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>>> Greets,
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 02.05.2017 um 01:12 schrieb Loic Dachary:
>>>>>>>> It is working, with straw2 (your cluster still is using straw).
>>>>>>>>
>>>>>>>> For instance for one host it goes from:
>>>>>>>>
>>>>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>>>>> ~name~
>>>>>>>> osd.24         149        159                 6.65     10.0      6.71
>>>>>>>> osd.29         149        159                 6.65     10.0      6.71
>>>>>>>> osd.0           69         77                11.04      8.0     11.59
>>>>>>>> osd.2           69         69                -0.50      0.0      0.00
>>>>>>>> osd.42         149        148                -0.73     -1.0     -0.67
>>>>>>>> osd.1           69         62               -10.59     -7.0    -10.14
>>>>>>>> osd.23          69         62               -10.59     -7.0    -10.14
>>>>>>>> osd.36         149        132               -11.46    -17.0    -11.41
>>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>>>         ~expected~  ~objects~  ~over/under used %~  ~delta~  ~delta%~
>>>>>>>> ~name~
>>>>>>>> osd.0           69         69                -0.50      0.0      0.00
>>>>>>>> osd.23          69         69                -0.50      0.0      0.00
>>>>>>>> osd.24         149        149                -0.06      0.0      0.00
>>>>>>>> osd.29         149        149                -0.06      0.0      0.00
>>>>>>>> osd.36         149        149                -0.06      0.0      0.00
>>>>>>>> osd.1           69         68                -1.94     -1.0     -1.45
>>>>>>>> osd.2           69         68                -1.94     -1.0     -1.45
>>>>>>>> osd.42         149        147                -1.40     -2.0     -1.34
>>>>>>>>
>>>>>>>> By changing the weights to
>>>>>>>>
>>>>>>>> [0.6609248140022604, 0.9148542821020436, 0.8174711575190294, 0.8870680217468655, 1.6031393139865695, 1.5871079208467038, 1.8784764188501162, 1.7308530904776616]
>>>>>>>>
>>>>>>>> And you could set these weights on the crushmap, there would be no need for backporting.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/01/2017 08:06 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>>> Am 01.05.2017 um 19:47 schrieb Loic Dachary:
>>>>>>>>>> Hi Stefan,
>>>>>>>>>>
>>>>>>>>>> On 05/01/2017 07:15 PM, Stefan Priebe - Profihost AG wrote:
>>>>>>>>>>> That sounds amazing! Is there any chance this will be backported to jewel?
>>>>>>>>>>
>>>>>>>>>> There should be ways to make that work with kraken and jewel. It may not even require a backport. If you know of a cluster with an uneven distribution, it would be great if you could send the crushmap so that I can test the algorithm. I'm still not sure this is the right solution and it would help confirm that.
>>>>>>>>>
>>>>>>>>> I've lots of them ;-)
>>>>>>>>>
>>>>>>>>> Will sent you one via private e-mail in some minutes.
>>>>>>>>>
>>>>>>>>> Greets,
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Greets,
>>>>>>>>>>> Stefan
>>>>>>>>>>>
>>>>>>>>>>> Am 30.04.2017 um 16:15 schrieb Loic Dachary:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Ideally CRUSH distributes PGs evenly on OSDs so that they all fill in
>>>>>>>>>>>> the same proportion. If an OSD is 75% full, it is expected that all
>>>>>>>>>>>> other OSDs are also 75% full.
>>>>>>>>>>>>
>>>>>>>>>>>> In reality the distribution is even only when more than 100,000 PGs
>>>>>>>>>>>> are distributed in a pool of size 1 (i.e. no replication).
>>>>>>>>>>>>
>>>>>>>>>>>> In small clusters there are a few thousands PGs and it is not enough
>>>>>>>>>>>> to get an even distribution. Running the following with
>>>>>>>>>>>> python-crush[1], shows a 15% difference when distributing 1,000 PGs on
>>>>>>>>>>>> 6 devices. Only with 1,000,000 PGs does the difference drop under 1%.
>>>>>>>>>>>>
>>>>>>>>>>>>   for PGs in 1000 10000 100000 1000000 ; do
>>>>>>>>>>>>     crush analyze --replication-count 1 \
>>>>>>>>>>>>                   --type device \
>>>>>>>>>>>>                   --values-count $PGs \
>>>>>>>>>>>>                   --rule data \
>>>>>>>>>>>>                   --crushmap tests/sample-crushmap.json
>>>>>>>>>>>>   done
>>>>>>>>>>>>
>>>>>>>>>>>> In larger clusters, even though a greater number of PGs are
>>>>>>>>>>>> distributed, there are at most a few dozens devices per host and the
>>>>>>>>>>>> problem remains. On a machine with 24 OSDs each expected to handle a
>>>>>>>>>>>> few hundred PGs, a total of a few thousands PGs are distributed which
>>>>>>>>>>>> is not enough to get an even distribution.
>>>>>>>>>>>>
>>>>>>>>>>>> There is a secondary reason for the distribution to be uneven, when
>>>>>>>>>>>> there is more than one replica. The second replica must be on a
>>>>>>>>>>>> different device than the first replica. This conditional probability
>>>>>>>>>>>> is not taken into account by CRUSH and would create an uneven
>>>>>>>>>>>> distribution if more than 10,000 PGs were distributed per OSD[2]. But
>>>>>>>>>>>> a given OSD can only handle a few hundred PGs and this conditional
>>>>>>>>>>>> probability bias is dominated by the uneven distribution caused by the
>>>>>>>>>>>> low number of PGs.
>>>>>>>>>>>>
>>>>>>>>>>>> The uneven CRUSH distributions are always caused by a low number of
>>>>>>>>>>>> samples, even in large clusters. Since this noise (i.e. the difference
>>>>>>>>>>>> between the desired distribution and the actual distribution) is
>>>>>>>>>>>> random, it cannot be fixed by optimizations methods.  The
>>>>>>>>>>>> Nedler-Mead[3] simplex converges to a local minimum that is far from
>>>>>>>>>>>> the optimal minimum in many cases. Broyden–Fletcher–Goldfarb–Shanno[4]
>>>>>>>>>>>> fails to find a gradient that would allow it to converge faster. And
>>>>>>>>>>>> even if it did, the local minimum found would be as often wrong as
>>>>>>>>>>>> with Nedler-Mead, only it would go faster. A least mean squares
>>>>>>>>>>>> filter[5] is equally unable to suppress the noise created by the
>>>>>>>>>>>> uneven distribution because no coefficients can model a random noise.
>>>>>>>>>>>>
>>>>>>>>>>>> With that in mind, I implemented a simple optimization algorithm[6]
>>>>>>>>>>>> which was first suggested by Thierry Delamare a few weeks ago. It goes
>>>>>>>>>>>> like this:
>>>>>>>>>>>>
>>>>>>>>>>>>     - Distribute the desired number of PGs[7]
>>>>>>>>>>>>     - Subtract 1% of the weight of the OSD that is the most over used
>>>>>>>>>>>>     - Add the subtracted weight to the OSD that is the most under used
>>>>>>>>>>>>     - Repeat until the Kullback–Leibler divergence[8] is small enough
>>>>>>>>>>>>
>>>>>>>>>>>> Quoting Adam Kupczyk, this works because:
>>>>>>>>>>>>
>>>>>>>>>>>>   "...CRUSH is not random proces at all, it behaves in numerically
>>>>>>>>>>>>    stable way.  Specifically, if we increase weight on one node, we
>>>>>>>>>>>>    will get more PGs on this node and less on every other node:
>>>>>>>>>>>>    CRUSH([10.1, 10, 10, 5, 5]) -> [146(+3), 152, 156(-2), 70(-1), 76]"
>>>>>>>>>>>>
>>>>>>>>>>>> A nice side effect of this optimization algorithm is that it does not
>>>>>>>>>>>> change the weight of the bucket containing the items being
>>>>>>>>>>>> optimized. It is local to a bucket with no influence on the other
>>>>>>>>>>>> parts of the crushmap (modulo the conditional probability bias).
>>>>>>>>>>>>
>>>>>>>>>>>> In all tests the situation improves at least by an order of
>>>>>>>>>>>> magnitude. For instance when there is a 30% difference between two
>>>>>>>>>>>> OSDs, it is down to less than 3% after optimization.
>>>>>>>>>>>>
>>>>>>>>>>>> The tests for the optimization method can be run with
>>>>>>>>>>>>
>>>>>>>>>>>>    git clone -b wip-fix-2 http://libcrush.org/dachary/python-crush.git
>>>>>>>>>>>>    tox -e py27 -- -s -vv -k test_fix tests/test_analyze.py
>>>>>>>>>>>>
>>>>>>>>>>>> If anyone think of a reason why this algorithm won't work in some
>>>>>>>>>>>> cases, please speak up :-)
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>>
>>>>>>>>>>>> [1] python-crush http://crush.readthedocs.io/
>>>>>>>>>>>> [2] crush multipick anomaly http://marc.info/?l=ceph-devel&m=148539995928656&w=2
>>>>>>>>>>>> [3] Nedler-Mead https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
>>>>>>>>>>>> [4] L-BFGS-B https://docs.scipy.org/doc/scipy-0.18.1/reference/optimize.minimize-lbfgsb.html#optimize-minimize-lbfgsb
>>>>>>>>>>>> [5] Least mean squares filter https://en.wikipedia.org/wiki/Least_mean_squares_filter
>>>>>>>>>>>> [6] http://libcrush.org/dachary/python-crush/blob/c6af9bbcbef7123af84ee4d75d63dd1b967213a2/tests/test_analyze.py#L39
>>>>>>>>>>>> [7] Predicting Ceph PG placement http://dachary.org/?p=4020
>>>>>>>>>>>> [8] Kullback–Leibler divergence https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-05-27 10:14 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-30 14:15 revisiting uneven CRUSH distributions Loic Dachary
2017-05-01 17:15 ` Stefan Priebe - Profihost AG
2017-05-01 17:47   ` Loic Dachary
2017-05-01 18:06     ` Stefan Priebe - Profihost AG
2017-05-01 23:12       ` Loic Dachary
2017-05-02  5:43         ` Stefan Priebe - Profihost AG
2017-05-02  5:48           ` Stefan Priebe - Profihost AG
2017-05-02  6:29             ` Alexandre DERUMIER
2017-05-02  6:31               ` Stefan Priebe - Profihost AG
2017-05-02  6:43               ` Stefan Priebe - Profihost AG
2017-05-02  7:52                 ` Alexandre DERUMIER
2017-05-02  7:32           ` Loic Dachary
2017-05-14 17:46             ` Loic Dachary
2017-05-15 19:08               ` Stefan Priebe - Profihost AG
2017-05-15 20:35                 ` Loic Dachary
2017-05-16  6:15                   ` Stefan Priebe - Profihost AG
2017-05-16  8:14                     ` Loic Dachary
2017-05-22 18:44                 ` Stefan Priebe - Profihost AG
2017-05-22 19:00                   ` Loic Dachary
2017-05-27 10:03                     ` 攀刘
2017-05-27 10:13                       ` Loic Dachary
     [not found] ` <CABZ+qqnqiUFbz=6CegW_o=2goOThpmoskDQ0oOUfE27jW0D17A@mail.gmail.com>
2017-05-02 10:21   ` Loic Dachary
2017-05-02 10:39     ` Dan van der Ster
2017-05-06 13:21       ` Loic Dachary
     [not found]         ` <CAAXqJ+oTkwT4AP6U5BUBVLbkTPwcwo8rnK1ng-p3UroEHBDV2A@mail.gmail.com>
2017-05-07 19:31           ` SMARTER REWEIGHT-BY-UTILIZATION Loic Dachary
2017-05-08  3:34         ` revisiting uneven CRUSH distributions Spandan Kumar Sahu
2017-05-08  9:59           ` Spandan Kumar Sahu
2017-05-08 10:27             ` Loic Dachary
2017-05-08 11:36         ` Dan van der Ster
2017-05-08 12:14           ` Loic Dachary
2017-05-02 16:16 ` Loic Dachary
2017-05-03  9:35   ` Dan van der Ster
2017-05-03 16:50     ` Loic Dachary
2017-05-03 17:59       ` Dan van der Ster
2017-05-03 18:41         ` Loic Dachary
2017-05-04  1:14     ` Gregory Farnum
2017-05-05 14:49 ` Loic Dachary

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.