All of lore.kernel.org
 help / color / mirror / Atom feed
* FW: CURSH optimization for unbalanced pg distribution
       [not found]   ` <alpine.DEB.2.00.1403190814520.316@cobra.newdream.net>
@ 2014-03-20  3:54     ` Zhang, Jian
  2014-09-09 13:36       ` Loic Dachary
  0 siblings, 1 reply; 9+ messages in thread
From: Zhang, Jian @ 2014-03-20  3:54 UTC (permalink / raw)
  To: ceph-devel
  Cc: Sage Weil (sage@inktank.com),
	Mark Nelson (mark.nelson@inktank.com) (mark.nelson@inktank.com),
	Duan, Jiangang, Zhang, Jian

Forwarding per Sage's suggestion. 


-----Original Message-----
From: Sage Weil [mailto:sage@inktank.com] 
Sent: Wednesday, March 19, 2014 11:29 PM
To: Mark Nelson
Cc: Zhang, Jian; Duan, Jiangang; He, Yujie
Subject: Re: CURSH optimization for unbalanced pg distribution

On Wed, 19 Mar 2014, Mark Nelson wrote:
> On 03/19/2014 03:24 AM, Zhang, Jian wrote:
> > For more detail data, please refer to the *Testing results* part.
> > 
> > *Optimization proposals: *
> > 
> > After we dived into the source code of CRUSH and related papers, we 
> > proposed two possible optimizations:
> > 
> > 1.Add different hash algorithms, as an alternative for the Jenkin's 
> > hash, e.g. algorithm that will produce even values when range of 
> > input value (pg#) is relatively small. Or add new bucket type at the 
> > same time if necessary.

This *might* work, but I don't have a strong intuition about it.  The modeling we've done now has essentially assumed a statistically uniform distribution, which has some inherent inbalance for low values of n (num pgs in our case).  I have generally assumed we can't do better than "random", and still have the other properties we want (independent, deterministic placement), but it may be possible.

> > 
> > 2.Find a better replica placement strategy instead of current retry
> > logic of crush_choose_firstn, which may cause CRUSH to behave badly.
> > 
> > We find there are several threshold of retry times by referring to code,
> > choose_total_tries, choose_local_tries and choose_local_fallback_tries.
> > They are used to decide whether to do a retry_bucket, retry_descent or
> > use permutation to do an exhaustive bucket search. We are wondering if
> > there is another retry strategy:
> > 
> > a)Backtracking retry. Now the logic of crush_choose_firstn can only
> > issue an retry either from the initial bucket(retry_descent) or from the
> > current bucket (retry_bucket), how about retrying the intervening buckets?
> > 
> > b)Adjust threshold of retry times by other values. We do noticed that
> > the 'optimal' crush tunable could be used to make it, but we still
> > encounter unbalanced [g distribution by using the optimal strategy.
> > Please refer to 4 of the Testing results part.
> > 
> > c)Add an mechanism that can adjust above mentioned thresholds
> > adaptively. Maybe we can record the retry times of the previous call for
> > CRUSH, and adjust retry thresholds automatically according to the record.

I suggest ignoring all of this retry logic.  The original version of CRUSH 
has the local retries to try to make data move "less far", but when we 
went back a year ago and did a statistical analysis of the distribution we 
found that *all* of these hacks degraded the quality of the placement,a nd 
by turning them all off (setting the 'optimal' values which zeroes them 
all out excent for total_retries) we got something that was 
indistinguishable from a uniform distribution.

> > 3.Add soft link for pg directories. During pg creation, we can create
> > soft links for the pgs if pg# on the selected osd is more than some
> > threshold, say 10% more than desired average number, to move objects
> > that will be stored in this pg to another osd. Balanced disk utilization
> > may be gained in this way.

I think you need to be careful, but yes, this is an option.  There is a 
similar exception mechanism in place that is used for other purposes and 
something similar could be done here.  The main challenge will be in 
ensuring that the soft links/exceptions follow the same overall policy 
that CRUSH does after the raw mapping is performed.  This is an option, 
but I would put it toward the bottom of the list...

> > 4.Change placement strategy only for step of selecting devices from
> > hosts. We found in our testing results that pg distribution was balanced
> > among hosts, which is reasonable since pg# of each host is above 1K
> > (according to the current BKM that pg# per osd should be about 100). So
> > how about we apply CRUSH only on the interval buckets and find another
> > simple but more balanced method to choose osd from host?

This idea has a lot of potential.  For example:

If you know the chassis can hold 12 disks, you can force the bucket size 
to twelve and somehow prevent users from adjusting the structure of the 
tree.  Then you can use a simple mapping that is truly flat (like a linear 
mapping, disk = x % num_disks) for that bucket/subtree.  The downside of 
course is that if you remove a disk *everything* reshuffles, hence some 
sort of guardrails to prevent a user from inadvertantly doing that.  If a 
disk *does* fail, you just need to make sure the disk is marked "out" but 
not removed from the CRUSH hierarchy and the normal retry will kick in.

Note that all this is reall doing is increasing the size of the "buckets" 
that we are (pseudo)randomly distribution over.  It is still a 
random/uniform distribution, but the N value is 12 times bigger (for a 12 
disk chassis) and as a result the variance is substantially lower.

I would suggest making a new bucket type that is called 'linear' and does 
a simple modulo and trying this out.  We will need a bunch of additional 
safety checks to help users avoid doing silly things (like adjusting the 
number of items in the linear buckets, which reshuffle everything) but 
that wouldn't be needed for an initial analysis of the performance impact.

Do you mind if we shift this thread over to ceph-devel?  I think there are 
lots of people who would be interested in this discussion.  We can of 
course leave off your attachment if you prefer.

Thanks!
sage

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: FW: CURSH optimization for unbalanced pg distribution
  2014-03-20  3:54     ` FW: CURSH optimization for unbalanced pg distribution Zhang, Jian
@ 2014-09-09 13:36       ` Loic Dachary
  2014-09-10  0:56         ` FW: " Zhang, Jian
       [not found]         ` <51FC7A40FB29414D88A121A7FFEF9A4710F570D6@SHSMSX104.ccr.corp.intel.com>
  0 siblings, 2 replies; 9+ messages in thread
From: Loic Dachary @ 2014-09-09 13:36 UTC (permalink / raw)
  To: Zhang, Jian, ceph-devel

[-- Attachment #1: Type: text/plain, Size: 6270 bytes --]



On 20/03/2014 04:54, Zhang, Jian wrote:
> Forwarding per Sage's suggestion. 

Very interesting discussion :-) For the record the corresponding pull request is https://github.com/ceph/ceph/pull/2402

> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sage@inktank.com] 
> Sent: Wednesday, March 19, 2014 11:29 PM
> To: Mark Nelson
> Cc: Zhang, Jian; Duan, Jiangang; He, Yujie
> Subject: Re: CURSH optimization for unbalanced pg distribution
> 
> On Wed, 19 Mar 2014, Mark Nelson wrote:
>> On 03/19/2014 03:24 AM, Zhang, Jian wrote:
>>> For more detail data, please refer to the *Testing results* part.
>>>
>>> *Optimization proposals: *
>>>
>>> After we dived into the source code of CRUSH and related papers, we 
>>> proposed two possible optimizations:
>>>
>>> 1.Add different hash algorithms, as an alternative for the Jenkin's 
>>> hash, e.g. algorithm that will produce even values when range of 
>>> input value (pg#) is relatively small. Or add new bucket type at the 
>>> same time if necessary.
> 
> This *might* work, but I don't have a strong intuition about it.  The modeling we've done now has essentially assumed a statistically uniform distribution, which has some inherent inbalance for low values of n (num pgs in our case).  I have generally assumed we can't do better than "random", and still have the other properties we want (independent, deterministic placement), but it may be possible.
> 
>>>
>>> 2.Find a better replica placement strategy instead of current retry
>>> logic of crush_choose_firstn, which may cause CRUSH to behave badly.
>>>
>>> We find there are several threshold of retry times by referring to code,
>>> choose_total_tries, choose_local_tries and choose_local_fallback_tries.
>>> They are used to decide whether to do a retry_bucket, retry_descent or
>>> use permutation to do an exhaustive bucket search. We are wondering if
>>> there is another retry strategy:
>>>
>>> a)Backtracking retry. Now the logic of crush_choose_firstn can only
>>> issue an retry either from the initial bucket(retry_descent) or from the
>>> current bucket (retry_bucket), how about retrying the intervening buckets?
>>>
>>> b)Adjust threshold of retry times by other values. We do noticed that
>>> the 'optimal' crush tunable could be used to make it, but we still
>>> encounter unbalanced [g distribution by using the optimal strategy.
>>> Please refer to 4 of the Testing results part.
>>>
>>> c)Add an mechanism that can adjust above mentioned thresholds
>>> adaptively. Maybe we can record the retry times of the previous call for
>>> CRUSH, and adjust retry thresholds automatically according to the record.
> 
> I suggest ignoring all of this retry logic.  The original version of CRUSH 
> has the local retries to try to make data move "less far", but when we 
> went back a year ago and did a statistical analysis of the distribution we 
> found that *all* of these hacks degraded the quality of the placement,a nd 
> by turning them all off (setting the 'optimal' values which zeroes them 
> all out excent for total_retries) we got something that was 
> indistinguishable from a uniform distribution.
> 
>>> 3.Add soft link for pg directories. During pg creation, we can create
>>> soft links for the pgs if pg# on the selected osd is more than some
>>> threshold, say 10% more than desired average number, to move objects
>>> that will be stored in this pg to another osd. Balanced disk utilization
>>> may be gained in this way.
> 
> I think you need to be careful, but yes, this is an option.  There is a 
> similar exception mechanism in place that is used for other purposes and 
> something similar could be done here.  The main challenge will be in 
> ensuring that the soft links/exceptions follow the same overall policy 
> that CRUSH does after the raw mapping is performed.  This is an option, 
> but I would put it toward the bottom of the list...
> 
>>> 4.Change placement strategy only for step of selecting devices from
>>> hosts. We found in our testing results that pg distribution was balanced
>>> among hosts, which is reasonable since pg# of each host is above 1K
>>> (according to the current BKM that pg# per osd should be about 100). So
>>> how about we apply CRUSH only on the interval buckets and find another
>>> simple but more balanced method to choose osd from host?
> 
> This idea has a lot of potential.  For example:
> 
> If you know the chassis can hold 12 disks, you can force the bucket size 
> to twelve and somehow prevent users from adjusting the structure of the 
> tree.  Then you can use a simple mapping that is truly flat (like a linear 
> mapping, disk = x % num_disks) for that bucket/subtree.  The downside of 
> course is that if you remove a disk *everything* reshuffles, hence some 
> sort of guardrails to prevent a user from inadvertantly doing that.  If a 
> disk *does* fail, you just need to make sure the disk is marked "out" but 
> not removed from the CRUSH hierarchy and the normal retry will kick in.
> 
> Note that all this is reall doing is increasing the size of the "buckets" 
> that we are (pseudo)randomly distribution over.  It is still a 
> random/uniform distribution, but the N value is 12 times bigger (for a 12 
> disk chassis) and as a result the variance is substantially lower.
> 
> I would suggest making a new bucket type that is called 'linear' and does 
> a simple modulo and trying this out.  We will need a bunch of additional 
> safety checks to help users avoid doing silly things (like adjusting the 
> number of items in the linear buckets, which reshuffle everything) but 
> that wouldn't be needed for an initial analysis of the performance impact.
> 
> Do you mind if we shift this thread over to ceph-devel?  I think there are 
> lots of people who would be interested in this discussion.  We can of 
> course leave off your attachment if you prefer.
> 
> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* FW: FW: CURSH optimization for unbalanced pg distribution
  2014-09-09 13:36       ` Loic Dachary
@ 2014-09-10  0:56         ` Zhang, Jian
       [not found]         ` <51FC7A40FB29414D88A121A7FFEF9A4710F570D6@SHSMSX104.ccr.corp.intel.com>
  1 sibling, 0 replies; 9+ messages in thread
From: Zhang, Jian @ 2014-09-10  0:56 UTC (permalink / raw)
  To: ceph-users-idqoXFIVOFJgJs9I8MT0rw, ceph-devel-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 13789 bytes --]

Resending. 

Thanks
Jian


-----Original Message-----
From: Zhang, Jian 
Sent: Wednesday, September 10, 2014 7:16 AM
To: 'Loic Dachary'; ceph-devel@vger.kernel.org; Sage Weil <sweil@redhat.com> (sweil@redhat.com)
Cc: He, Yujie (yujie.he@intel.com)
Subject: RE: FW: CURSH optimization for unbalanced pg distribution

Yujie sent out the following email yesterday, but it seems it was missed. Resending it. 

=============
Hi all,
   Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.

Key Message:
   As mentioned in the attached pdf, we described possible optimization proposals for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called “linear”, and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.

Design and Implementation:
1.   Problem Identification
1.1  Input key (pps) space of CRUSH is not uniform
    Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
1.2  Algorithm of selecting items from buckets is not uniform
    After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
2.   Design
2.1  New pps hash algorithm
    We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}. 
    Assume there are np PGs in a pool, we can regard pgid (0≤pgid<2^n, np≤2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
2.2  New bucket type, Linear
    We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
    For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
2.3  Adaptive Strategy
    Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
  1) Try different balance_param when preparing for a new pool
    - Iteratively call CRUSH(map, ruleno, x, balance_param) to get corresponding PG distribution with different balance_params
    - Calculate stdev of PG# among all osds
    - Choose the balance_param with the minimal stdev 
 	2) Add a member variable to pool struct pg_pool_t to save the best balance_param value
  The adaptive procedure can be described as following:
Input: cluster map, total PG number m, adaptive retry times n
Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n { 
    for pgid from 0 to m {
        calculate pps using the new generator in 2.1;
        for bucket b in cluster map // apply CRUSH algorithm
            apply corresponding bucket hashing algorithm and get a osd list for pgid
    }
    calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
    if pg_stdev_a < min_pg_stdev {
        min_pg_stdev = pg_stdev_a;
        balance_param = a; 
    }
    adjust a to a new value;
}


Evaluation:
    We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
    We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
1.   PG and data distribution is more balanced using optimized CRUSH algorithm
  a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 osds decreases from 10.09 to 6.50
  b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
2.   Large scaled performance is improved since data distribution is more balanced
  a) More than 10% performance improvement for 128K and 10M read
  b) Write performance not impacted
Detailed performance data can be found in the attached pdf (crush_optimization).

We also created a pull request: https://github.com/ceph/ceph/pull/2402

Thanks
Jian


-----Original Message-----
From: Loic Dachary [mailto:loic@dachary.org] 
Sent: Tuesday, September 09, 2014 9:36 PM
To: Zhang, Jian; ceph-devel@vger.kernel.org
Subject: Re: FW: CURSH optimization for unbalanced pg distribution



On 20/03/2014 04:54, Zhang, Jian wrote:
> Forwarding per Sage's suggestion. 

Very interesting discussion :-) For the record the corresponding pull request is https://github.com/ceph/ceph/pull/2402

> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sage@inktank.com]
> Sent: Wednesday, March 19, 2014 11:29 PM
> To: Mark Nelson
> Cc: Zhang, Jian; Duan, Jiangang; He, Yujie
> Subject: Re: CURSH optimization for unbalanced pg distribution
> 
> On Wed, 19 Mar 2014, Mark Nelson wrote:
>> On 03/19/2014 03:24 AM, Zhang, Jian wrote:
>>> For more detail data, please refer to the *Testing results* part.
>>>
>>> *Optimization proposals: *
>>>
>>> After we dived into the source code of CRUSH and related papers, we 
>>> proposed two possible optimizations:
>>>
>>> 1.Add different hash algorithms, as an alternative for the Jenkin's 
>>> hash, e.g. algorithm that will produce even values when range of 
>>> input value (pg#) is relatively small. Or add new bucket type at the 
>>> same time if necessary.
> 
> This *might* work, but I don't have a strong intuition about it.  The modeling we've done now has essentially assumed a statistically uniform distribution, which has some inherent inbalance for low values of n (num pgs in our case).  I have generally assumed we can't do better than "random", and still have the other properties we want (independent, deterministic placement), but it may be possible.
> 
>>>
>>> 2.Find a better replica placement strategy instead of current retry 
>>> logic of crush_choose_firstn, which may cause CRUSH to behave badly.
>>>
>>> We find there are several threshold of retry times by referring to 
>>> code, choose_total_tries, choose_local_tries and choose_local_fallback_tries.
>>> They are used to decide whether to do a retry_bucket, retry_descent 
>>> or use permutation to do an exhaustive bucket search. We are 
>>> wondering if there is another retry strategy:
>>>
>>> a)Backtracking retry. Now the logic of crush_choose_firstn can only 
>>> issue an retry either from the initial bucket(retry_descent) or from 
>>> the current bucket (retry_bucket), how about retrying the intervening buckets?
>>>
>>> b)Adjust threshold of retry times by other values. We do noticed 
>>> that the 'optimal' crush tunable could be used to make it, but we 
>>> still encounter unbalanced [g distribution by using the optimal strategy.
>>> Please refer to 4 of the Testing results part.
>>>
>>> c)Add an mechanism that can adjust above mentioned thresholds 
>>> adaptively. Maybe we can record the retry times of the previous call 
>>> for CRUSH, and adjust retry thresholds automatically according to the record.
> 
> I suggest ignoring all of this retry logic.  The original version of 
> CRUSH has the local retries to try to make data move "less far", but 
> when we went back a year ago and did a statistical analysis of the 
> distribution we found that *all* of these hacks degraded the quality 
> of the placement,a nd by turning them all off (setting the 'optimal' 
> values which zeroes them all out excent for total_retries) we got 
> something that was indistinguishable from a uniform distribution.
> 
>>> 3.Add soft link for pg directories. During pg creation, we can 
>>> create soft links for the pgs if pg# on the selected osd is more 
>>> than some threshold, say 10% more than desired average number, to 
>>> move objects that will be stored in this pg to another osd. Balanced 
>>> disk utilization may be gained in this way.
> 
> I think you need to be careful, but yes, this is an option.  There is 
> a similar exception mechanism in place that is used for other purposes 
> and something similar could be done here.  The main challenge will be 
> in ensuring that the soft links/exceptions follow the same overall 
> policy that CRUSH does after the raw mapping is performed.  This is an 
> option, but I would put it toward the bottom of the list...
> 
>>> 4.Change placement strategy only for step of selecting devices from 
>>> hosts. We found in our testing results that pg distribution was 
>>> balanced among hosts, which is reasonable since pg# of each host is 
>>> above 1K (according to the current BKM that pg# per osd should be 
>>> about 100). So how about we apply CRUSH only on the interval buckets 
>>> and find another simple but more balanced method to choose osd from host?
> 
> This idea has a lot of potential.  For example:
> 
> If you know the chassis can hold 12 disks, you can force the bucket 
> size to twelve and somehow prevent users from adjusting the structure 
> of the tree.  Then you can use a simple mapping that is truly flat 
> (like a linear mapping, disk = x % num_disks) for that bucket/subtree.  
> The downside of course is that if you remove a disk *everything* 
> reshuffles, hence some sort of guardrails to prevent a user from 
> inadvertantly doing that.  If a disk *does* fail, you just need to 
> make sure the disk is marked "out" but not removed from the CRUSH hierarchy and the normal retry will kick in.
> 
> Note that all this is reall doing is increasing the size of the "buckets" 
> that we are (pseudo)randomly distribution over.  It is still a 
> random/uniform distribution, but the N value is 12 times bigger (for a 
> 12 disk chassis) and as a result the variance is substantially lower.
> 
> I would suggest making a new bucket type that is called 'linear' and 
> does a simple modulo and trying this out.  We will need a bunch of 
> additional safety checks to help users avoid doing silly things (like 
> adjusting the number of items in the linear buckets, which reshuffle 
> everything) but that wouldn't be needed for an initial analysis of the performance impact.
> 
> Do you mind if we shift this thread over to ceph-devel?  I think there 
> are lots of people who would be interested in this discussion.  We can 
> of course leave off your attachment if you prefer.
> 
> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 

--
Loïc Dachary, Artisan Logiciel Libre


[-- Attachment #2: adaptive-crush-modify.patch --]
[-- Type: application/octet-stream, Size: 34460 bytes --]

diff --git a/src/crush/CrushCompiler.cc b/src/crush/CrushCompiler.cc
index b52a55a..5ddb5e9 100644
--- a/src/crush/CrushCompiler.cc
+++ b/src/crush/CrushCompiler.cc
@@ -78,6 +78,7 @@ int CrushCompiler::decompile_bucket_impl(int i, ostream &out)
   bool dopos = false;
   switch (alg) {
   case CRUSH_BUCKET_UNIFORM:
+  case CRUSH_BUCKET_LINEAR:
     out << "\t# do not change bucket size (" << n << ") unnecessarily";
     dopos = true;
     break;
@@ -435,6 +436,8 @@ int CrushCompiler::parse_bucket(iter_t const& i)
 	alg = CRUSH_BUCKET_TREE;
       else if (a == "straw")
 	alg = CRUSH_BUCKET_STRAW;
+      else if (a == "linear")
+	alg = CRUSH_BUCKET_LINEAR;
       else {
 	err << "unknown bucket alg '" << a << "'" << std::endl << std::endl;
 	return -EINVAL;
@@ -512,7 +515,7 @@ int CrushCompiler::parse_bucket(iter_t const& i)
 	  assert(0);
 
       }
-      if (alg == CRUSH_BUCKET_UNIFORM) {
+      if (alg == CRUSH_BUCKET_UNIFORM || alg == CRUSH_BUCKET_LINEAR) {
 	if (!have_uniform_weight) {
 	  have_uniform_weight = true;
 	  uniform_weight = weight;
diff --git a/src/crush/CrushWrapper.cc b/src/crush/CrushWrapper.cc
index 9618d98..1366193 100644
--- a/src/crush/CrushWrapper.cc
+++ b/src/crush/CrushWrapper.cc
@@ -964,6 +964,10 @@ void CrushWrapper::encode(bufferlist& bl, bool lean) const
       }
       break;
 
+    case CRUSH_BUCKET_LINEAR:
+      ::encode(((crush_bucket_linear*)crush->buckets[i])->item_weight, bl);
+      break;
+
     default:
       assert(0);
       break;
@@ -1108,6 +1112,9 @@ void CrushWrapper::decode_crush_bucket(crush_bucket** bptr, bufferlist::iterator
   case CRUSH_BUCKET_STRAW:
     size = sizeof(crush_bucket_straw);
     break;
+  case CRUSH_BUCKET_LINEAR:
+    size = sizeof(crush_bucket_linear);
+    break;
   default:
     {
       char str[128];
@@ -1171,6 +1178,10 @@ void CrushWrapper::decode_crush_bucket(crush_bucket** bptr, bufferlist::iterator
     break;
   }
 
+  case CRUSH_BUCKET_LINEAR:
+    ::decode(((crush_bucket_linear*)bucket)->item_weight, blp);
+    break;
+
   default:
     // We should have handled this case in the first switch statement
     assert(0);
diff --git a/src/crush/CrushWrapper.h b/src/crush/CrushWrapper.h
index 56e3d16..f20fa1a 100644
--- a/src/crush/CrushWrapper.h
+++ b/src/crush/CrushWrapper.h
@@ -925,6 +925,20 @@ public:
       out[i] = rawout[i];
   }
 
+  void do_rule(int rule, int x, vector<int>& out, int maxout,
+	       const vector<__u32>& weight, float balance_param) const {
+    Mutex::Locker l(mapper_lock);
+    int rawout[maxout];
+    int scratch[maxout * 3];
+    int numrep = crush_do_rule_wrapper(crush, rule, x, rawout, maxout,
+					&weight[0], weight.size(), scratch, balance_param);
+    if (numrep < 0)
+      numrep = 0;
+    out.resize(numrep);
+    for (int i=0; i<numrep; i++)
+      out[i] = rawout[i];
+  }
+
   int read_from_file(const char *fn) {
     bufferlist bl;
     std::string error;
diff --git a/src/crush/builder.c b/src/crush/builder.c
index 41b90aa..8090473 100644
--- a/src/crush/builder.c
+++ b/src/crush/builder.c
@@ -517,6 +517,53 @@ err:
         return NULL;
 }
 
+/* linear bucket */
+
+struct crush_bucket_linear *
+crush_make_linear_bucket(int hash, int type, int size,
+			  int *items,
+			  int item_weight)
+{
+	int i;
+	struct crush_bucket_linear *bucket;
+
+	if (crush_multiplication_is_unsafe(size, item_weight)) {
+		//need more info
+		//printf("");
+                return NULL;
+	}
+
+	bucket = malloc(sizeof(*bucket));
+        if (!bucket)
+                return NULL;
+	memset(bucket, 0, sizeof(*bucket));
+	bucket->h.alg = CRUSH_BUCKET_LINEAR;
+	bucket->h.hash = hash;
+	bucket->h.type = type;
+	bucket->h.size = size;
+
+	bucket->h.weight = size * item_weight;
+	bucket->item_weight = item_weight;
+	bucket->h.items = malloc(sizeof(__s32) * size);
+
+        if (!bucket->h.items)
+                goto err;
+
+        bucket->h.perm = malloc(sizeof(__u32) * size);
+
+        if (!bucket->h.perm)
+                goto err;
+	for (i=0; i<size; i++)
+		bucket->h.items[i] = items[i];
+
+	return bucket;
+err:
+        free(bucket->h.perm);
+        free(bucket->h.items);
+        free(bucket);
+        return NULL;
+}
+
 
 
 struct crush_bucket*
@@ -542,6 +589,13 @@ crush_make_bucket(int alg, int hash, int type, int size,
 
 	case CRUSH_BUCKET_STRAW:
 		return (struct crush_bucket *)crush_make_straw_bucket(hash, type, size, items, weights);
+
+	case CRUSH_BUCKET_LINEAR:
+		if (size && weights)
+			item_weight = weights[0];
+		else
+			item_weight = 0;
+		return (struct crush_bucket *)crush_make_linear_bucket(hash, type, size, items, item_weight);
 	}
 	return 0;
 }
@@ -709,6 +763,33 @@ int crush_add_straw_bucket_item(struct crush_bucket_straw *bucket, int item, int
 	return crush_calc_straw(bucket);
 }
 
+int crush_add_linear_bucket_item(struct crush_bucket_linear *bucket, int item, int weight)
+{
+	int newsize = bucket->h.size + 1;
+	void *_realloc = NULL;
+
+	if ((_realloc = realloc(bucket->h.items, sizeof(__s32)*newsize)) == NULL) {
+		return -ENOMEM;
+	} else {
+		bucket->h.items = _realloc;
+	}
+	if ((_realloc = realloc(bucket->h.perm, sizeof(__u32)*newsize)) == NULL) {
+		return -ENOMEM;
+	} else {
+		bucket->h.perm = _realloc;
+	}
+
+	bucket->h.items[newsize-1] = item;
+
+        if (crush_addition_is_unsafe(bucket->h.weight, weight))
+                return -ERANGE;
+
+        bucket->h.weight += weight;
+        bucket->h.size++;
+
+        return 0;
+}
+
 int crush_bucket_add_item(struct crush_bucket *b, int item, int weight)
 {
 	/* invalidate perm cache */
@@ -723,6 +804,8 @@ int crush_bucket_add_item(struct crush_bucket *b, int item, int weight)
 		return crush_add_tree_bucket_item((struct crush_bucket_tree *)b, item, weight);
 	case CRUSH_BUCKET_STRAW:
 		return crush_add_straw_bucket_item((struct crush_bucket_straw *)b, item, weight);
+	case CRUSH_BUCKET_LINEAR:
+		return crush_add_linear_bucket_item((struct crush_bucket_linear *)b, item, weight);
 	default:
 		return -1;
 	}
@@ -921,6 +1004,36 @@ int crush_remove_straw_bucket_item(struct crush_bucket_straw *bucket, int item)
 	return crush_calc_straw(bucket);
 }
 
+int crush_remove_linear_bucket_item(struct crush_bucket_linear *bucket, int item)
+{
+	unsigned i, j;
+	int newsize;
+	void *_realloc = NULL;
+
+	for (i = 0; i < bucket->h.size; i++)
+		if (bucket->h.items[i] == item)
+			break;
+	if (i == bucket->h.size)
+		return -ENOENT;
+
+	for (j = i; j < bucket->h.size; j++)
+		bucket->h.items[j] = bucket->h.items[j+1];
+	newsize = --bucket->h.size;
+	bucket->h.weight -= bucket->item_weight;
+
+	if ((_realloc = realloc(bucket->h.items, sizeof(__s32)*newsize)) == NULL) {
+		return -ENOMEM;
+	} else {
+		bucket->h.items = _realloc;
+	}
+	if ((_realloc = realloc(bucket->h.perm, sizeof(__u32)*newsize)) == NULL) {
+		return -ENOMEM;
+	} else {
+		bucket->h.perm = _realloc;
+	}
+	return 0;
+}
+
 int crush_bucket_remove_item(struct crush_bucket *b, int item)
 {
 	/* invalidate perm cache */
@@ -935,6 +1048,8 @@ int crush_bucket_remove_item(struct crush_bucket *b, int item)
 		return crush_remove_tree_bucket_item((struct crush_bucket_tree *)b, item);
 	case CRUSH_BUCKET_STRAW:
 		return crush_remove_straw_bucket_item((struct crush_bucket_straw *)b, item);
+	case CRUSH_BUCKET_LINEAR:
+		return crush_remove_linear_bucket_item((struct crush_bucket_linear *)b, item);
 	default:
 		return -1;
 	}
@@ -1025,6 +1140,16 @@ int crush_adjust_straw_bucket_item_weight(struct crush_bucket_straw *bucket, int
 	return diff;
 }
 
+int crush_adjust_linear_bucket_item_weight(struct crush_bucket_linear *bucket, int item, int weight)
+{
+	int diff = (weight - bucket->item_weight) * bucket->h.size;
+
+	bucket->item_weight = weight;
+	bucket->h.weight = bucket->item_weight * bucket->h.size;
+
+	return diff;
+}
+
 int crush_bucket_adjust_item_weight(struct crush_bucket *b, int item, int weight)
 {
 	switch (b->alg) {
@@ -1040,6 +1165,9 @@ int crush_bucket_adjust_item_weight(struct crush_bucket *b, int item, int weight
 	case CRUSH_BUCKET_STRAW:
 		return crush_adjust_straw_bucket_item_weight((struct crush_bucket_straw *)b,
 							     item, weight);
+	case CRUSH_BUCKET_LINEAR:
+		return crush_adjust_linear_bucket_item_weight((struct crush_bucket_linear *)b,
+							     item, weight);
 	default:
 		return -1;
 	}
@@ -1144,6 +1272,34 @@ static int crush_reweight_straw_bucket(struct crush_map *crush, struct crush_buc
 	return 0;
 }
 
+static int crush_reweight_linear_bucket(struct crush_map *crush, struct crush_bucket_linear *bucket)
+{
+	unsigned i;
+	unsigned sum = 0, n = 0, leaves = 0;
+
+	for (i = 0; i < bucket->h.size; i++) {
+		int id = bucket->h.items[i];
+		if (id < 0) {
+			struct crush_bucket *c = crush->buckets[-1-id];
+			crush_reweight_bucket(crush, c);
+
+			if (crush_addition_is_unsafe(sum, c->weight))
+                                return -ERANGE;
+
+			sum += c->weight;
+			n++;
+		} else {
+			leaves++;
+		}
+	}
+
+	if (n > leaves)
+		bucket->item_weight = sum / n;  // more bucket children than leaves, average!
+	bucket->h.weight = bucket->item_weight * bucket->h.size;
+
+	return 0;
+}
+
 int crush_reweight_bucket(struct crush_map *crush, struct crush_bucket *b)
 {
 	switch (b->alg) {
@@ -1155,6 +1311,8 @@ int crush_reweight_bucket(struct crush_map *crush, struct crush_bucket *b)
 		return crush_reweight_tree_bucket(crush, (struct crush_bucket_tree *)b);
 	case CRUSH_BUCKET_STRAW:
 		return crush_reweight_straw_bucket(crush, (struct crush_bucket_straw *)b);
+	case CRUSH_BUCKET_LINEAR:
+		return crush_reweight_linear_bucket(crush, (struct crush_bucket_linear *)b);
 	default:
 		return -1;
 	}
diff --git a/src/crush/builder.h b/src/crush/builder.h
index 1003c35..13c7660 100644
--- a/src/crush/builder.h
+++ b/src/crush/builder.h
@@ -39,5 +39,9 @@ struct crush_bucket_straw *
 crush_make_straw_bucket(int hash, int type, int size,
 			int *items,
 			int *weights);
+struct crush_bucket_linear *
+crush_make_linear_bucket(int hash, int type, int size,
+			  int *items,
+			  int item_weight);
 
 #endif
diff --git a/src/crush/crush.c b/src/crush/crush.c
index 519793a..8e7841a 100644
--- a/src/crush/crush.c
+++ b/src/crush/crush.c
@@ -18,6 +18,7 @@ const char *crush_bucket_alg_name(int alg)
 	case CRUSH_BUCKET_LIST: return "list";
 	case CRUSH_BUCKET_TREE: return "tree";
 	case CRUSH_BUCKET_STRAW: return "straw";
+	case CRUSH_BUCKET_LINEAR: return "linear";
 	default: return "unknown";
 	}
 }
@@ -41,6 +42,8 @@ int crush_get_bucket_item_weight(const struct crush_bucket *b, int p)
 		return ((struct crush_bucket_tree *)b)->node_weights[crush_calc_tree_node(p)];
 	case CRUSH_BUCKET_STRAW:
 		return ((struct crush_bucket_straw *)b)->item_weights[p];
+	case CRUSH_BUCKET_LINEAR:
+		return ((struct crush_bucket_linear *)b)->item_weight;
 	}
 	return 0;
 }
@@ -78,6 +81,13 @@ void crush_destroy_bucket_straw(struct crush_bucket_straw *b)
 	kfree(b);
 }
 
+void crush_destroy_bucket_linear(struct crush_bucket_linear *b)
+{
+	kfree(b->h.perm);
+	kfree(b->h.items);
+	kfree(b);
+}
+
 void crush_destroy_bucket(struct crush_bucket *b)
 {
 	switch (b->alg) {
@@ -93,6 +103,9 @@ void crush_destroy_bucket(struct crush_bucket *b)
 	case CRUSH_BUCKET_STRAW:
 		crush_destroy_bucket_straw((struct crush_bucket_straw *)b);
 		break;
+	case CRUSH_BUCKET_LINEAR:
+		crush_destroy_bucket_linear((struct crush_bucket_linear *)b);
+		break;
 	}
 }
 
diff --git a/src/crush/crush.h b/src/crush/crush.h
index 322d16c..e881fb2 100644
--- a/src/crush/crush.h
+++ b/src/crush/crush.h
@@ -112,7 +112,8 @@ enum {
 	CRUSH_BUCKET_UNIFORM = 1,
 	CRUSH_BUCKET_LIST = 2,
 	CRUSH_BUCKET_TREE = 3,
-	CRUSH_BUCKET_STRAW = 4
+	CRUSH_BUCKET_STRAW = 4,
+	CRUSH_BUCKET_LINEAR = 5
 };
 extern const char *crush_bucket_alg_name(int alg);
 
@@ -159,7 +160,10 @@ struct crush_bucket_straw {
 	__u32 *straws;         /* 16-bit fixed point */
 };
 
-
+struct crush_bucket_linear {
+	struct crush_bucket h;
+	__u32 item_weight;  /* 16-bit fixed point; all items equally weighted */
+};
 
 /*
  * CRUSH map includes all buckets, rules, etc.
@@ -203,6 +207,7 @@ extern void crush_destroy_bucket_uniform(struct crush_bucket_uniform *b);
 extern void crush_destroy_bucket_list(struct crush_bucket_list *b);
 extern void crush_destroy_bucket_tree(struct crush_bucket_tree *b);
 extern void crush_destroy_bucket_straw(struct crush_bucket_straw *b);
+extern void crush_destroy_bucket_linear(struct crush_bucket_linear *b);
 extern void crush_destroy_bucket(struct crush_bucket *b);
 extern void crush_destroy_rule(struct crush_rule *r);
 extern void crush_destroy(struct crush_map *map);
diff --git a/src/crush/grammar.h b/src/crush/grammar.h
index 42b0b8e..3dd75b0 100644
--- a/src/crush/grammar.h
+++ b/src/crush/grammar.h
@@ -116,7 +116,8 @@ struct crush_grammar : public grammar<crush_grammar>
       bucket_alg = str_p("alg") >> ( str_p("uniform") |
 				     str_p("list") |
 				     str_p("tree") |
-				     str_p("straw") );
+				     str_p("straw")|
+				     str_p("linear"));
       bucket_hash = str_p("hash") >> ( integer |
 				       str_p("rjenkins1") );
       bucket_item = str_p("item") >> name
diff --git a/src/crush/mapper.c b/src/crush/mapper.c
index e610f31..6ce477e 100644
--- a/src/crush/mapper.c
+++ b/src/crush/mapper.c
@@ -239,6 +239,13 @@ static int bucket_straw_choose(struct crush_bucket_straw *bucket,
 	return bucket->h.items[high];
 }
 
+static int bucket_linear_choose(struct crush_bucket_linear *bucket,
+			       int x, int r)
+{
+	unsigned int item = x%bucket->h.size;
+	return bucket->h.items[item];
+}
+
 static int crush_bucket_choose(struct crush_bucket *in, int x, int r)
 {
 	dprintk(" crush_bucket_choose %d x=%d r=%d\n", in->id, x, r);
@@ -262,6 +269,32 @@ static int crush_bucket_choose(struct crush_bucket *in, int x, int r)
 	}
 }
 
+static int crush_bucket_choose_wrapper(struct crush_bucket *in, int x, int r, float balance_param)
+{
+	dprintk(" crush_bucket_choose %d x=%d r=%d\n", in->id, x, r);
+	BUG_ON(in->size == 0);
+	switch (in->alg) {
+	case CRUSH_BUCKET_UNIFORM:
+		return bucket_uniform_choose((struct crush_bucket_uniform *)in,
+					  x, r);
+	case CRUSH_BUCKET_LIST:
+		return bucket_list_choose((struct crush_bucket_list *)in,
+					  x, r);
+	case CRUSH_BUCKET_TREE:
+		return bucket_tree_choose((struct crush_bucket_tree *)in,
+					  x, r);
+	case CRUSH_BUCKET_STRAW:
+		return bucket_straw_choose((struct crush_bucket_straw *)in,
+					   x, r);
+	case CRUSH_BUCKET_LINEAR:
+		return bucket_linear_choose((struct crush_bucket_linear *)in,
+					  x/balance_param, r);
+	default:
+		dprintk("unknown bucket %d alg %d\n", in->id, in->alg);
+		return in->items[0];
+	}
+}
+
 /*
  * true if device is marked "out" (failed, fully offloaded)
  * of the cluster
@@ -473,6 +506,166 @@ reject:
 	return outpos;
 }
 
+static int crush_choose_firstn_wrapper(const struct crush_map *map,
+			       struct crush_bucket *bucket,
+			       const __u32 *weight, int weight_max,
+			       int x, int numrep, int type,
+			       int *out, int outpos,
+			       unsigned int tries,
+			       unsigned int recurse_tries,
+			       unsigned int local_retries,
+			       unsigned int local_fallback_retries,
+			       int recurse_to_leaf,
+			       int *out2, float balance_param)
+{
+	int rep;
+	unsigned int ftotal, flocal;
+	int retry_descent, retry_bucket, skip_rep;
+	struct crush_bucket *in = bucket;
+	int r;
+	int i;
+	int item = 0;
+	int itemtype;
+	int collide, reject;
+
+	dprintk("CHOOSE%s bucket %d x %d outpos %d numrep %d\n", recurse_to_leaf ? "_LEAF" : "",
+		bucket->id, x, outpos, numrep);
+
+	for (rep = outpos; rep < numrep; rep++) {
+		/* keep trying until we get a non-out, non-colliding item */
+		ftotal = 0;
+		skip_rep = 0;
+		do {
+			retry_descent = 0;
+			in = bucket;               /* initial bucket */
+
+			/* choose through intervening buckets */
+			flocal = 0;
+			do {
+				collide = 0;
+				retry_bucket = 0;
+				r = rep;
+				/* r' = r + f_total */
+				r += ftotal;
+
+				/* bucket choose */
+				if (in->size == 0) {
+					reject = 1;
+					goto reject;
+				}
+				if (local_fallback_retries > 0 &&
+				    flocal >= (in->size>>1) &&
+				    flocal > local_fallback_retries)
+					item = bucket_perm_choose(in, x, r);
+				else
+					item = crush_bucket_choose_wrapper(in, x, r, balance_param);
+				if (item >= map->max_devices) {
+					dprintk("   bad item %d\n", item);
+					skip_rep = 1;
+					break;
+				}
+
+				/* desired type? */
+				if (item < 0)
+					itemtype = map->buckets[-1-item]->type;
+				else
+					itemtype = 0;
+				dprintk("  item %d type %d\n", item, itemtype);
+
+				/* keep going? */
+				if (itemtype != type) {
+					if (item >= 0 ||
+					    (-1-item) >= map->max_buckets) {
+						dprintk("   bad item type %d\n", type);
+						skip_rep = 1;
+						break;
+					}
+					in = map->buckets[-1-item];
+					retry_bucket = 1;
+					continue;
+				}
+
+				/* collision? */
+				for (i = 0; i < outpos; i++) {
+					if (out[i] == item) {
+						collide = 1;
+						break;
+					}
+				}
+
+				reject = 0;
+				if (!collide && recurse_to_leaf) {
+					if (item < 0) {
+						if (crush_choose_firstn_wrapper(map,
+							 map->buckets[-1-item],
+							 weight, weight_max,
+							 x, outpos+1, 0,
+							 out2, outpos,
+							 recurse_tries, 0,
+							 local_retries,
+							 local_fallback_retries,
+							 0,
+							 NULL, balance_param) <= outpos)
+							/* didn't get leaf */
+							reject = 1;
+					} else {
+						/* we already have a leaf! */
+						out2[outpos] = item;
+					}
+				}
+
+				if (!reject) {
+					/* out? */
+					if (itemtype == 0)
+						reject = is_out(map, weight,
+								weight_max,
+								item, x);
+					else
+						reject = 0;
+				}
+
+reject:
+				if (reject || collide) {
+					ftotal++;
+					flocal++;
+
+					if (collide && flocal <= local_retries)
+						/* retry locally a few times */
+						retry_bucket = 1;
+					else if (local_fallback_retries > 0 &&
+						 flocal <= in->size + local_fallback_retries)
+						/* exhaustive bucket search */
+						retry_bucket = 1;
+					else if (ftotal < tries)
+						/* then retry descent */
+						retry_descent = 1;
+					else
+						/* else give up */
+						skip_rep = 1;
+					dprintk("  reject %d  collide %d  "
+						"ftotal %u  flocal %u\n",
+						reject, collide, ftotal,
+						flocal);
+				}
+			} while (retry_bucket);
+		} while (retry_descent);
+
+		if (skip_rep) {
+			dprintk("skip rep\n");
+			continue;
+		}
+
+		dprintk("CHOOSE got %d\n", item);
+		out[outpos] = item;
+		outpos++;
+
+		if (map->choose_tries && ftotal <= map->choose_total_tries)
+			map->choose_tries[ftotal]++;
+	}
+
+	dprintk("CHOOSE returns %d\n", outpos);
+	return outpos;
+}
 
 /**
  * crush_choose_indep: alternative breadth-first positionally stable mapping
@@ -848,4 +1041,177 @@ int crush_do_rule(const struct crush_map *map,
 	return result_len;
 }
 
+int crush_do_rule_wrapper(const struct crush_map *map,
+		  int ruleno, int x, int *result, int result_max,
+		  const __u32 *weight, int weight_max,
+		  int *scratch,
+		  float balance_param)
+{
+	int result_len;
+	int *a = scratch;
+	int *b = scratch + result_max;
+	int *c = scratch + result_max*2;
+	int recurse_to_leaf;
+	int *w;
+	int wsize = 0;
+	int *o;
+	int osize;
+	int *tmp;
+	struct crush_rule *rule;
+	__u32 step;
+	int i, j;
+	int numrep;
+	/*
+	 * the original choose_total_tries value was off by one (it
+	 * counted "retries" and not "tries").  add one.
+	 */
+	int choose_tries = map->choose_total_tries + 1;
+	int choose_leaf_tries = 0;
+	/*
+	 * the local tries values were counted as "retries", though,
+	 * and need no adjustment
+	 */
+	int choose_local_retries = map->choose_local_tries;
+	int choose_local_fallback_retries = map->choose_local_fallback_tries;
+
+	if ((__u32)ruleno >= map->max_rules) {
+		dprintk(" bad ruleno %d\n", ruleno);
+		return 0;
+	}
+
+	rule = map->rules[ruleno];
+	result_len = 0;
+	w = a;
+	o = b;
+
+	for (step = 0; step < rule->len; step++) {
+		int firstn = 0;
+		struct crush_rule_step *curstep = &rule->steps[step];
+
+		switch (curstep->op) {
+		case CRUSH_RULE_TAKE:
+			w[0] = curstep->arg1;
+			wsize = 1;
+			break;
+
+		case CRUSH_RULE_SET_CHOOSE_TRIES:
+			if (curstep->arg1 > 0)
+				choose_tries = curstep->arg1;
+			break;
+
+		case CRUSH_RULE_SET_CHOOSELEAF_TRIES:
+			if (curstep->arg1 > 0)
+				choose_leaf_tries = curstep->arg1;
+			break;
+
+		case CRUSH_RULE_SET_CHOOSE_LOCAL_TRIES:
+			if (curstep->arg1 >= 0)
+				choose_local_retries = curstep->arg1;
+			break;
+
+		case CRUSH_RULE_SET_CHOOSE_LOCAL_FALLBACK_TRIES:
+			if (curstep->arg1 >= 0)
+				choose_local_fallback_retries = curstep->arg1;
+			break;
+
+		case CRUSH_RULE_CHOOSELEAF_FIRSTN:
+		case CRUSH_RULE_CHOOSE_FIRSTN:
+			firstn = 1;
+			/* fall through */
+		case CRUSH_RULE_CHOOSELEAF_INDEP:
+		case CRUSH_RULE_CHOOSE_INDEP:
+			if (wsize == 0)
+				break;
+
+			recurse_to_leaf =
+				curstep->op ==
+				 CRUSH_RULE_CHOOSELEAF_FIRSTN ||
+				curstep->op ==
+				CRUSH_RULE_CHOOSELEAF_INDEP;
+
+			/* reset output */
+			osize = 0;
+
+			for (i = 0; i < wsize; i++) {
+				/*
+				 * see CRUSH_N, CRUSH_N_MINUS macros.
+				 * basically, numrep <= 0 means relative to
+				 * the provided result_max
+				 */
+				numrep = curstep->arg1;
+				if (numrep <= 0) {
+					numrep += result_max;
+					if (numrep <= 0)
+						continue;
+				}
+				j = 0;
+				if (firstn) {
+					int recurse_tries;
+					if (choose_leaf_tries)
+						recurse_tries =
+							choose_leaf_tries;
+					else if (map->chooseleaf_descend_once)
+						recurse_tries = 1;
+					else
+						recurse_tries = choose_tries;
+					osize += crush_choose_firstn_wrapper(
+						map,
+						map->buckets[-1-w[i]],
+						weight, weight_max,
+						x, numrep,
+						curstep->arg2,
+						o+osize, j,
+						choose_tries,
+						recurse_tries,
+						choose_local_retries,
+						choose_local_fallback_retries,
+						recurse_to_leaf,
+						c+osize,
+						balance_param);
+				} else {
+					crush_choose_indep(
+						map,
+						map->buckets[-1-w[i]],
+						weight, weight_max,
+						x, numrep, numrep,
+						curstep->arg2,
+						o+osize, j,
+						choose_tries,
+						choose_leaf_tries ?
+						   choose_leaf_tries : 1,
+						recurse_to_leaf,
+						c+osize,
+						0);
+					osize += numrep;
+				}
+			}
+
+			if (recurse_to_leaf)
+				/* copy final _leaf_ values to output set */
+				memcpy(o, c, osize*sizeof(*o));
+
+			/* swap o and w arrays */
+			tmp = o;
+			o = w;
+			w = tmp;
+			wsize = osize;
+			break;
+
+
+		case CRUSH_RULE_EMIT:
+			for (i = 0; i < wsize && result_len < result_max; i++) {
+				result[result_len] = w[i];
+				result_len++;
+			}
+			wsize = 0;
+			break;
+
+		default:
+			dprintk(" unknown op %d at step %d\n",
+				curstep->op, step);
+			break;
+		}
+	}
+	return result_len;
+}
 
diff --git a/src/crush/mapper.h b/src/crush/mapper.h
index 5dfd5b1..0fa1b88 100644
--- a/src/crush/mapper.h
+++ b/src/crush/mapper.h
@@ -16,5 +16,10 @@ extern int crush_do_rule(const struct crush_map *map,
 			 int x, int *result, int result_max,
 			 const __u32 *weights, int weight_max,
 			 int *scratch);
-
+extern int crush_do_rule_wrapper(const struct crush_map *map,
+			 int ruleno,
+			 int x, int *result, int result_max,
+			 const __u32 *weights, int weight_max,
+			 int *scratch,
+			 float balance_param);
 #endif
diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index dd729f0..eb56bab 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -66,6 +66,27 @@ static ostream& _prefix(std::ostream *_dout, Monitor *mon, OSDMap& osdmap) {
 		<< ").osd e" << osdmap.get_epoch() << " ";
 }
 
+float cal_average(vector<int> dataset, int length) {
+  float avg = 0;
+  float sum = 0;
+  for(int i = 0; i < length; i++){
+    sum += dataset[i];
+  }
+  avg = sum/length;
+  return avg;
+}
+
+float cal_stdev(vector<int> dataset, int length) {
+  float avg = cal_average(dataset, length);
+  float sum = 0;
+  float stdev;
+  for(int i = 0; i < length; i++){
+    sum += pow(dataset[i] - avg, 2);
+  }
+  stdev = pow(sum/length, 0.5);
+  return stdev;
+}
+
 bool OSDMonitor::_have_pending_crush()
 {
   return pending_inc.crush.length();
@@ -3150,6 +3171,43 @@ int OSDMonitor::prepare_new_pool(MPoolOp *m)
 			    pg_pool_t::TYPE_REPLICATED, 0, ss);
 }
 
+void OSDMonitor::prepare_adaptive_balance_param(pg_pool_t *pi, int64_t pool)
+{
+  float min_stdev = 999999;
+  float balance_param = 1;
+  
+  for (int k = 1; k < 5; k++) {
+    map<int, int> osd_total_pgs;
+    pi->set_balance_param((float)k);
+    for (int i = 0; i < (int)pi->get_pg_num(); i++) {
+      pg_t pg(i, pool);
+      pg.set_pool(pool);
+      vector<int> acting;
+      int nrep = osdmap.pg_to_up_osds_adaptive(*pi, pg, acting);
+      if (nrep) {
+        for (int j = 0; j < nrep; j++) {
+          osd_total_pgs[acting[j]]++;
+        }
+      }
+    }
+	  
+    vector<int> pg_num;
+    for (map<int, int>::iterator ptr = osd_total_pgs.begin();
+	   ptr != osd_total_pgs.end();
+	   ++ptr) {
+      pg_num.push_back(ptr->second);
+    }
+	  
+    float stdev = cal_stdev(pg_num, pg_num.size());
+    if (stdev < min_stdev) {
+      min_stdev = stdev;
+      balance_param = k;
+    } 
+  }
+  
+  pi->set_balance_param(balance_param);
+}
+
 int OSDMonitor::crush_ruleset_create_erasure(const string &name,
 					     const string &profile,
 					     int *ruleset,
@@ -3527,6 +3585,7 @@ int OSDMonitor::prepare_new_pool(string& name, uint64_t auid,
     g_conf->osd_pool_default_cache_target_full_ratio * 1000000;
   pi->cache_min_flush_age = g_conf->osd_pool_default_cache_min_flush_age;
   pi->cache_min_evict_age = g_conf->osd_pool_default_cache_min_evict_age;
+  prepare_adaptive_balance_param(pi, pool);
   pending_inc.new_pool_names[pool] = name;
   return 0;
 }
diff --git a/src/mon/OSDMonitor.h b/src/mon/OSDMonitor.h
index 650c55e..eabe06d 100644
--- a/src/mon/OSDMonitor.h
+++ b/src/mon/OSDMonitor.h
@@ -253,6 +253,7 @@ private:
   bool prepare_pool_op (MPoolOp *m);
   bool prepare_pool_op_create (MPoolOp *m);
   bool prepare_pool_op_delete(MPoolOp *m);
+  void prepare_adaptive_balance_param(pg_pool_t *pi, int64_t pool);
   int crush_ruleset_create_erasure(const string &name,
 				   const string &profile,
 				   int *ruleset,
diff --git a/src/osd/OSDMap.cc b/src/osd/OSDMap.cc
index ada0909..118f1ce 100644
--- a/src/osd/OSDMap.cc
+++ b/src/osd/OSDMap.cc
@@ -1419,13 +1419,13 @@ int OSDMap::_pg_to_osds(const pg_pool_t& pool, pg_t pg,
 			ps_t *ppps) const
 {
   // map to osds[]
-  ps_t pps = pool.raw_pg_to_pps(pg);  // placement ps
+  ps_t pps = pool.raw_pg_to_congruential_pps(pg);  // placement ps
   unsigned size = pool.get_size();
 
   // what crush rule?
   int ruleno = crush->find_rule(pool.get_crush_ruleset(), pool.get_type(), size);
   if (ruleno >= 0)
-    crush->do_rule(ruleno, pps, *osds, size, osd_weight);
+    crush->do_rule(ruleno, pps, *osds, size, osd_weight, pool.get_balance_param());
 
   _remove_nonexistent_osds(pool, *osds);
 
@@ -1583,6 +1583,15 @@ void OSDMap::pg_to_raw_up(pg_t pg, vector<int> *up, int *primary) const
   _raw_to_up_osds(*pool, raw, up, primary);
   _apply_primary_affinity(pps, *pool, up, primary);
 }
+
+int OSDMap::pg_to_up_osds_adaptive(const pg_pool_t& pool, pg_t pg, vector<int>& up) const
+{
+  int primary;
+  vector<int> raw;
+  _pg_to_osds(pool, pg, &raw, &primary, NULL);
+  _raw_to_up_osds(pool, raw, &up, &primary);
+  return up.size();
+}
   
 void OSDMap::_pg_to_up_acting_osds(const pg_t& pg, vector<int> *up, int *up_primary,
                                    vector<int> *acting, int *acting_primary) const
diff --git a/src/osd/OSDMap.h b/src/osd/OSDMap.h
index 983a44d..5f1addd 100644
--- a/src/osd/OSDMap.h
+++ b/src/osd/OSDMap.h
@@ -642,6 +642,7 @@ public:
     int r = pg_to_acting_osds(pg, &acting, &primary);
     return r;
   }
+  int pg_to_up_osds_adaptive(const pg_pool_t& pool, pg_t pg, vector<int>& up) const;
   /**
    * This does not apply temp overrides and should not be used
    * by anybody for data mapping purposes. Specify both pointers.
diff --git a/src/osd/osd_types.cc b/src/osd/osd_types.cc
index 1dc2af1..5017005 100644
--- a/src/osd/osd_types.cc
+++ b/src/osd/osd_types.cc
@@ -770,6 +770,7 @@ void pg_pool_t::dump(Formatter *f) const
   f->dump_int("min_size", get_min_size());
   f->dump_int("crush_ruleset", get_crush_ruleset());
   f->dump_int("object_hash", get_object_hash());
+  f->dump_float("balance_param", get_balance_param());
   f->dump_int("pg_num", get_pg_num());
   f->dump_int("pg_placement_num", get_pgp_num());
   f->dump_unsigned("crash_replay_interval", get_crash_replay_interval());
@@ -993,6 +994,23 @@ ps_t pg_pool_t::raw_pg_to_pps(pg_t pg) const
   }
 }
 
+ps_t pg_pool_t::raw_pg_to_congruential_pps(pg_t pg) const
+{
+  if (flags & FLAG_HASHPSPOOL) {
+    // Shuffles the original pgid sequence, with poolid being the seed
+    ps_t stable = ceph_stable_mod(pg.ps(), pgp_num, pgp_num_mask);
+    ps_t pps = (1033*stable + 2*pg.pool() + 1) % pgp_num_mask + pg.pool();
+    return pps;
+  } else {
+    // Legacy behavior; add ps and pool together.  This is not a great
+    // idea because the PGs from each pool will essentially overlap on
+    // top of each other: 0.5 == 1.4 == 2.3 == ...
+    return
+      ceph_stable_mod(pg.ps(), pgp_num, pgp_num_mask) +
+      pg.pool();
+  }
+}
+
 uint32_t pg_pool_t::get_random_pg_position(pg_t pg, uint32_t seed) const
 {
   uint32_t r = crush_hash32_2(CRUSH_HASH_RJENKINS1, seed, 123);
@@ -1020,6 +1038,7 @@ void pg_pool_t::encode(bufferlist& bl, uint64_t features) const
     ::encode(size, bl);
     ::encode(crush_ruleset, bl);
     ::encode(object_hash, bl);
+    ::encode(balance_param, bl);
     ::encode(pg_num, bl);
     ::encode(pgp_num, bl);
     __u32 lpg_num = 0, lpgp_num = 0;  // tell old code that there are no localized pgs.
@@ -1048,6 +1067,7 @@ void pg_pool_t::encode(bufferlist& bl, uint64_t features) const
     ::encode(size, bl);
     ::encode(crush_ruleset, bl);
     ::encode(object_hash, bl);
+    ::encode(balance_param, bl);
     ::encode(pg_num, bl);
     ::encode(pgp_num, bl);
     __u32 lpg_num = 0, lpgp_num = 0;  // tell old code that there are no localized pgs.
@@ -1075,6 +1095,7 @@ void pg_pool_t::encode(bufferlist& bl, uint64_t features) const
     ::encode(size, bl);
     ::encode(crush_ruleset, bl);
     ::encode(object_hash, bl);
+    ::encode(balance_param, bl);
     ::encode(pg_num, bl);
     ::encode(pgp_num, bl);
     __u32 lpg_num = 0, lpgp_num = 0;  // tell old code that there are no localized pgs.
@@ -1118,6 +1139,7 @@ void pg_pool_t::encode(bufferlist& bl, uint64_t features) const
   ::encode(size, bl);
   ::encode(crush_ruleset, bl);
   ::encode(object_hash, bl);
+  ::encode(balance_param, bl);
   ::encode(pg_num, bl);
   ::encode(pgp_num, bl);
   __u32 lpg_num = 0, lpgp_num = 0;  // tell old code that there are no localized pgs.
@@ -1165,6 +1187,7 @@ void pg_pool_t::decode(bufferlist::iterator& bl)
   ::decode(size, bl);
   ::decode(crush_ruleset, bl);
   ::decode(object_hash, bl);
+  ::decode(balance_param, bl);
   ::decode(pg_num, bl);
   ::decode(pgp_num, bl);
   {
@@ -1286,6 +1309,7 @@ void pg_pool_t::generate_test_instances(list<pg_pool_t*>& o)
   a.size = 2;
   a.crush_ruleset = 3;
   a.object_hash = 4;
+  a.balance_param = 1.0;
   a.pg_num = 6;
   a.pgp_num = 5;
   a.last_change = 9;
@@ -1338,6 +1362,7 @@ ostream& operator<<(ostream& out, const pg_pool_t& p)
       << " min_size " << p.get_min_size()
       << " crush_ruleset " << p.get_crush_ruleset()
       << " object_hash " << p.get_object_hash_name()
+      << " balance_param " << p.get_balance_param()
       << " pg_num " << p.get_pg_num()
       << " pgp_num " << p.get_pgp_num()
       << " last_change " << p.get_last_change();
diff --git a/src/osd/osd_types.h b/src/osd/osd_types.h
index 4dab643..b9c255c 100644
--- a/src/osd/osd_types.h
+++ b/src/osd/osd_types.h
@@ -897,6 +897,7 @@ struct pg_pool_t {
   __u8 size, min_size;      ///< number of osds in each pg
   __u8 crush_ruleset;       ///< crush placement rule set
   __u8 object_hash;         ///< hash mapping object name to ps
+  float balance_param;
 private:
   __u32 pg_num, pgp_num;    ///< number of pgs
 
@@ -984,7 +985,7 @@ public:
 
   pg_pool_t()
     : flags(0), type(0), size(0), min_size(0),
-      crush_ruleset(0), object_hash(0),
+      crush_ruleset(0), object_hash(0), balance_param(1),
       pg_num(0), pgp_num(0),
       last_change(0),
       last_force_op_resend(0),
@@ -1031,6 +1032,7 @@ public:
   unsigned get_min_size() const { return min_size; }
   int get_crush_ruleset() const { return crush_ruleset; }
   int get_object_hash() const { return object_hash; }
+  float get_balance_param() const { return balance_param; }
   const char *get_object_hash_name() const {
     return ceph_str_hash_name(get_object_hash());
   }
@@ -1070,6 +1072,8 @@ public:
   unsigned get_pg_num_mask() const { return pg_num_mask; }
   unsigned get_pgp_num_mask() const { return pgp_num_mask; }
 
+  void set_balance_param(float b) { balance_param = b; }
+
   // if pg_num is not a multiple of two, pgs are not equally sized.
   // return, for a given pg, the fraction (denominator) of the total
   // pool size that it represents.
@@ -1145,6 +1149,7 @@ public:
    * seeds.
    */
   ps_t raw_pg_to_pps(pg_t pg) const;
+  ps_t raw_pg_to_congruential_pps(pg_t pg) const;
 
   /// choose a random hash position within a pg
   uint32_t get_random_pg_position(pg_t pgid, uint32_t seed) const;
diff --git a/src/tools/crushtool.cc b/src/tools/crushtool.cc
index 08818c4..cd93151 100644
--- a/src/tools/crushtool.cc
+++ b/src/tools/crushtool.cc
@@ -91,7 +91,7 @@ void usage()
   cout << "                         specify output for for (de)compilation\n";
   cout << "   --build --num_osds N layer1 ...\n";
   cout << "                         build a new map, where each 'layer' is\n";
-  cout << "                           'name (uniform|straw|list|tree) size'\n";
+  cout << "                           'name (uniform|straw|list|tree|linear) size'\n";
   cout << "   -i mapfn --test       test a range of inputs on the map\n";
   cout << "      [--min-x x] [--max-x x] [--x x]\n";
   cout << "      [--min-rule r] [--max-rule r] [--rule r]\n";
@@ -148,6 +148,7 @@ struct bucket_types_t {
   { "list", CRUSH_BUCKET_LIST },
   { "straw", CRUSH_BUCKET_STRAW },
   { "tree", CRUSH_BUCKET_TREE },
+  { "linear", CRUSH_BUCKET_LINEAR },
   { 0, 0 },
 };
 

[-- Attachment #3: crush_proposals.pdf --]
[-- Type: application/pdf, Size: 948539 bytes --]

[-- Attachment #4: crush_optimization.pdf --]
[-- Type: application/pdf, Size: 607981 bytes --]

[-- Attachment #5: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* RE: FW: CURSH optimization for unbalanced pg distribution
       [not found]         ` <51FC7A40FB29414D88A121A7FFEF9A4710F570D6@SHSMSX104.ccr.corp.intel.com>
@ 2014-09-10  1:06           ` Sage Weil
  2014-09-10  1:32             ` Zhang, Jian
  0 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2014-09-10  1:06 UTC (permalink / raw)
  To: Zhang, Jian; +Cc: Loic Dachary, ceph-devel, He, Yujie

The lists are rejecting the email because of the big attachments.  Send 
with links instead?

On Tue, 9 Sep 2014, Zhang, Jian wrote:

> Yujie sent out the following email yesterday, but it seems it was missed. Resending it. 
> 
> =============
> Hi all,
> ?  Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
> 
> Key Message:
> ?  As mentioned in the attached pdf, we described possible optimization proposals for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called ?linear?, and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.
> 
> Design and Implementation:
> 1.   Problem Identification
> 1.1  Input key (pps) space of CRUSH is not uniform
>     Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
> 1.2  Algorithm of selecting items from buckets is not uniform
>     After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
> 2.   Design
> 2.1  New pps hash algorithm
>     We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}. 
>     Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
> 2.2  New bucket type, Linear
>     We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
>     For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
> 2.3  Adaptive Strategy
>     Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
> ??1) Try different balance_param when preparing for a new pool
> ????- Iteratively call CRUSH(map, ruleno, x, balance_param) to get corresponding PG distribution with different balance_params
> ????- Calculate stdev of PG# among all osds
> ????- Choose the balance_param with the minimal stdev 
>  	2) Add a member variable to pool struct pg_pool_t to save the best balance_param value
> ??The adaptive procedure can be described as following:
> Input: cluster map, total PG number m, adaptive retry times n
> Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n { 
>     for pgid from 0 to m {
>         calculate pps using the new generator in 2.1;
>         for bucket b in cluster map // apply CRUSH algorithm
>             apply corresponding bucket hashing algorithm and get a osd list for pgid
>     }
>     calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
>     if pg_stdev_a < min_pg_stdev {
>         min_pg_stdev = pg_stdev_a;
>         balance_param = a; 
>     }
>     adjust a to a new value;
> }
> 
> 
> Evaluation:
>     We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
>     We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
> 1.   PG and data distribution is more balanced using optimized CRUSH algorithm
> ??a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 osds decreases from 10.09 to 6.50
> ??b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
> 2.   Large scaled performance is improved since data distribution is more balanced
> ??a) More than 10% performance improvement for 128K and 10M read
> ??b) Write performance not impacted
> Detailed performance data can be found in the attached pdf (crush_optimization).
> 
> We also created a pull request: https://github.com/ceph/ceph/pull/2402
> 
> Thanks
> Jian
> 
> 
> -----Original Message-----
> From: Loic Dachary [mailto:loic@dachary.org] 
> Sent: Tuesday, September 09, 2014 9:36 PM
> To: Zhang, Jian; ceph-devel@vger.kernel.org
> Subject: Re: FW: CURSH optimization for unbalanced pg distribution
> 
> 
> 
> On 20/03/2014 04:54, Zhang, Jian wrote:
> > Forwarding per Sage's suggestion. 
> 
> Very interesting discussion :-) For the record the corresponding pull request is https://github.com/ceph/ceph/pull/2402
> 
> > 
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@inktank.com]
> > Sent: Wednesday, March 19, 2014 11:29 PM
> > To: Mark Nelson
> > Cc: Zhang, Jian; Duan, Jiangang; He, Yujie
> > Subject: Re: CURSH optimization for unbalanced pg distribution
> > 
> > On Wed, 19 Mar 2014, Mark Nelson wrote:
> >> On 03/19/2014 03:24 AM, Zhang, Jian wrote:
> >>> For more detail data, please refer to the *Testing results* part.
> >>>
> >>> *Optimization proposals: *
> >>>
> >>> After we dived into the source code of CRUSH and related papers, we 
> >>> proposed two possible optimizations:
> >>>
> >>> 1.Add different hash algorithms, as an alternative for the Jenkin's 
> >>> hash, e.g. algorithm that will produce even values when range of 
> >>> input value (pg#) is relatively small. Or add new bucket type at the 
> >>> same time if necessary.
> > 
> > This *might* work, but I don't have a strong intuition about it.  The modeling we've done now has essentially assumed a statistically uniform distribution, which has some inherent inbalance for low values of n (num pgs in our case).  I have generally assumed we can't do better than "random", and still have the other properties we want (independent, deterministic placement), but it may be possible.
> > 
> >>>
> >>> 2.Find a better replica placement strategy instead of current retry 
> >>> logic of crush_choose_firstn, which may cause CRUSH to behave badly.
> >>>
> >>> We find there are several threshold of retry times by referring to 
> >>> code, choose_total_tries, choose_local_tries and choose_local_fallback_tries.
> >>> They are used to decide whether to do a retry_bucket, retry_descent 
> >>> or use permutation to do an exhaustive bucket search. We are 
> >>> wondering if there is another retry strategy:
> >>>
> >>> a)Backtracking retry. Now the logic of crush_choose_firstn can only 
> >>> issue an retry either from the initial bucket(retry_descent) or from 
> >>> the current bucket (retry_bucket), how about retrying the intervening buckets?
> >>>
> >>> b)Adjust threshold of retry times by other values. We do noticed 
> >>> that the 'optimal' crush tunable could be used to make it, but we 
> >>> still encounter unbalanced [g distribution by using the optimal strategy.
> >>> Please refer to 4 of the Testing results part.
> >>>
> >>> c)Add an mechanism that can adjust above mentioned thresholds 
> >>> adaptively. Maybe we can record the retry times of the previous call 
> >>> for CRUSH, and adjust retry thresholds automatically according to the record.
> > 
> > I suggest ignoring all of this retry logic.  The original version of 
> > CRUSH has the local retries to try to make data move "less far", but 
> > when we went back a year ago and did a statistical analysis of the 
> > distribution we found that *all* of these hacks degraded the quality 
> > of the placement,a nd by turning them all off (setting the 'optimal' 
> > values which zeroes them all out excent for total_retries) we got 
> > something that was indistinguishable from a uniform distribution.
> > 
> >>> 3.Add soft link for pg directories. During pg creation, we can 
> >>> create soft links for the pgs if pg# on the selected osd is more 
> >>> than some threshold, say 10% more than desired average number, to 
> >>> move objects that will be stored in this pg to another osd. Balanced 
> >>> disk utilization may be gained in this way.
> > 
> > I think you need to be careful, but yes, this is an option.  There is 
> > a similar exception mechanism in place that is used for other purposes 
> > and something similar could be done here.  The main challenge will be 
> > in ensuring that the soft links/exceptions follow the same overall 
> > policy that CRUSH does after the raw mapping is performed.  This is an 
> > option, but I would put it toward the bottom of the list...
> > 
> >>> 4.Change placement strategy only for step of selecting devices from 
> >>> hosts. We found in our testing results that pg distribution was 
> >>> balanced among hosts, which is reasonable since pg# of each host is 
> >>> above 1K (according to the current BKM that pg# per osd should be 
> >>> about 100). So how about we apply CRUSH only on the interval buckets 
> >>> and find another simple but more balanced method to choose osd from host?
> > 
> > This idea has a lot of potential.  For example:
> > 
> > If you know the chassis can hold 12 disks, you can force the bucket 
> > size to twelve and somehow prevent users from adjusting the structure 
> > of the tree.  Then you can use a simple mapping that is truly flat 
> > (like a linear mapping, disk = x % num_disks) for that bucket/subtree.  
> > The downside of course is that if you remove a disk *everything* 
> > reshuffles, hence some sort of guardrails to prevent a user from 
> > inadvertantly doing that.  If a disk *does* fail, you just need to 
> > make sure the disk is marked "out" but not removed from the CRUSH hierarchy and the normal retry will kick in.
> > 
> > Note that all this is reall doing is increasing the size of the "buckets" 
> > that we are (pseudo)randomly distribution over.  It is still a 
> > random/uniform distribution, but the N value is 12 times bigger (for a 
> > 12 disk chassis) and as a result the variance is substantially lower.
> > 
> > I would suggest making a new bucket type that is called 'linear' and 
> > does a simple modulo and trying this out.  We will need a bunch of 
> > additional safety checks to help users avoid doing silly things (like 
> > adjusting the number of items in the linear buckets, which reshuffle 
> > everything) but that wouldn't be needed for an initial analysis of the performance impact.
> > 
> > Do you mind if we shift this thread over to ceph-devel?  I think there 
> > are lots of people who would be interested in this discussion.  We can 
> > of course leave off your attachment if you prefer.
> > 
> > Thanks!
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> --
> Lo?c Dachary, Artisan Logiciel Libre
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: FW: CURSH optimization for unbalanced pg distribution
  2014-09-10  1:06           ` Sage Weil
@ 2014-09-10  1:32             ` Zhang, Jian
  2014-09-10  2:16               ` Mark Nelson
  2014-09-12  4:41               ` Sage Weil
  0 siblings, 2 replies; 9+ messages in thread
From: Zhang, Jian @ 2014-09-10  1:32 UTC (permalink / raw)
  To: Sage Weil; +Cc: Loic Dachary, ceph-devel, He, Yujie

Thanks. 

Created a feature here: http://tracker.ceph.com/issues/9410, to include all the attachments. .
http://tracker.ceph.com/attachments/download/1383/adaptive-crush-modify.patch
http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf
http://tracker.ceph.com/attachments/download/1385/crush_optimization.pdf


Hi all,
    Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.

Key Message:
    As mentioned in the attached pdf, we described possible optimization proposals (http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf) for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979 ) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called "linear", and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.

Design and Implementation:
1.Problem Identification
1.1 Input key (pps) space of CRUSH is not uniform
    Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
1.2 Algorithm of selecting items from buckets is not uniform
    After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
2.Design
2.1New pps hash algorithm
    We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}. 
    Assume there are np PGs in a pool, we can regard pgid (0≤pgid<2^n, np≤2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
2.2 New bucket type, Linear
    We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
    For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
2.3 Adaptive Strategy
    Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
    1) Try different balance_param when preparing for a new pool
    a. Iteratively call CRUSH(map, ruleno, x, balance_param) to get corresponding PG distribution with different balance_params
  b. Calculate stdev of PG# among all osds
   c. Choose the balance_param with the minimal stdev 
 	2) Add a member variable to pool struct pg_pool_t to save the best balance_param value
    The adaptive procedure can be described as following:
    Input: cluster map, total PG number m, adaptive retry times n
    Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n { 
    for pgid from 0 to m {
        calculate pps using the new generator in 2.1;
        for bucket b in cluster map // apply CRUSH algorithm
            apply corresponding bucket hashing algorithm and get a osd list for pgid
    }
    calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
    if pg_stdev_a < min_pg_stdev {
        min_pg_stdev = pg_stdev_a;
        balance_param = a; 
    }
    adjust a to a new value;
}


Evaluation:
    We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
    We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
1.PG and data distribution is more balanced using optimized CRUSH algorithm
    a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 osds decreases from 10.09 to 6.50
    b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
2.Large scaled performance is improved since data distribution is more balanced
    a) More than 10% performance improvement for 128K and 10M read
    b) Write performance not impacted
Detailed performance data can be found in the http://tracker.ceph.com/attachments/download/1385/crush_optimization.pdf .

We also created a pull request: https://github.com/ceph/ceph/pull/2402 


Thanks
Jian


-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Wednesday, September 10, 2014 9:06 AM
To: Zhang, Jian
Cc: Loic Dachary; ceph-devel@vger.kernel.org; He, Yujie
Subject: RE: FW: CURSH optimization for unbalanced pg distribution

The lists are rejecting the email because of the big attachments.  Send with links instead?

On Tue, 9 Sep 2014, Zhang, Jian wrote:

> Yujie sent out the following email yesterday, but it seems it was missed. Resending it. 
> 
> =============
> Hi all,
> ?  Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
> 
> Key Message:
> ?  As mentioned in the attached pdf, we described possible optimization proposals for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called ?linear?, and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.
> 
> Design and Implementation:
> 1.   Problem Identification
> 1.1  Input key (pps) space of CRUSH is not uniform
>     Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
> 1.2  Algorithm of selecting items from buckets is not uniform
>     After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
> 2.   Design
> 2.1  New pps hash algorithm
>     We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}. 
>     Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
> 2.2  New bucket type, Linear
>     We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
>     For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
> 2.3  Adaptive Strategy
>     Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
> ??1) Try different balance_param when preparing for a new pool
> ????- Iteratively call CRUSH(map, ruleno, x, balance_param) to get 
> corresponding PG distribution with different balance_params
> ????- Calculate stdev of PG# among all osds
> ????- Choose the balance_param with the minimal stdev 
>  	2) Add a member variable to pool struct pg_pool_t to save the best 
> balance_param value ??The adaptive procedure can be described as following:
> Input: cluster map, total PG number m, adaptive retry times n
> Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n { 
>     for pgid from 0 to m {
>         calculate pps using the new generator in 2.1;
>         for bucket b in cluster map // apply CRUSH algorithm
>             apply corresponding bucket hashing algorithm and get a osd list for pgid
>     }
>     calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
>     if pg_stdev_a < min_pg_stdev {
>         min_pg_stdev = pg_stdev_a;
>         balance_param = a; 
>     }
>     adjust a to a new value;
> }
> 
> 
> Evaluation:
>     We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
>     We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
> 1.   PG and data distribution is more balanced using optimized CRUSH algorithm
> ??a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases 
> from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 
> osds decreases from 10.09 to 6.50
> ??b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
> 2.   Large scaled performance is improved since data distribution is more balanced
> ??a) More than 10% performance improvement for 128K and 10M read
> ??b) Write performance not impacted
> Detailed performance data can be found in the attached pdf (crush_optimization).
> 
> We also created a pull request: https://github.com/ceph/ceph/pull/2402
> 
> Thanks
> Jian
> 
> 
> -----Original Message-----
> From: Loic Dachary [mailto:loic@dachary.org]
> Sent: Tuesday, September 09, 2014 9:36 PM
> To: Zhang, Jian; ceph-devel@vger.kernel.org
> Subject: Re: FW: CURSH optimization for unbalanced pg distribution
> 
> 
> 
> On 20/03/2014 04:54, Zhang, Jian wrote:
> > Forwarding per Sage's suggestion. 
> 
> Very interesting discussion :-) For the record the corresponding pull 
> request is https://github.com/ceph/ceph/pull/2402
> 
> > 
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@inktank.com]
> > Sent: Wednesday, March 19, 2014 11:29 PM
> > To: Mark Nelson
> > Cc: Zhang, Jian; Duan, Jiangang; He, Yujie
> > Subject: Re: CURSH optimization for unbalanced pg distribution
> > 
> > On Wed, 19 Mar 2014, Mark Nelson wrote:
> >> On 03/19/2014 03:24 AM, Zhang, Jian wrote:
> >>> For more detail data, please refer to the *Testing results* part.
> >>>
> >>> *Optimization proposals: *
> >>>
> >>> After we dived into the source code of CRUSH and related papers, 
> >>> we proposed two possible optimizations:
> >>>
> >>> 1.Add different hash algorithms, as an alternative for the 
> >>> Jenkin's hash, e.g. algorithm that will produce even values when 
> >>> range of input value (pg#) is relatively small. Or add new bucket 
> >>> type at the same time if necessary.
> > 
> > This *might* work, but I don't have a strong intuition about it.  The modeling we've done now has essentially assumed a statistically uniform distribution, which has some inherent inbalance for low values of n (num pgs in our case).  I have generally assumed we can't do better than "random", and still have the other properties we want (independent, deterministic placement), but it may be possible.
> > 
> >>>
> >>> 2.Find a better replica placement strategy instead of current 
> >>> retry logic of crush_choose_firstn, which may cause CRUSH to behave badly.
> >>>
> >>> We find there are several threshold of retry times by referring to 
> >>> code, choose_total_tries, choose_local_tries and choose_local_fallback_tries.
> >>> They are used to decide whether to do a retry_bucket, 
> >>> retry_descent or use permutation to do an exhaustive bucket 
> >>> search. We are wondering if there is another retry strategy:
> >>>
> >>> a)Backtracking retry. Now the logic of crush_choose_firstn can 
> >>> only issue an retry either from the initial bucket(retry_descent) 
> >>> or from the current bucket (retry_bucket), how about retrying the intervening buckets?
> >>>
> >>> b)Adjust threshold of retry times by other values. We do noticed 
> >>> that the 'optimal' crush tunable could be used to make it, but we 
> >>> still encounter unbalanced [g distribution by using the optimal strategy.
> >>> Please refer to 4 of the Testing results part.
> >>>
> >>> c)Add an mechanism that can adjust above mentioned thresholds 
> >>> adaptively. Maybe we can record the retry times of the previous 
> >>> call for CRUSH, and adjust retry thresholds automatically according to the record.
> > 
> > I suggest ignoring all of this retry logic.  The original version of 
> > CRUSH has the local retries to try to make data move "less far", but 
> > when we went back a year ago and did a statistical analysis of the 
> > distribution we found that *all* of these hacks degraded the quality 
> > of the placement,a nd by turning them all off (setting the 'optimal'
> > values which zeroes them all out excent for total_retries) we got 
> > something that was indistinguishable from a uniform distribution.
> > 
> >>> 3.Add soft link for pg directories. During pg creation, we can 
> >>> create soft links for the pgs if pg# on the selected osd is more 
> >>> than some threshold, say 10% more than desired average number, to 
> >>> move objects that will be stored in this pg to another osd. 
> >>> Balanced disk utilization may be gained in this way.
> > 
> > I think you need to be careful, but yes, this is an option.  There 
> > is a similar exception mechanism in place that is used for other 
> > purposes and something similar could be done here.  The main 
> > challenge will be in ensuring that the soft links/exceptions follow 
> > the same overall policy that CRUSH does after the raw mapping is 
> > performed.  This is an option, but I would put it toward the bottom of the list...
> > 
> >>> 4.Change placement strategy only for step of selecting devices 
> >>> from hosts. We found in our testing results that pg distribution 
> >>> was balanced among hosts, which is reasonable since pg# of each 
> >>> host is above 1K (according to the current BKM that pg# per osd 
> >>> should be about 100). So how about we apply CRUSH only on the 
> >>> interval buckets and find another simple but more balanced method to choose osd from host?
> > 
> > This idea has a lot of potential.  For example:
> > 
> > If you know the chassis can hold 12 disks, you can force the bucket 
> > size to twelve and somehow prevent users from adjusting the 
> > structure of the tree.  Then you can use a simple mapping that is 
> > truly flat (like a linear mapping, disk = x % num_disks) for that bucket/subtree.
> > The downside of course is that if you remove a disk *everything* 
> > reshuffles, hence some sort of guardrails to prevent a user from 
> > inadvertantly doing that.  If a disk *does* fail, you just need to 
> > make sure the disk is marked "out" but not removed from the CRUSH hierarchy and the normal retry will kick in.
> > 
> > Note that all this is reall doing is increasing the size of the "buckets" 
> > that we are (pseudo)randomly distribution over.  It is still a 
> > random/uniform distribution, but the N value is 12 times bigger (for 
> > a
> > 12 disk chassis) and as a result the variance is substantially lower.
> > 
> > I would suggest making a new bucket type that is called 'linear' and 
> > does a simple modulo and trying this out.  We will need a bunch of 
> > additional safety checks to help users avoid doing silly things 
> > (like adjusting the number of items in the linear buckets, which 
> > reshuffle
> > everything) but that wouldn't be needed for an initial analysis of the performance impact.
> > 
> > Do you mind if we shift this thread over to ceph-devel?  I think 
> > there are lots of people who would be interested in this discussion.  
> > We can of course leave off your attachment if you prefer.
> > 
> > Thanks!
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> --
> Lo?c Dachary, Artisan Logiciel Libre
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: FW: CURSH optimization for unbalanced pg distribution
  2014-09-10  1:32             ` Zhang, Jian
@ 2014-09-10  2:16               ` Mark Nelson
  2014-09-10  2:56                 ` He, Yujie
  2014-09-12  4:41               ` Sage Weil
  1 sibling, 1 reply; 9+ messages in thread
From: Mark Nelson @ 2014-09-10  2:16 UTC (permalink / raw)
  To: Zhang, Jian, Sage Weil; +Cc: Loic Dachary, ceph-devel, He, Yujie

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=GB2312, Size: 22713 bytes --]

Very interesting!  I will need to sit down and read through the
attachments tomorrow.  Did you find that this method was superior to
simply reweighing the OSDs based on the quality of the distribution
using the Jenkins hash?  Clearly this isn't easy with lots of pools,
though with some fancy crush rule management you might be able to get
around it to a limited extent.

I haven't done any extensive analysis, but I've also wondered if simply
replacing Jenkins with something like City, Murmur3, or Spooky might
improve distribution quality (and especially whether it would improve
distribution quality given small changes in the input).

Really glad you guys are looking into this!

Mark

On 09/09/2014 08:32 PM, Zhang, Jian wrote:
> Thanks.
> 
> Created a feature here: http://tracker.ceph.com/issues/9410, to include all the attachments. .
> http://tracker.ceph.com/attachments/download/1383/adaptive-crush-modify.patch
> http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf
> http://tracker.ceph.com/attachments/download/1385/crush_optimization.pdf
> 
> 
> Hi all,
>      Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
> 
> Key Message:
>      As mentioned in the attached pdf, we described possible optimization proposals (http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf) for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979 ) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called "linear", and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.
> 
> Design and Implementation:
> 1.Problem Identification
> 1.1 Input key (pps) space of CRUSH is not uniform
>      Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
> 1.2 Algorithm of selecting items from buckets is not uniform
>      After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
> 2.Design
> 2.1New pps hash algorithm
>      We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}.
>      Assume there are np PGs in a pool, we can regard pgid (0¡Üpgid<2^n, np¡Ü2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
> 2.2 New bucket type, Linear
>      We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
>      For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
> 2.3 Adaptive Strategy
>      Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
>      1) Try different balance_param when preparing for a new pool
>      a. Iteratively call CRUSH(map, ruleno, x, balance_param) to get corresponding PG distribution with different balance_params
> ¡¡¡¡b. Calculate stdev of PG# among all osds
> ¡¡  c. Choose the balance_param with the minimal stdev
>   	2) Add a member variable to pool struct pg_pool_t to save the best balance_param value
>      The adaptive procedure can be described as following:
>      Input: cluster map, total PG number m, adaptive retry times n
>      Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
>      for pgid from 0 to m {
>          calculate pps using the new generator in 2.1;
>          for bucket b in cluster map // apply CRUSH algorithm
>              apply corresponding bucket hashing algorithm and get a osd list for pgid
>      }
>      calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
>      if pg_stdev_a < min_pg_stdev {
>          min_pg_stdev = pg_stdev_a;
>          balance_param = a;
>      }
>      adjust a to a new value;
> }
> 
> 
> Evaluation:
>      We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
>      We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
> 1.PG and data distribution is more balanced using optimized CRUSH algorithm
>      a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 osds decreases from 10.09 to 6.50
>      b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
> 2.Large scaled performance is improved since data distribution is more balanced
>      a) More than 10% performance improvement for 128K and 10M read
>      b) Write performance not impacted
> Detailed performance data can be found in the http://tracker.ceph.com/attachments/download/1385/crush_optimization.pdf .
> 
> We also created a pull request: https://github.com/ceph/ceph/pull/2402
> 
> 
> Thanks
> Jian
> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, September 10, 2014 9:06 AM
> To: Zhang, Jian
> Cc: Loic Dachary; ceph-devel@vger.kernel.org; He, Yujie
> Subject: RE: FW: CURSH optimization for unbalanced pg distribution
> 
> The lists are rejecting the email because of the big attachments.  Send with links instead?
> 
> On Tue, 9 Sep 2014, Zhang, Jian wrote:
> 
>> Yujie sent out the following email yesterday, but it seems it was missed. Resending it.
>>
>> =============
>> Hi all,
>> ?  Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
>>
>> Key Message:
>> ?  As mentioned in the attached pdf, we described possible optimization proposals for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called ?linear?, and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.
>>
>> Design and Implementation:
>> 1.   Problem Identification
>> 1.1  Input key (pps) space of CRUSH is not uniform
>>      Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
>> 1.2  Algorithm of selecting items from buckets is not uniform
>>      After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
>> 2.   Design
>> 2.1  New pps hash algorithm
>>      We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}.
>>      Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
>> 2.2  New bucket type, Linear
>>      We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
>>      For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
>> 2.3  Adaptive Strategy
>>      Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
>> ??1) Try different balance_param when preparing for a new pool
>> ????- Iteratively call CRUSH(map, ruleno, x, balance_param) to get
>> corresponding PG distribution with different balance_params
>> ????- Calculate stdev of PG# among all osds
>> ????- Choose the balance_param with the minimal stdev
>>   	2) Add a member variable to pool struct pg_pool_t to save the best
>> balance_param value ??The adaptive procedure can be described as following:
>> Input: cluster map, total PG number m, adaptive retry times n
>> Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
>>      for pgid from 0 to m {
>>          calculate pps using the new generator in 2.1;
>>          for bucket b in cluster map // apply CRUSH algorithm
>>              apply corresponding bucket hashing algorithm and get a osd list for pgid
>>      }
>>      calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
>>      if pg_stdev_a < min_pg_stdev {
>>          min_pg_stdev = pg_stdev_a;
>>          balance_param = a;
>>      }
>>      adjust a to a new value;
>> }
>>
>>
>> Evaluation:
>>      We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
>>      We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
>> 1.   PG and data distribution is more balanced using optimized CRUSH algorithm
>> ??a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases
>> from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40
>> osds decreases from 10.09 to 6.50
>> ??b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
>> 2.   Large scaled performance is improved since data distribution is more balanced
>> ??a) More than 10% performance improvement for 128K and 10M read
>> ??b) Write performance not impacted
>> Detailed performance data can be found in the attached pdf (crush_optimization).
>>
>> We also created a pull request: https://github.com/ceph/ceph/pull/2402
>>
>> Thanks
>> Jian
>>
>>
>> -----Original Message-----
>> From: Loic Dachary [mailto:loic@dachary.org]
>> Sent: Tuesday, September 09, 2014 9:36 PM
>> To: Zhang, Jian; ceph-devel@vger.kernel.org
>> Subject: Re: FW: CURSH optimization for unbalanced pg distribution
>>
>>
>>
>> On 20/03/2014 04:54, Zhang, Jian wrote:
>>> Forwarding per Sage's suggestion.
>>
>> Very interesting discussion :-) For the record the corresponding pull
>> request is https://github.com/ceph/ceph/pull/2402
>>
>>>
>>>
>>> -----Original Message-----
>>> From: Sage Weil [mailto:sage@inktank.com]
>>> Sent: Wednesday, March 19, 2014 11:29 PM
>>> To: Mark Nelson
>>> Cc: Zhang, Jian; Duan, Jiangang; He, Yujie
>>> Subject: Re: CURSH optimization for unbalanced pg distribution
>>>
>>> On Wed, 19 Mar 2014, Mark Nelson wrote:
>>>> On 03/19/2014 03:24 AM, Zhang, Jian wrote:
>>>>> For more detail data, please refer to the *Testing results* part.
>>>>>
>>>>> *Optimization proposals: *
>>>>>
>>>>> After we dived into the source code of CRUSH and related papers,
>>>>> we proposed two possible optimizations:
>>>>>
>>>>> 1.Add different hash algorithms, as an alternative for the
>>>>> Jenkin's hash, e.g. algorithm that will produce even values when
>>>>> range of input value (pg#) is relatively small. Or add new bucket
>>>>> type at the same time if necessary.
>>>
>>> This *might* work, but I don't have a strong intuition about it.  The modeling we've done now has essentially assumed a statistically uniform distribution, which has some inherent inbalance for low values of n (num pgs in our case).  I have generally assumed we can't do better than "random", and still have the other properties we want (independent, deterministic placement), but it may be possible.
>>>
>>>>>
>>>>> 2.Find a better replica placement strategy instead of current
>>>>> retry logic of crush_choose_firstn, which may cause CRUSH to behave badly.
>>>>>
>>>>> We find there are several threshold of retry times by referring to
>>>>> code, choose_total_tries, choose_local_tries and choose_local_fallback_tries.
>>>>> They are used to decide whether to do a retry_bucket,
>>>>> retry_descent or use permutation to do an exhaustive bucket
>>>>> search. We are wondering if there is another retry strategy:
>>>>>
>>>>> a)Backtracking retry. Now the logic of crush_choose_firstn can
>>>>> only issue an retry either from the initial bucket(retry_descent)
>>>>> or from the current bucket (retry_bucket), how about retrying the intervening buckets?
>>>>>
>>>>> b)Adjust threshold of retry times by other values. We do noticed
>>>>> that the 'optimal' crush tunable could be used to make it, but we
>>>>> still encounter unbalanced [g distribution by using the optimal strategy.
>>>>> Please refer to 4 of the Testing results part.
>>>>>
>>>>> c)Add an mechanism that can adjust above mentioned thresholds
>>>>> adaptively. Maybe we can record the retry times of the previous
>>>>> call for CRUSH, and adjust retry thresholds automatically according to the record.
>>>
>>> I suggest ignoring all of this retry logic.  The original version of
>>> CRUSH has the local retries to try to make data move "less far", but
>>> when we went back a year ago and did a statistical analysis of the
>>> distribution we found that *all* of these hacks degraded the quality
>>> of the placement,a nd by turning them all off (setting the 'optimal'
>>> values which zeroes them all out excent for total_retries) we got
>>> something that was indistinguishable from a uniform distribution.
>>>
>>>>> 3.Add soft link for pg directories. During pg creation, we can
>>>>> create soft links for the pgs if pg# on the selected osd is more
>>>>> than some threshold, say 10% more than desired average number, to
>>>>> move objects that will be stored in this pg to another osd.
>>>>> Balanced disk utilization may be gained in this way.
>>>
>>> I think you need to be careful, but yes, this is an option.  There
>>> is a similar exception mechanism in place that is used for other
>>> purposes and something similar could be done here.  The main
>>> challenge will be in ensuring that the soft links/exceptions follow
>>> the same overall policy that CRUSH does after the raw mapping is
>>> performed.  This is an option, but I would put it toward the bottom of the list...
>>>
>>>>> 4.Change placement strategy only for step of selecting devices
>>>>> from hosts. We found in our testing results that pg distribution
>>>>> was balanced among hosts, which is reasonable since pg# of each
>>>>> host is above 1K (according to the current BKM that pg# per osd
>>>>> should be about 100). So how about we apply CRUSH only on the
>>>>> interval buckets and find another simple but more balanced method to choose osd from host?
>>>
>>> This idea has a lot of potential.  For example:
>>>
>>> If you know the chassis can hold 12 disks, you can force the bucket
>>> size to twelve and somehow prevent users from adjusting the
>>> structure of the tree.  Then you can use a simple mapping that is
>>> truly flat (like a linear mapping, disk = x % num_disks) for that bucket/subtree.
>>> The downside of course is that if you remove a disk *everything*
>>> reshuffles, hence some sort of guardrails to prevent a user from
>>> inadvertantly doing that.  If a disk *does* fail, you just need to
>>> make sure the disk is marked "out" but not removed from the CRUSH hierarchy and the normal retry will kick in.
>>>
>>> Note that all this is reall doing is increasing the size of the "buckets"
>>> that we are (pseudo)randomly distribution over.  It is still a
>>> random/uniform distribution, but the N value is 12 times bigger (for
>>> a
>>> 12 disk chassis) and as a result the variance is substantially lower.
>>>
>>> I would suggest making a new bucket type that is called 'linear' and
>>> does a simple modulo and trying this out.  We will need a bunch of
>>> additional safety checks to help users avoid doing silly things
>>> (like adjusting the number of items in the linear buckets, which
>>> reshuffle
>>> everything) but that wouldn't be needed for an initial analysis of the performance impact.
>>>
>>> Do you mind if we shift this thread over to ceph-devel?  I think
>>> there are lots of people who would be interested in this discussion.
>>> We can of course leave off your attachment if you prefer.
>>>
>>> Thanks!
>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Lo?c Dachary, Artisan Logiciel Libre
>>
>>
> N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^\x7f)Þº{.n\x7f+‰·œz˜]z÷¥Š{ay\x7f\x1dʇڙ\x7f,j\a­¢f£¢·hš‹àz\x7f\x1e®w¥¢\x7f\f¢·¦j:+v‰¨ŠwèjØm¶Ÿ\x7f\x7f\a«‘êçzZ+ƒùšŽŠÝ¢j"ú!tml=
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: FW: CURSH optimization for unbalanced pg distribution
  2014-09-10  2:16               ` Mark Nelson
@ 2014-09-10  2:56                 ` He, Yujie
  0 siblings, 0 replies; 9+ messages in thread
From: He, Yujie @ 2014-09-10  2:56 UTC (permalink / raw)
  To: Mark Nelson, Zhang, Jian, Sage Weil; +Cc: Loic Dachary, ceph-devel

Hi mark,
In my opinion, the method of reweighting osd based on current pg distribution may not work well if we need to create more pools later. Since that if we reweight all osds again after creating new pools, there will be a reshuffle of data on all osds. In our proposal, we just apply the pools with respective parameters, trying to achieve balanced pg distribution for each pool, without influence to pools created in the future.

For other hash algorithms, we did tried the City algorithm, but got no better results than Jenkins. Maybe we need to take a look at others.

Thanks,
Yujie

-----Original Message-----
From: Mark Nelson [mailto:mark.nelson@inktank.com] 
Sent: Wednesday, September 10, 2014 10:17 AM
To: Zhang, Jian; Sage Weil
Cc: Loic Dachary; ceph-devel@vger.kernel.org; He, Yujie
Subject: Re: FW: CURSH optimization for unbalanced pg distribution

Very interesting!  I will need to sit down and read through the attachments tomorrow.  Did you find that this method was superior to simply reweighing the OSDs based on the quality of the distribution using the Jenkins hash?  Clearly this isn't easy with lots of pools, though with some fancy crush rule management you might be able to get around it to a limited extent.

I haven't done any extensive analysis, but I've also wondered if simply replacing Jenkins with something like City, Murmur3, or Spooky might improve distribution quality (and especially whether it would improve distribution quality given small changes in the input).

Really glad you guys are looking into this!

Mark

On 09/09/2014 08:32 PM, Zhang, Jian wrote:
> Thanks.
> 
> Created a feature here: http://tracker.ceph.com/issues/9410, to include all the attachments. .
> http://tracker.ceph.com/attachments/download/1383/adaptive-crush-modif
> y.patch 
> http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf
> http://tracker.ceph.com/attachments/download/1385/crush_optimization.p
> df
> 
> 
> Hi all,
>      Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
> 
> Key Message:
>      As mentioned in the attached pdf, we described possible optimization proposals (http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf) for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979 ) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called "linear", and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.
> 
> Design and Implementation:
> 1.Problem Identification
> 1.1 Input key (pps) space of CRUSH is not uniform
>      Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
> 1.2 Algorithm of selecting items from buckets is not uniform
>      After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
> 2.Design
> 2.1New pps hash algorithm
>      We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}.
>      Assume there are np PGs in a pool, we can regard pgid (0≤pgid<2^n, np≤2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
> 2.2 New bucket type, Linear
>      We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
>      For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
> 2.3 Adaptive Strategy
>      Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
>      1) Try different balance_param when preparing for a new pool
>      a. Iteratively call CRUSH(map, ruleno, x, balance_param) to get corresponding PG distribution with different balance_params
>   b. Calculate stdev of PG# among all osds
>    c. Choose the balance_param with the minimal stdev
>   	2) Add a member variable to pool struct pg_pool_t to save the best balance_param value
>      The adaptive procedure can be described as following:
>      Input: cluster map, total PG number m, adaptive retry times n
>      Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
>      for pgid from 0 to m {
>          calculate pps using the new generator in 2.1;
>          for bucket b in cluster map // apply CRUSH algorithm
>              apply corresponding bucket hashing algorithm and get a osd list for pgid
>      }
>      calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
>      if pg_stdev_a < min_pg_stdev {
>          min_pg_stdev = pg_stdev_a;
>          balance_param = a;
>      }
>      adjust a to a new value;
> }
> 
> 
> Evaluation:
>      We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
>      We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
> 1.PG and data distribution is more balanced using optimized CRUSH algorithm
>      a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 osds decreases from 10.09 to 6.50
>      b) For 1 million 10MB objects with 3 replicas, stdev of disk use% 
> on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042 2.Large scaled performance is improved since data distribution is more balanced
>      a) More than 10% performance improvement for 128K and 10M read
>      b) Write performance not impacted Detailed performance data can 
> be found in the http://tracker.ceph.com/attachments/download/1385/crush_optimization.pdf .
> 
> We also created a pull request: https://github.com/ceph/ceph/pull/2402
> 
> 
> Thanks
> Jian
> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, September 10, 2014 9:06 AM
> To: Zhang, Jian
> Cc: Loic Dachary; ceph-devel@vger.kernel.org; He, Yujie
> Subject: RE: FW: CURSH optimization for unbalanced pg distribution
> 
> The lists are rejecting the email because of the big attachments.  Send with links instead?
> 
> On Tue, 9 Sep 2014, Zhang, Jian wrote:
> 
>> Yujie sent out the following email yesterday, but it seems it was missed. Resending it.
>>
>> =============
>> Hi all,
>> ?  Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
>>
>> Key Message:
>> ?  As mentioned in the attached pdf, we described possible optimization proposals for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called ?linear?, and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.
>>
>> Design and Implementation:
>> 1.   Problem Identification
>> 1.1  Input key (pps) space of CRUSH is not uniform
>>      Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
>> 1.2  Algorithm of selecting items from buckets is not uniform
>>      After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
>> 2.   Design
>> 2.1  New pps hash algorithm
>>      We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}.
>>      Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
>> 2.2  New bucket type, Linear
>>      We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
>>      For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
>> 2.3  Adaptive Strategy
>>      Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
>> ??1) Try different balance_param when preparing for a new pool
>> ????- Iteratively call CRUSH(map, ruleno, x, balance_param) to get 
>> corresponding PG distribution with different balance_params
>> ????- Calculate stdev of PG# among all osds
>> ????- Choose the balance_param with the minimal stdev
>>   	2) Add a member variable to pool struct pg_pool_t to save the best 
>> balance_param value ??The adaptive procedure can be described as following:
>> Input: cluster map, total PG number m, adaptive retry times n
>> Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n {
>>      for pgid from 0 to m {
>>          calculate pps using the new generator in 2.1;
>>          for bucket b in cluster map // apply CRUSH algorithm
>>              apply corresponding bucket hashing algorithm and get a osd list for pgid
>>      }
>>      calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
>>      if pg_stdev_a < min_pg_stdev {
>>          min_pg_stdev = pg_stdev_a;
>>          balance_param = a;
>>      }
>>      adjust a to a new value;
>> }
>>
>>
>> Evaluation:
>>      We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
>>      We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
>> 1.   PG and data distribution is more balanced using optimized CRUSH algorithm
>> ??a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases 
>> from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 
>> osds decreases from 10.09 to 6.50
>> ??b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
>> 2.   Large scaled performance is improved since data distribution is more balanced
>> ??a) More than 10% performance improvement for 128K and 10M read
>> ??b) Write performance not impacted
>> Detailed performance data can be found in the attached pdf (crush_optimization).
>>
>> We also created a pull request: 
>> https://github.com/ceph/ceph/pull/2402
>>
>> Thanks
>> Jian
>>
>>
>> -----Original Message-----
>> From: Loic Dachary [mailto:loic@dachary.org]
>> Sent: Tuesday, September 09, 2014 9:36 PM
>> To: Zhang, Jian; ceph-devel@vger.kernel.org
>> Subject: Re: FW: CURSH optimization for unbalanced pg distribution
>>
>>
>>
>> On 20/03/2014 04:54, Zhang, Jian wrote:
>>> Forwarding per Sage's suggestion.
>>
>> Very interesting discussion :-) For the record the corresponding pull 
>> request is https://github.com/ceph/ceph/pull/2402
>>
>>>
>>>
>>> -----Original Message-----
>>> From: Sage Weil [mailto:sage@inktank.com]
>>> Sent: Wednesday, March 19, 2014 11:29 PM
>>> To: Mark Nelson
>>> Cc: Zhang, Jian; Duan, Jiangang; He, Yujie
>>> Subject: Re: CURSH optimization for unbalanced pg distribution
>>>
>>> On Wed, 19 Mar 2014, Mark Nelson wrote:
>>>> On 03/19/2014 03:24 AM, Zhang, Jian wrote:
>>>>> For more detail data, please refer to the *Testing results* part.
>>>>>
>>>>> *Optimization proposals: *
>>>>>
>>>>> After we dived into the source code of CRUSH and related papers, 
>>>>> we proposed two possible optimizations:
>>>>>
>>>>> 1.Add different hash algorithms, as an alternative for the 
>>>>> Jenkin's hash, e.g. algorithm that will produce even values when 
>>>>> range of input value (pg#) is relatively small. Or add new bucket 
>>>>> type at the same time if necessary.
>>>
>>> This *might* work, but I don't have a strong intuition about it.  The modeling we've done now has essentially assumed a statistically uniform distribution, which has some inherent inbalance for low values of n (num pgs in our case).  I have generally assumed we can't do better than "random", and still have the other properties we want (independent, deterministic placement), but it may be possible.
>>>
>>>>>
>>>>> 2.Find a better replica placement strategy instead of current 
>>>>> retry logic of crush_choose_firstn, which may cause CRUSH to behave badly.
>>>>>
>>>>> We find there are several threshold of retry times by referring to 
>>>>> code, choose_total_tries, choose_local_tries and choose_local_fallback_tries.
>>>>> They are used to decide whether to do a retry_bucket, 
>>>>> retry_descent or use permutation to do an exhaustive bucket 
>>>>> search. We are wondering if there is another retry strategy:
>>>>>
>>>>> a)Backtracking retry. Now the logic of crush_choose_firstn can 
>>>>> only issue an retry either from the initial bucket(retry_descent) 
>>>>> or from the current bucket (retry_bucket), how about retrying the intervening buckets?
>>>>>
>>>>> b)Adjust threshold of retry times by other values. We do noticed 
>>>>> that the 'optimal' crush tunable could be used to make it, but we 
>>>>> still encounter unbalanced [g distribution by using the optimal strategy.
>>>>> Please refer to 4 of the Testing results part.
>>>>>
>>>>> c)Add an mechanism that can adjust above mentioned thresholds 
>>>>> adaptively. Maybe we can record the retry times of the previous 
>>>>> call for CRUSH, and adjust retry thresholds automatically according to the record.
>>>
>>> I suggest ignoring all of this retry logic.  The original version of 
>>> CRUSH has the local retries to try to make data move "less far", but 
>>> when we went back a year ago and did a statistical analysis of the 
>>> distribution we found that *all* of these hacks degraded the quality 
>>> of the placement,a nd by turning them all off (setting the 'optimal'
>>> values which zeroes them all out excent for total_retries) we got 
>>> something that was indistinguishable from a uniform distribution.
>>>
>>>>> 3.Add soft link for pg directories. During pg creation, we can 
>>>>> create soft links for the pgs if pg# on the selected osd is more 
>>>>> than some threshold, say 10% more than desired average number, to 
>>>>> move objects that will be stored in this pg to another osd.
>>>>> Balanced disk utilization may be gained in this way.
>>>
>>> I think you need to be careful, but yes, this is an option.  There 
>>> is a similar exception mechanism in place that is used for other 
>>> purposes and something similar could be done here.  The main 
>>> challenge will be in ensuring that the soft links/exceptions follow 
>>> the same overall policy that CRUSH does after the raw mapping is 
>>> performed.  This is an option, but I would put it toward the bottom of the list...
>>>
>>>>> 4.Change placement strategy only for step of selecting devices 
>>>>> from hosts. We found in our testing results that pg distribution 
>>>>> was balanced among hosts, which is reasonable since pg# of each 
>>>>> host is above 1K (according to the current BKM that pg# per osd 
>>>>> should be about 100). So how about we apply CRUSH only on the 
>>>>> interval buckets and find another simple but more balanced method to choose osd from host?
>>>
>>> This idea has a lot of potential.  For example:
>>>
>>> If you know the chassis can hold 12 disks, you can force the bucket 
>>> size to twelve and somehow prevent users from adjusting the 
>>> structure of the tree.  Then you can use a simple mapping that is 
>>> truly flat (like a linear mapping, disk = x % num_disks) for that bucket/subtree.
>>> The downside of course is that if you remove a disk *everything* 
>>> reshuffles, hence some sort of guardrails to prevent a user from 
>>> inadvertantly doing that.  If a disk *does* fail, you just need to 
>>> make sure the disk is marked "out" but not removed from the CRUSH hierarchy and the normal retry will kick in.
>>>
>>> Note that all this is reall doing is increasing the size of the "buckets"
>>> that we are (pseudo)randomly distribution over.  It is still a 
>>> random/uniform distribution, but the N value is 12 times bigger (for 
>>> a
>>> 12 disk chassis) and as a result the variance is substantially lower.
>>>
>>> I would suggest making a new bucket type that is called 'linear' and 
>>> does a simple modulo and trying this out.  We will need a bunch of 
>>> additional safety checks to help users avoid doing silly things 
>>> (like adjusting the number of items in the linear buckets, which 
>>> reshuffle
>>> everything) but that wouldn't be needed for an initial analysis of the performance impact.
>>>
>>> Do you mind if we shift this thread over to ceph-devel?  I think 
>>> there are lots of people who would be interested in this discussion.
>>> We can of course leave off your attachment if you prefer.
>>>
>>> Thanks!
>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Lo?c Dachary, Artisan Logiciel Libre
>>
>>
> N嫥叉靣笡y氊b瞂千v豝?)藓{.n?+壏渮榏z鳐妠ay?\x1d蕠跈?,j f"穐殝鄗?\x1e畐ア?
⒎:+v墾妛鑚豰稛?? 珣赙zZ+凒殠娸"濟!tml=
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: FW: CURSH optimization for unbalanced pg distribution
  2014-09-10  1:32             ` Zhang, Jian
  2014-09-10  2:16               ` Mark Nelson
@ 2014-09-12  4:41               ` Sage Weil
  2014-09-12  8:20                 ` He, Yujie
  1 sibling, 1 reply; 9+ messages in thread
From: Sage Weil @ 2014-09-12  4:41 UTC (permalink / raw)
  To: Zhang, Jian; +Cc: Loic Dachary, ceph-devel, He, Yujie

Hi,

This is pretty exciting.  I haven't read through all of it, but have 
some initial comments on the pps mapping portion.

On Wed, 10 Sep 2014, Zhang, Jian wrote:
> Thanks. 
> 
> Created a feature here: http://tracker.ceph.com/issues/9410, to include all the attachments. .
> http://tracker.ceph.com/attachments/download/1383/adaptive-crush-modify.patch
> http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf
> http://tracker.ceph.com/attachments/download/1385/crush_optimization.pdf
> 
> 
> Hi all,
>     Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
> 
> Key Message:
>     As mentioned in the attached pdf, we described possible optimization proposals (http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf) for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979 ) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called "linear", and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than
  10% read performance improvement using the optimized CRUSH algorithm.
> 
> Design and Implementation:
> 1.Problem Identification
> 1.1 Input key (pps) space of CRUSH is not uniform
>     Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
> 1.2 Algorithm of selecting items from buckets is not uniform
>     After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
> 2.Design
> 2.1New pps hash algorithm
>     We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}. 
>     Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.

I made a few comments on github at

	https://github.com/ceph/ceph/pull/2402/files#r17462015

I have some questions about the underlying math.  If this is similar to 
the approach used by the uniform buckets, I think 1033 needs to be > the 
denominator?  Also, I looked a bit at the referenced paper and I think the 
denominator should be prime, not 2^n-1 (pg_num_mask).

My other concern is with raw_pg_to_congruential_pps.  Adding poolid into 
the numerator before you do the modulo means that each pool has a 
different permutation.  But, if you have two pools both with (say) 1024 
PGs, they will map to the same 1024 outputs (0..1023).  The pool is added 
in to the final pps, but this doesn't really help as it only means a 
handful of PGs get unique mappings... and they'll be overlap with the next 
pool.  This is exactly the problem we were solving with the HASHPSPOOL 
flag.  Perhaps adding a pseudorrandom value between 0 and 2^32 based on 
the poolid will (usually) give the pools distinct output ranges and the 
linear mapping will still be happy with that (since the inputs for each 
pool live in a contiguous range).

In any case, though, yes: this general approach will mean that the pps 
values live in a packed range instead of being spread uniformly across the 
0..2^32 range.

The other concern I have is whehter the pgid -> pps mapping is stable when 
pg_num is adjusted up.  Specifically, what we want is that when moving 
from pg_num to pg_num * 2, pg_num of the original inputs will keep the 
same output pps value, while the other half will get a new value.  It 
doesn't seem like this is true for this strategy.  That may be a tradeoff 
the user is willing to make, but we'll need to be very careful about 
making that apparent to the user.. it means that bumping pg_num will 
reshuffle all (not just half) of their data for each power of 2.

> 2.2 New bucket type, Linear
>     We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
>     For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
> 2.3 Adaptive Strategy
>     Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
>     1) Try different balance_param when preparing for a new pool
>     a. Iteratively call CRUSH(map, ruleno, x, balance_param) to get corresponding PG distribution with different balance_params
> ??b. Calculate stdev of PG# among all osds
> ?  c. Choose the balance_param with the minimal stdev 
>  	2) Add a member variable to pool struct pg_pool_t to save the best balance_param value
>     The adaptive procedure can be described as following:
>     Input: cluster map, total PG number m, adaptive retry times n
>     Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n { 
>     for pgid from 0 to m {
>         calculate pps using the new generator in 2.1;
>         for bucket b in cluster map // apply CRUSH algorithm
>             apply corresponding bucket hashing algorithm and get a osd list for pgid
>     }
>     calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
>     if pg_stdev_a < min_pg_stdev {
>         min_pg_stdev = pg_stdev_a;
>         balance_param = a; 
>     }
>     adjust a to a new value;
> }

I see the core placement is basically just x % n.  But there is the 
balance_param value (which is an integer value in the range 1..5?).  I 
don't really understand intuitively what this is accomplishing.  Is the 
goal just to have a different permutation and pick the best of 5?  Or is 
it specifically dividing the raw x so that it is squished into a narrower 
range that is accomplishing a more balance distribution?  I'm hoping the 
goal is just another permutation, because then we can modify x in some 
other way *prior* to feeding it into CRUSH and we can avoid duplicating 
half of the code in mapper.c just to pass down the extra argument.

Thanks!
sage


> Evaluation:
>     We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
>     We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
> 1.PG and data distribution is more balanced using optimized CRUSH algorithm
>     a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 osds decreases from 10.09 to 6.50
>     b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
> 2.Large scaled performance is improved since data distribution is more balanced
>     a) More than 10% performance improvement for 128K and 10M read
>     b) Write performance not impacted
> Detailed performance data can be found in the http://tracker.ceph.com/attachments/download/1385/crush_optimization.pdf .
> 
> We also created a pull request: https://github.com/ceph/ceph/pull/2402 
> 
> 
> Thanks
> Jian
> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com] 
> Sent: Wednesday, September 10, 2014 9:06 AM
> To: Zhang, Jian
> Cc: Loic Dachary; ceph-devel@vger.kernel.org; He, Yujie
> Subject: RE: FW: CURSH optimization for unbalanced pg distribution
> 
> The lists are rejecting the email because of the big attachments.  Send with links instead?
> 
> On Tue, 9 Sep 2014, Zhang, Jian wrote:
> 
> > Yujie sent out the following email yesterday, but it seems it was missed. Resending it. 
> > 
> > =============
> > Hi all,
> > ?  Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
> > 
> > Key Message:
> > ?  As mentioned in the attached pdf, we described possible optimization proposals for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called ?linear?, and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.
> > 
> > Design and Implementation:
> > 1.   Problem Identification
> > 1.1  Input key (pps) space of CRUSH is not uniform
> >     Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
> > 1.2  Algorithm of selecting items from buckets is not uniform
> >     After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
> > 2.   Design
> > 2.1  New pps hash algorithm
> >     We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}. 
> >     Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
> > 2.2  New bucket type, Linear
> >     We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
> >     For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
> > 2.3  Adaptive Strategy
> >     Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
> > ??1) Try different balance_param when preparing for a new pool
> > ????- Iteratively call CRUSH(map, ruleno, x, balance_param) to get 
> > corresponding PG distribution with different balance_params
> > ????- Calculate stdev of PG# among all osds
> > ????- Choose the balance_param with the minimal stdev 
> >  	2) Add a member variable to pool struct pg_pool_t to save the best 
> > balance_param value ??The adaptive procedure can be described as following:
> > Input: cluster map, total PG number m, adaptive retry times n
> > Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n { 
> >     for pgid from 0 to m {
> >         calculate pps using the new generator in 2.1;
> >         for bucket b in cluster map // apply CRUSH algorithm
> >             apply corresponding bucket hashing algorithm and get a osd list for pgid
> >     }
> >     calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
> >     if pg_stdev_a < min_pg_stdev {
> >         min_pg_stdev = pg_stdev_a;
> >         balance_param = a; 
> >     }
> >     adjust a to a new value;
> > }
> > 
> > 
> > Evaluation:
> >     We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
> >     We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
> > 1.   PG and data distribution is more balanced using optimized CRUSH algorithm
> > ??a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases 
> > from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 
> > osds decreases from 10.09 to 6.50
> > ??b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
> > 2.   Large scaled performance is improved since data distribution is more balanced
> > ??a) More than 10% performance improvement for 128K and 10M read
> > ??b) Write performance not impacted
> > Detailed performance data can be found in the attached pdf (crush_optimization).
> > 
> > We also created a pull request: https://github.com/ceph/ceph/pull/2402
> > 
> > Thanks
> > Jian
> > 
> > 
> > -----Original Message-----
> > From: Loic Dachary [mailto:loic@dachary.org]
> > Sent: Tuesday, September 09, 2014 9:36 PM
> > To: Zhang, Jian; ceph-devel@vger.kernel.org
> > Subject: Re: FW: CURSH optimization for unbalanced pg distribution
> > 
> > 
> > 
> > On 20/03/2014 04:54, Zhang, Jian wrote:
> > > Forwarding per Sage's suggestion. 
> > 
> > Very interesting discussion :-) For the record the corresponding pull 
> > request is https://github.com/ceph/ceph/pull/2402
> > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@inktank.com]
> > > Sent: Wednesday, March 19, 2014 11:29 PM
> > > To: Mark Nelson
> > > Cc: Zhang, Jian; Duan, Jiangang; He, Yujie
> > > Subject: Re: CURSH optimization for unbalanced pg distribution
> > > 
> > > On Wed, 19 Mar 2014, Mark Nelson wrote:
> > >> On 03/19/2014 03:24 AM, Zhang, Jian wrote:
> > >>> For more detail data, please refer to the *Testing results* part.
> > >>>
> > >>> *Optimization proposals: *
> > >>>
> > >>> After we dived into the source code of CRUSH and related papers, 
> > >>> we proposed two possible optimizations:
> > >>>
> > >>> 1.Add different hash algorithms, as an alternative for the 
> > >>> Jenkin's hash, e.g. algorithm that will produce even values when 
> > >>> range of input value (pg#) is relatively small. Or add new bucket 
> > >>> type at the same time if necessary.
> > > 
> > > This *might* work, but I don't have a strong intuition about it.  The modeling we've done now has essentially assumed a statistically uniform distribution, which has some inherent inbalance for low values of n (num pgs in our case).  I have generally assumed we can't do better than "random", and still have the other properties we want (independent, deterministic placement), but it may be possible.
> > > 
> > >>>
> > >>> 2.Find a better replica placement strategy instead of current 
> > >>> retry logic of crush_choose_firstn, which may cause CRUSH to behave badly.
> > >>>
> > >>> We find there are several threshold of retry times by referring to 
> > >>> code, choose_total_tries, choose_local_tries and choose_local_fallback_tries.
> > >>> They are used to decide whether to do a retry_bucket, 
> > >>> retry_descent or use permutation to do an exhaustive bucket 
> > >>> search. We are wondering if there is another retry strategy:
> > >>>
> > >>> a)Backtracking retry. Now the logic of crush_choose_firstn can 
> > >>> only issue an retry either from the initial bucket(retry_descent) 
> > >>> or from the current bucket (retry_bucket), how about retrying the intervening buckets?
> > >>>
> > >>> b)Adjust threshold of retry times by other values. We do noticed 
> > >>> that the 'optimal' crush tunable could be used to make it, but we 
> > >>> still encounter unbalanced [g distribution by using the optimal strategy.
> > >>> Please refer to 4 of the Testing results part.
> > >>>
> > >>> c)Add an mechanism that can adjust above mentioned thresholds 
> > >>> adaptively. Maybe we can record the retry times of the previous 
> > >>> call for CRUSH, and adjust retry thresholds automatically according to the record.
> > > 
> > > I suggest ignoring all of this retry logic.  The original version of 
> > > CRUSH has the local retries to try to make data move "less far", but 
> > > when we went back a year ago and did a statistical analysis of the 
> > > distribution we found that *all* of these hacks degraded the quality 
> > > of the placement,a nd by turning them all off (setting the 'optimal'
> > > values which zeroes them all out excent for total_retries) we got 
> > > something that was indistinguishable from a uniform distribution.
> > > 
> > >>> 3.Add soft link for pg directories. During pg creation, we can 
> > >>> create soft links for the pgs if pg# on the selected osd is more 
> > >>> than some threshold, say 10% more than desired average number, to 
> > >>> move objects that will be stored in this pg to another osd. 
> > >>> Balanced disk utilization may be gained in this way.
> > > 
> > > I think you need to be careful, but yes, this is an option.  There 
> > > is a similar exception mechanism in place that is used for other 
> > > purposes and something similar could be done here.  The main 
> > > challenge will be in ensuring that the soft links/exceptions follow 
> > > the same overall policy that CRUSH does after the raw mapping is 
> > > performed.  This is an option, but I would put it toward the bottom of the list...
> > > 
> > >>> 4.Change placement strategy only for step of selecting devices 
> > >>> from hosts. We found in our testing results that pg distribution 
> > >>> was balanced among hosts, which is reasonable since pg# of each 
> > >>> host is above 1K (according to the current BKM that pg# per osd 
> > >>> should be about 100). So how about we apply CRUSH only on the 
> > >>> interval buckets and find another simple but more balanced method to choose osd from host?
> > > 
> > > This idea has a lot of potential.  For example:
> > > 
> > > If you know the chassis can hold 12 disks, you can force the bucket 
> > > size to twelve and somehow prevent users from adjusting the 
> > > structure of the tree.  Then you can use a simple mapping that is 
> > > truly flat (like a linear mapping, disk = x % num_disks) for that bucket/subtree.
> > > The downside of course is that if you remove a disk *everything* 
> > > reshuffles, hence some sort of guardrails to prevent a user from 
> > > inadvertantly doing that.  If a disk *does* fail, you just need to 
> > > make sure the disk is marked "out" but not removed from the CRUSH hierarchy and the normal retry will kick in.
> > > 
> > > Note that all this is reall doing is increasing the size of the "buckets" 
> > > that we are (pseudo)randomly distribution over.  It is still a 
> > > random/uniform distribution, but the N value is 12 times bigger (for 
> > > a
> > > 12 disk chassis) and as a result the variance is substantially lower.
> > > 
> > > I would suggest making a new bucket type that is called 'linear' and 
> > > does a simple modulo and trying this out.  We will need a bunch of 
> > > additional safety checks to help users avoid doing silly things 
> > > (like adjusting the number of items in the linear buckets, which 
> > > reshuffle
> > > everything) but that wouldn't be needed for an initial analysis of the performance impact.
> > > 
> > > Do you mind if we shift this thread over to ceph-devel?  I think 
> > > there are lots of people who would be interested in this discussion.  
> > > We can of course leave off your attachment if you prefer.
> > > 
> > > Thanks!
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > 
> > --
> > Lo?c Dachary, Artisan Logiciel Libre
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: FW: CURSH optimization for unbalanced pg distribution
  2014-09-12  4:41               ` Sage Weil
@ 2014-09-12  8:20                 ` He, Yujie
  0 siblings, 0 replies; 9+ messages in thread
From: He, Yujie @ 2014-09-12  8:20 UTC (permalink / raw)
  To: Sage Weil, Zhang, Jian; +Cc: Loic Dachary, ceph-devel

Hi sage,
Thanks a lot for the comments!

About the pps part, 1033* is arbitrary as long as it satisfies the value 1mod(4), and m does not need to be prime, though maybe pg_num_mask+1 (2^n) can make it faster.
The formula is just used for reshuffling the original pgid sequence to some permutation of the original one, and which poolid really makes sense is not the one added at the last, but the one before the modulo operation. I did some tests just now and found that it did introduce pg overlaps, since the step of corresponding pps values of two consecutive pgs is determined by e.g. 1033 in all pools in this case. But I just wonder if we can replace the constant value 1033 with e.g. 4*poolid+1, thus making pps =( (4*poolid+1)*stable + 2*pg.pool() + 1) % (pg_num_mask+1) + pg.pool(), in which case the step of pps of a pool is determined by its poolid, varying with pools. 

And yes, the pg is not stable when increasing pg# of an existed pool.

For the balance_param, actually we use it for both goals. As we cannot decide in advance which is the suitable value for a certain pool, considering the cluster topology, pool size and pg number. So we need to pick a best one after trying them. 

Thanks,
Yujie

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Friday, September 12, 2014 12:42 PM
To: Zhang, Jian
Cc: Loic Dachary; ceph-devel@vger.kernel.org; He, Yujie
Subject: RE: FW: CURSH optimization for unbalanced pg distribution

Hi,

This is pretty exciting.  I haven't read through all of it, but have some initial comments on the pps mapping portion.

On Wed, 10 Sep 2014, Zhang, Jian wrote:
> Thanks. 
> 
> Created a feature here: http://tracker.ceph.com/issues/9410, to include all the attachments. .
> http://tracker.ceph.com/attachments/download/1383/adaptive-crush-modif
> y.patch 
> http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf
> http://tracker.ceph.com/attachments/download/1385/crush_optimization.p
> df
> 
> 
> Hi all,
>     Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
> 
> Key Message:
>     As mentioned in the attached pdf, we described possible optimization proposals (http://tracker.ceph.com/attachments/download/1384/crush_proposals.pdf) for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979 ) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called "linear", and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than
  10% read performance improvement using the optimized CRUSH algorithm.
> 
> Design and Implementation:
> 1.Problem Identification
> 1.1 Input key (pps) space of CRUSH is not uniform
>     Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
> 1.2 Algorithm of selecting items from buckets is not uniform
>     After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
> 2.Design
> 2.1New pps hash algorithm
>     We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}. 
>     Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.

I made a few comments on github at

	https://github.com/ceph/ceph/pull/2402/files#r17462015

I have some questions about the underlying math.  If this is similar to the approach used by the uniform buckets, I think 1033 needs to be > the denominator?  Also, I looked a bit at the referenced paper and I think the denominator should be prime, not 2^n-1 (pg_num_mask).

My other concern is with raw_pg_to_congruential_pps.  Adding poolid into the numerator before you do the modulo means that each pool has a different permutation.  But, if you have two pools both with (say) 1024 PGs, they will map to the same 1024 outputs (0..1023).  The pool is added in to the final pps, but this doesn't really help as it only means a handful of PGs get unique mappings... and they'll be overlap with the next pool.  This is exactly the problem we were solving with the HASHPSPOOL flag.  Perhaps adding a pseudorrandom value between 0 and 2^32 based on the poolid will (usually) give the pools distinct output ranges and the linear mapping will still be happy with that (since the inputs for each pool live in a contiguous range).

In any case, though, yes: this general approach will mean that the pps values live in a packed range instead of being spread uniformly across the
0..2^32 range.

The other concern I have is whehter the pgid -> pps mapping is stable when pg_num is adjusted up.  Specifically, what we want is that when moving from pg_num to pg_num * 2, pg_num of the original inputs will keep the same output pps value, while the other half will get a new value.  It doesn't seem like this is true for this strategy.  That may be a tradeoff the user is willing to make, but we'll need to be very careful about making that apparent to the user.. it means that bumping pg_num will reshuffle all (not just half) of their data for each power of 2.

> 2.2 New bucket type, Linear
>     We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
>     For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
> 2.3 Adaptive Strategy
>     Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
>     1) Try different balance_param when preparing for a new pool
>     a. Iteratively call CRUSH(map, ruleno, x, balance_param) to get 
> corresponding PG distribution with different balance_params ??b. 
> Calculate stdev of PG# among all osds ?  c. Choose the balance_param with the minimal stdev
>  	2) Add a member variable to pool struct pg_pool_t to save the best balance_param value
>     The adaptive procedure can be described as following:
>     Input: cluster map, total PG number m, adaptive retry times n
>     Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n { 
>     for pgid from 0 to m {
>         calculate pps using the new generator in 2.1;
>         for bucket b in cluster map // apply CRUSH algorithm
>             apply corresponding bucket hashing algorithm and get a osd list for pgid
>     }
>     calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
>     if pg_stdev_a < min_pg_stdev {
>         min_pg_stdev = pg_stdev_a;
>         balance_param = a; 
>     }
>     adjust a to a new value;
> }

I see the core placement is basically just x % n.  But there is the balance_param value (which is an integer value in the range 1..5?).  I don't really understand intuitively what this is accomplishing.  Is the goal just to have a different permutation and pick the best of 5?  Or is it specifically dividing the raw x so that it is squished into a narrower range that is accomplishing a more balance distribution?  I'm hoping the goal is just another permutation, because then we can modify x in some other way *prior* to feeding it into CRUSH and we can avoid duplicating half of the code in mapper.c just to pass down the extra argument.

Thanks!
sage


> Evaluation:
>     We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
>     We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
> 1.PG and data distribution is more balanced using optimized CRUSH algorithm
>     a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 osds decreases from 10.09 to 6.50
>     b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
> 2.Large scaled performance is improved since data distribution is more balanced
>     a) More than 10% performance improvement for 128K and 10M read
>     b) Write performance not impacted
> Detailed performance data can be found in the http://tracker.ceph.com/attachments/download/1385/crush_optimization.pdf .
> 
> We also created a pull request: https://github.com/ceph/ceph/pull/2402 
> 
> 
> Thanks
> Jian
> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com] 
> Sent: Wednesday, September 10, 2014 9:06 AM
> To: Zhang, Jian
> Cc: Loic Dachary; ceph-devel@vger.kernel.org; He, Yujie
> Subject: RE: FW: CURSH optimization for unbalanced pg distribution
> 
> The lists are rejecting the email because of the big attachments.  Send with links instead?
> 
> On Tue, 9 Sep 2014, Zhang, Jian wrote:
> 
> > Yujie sent out the following email yesterday, but it seems it was missed. Resending it. 
> > 
> > =============
> > Hi all,
> > ?  Several months ago we met an issue of read performance issues (17% degradation) when working on ceph object storage performance evaluation with 10M objects (scaling from 10K objects to 1Million objects) , and found the root cause is unbalanced pg distribution among all osd disks, leading to unbalanced data distribution. We did some further investigation then and identified that CRUSH failed to map pgs evenly to each osd. Please refer to the attached pdf (crush_proposals) for details.
> > 
> > Key Message:
> > ?  As mentioned in the attached pdf, we described possible optimization proposals for CRUSH and got some feedback from community (http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/18979) . Sage suggested us take the idea of "Change placement strategy only for step of selecting devices from hosts", by adding a new bucket type called ?linear?, and applying a modulo-like hash function to this kind of buckets to achieve balanced distribution. We followed this suggestion and designed an optimized CRUSH algorithm, with new hash methods and an adaptive module. Please refer to the Design and Implementation part for details. We also wrote some POC for it, see the attached patch. And as a result, we got more than 10% read performance improvement using the optimized CRUSH algorithm.
> > 
> > Design and Implementation:
> > 1.   Problem Identification
> > 1.1  Input key (pps) space of CRUSH is not uniform
> >     Since PG# on the nested devices of a host is not uniform even if we select the device using simple modulo operation, we decide to change the algorithm of hashing raw pg to pps.
> > 1.2  Algorithm of selecting items from buckets is not uniform
> >     After we get uniform input key space, we should make the procedure of selecting devices from host be uniform. Since current CRUSH algorithm uses Jenkins hash based strategies and failed to reach the goal, we decide to add a new bucket type and apply new (modulo based) hash algorithm to make it.
> > 2.   Design
> > 2.1  New pps hash algorithm
> >     We design the new pps hash algorithm based on the "Congruential pseudo-random number generator" (http://comjnl.oxfordjournals.org/content/10/1/74.full.pdf) . It defines a bijection between the original sequence {0, ...,2^N-1} and some permutation of it. In other words, given different keys between 0 and 2^N-1, the generator will produce different integers, but within the same range {0,...,2^N-1}. 
> >     Assume there are np PGs in a pool, we can regard pgid (0?pgid<2^n, np?2^n<2*np) as the key, and then it will be hashed into a pps value between 0 and 2^n-1. Since PG# in a pool is usually 2^N, the generator just shuffles the original pgid sequence as output in this case, making the key space consisting of a permutation of {0,...,2^n-1}, which achieves the best uniformity. Moreover, poolid can be regarded as a seed in the generator, producing different pps value even with the same pgid but different poolid. Therefore, pgid sequences of various pools are mapped into distinct pps sequences, getting rid of PG overlapping.
> > 2.2  New bucket type, Linear
> >     We introduce a new bucket type called "linear", and apply a new modulo based hash algorithm to it. As the pps values assigned to each host are a pseudo-random subset of the original permutation and is possibly out of uniformity, in which situation applying modulo operation directly on integers in the subset cannot produce balanced distribution among disks in the host. To decrease deviation of the subset, we apply a balance parameter 1/balance_param to the key before conducting the modulo method.
> >     For osd failure and recovery, it assumes that items nested in this kind of bucket will not be removed, nor new items are added, same as the UNIFORM bucket. Linear host will not introduce more data movement than the uniform bucket.
> > 2.3  Adaptive Strategy
> >     Since there is no one constant balance parameter applying for all cases that will result in the best PG distribution. We make it an adaptive procedure by adjusting the balance parameter automatically during the preparation for creating a new pool, according to different cluster topology, PG# and replica#, in order to gain a most uniform distribution.
> > ??1) Try different balance_param when preparing for a new pool
> > ????- Iteratively call CRUSH(map, ruleno, x, balance_param) to get 
> > corresponding PG distribution with different balance_params
> > ????- Calculate stdev of PG# among all osds
> > ????- Choose the balance_param with the minimal stdev 
> >  	2) Add a member variable to pool struct pg_pool_t to save the best 
> > balance_param value ??The adaptive procedure can be described as following:
> > Input: cluster map, total PG number m, adaptive retry times n
> > Output: local optimal balance parameter balance_param min_pg_stdev = MAX; balance_param = a; // initial value for trial from 0 to n { 
> >     for pgid from 0 to m {
> >         calculate pps using the new generator in 2.1;
> >         for bucket b in cluster map // apply CRUSH algorithm
> >             apply corresponding bucket hashing algorithm and get a osd list for pgid
> >     }
> >     calculate pg_stdev_a by gathering all osd lists; // stdev of PG distribution among all osds
> >     if pg_stdev_a < min_pg_stdev {
> >         min_pg_stdev = pg_stdev_a;
> >         balance_param = a; 
> >     }
> >     adjust a to a new value;
> > }
> > 
> > 
> > Evaluation:
> >     We evaluated the performance of optimized and current CRUSH in a cluster consisting of 4 hosts, and each attaching with 10x1T disks. We designed two test cases, for the first one, by creating a pool with 2048 pgs, 2 replicas, preparing 100 million 128KB objects, and then evaluating read performance of these objects; for the second one, by creating a pool with 2048 pgs, 3 replicas, preparing 1 million 10MB objects, and then evaluating read performance of these objects.
> >     We compared the PG & data distribution and read performance of the two CRUSH algorithms, and got results as follows:
> > 1.   PG and data distribution is more balanced using optimized CRUSH algorithm
> > ??a) For 2048 PGs with 3 replicas, stdev of PG# on 40 osds decreases 
> > from 12.13 to 5.17; for 2048 PGs with 2 replicas, stdev of PG# on 40 
> > osds decreases from 10.09 to 6.50
> > ??b) For 1 million 10MB objects with 3 replicas, stdev of disk use% on 40 osds decreases from 0.068 to 0.026; for 100 million 128KB objects with 2 replicas, stdev of disk use% on 40 osds decreases from 0.067 to 0.042
> > 2.   Large scaled performance is improved since data distribution is more balanced
> > ??a) More than 10% performance improvement for 128K and 10M read
> > ??b) Write performance not impacted
> > Detailed performance data can be found in the attached pdf (crush_optimization).
> > 
> > We also created a pull request: https://github.com/ceph/ceph/pull/2402
> > 
> > Thanks
> > Jian
> > 
> > 
> > -----Original Message-----
> > From: Loic Dachary [mailto:loic@dachary.org]
> > Sent: Tuesday, September 09, 2014 9:36 PM
> > To: Zhang, Jian; ceph-devel@vger.kernel.org
> > Subject: Re: FW: CURSH optimization for unbalanced pg distribution
> > 
> > 
> > 
> > On 20/03/2014 04:54, Zhang, Jian wrote:
> > > Forwarding per Sage's suggestion. 
> > 
> > Very interesting discussion :-) For the record the corresponding pull 
> > request is https://github.com/ceph/ceph/pull/2402
> > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@inktank.com]
> > > Sent: Wednesday, March 19, 2014 11:29 PM
> > > To: Mark Nelson
> > > Cc: Zhang, Jian; Duan, Jiangang; He, Yujie
> > > Subject: Re: CURSH optimization for unbalanced pg distribution
> > > 
> > > On Wed, 19 Mar 2014, Mark Nelson wrote:
> > >> On 03/19/2014 03:24 AM, Zhang, Jian wrote:
> > >>> For more detail data, please refer to the *Testing results* part.
> > >>>
> > >>> *Optimization proposals: *
> > >>>
> > >>> After we dived into the source code of CRUSH and related papers, 
> > >>> we proposed two possible optimizations:
> > >>>
> > >>> 1.Add different hash algorithms, as an alternative for the 
> > >>> Jenkin's hash, e.g. algorithm that will produce even values when 
> > >>> range of input value (pg#) is relatively small. Or add new bucket 
> > >>> type at the same time if necessary.
> > > 
> > > This *might* work, but I don't have a strong intuition about it.  The modeling we've done now has essentially assumed a statistically uniform distribution, which has some inherent inbalance for low values of n (num pgs in our case).  I have generally assumed we can't do better than "random", and still have the other properties we want (independent, deterministic placement), but it may be possible.
> > > 
> > >>>
> > >>> 2.Find a better replica placement strategy instead of current 
> > >>> retry logic of crush_choose_firstn, which may cause CRUSH to behave badly.
> > >>>
> > >>> We find there are several threshold of retry times by referring to 
> > >>> code, choose_total_tries, choose_local_tries and choose_local_fallback_tries.
> > >>> They are used to decide whether to do a retry_bucket, 
> > >>> retry_descent or use permutation to do an exhaustive bucket 
> > >>> search. We are wondering if there is another retry strategy:
> > >>>
> > >>> a)Backtracking retry. Now the logic of crush_choose_firstn can 
> > >>> only issue an retry either from the initial bucket(retry_descent) 
> > >>> or from the current bucket (retry_bucket), how about retrying the intervening buckets?
> > >>>
> > >>> b)Adjust threshold of retry times by other values. We do noticed 
> > >>> that the 'optimal' crush tunable could be used to make it, but we 
> > >>> still encounter unbalanced [g distribution by using the optimal strategy.
> > >>> Please refer to 4 of the Testing results part.
> > >>>
> > >>> c)Add an mechanism that can adjust above mentioned thresholds 
> > >>> adaptively. Maybe we can record the retry times of the previous 
> > >>> call for CRUSH, and adjust retry thresholds automatically according to the record.
> > > 
> > > I suggest ignoring all of this retry logic.  The original version of 
> > > CRUSH has the local retries to try to make data move "less far", but 
> > > when we went back a year ago and did a statistical analysis of the 
> > > distribution we found that *all* of these hacks degraded the quality 
> > > of the placement,a nd by turning them all off (setting the 'optimal'
> > > values which zeroes them all out excent for total_retries) we got 
> > > something that was indistinguishable from a uniform distribution.
> > > 
> > >>> 3.Add soft link for pg directories. During pg creation, we can 
> > >>> create soft links for the pgs if pg# on the selected osd is more 
> > >>> than some threshold, say 10% more than desired average number, to 
> > >>> move objects that will be stored in this pg to another osd. 
> > >>> Balanced disk utilization may be gained in this way.
> > > 
> > > I think you need to be careful, but yes, this is an option.  There 
> > > is a similar exception mechanism in place that is used for other 
> > > purposes and something similar could be done here.  The main 
> > > challenge will be in ensuring that the soft links/exceptions follow 
> > > the same overall policy that CRUSH does after the raw mapping is 
> > > performed.  This is an option, but I would put it toward the bottom of the list...
> > > 
> > >>> 4.Change placement strategy only for step of selecting devices 
> > >>> from hosts. We found in our testing results that pg distribution 
> > >>> was balanced among hosts, which is reasonable since pg# of each 
> > >>> host is above 1K (according to the current BKM that pg# per osd 
> > >>> should be about 100). So how about we apply CRUSH only on the 
> > >>> interval buckets and find another simple but more balanced method to choose osd from host?
> > > 
> > > This idea has a lot of potential.  For example:
> > > 
> > > If you know the chassis can hold 12 disks, you can force the bucket 
> > > size to twelve and somehow prevent users from adjusting the 
> > > structure of the tree.  Then you can use a simple mapping that is 
> > > truly flat (like a linear mapping, disk = x % num_disks) for that bucket/subtree.
> > > The downside of course is that if you remove a disk *everything* 
> > > reshuffles, hence some sort of guardrails to prevent a user from 
> > > inadvertantly doing that.  If a disk *does* fail, you just need to 
> > > make sure the disk is marked "out" but not removed from the CRUSH hierarchy and the normal retry will kick in.
> > > 
> > > Note that all this is reall doing is increasing the size of the "buckets" 
> > > that we are (pseudo)randomly distribution over.  It is still a 
> > > random/uniform distribution, but the N value is 12 times bigger (for 
> > > a
> > > 12 disk chassis) and as a result the variance is substantially lower.
> > > 
> > > I would suggest making a new bucket type that is called 'linear' and 
> > > does a simple modulo and trying this out.  We will need a bunch of 
> > > additional safety checks to help users avoid doing silly things 
> > > (like adjusting the number of items in the linear buckets, which 
> > > reshuffle
> > > everything) but that wouldn't be needed for an initial analysis of the performance impact.
> > > 
> > > Do you mind if we shift this thread over to ceph-devel?  I think 
> > > there are lots of people who would be interested in this discussion.  
> > > We can of course leave off your attachment if you prefer.
> > > 
> > > Thanks!
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > 
> > --
> > Lo?c Dachary, Artisan Logiciel Libre
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-09-12  8:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <51FC7A40FB29414D88A121A7FFEF9A4710DC2188@SHSMSX104.ccr.corp.intel.com>
     [not found] ` <53299E33.5060809@inktank.com>
     [not found]   ` <alpine.DEB.2.00.1403190814520.316@cobra.newdream.net>
2014-03-20  3:54     ` FW: CURSH optimization for unbalanced pg distribution Zhang, Jian
2014-09-09 13:36       ` Loic Dachary
2014-09-10  0:56         ` FW: " Zhang, Jian
     [not found]         ` <51FC7A40FB29414D88A121A7FFEF9A4710F570D6@SHSMSX104.ccr.corp.intel.com>
2014-09-10  1:06           ` Sage Weil
2014-09-10  1:32             ` Zhang, Jian
2014-09-10  2:16               ` Mark Nelson
2014-09-10  2:56                 ` He, Yujie
2014-09-12  4:41               ` Sage Weil
2014-09-12  8:20                 ` He, Yujie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.