From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sage Weil Subject: crush multipick anomaly Date: Thu, 26 Jan 2017 03:05:54 +0000 (UTC) Message-ID: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Return-path: Received: from cobra.newdream.net ([66.33.216.30]:51972 "EHLO cobra.newdream.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752453AbdAZDFz (ORCPT ); Wed, 25 Jan 2017 22:05:55 -0500 Received: from cobra.newdream.net (localhost [127.0.0.1]) by cobra.newdream.net (Postfix) with ESMTP id 2277C813F9 for ; Wed, 25 Jan 2017 19:05:55 -0800 (PST) Received: from localhost.localdomain (unknown [64.111.99.127]) by cobra.newdream.net (Postfix) with ESMTPSA id 0201B8021C for ; Wed, 25 Jan 2017 19:05:54 -0800 (PST) Sender: ceph-devel-owner@vger.kernel.org List-ID: To: ceph-devel@vger.kernel.org This is a longstanding bug, http://tracker.ceph.com/issues/15653 that causes low-weighted devices to get more data than they should. Loic's recent activity resurrected discussion on the original PR https://github.com/ceph/ceph/pull/10218 but since it's closed and almost nobody will see it I'm moving the discussion here. The main news is that I have a simple adjustment for the weights that works (almost perfectly) for the 2nd round of placements. The solution is pretty simple, although as with most probabilities it tends to make my brain hurt. The idea is that, on the second round, the original weight for the small OSD (call it P(pick small)) isn't what we should use. Instead, we want P(pick small | first pick not small). Since P(a|b) (the probability of a given b) is P(a && b) / P(b), P(pick small | first pick not small) = P(pick small && first pick not small) / P(first pick not small) The last term is easy to calculate, P(first pick not small) = (total_weight - small_weight) / total_weight and the && term is the distribution we're trying to produce. For exmaple, if small has 1/10 the weight, then we should see 1/10th of the PGs have their second replica be the small OSD. So P(pick small && first pick not small) = small_weight / total_weight Putting those together, P(pick small | first pick not small) = P(pick small && first pick not small) / P(first pick not small) = (small_weight / total_weight) / ((total_weight - small_weight) / total_weight) = small_weight / (total_weight - small_weight) This is, on the second round, we should adjust the weights by the above so that we get the right distribution of second choices. It turns out it works to adjust *all* weights like this to get hte conditional probability that they weren't already chosen. I have a branch that hacks this into straw2 and it appears to work properly for num_rep = 2. With a test bucket of [99 99 99 99 4], and the current code, you get $ bin/crushtool -c cm.txt --test --show-utilization --min-x 0 --max-x 40000000 --num-rep 2 rule 0 (data), x = 0..40000000, numrep = 2..2 rule 0 (data) num_rep 2 result size == 2: 40000001/40000001 device 0: 19765965 [9899364,9866601] device 1: 19768033 [9899444,9868589] device 2: 19769938 [9901770,9868168] device 3: 19766918 [9898851,9868067] device 6: 929148 [400572,528576] which is very close for the first replica (primary), but way off for the second. With my hacky change, rule 0 (data), x = 0..40000000, numrep = 2..2 rule 0 (data) num_rep 2 result size == 2: 40000001/40000001 device 0: 19797315 [9899364,9897951] device 1: 19799199 [9899444,9899755] device 2: 19801016 [9901770,9899246] device 3: 19797906 [9898851,9899055] device 6: 804566 [400572,403994] which is quite close, but still skewing slightly high (by a big less than 1%). Next steps: 1- generalize this for >2 replicas 2- figure out why it skews high 3- make this work for multi-level hierarchical descent sage