* [PATCH 0/5] libceph: crush tunables5
@ 2016-02-03 14:23 Ilya Dryomov
2016-02-03 14:23 ` [PATCH 1/5] crush: ensure bucket id is valid before indexing buckets array Ilya Dryomov
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: Ilya Dryomov @ 2016-02-03 14:23 UTC (permalink / raw)
To: ceph-devel; +Cc: Sage Weil
Hello,
This series adds support for the new chooseleaf_stable CRUSH tunable
and enables TUNABLES5 feature bit. It's likely that TUNABLES5 bit will
be shared and cover request_redirect_t encoding in MOSDOpReply and
possibly new file layouts but those aren't in master yet.
Thanks,
Ilya
Ilya Dryomov (5):
crush: ensure bucket id is valid before indexing buckets array
crush: ensure take bucket value is valid
crush: add chooseleaf_stable tunable
crush: decode and initialize chooseleaf_stable
libceph: advertise support for TUNABLES5
include/linux/ceph/ceph_features.h | 13 ++++++++++++-
include/linux/crush/crush.h | 8 +++++++-
net/ceph/crush/mapper.c | 33 ++++++++++++++++++++++++++-------
net/ceph/osdmap.c | 19 ++++++++++++++-----
4 files changed, 59 insertions(+), 14 deletions(-)
--
2.4.3
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/5] crush: ensure bucket id is valid before indexing buckets array
2016-02-03 14:23 [PATCH 0/5] libceph: crush tunables5 Ilya Dryomov
@ 2016-02-03 14:23 ` Ilya Dryomov
2016-02-03 14:23 ` [PATCH 2/5] crush: ensure take bucket value is valid Ilya Dryomov
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Ilya Dryomov @ 2016-02-03 14:23 UTC (permalink / raw)
To: ceph-devel; +Cc: Sage Weil
We were indexing the buckets array without verifying the index was
within the [0,max_buckets) range. This could happen because
a multistep rule does not have enough buckets and has CRUSH_ITEM_NONE
for an intermediate result, which would feed in CRUSH_ITEM_NONE and
make us crash.
Reflects ceph.git commit 976a24a326da8931e689ee22fce35feab5b67b76.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
---
net/ceph/crush/mapper.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/net/ceph/crush/mapper.c b/net/ceph/crush/mapper.c
index 393bfb22d5bb..97ecf6f262aa 100644
--- a/net/ceph/crush/mapper.c
+++ b/net/ceph/crush/mapper.c
@@ -888,6 +888,7 @@ int crush_do_rule(const struct crush_map *map,
osize = 0;
for (i = 0; i < wsize; i++) {
+ int bno;
/*
* see CRUSH_N, CRUSH_N_MINUS macros.
* basically, numrep <= 0 means relative to
@@ -900,6 +901,13 @@ int crush_do_rule(const struct crush_map *map,
continue;
}
j = 0;
+ /* make sure bucket id is valid */
+ bno = -1 - w[i];
+ if (bno < 0 || bno >= map->max_buckets) {
+ /* w[i] is probably CRUSH_ITEM_NONE */
+ dprintk(" bad w[i] %d\n", w[i]);
+ continue;
+ }
if (firstn) {
int recurse_tries;
if (choose_leaf_tries)
@@ -911,7 +919,7 @@ int crush_do_rule(const struct crush_map *map,
recurse_tries = choose_tries;
osize += crush_choose_firstn(
map,
- map->buckets[-1-w[i]],
+ map->buckets[bno],
weight, weight_max,
x, numrep,
curstep->arg2,
@@ -930,7 +938,7 @@ int crush_do_rule(const struct crush_map *map,
numrep : (result_max-osize));
crush_choose_indep(
map,
- map->buckets[-1-w[i]],
+ map->buckets[bno],
weight, weight_max,
x, out_size, numrep,
curstep->arg2,
--
2.4.3
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/5] crush: ensure take bucket value is valid
2016-02-03 14:23 [PATCH 0/5] libceph: crush tunables5 Ilya Dryomov
2016-02-03 14:23 ` [PATCH 1/5] crush: ensure bucket id is valid before indexing buckets array Ilya Dryomov
@ 2016-02-03 14:23 ` Ilya Dryomov
2016-02-03 14:23 ` [PATCH 3/5] crush: add chooseleaf_stable tunable Ilya Dryomov
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Ilya Dryomov @ 2016-02-03 14:23 UTC (permalink / raw)
To: ceph-devel; +Cc: Sage Weil
Ensure that the take argument is a valid bucket ID before indexing the
buckets array.
Reflects ceph.git commit 93ec538e8a667699876b72459b8ad78966d89c61.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
---
net/ceph/crush/mapper.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ceph/crush/mapper.c b/net/ceph/crush/mapper.c
index 97ecf6f262aa..abb700621e4a 100644
--- a/net/ceph/crush/mapper.c
+++ b/net/ceph/crush/mapper.c
@@ -835,7 +835,8 @@ int crush_do_rule(const struct crush_map *map,
case CRUSH_RULE_TAKE:
if ((curstep->arg1 >= 0 &&
curstep->arg1 < map->max_devices) ||
- (-1-curstep->arg1 < map->max_buckets &&
+ (-1-curstep->arg1 >= 0 &&
+ -1-curstep->arg1 < map->max_buckets &&
map->buckets[-1-curstep->arg1])) {
w[0] = curstep->arg1;
wsize = 1;
--
2.4.3
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 3/5] crush: add chooseleaf_stable tunable
2016-02-03 14:23 [PATCH 0/5] libceph: crush tunables5 Ilya Dryomov
2016-02-03 14:23 ` [PATCH 1/5] crush: ensure bucket id is valid before indexing buckets array Ilya Dryomov
2016-02-03 14:23 ` [PATCH 2/5] crush: ensure take bucket value is valid Ilya Dryomov
@ 2016-02-03 14:23 ` Ilya Dryomov
2016-02-03 14:23 ` [PATCH 4/5] crush: decode and initialize chooseleaf_stable Ilya Dryomov
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Ilya Dryomov @ 2016-02-03 14:23 UTC (permalink / raw)
To: ceph-devel; +Cc: Sage Weil
Add a tunable to fix the bug that chooseleaf may cause unnecessary pg
migrations when some device fails.
Reflects ceph.git commit fdb3f664448e80d984470f32f04e2e6f03ab52ec.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
---
include/linux/crush/crush.h | 8 +++++++-
net/ceph/crush/mapper.c | 18 ++++++++++++++----
2 files changed, 21 insertions(+), 5 deletions(-)
diff --git a/include/linux/crush/crush.h b/include/linux/crush/crush.h
index 48b49305716b..be8f12b8f195 100644
--- a/include/linux/crush/crush.h
+++ b/include/linux/crush/crush.h
@@ -59,7 +59,8 @@ enum {
CRUSH_RULE_SET_CHOOSELEAF_TRIES = 9, /* override chooseleaf_descend_once */
CRUSH_RULE_SET_CHOOSE_LOCAL_TRIES = 10,
CRUSH_RULE_SET_CHOOSE_LOCAL_FALLBACK_TRIES = 11,
- CRUSH_RULE_SET_CHOOSELEAF_VARY_R = 12
+ CRUSH_RULE_SET_CHOOSELEAF_VARY_R = 12,
+ CRUSH_RULE_SET_CHOOSELEAF_STABLE = 13
};
/*
@@ -205,6 +206,11 @@ struct crush_map {
* mappings line up a bit better with previous mappings. */
__u8 chooseleaf_vary_r;
+ /* if true, it makes chooseleaf firstn to return stable results (if
+ * no local retry) so that data migrations would be optimal when some
+ * device fails. */
+ __u8 chooseleaf_stable;
+
#ifndef __KERNEL__
/*
* version 0 (original) of straw_calc has various flaws. version 1
diff --git a/net/ceph/crush/mapper.c b/net/ceph/crush/mapper.c
index abb700621e4a..5fcfb98f309e 100644
--- a/net/ceph/crush/mapper.c
+++ b/net/ceph/crush/mapper.c
@@ -403,6 +403,7 @@ static int is_out(const struct crush_map *map,
* @local_retries: localized retries
* @local_fallback_retries: localized fallback retries
* @recurse_to_leaf: true if we want one device under each item of given type (chooseleaf instead of choose)
+ * @stable: stable mode starts rep=0 in the recursive call for all replicas
* @vary_r: pass r to recursive calls
* @out2: second output vector for leaf items (if @recurse_to_leaf)
* @parent_r: r value passed from the parent
@@ -419,6 +420,7 @@ static int crush_choose_firstn(const struct crush_map *map,
unsigned int local_fallback_retries,
int recurse_to_leaf,
unsigned int vary_r,
+ unsigned int stable,
int *out2,
int parent_r)
{
@@ -433,13 +435,13 @@ static int crush_choose_firstn(const struct crush_map *map,
int collide, reject;
int count = out_size;
- dprintk("CHOOSE%s bucket %d x %d outpos %d numrep %d tries %d recurse_tries %d local_retries %d local_fallback_retries %d parent_r %d\n",
+ dprintk("CHOOSE%s bucket %d x %d outpos %d numrep %d tries %d recurse_tries %d local_retries %d local_fallback_retries %d parent_r %d stable %d\n",
recurse_to_leaf ? "_LEAF" : "",
bucket->id, x, outpos, numrep,
tries, recurse_tries, local_retries, local_fallback_retries,
- parent_r);
+ parent_r, stable);
- for (rep = outpos; rep < numrep && count > 0 ; rep++) {
+ for (rep = stable ? 0 : outpos; rep < numrep && count > 0 ; rep++) {
/* keep trying until we get a non-out, non-colliding item */
ftotal = 0;
skip_rep = 0;
@@ -512,13 +514,14 @@ static int crush_choose_firstn(const struct crush_map *map,
if (crush_choose_firstn(map,
map->buckets[-1-item],
weight, weight_max,
- x, outpos+1, 0,
+ x, stable ? 1 : outpos+1, 0,
out2, outpos, count,
recurse_tries, 0,
local_retries,
local_fallback_retries,
0,
vary_r,
+ stable,
NULL,
sub_r) <= outpos)
/* didn't get leaf */
@@ -816,6 +819,7 @@ int crush_do_rule(const struct crush_map *map,
int choose_local_fallback_retries = map->choose_local_fallback_tries;
int vary_r = map->chooseleaf_vary_r;
+ int stable = map->chooseleaf_stable;
if ((__u32)ruleno >= map->max_rules) {
dprintk(" bad ruleno %d\n", ruleno);
@@ -870,6 +874,11 @@ int crush_do_rule(const struct crush_map *map,
vary_r = curstep->arg1;
break;
+ case CRUSH_RULE_SET_CHOOSELEAF_STABLE:
+ if (curstep->arg1 >= 0)
+ stable = curstep->arg1;
+ break;
+
case CRUSH_RULE_CHOOSELEAF_FIRSTN:
case CRUSH_RULE_CHOOSE_FIRSTN:
firstn = 1;
@@ -932,6 +941,7 @@ int crush_do_rule(const struct crush_map *map,
choose_local_fallback_retries,
recurse_to_leaf,
vary_r,
+ stable,
c+osize,
0);
} else {
--
2.4.3
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 4/5] crush: decode and initialize chooseleaf_stable
2016-02-03 14:23 [PATCH 0/5] libceph: crush tunables5 Ilya Dryomov
` (2 preceding siblings ...)
2016-02-03 14:23 ` [PATCH 3/5] crush: add chooseleaf_stable tunable Ilya Dryomov
@ 2016-02-03 14:23 ` Ilya Dryomov
2016-02-03 14:23 ` [PATCH 5/5] libceph: advertise support for TUNABLES5 Ilya Dryomov
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Ilya Dryomov @ 2016-02-03 14:23 UTC (permalink / raw)
To: ceph-devel; +Cc: Sage Weil
Also add missing \n while at it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
---
net/ceph/osdmap.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 7d8f581d9f1f..243574c8cf33 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -342,23 +342,32 @@ static struct crush_map *crush_decode(void *pbyval, void *end)
c->choose_local_tries = ceph_decode_32(p);
c->choose_local_fallback_tries = ceph_decode_32(p);
c->choose_total_tries = ceph_decode_32(p);
- dout("crush decode tunable choose_local_tries = %d",
+ dout("crush decode tunable choose_local_tries = %d\n",
c->choose_local_tries);
- dout("crush decode tunable choose_local_fallback_tries = %d",
+ dout("crush decode tunable choose_local_fallback_tries = %d\n",
c->choose_local_fallback_tries);
- dout("crush decode tunable choose_total_tries = %d",
+ dout("crush decode tunable choose_total_tries = %d\n",
c->choose_total_tries);
ceph_decode_need(p, end, sizeof(u32), done);
c->chooseleaf_descend_once = ceph_decode_32(p);
- dout("crush decode tunable chooseleaf_descend_once = %d",
+ dout("crush decode tunable chooseleaf_descend_once = %d\n",
c->chooseleaf_descend_once);
ceph_decode_need(p, end, sizeof(u8), done);
c->chooseleaf_vary_r = ceph_decode_8(p);
- dout("crush decode tunable chooseleaf_vary_r = %d",
+ dout("crush decode tunable chooseleaf_vary_r = %d\n",
c->chooseleaf_vary_r);
+ /* skip straw_calc_version, allowed_bucket_algs */
+ ceph_decode_need(p, end, sizeof(u8) + sizeof(u32), done);
+ *p += sizeof(u8) + sizeof(u32);
+
+ ceph_decode_need(p, end, sizeof(u8), done);
+ c->chooseleaf_stable = ceph_decode_8(p);
+ dout("crush decode tunable chooseleaf_stable = %d\n",
+ c->chooseleaf_stable);
+
done:
dout("crush_decode success\n");
return c;
--
2.4.3
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 5/5] libceph: advertise support for TUNABLES5
2016-02-03 14:23 [PATCH 0/5] libceph: crush tunables5 Ilya Dryomov
` (3 preceding siblings ...)
2016-02-03 14:23 ` [PATCH 4/5] crush: decode and initialize chooseleaf_stable Ilya Dryomov
@ 2016-02-03 14:23 ` Ilya Dryomov
2016-02-03 14:28 ` [PATCH 0/5] libceph: crush tunables5 Sage Weil
2016-02-03 14:46 ` Yan, Zheng
6 siblings, 0 replies; 8+ messages in thread
From: Ilya Dryomov @ 2016-02-03 14:23 UTC (permalink / raw)
To: ceph-devel; +Cc: Sage Weil
Add TUNABLES5 feature (chooseleaf_stable tunable) to a set of features
supported by default.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
---
include/linux/ceph/ceph_features.h | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h
index fa3f82aae3be..6b4b059c158f 100644
--- a/include/linux/ceph/ceph_features.h
+++ b/include/linux/ceph/ceph_features.h
@@ -63,6 +63,16 @@
#define CEPH_FEATURE_OSD_MIN_SIZE_RECOVERY (1ULL<<49)
// duplicated since it was introduced at the same time as MIN_SIZE_RECOVERY
#define CEPH_FEATURE_OSD_PROXY_FEATURES (1ULL<<49) /* overlap w/ above */
+#define CEPH_FEATURE_MON_METADATA (1ULL<<50)
+#define CEPH_FEATURE_OSD_BITWISE_HOBJ_SORT (1ULL<<51) /* can sort objs bitwise */
+#define CEPH_FEATURE_OSD_PROXY_WRITE_FEATURES (1ULL<<52)
+#define CEPH_FEATURE_ERASURE_CODE_PLUGINS_V3 (1ULL<<53)
+#define CEPH_FEATURE_OSD_HITSET_GMT (1ULL<<54)
+#define CEPH_FEATURE_HAMMER_0_94_4 (1ULL<<55)
+#define CEPH_FEATURE_NEW_OSDOP_ENCODING (1ULL<<56) /* New, v7 encoding */
+#define CEPH_FEATURE_MON_STATEFUL_SUB (1ULL<<57) /* stateful mon subscription */
+#define CEPH_FEATURE_MON_ROUTE_OSDMAP (1ULL<<57) /* peon sends osdmaps */
+#define CEPH_FEATURE_CRUSH_TUNABLES5 (1ULL<<58) /* chooseleaf stable mode */
/*
* The introduction of CEPH_FEATURE_OSD_SNAPMAPPER caused the feature
@@ -109,7 +119,8 @@ static inline u64 ceph_sanitize_features(u64 features)
CEPH_FEATURE_CRUSH_TUNABLES3 | \
CEPH_FEATURE_OSD_PRIMARY_AFFINITY | \
CEPH_FEATURE_MSGR_KEEPALIVE2 | \
- CEPH_FEATURE_CRUSH_V4)
+ CEPH_FEATURE_CRUSH_V4 | \
+ CEPH_FEATURE_CRUSH_TUNABLES5)
#define CEPH_FEATURES_REQUIRED_DEFAULT \
(CEPH_FEATURE_NOSRCADDR | \
--
2.4.3
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 0/5] libceph: crush tunables5
2016-02-03 14:23 [PATCH 0/5] libceph: crush tunables5 Ilya Dryomov
` (4 preceding siblings ...)
2016-02-03 14:23 ` [PATCH 5/5] libceph: advertise support for TUNABLES5 Ilya Dryomov
@ 2016-02-03 14:28 ` Sage Weil
2016-02-03 14:46 ` Yan, Zheng
6 siblings, 0 replies; 8+ messages in thread
From: Sage Weil @ 2016-02-03 14:28 UTC (permalink / raw)
To: Ilya Dryomov; +Cc: ceph-devel
On Wed, 3 Feb 2016, Ilya Dryomov wrote:
> Hello,
>
> This series adds support for the new chooseleaf_stable CRUSH tunable
> and enables TUNABLES5 feature bit. It's likely that TUNABLES5 bit will
> be shared and cover request_redirect_t encoding in MOSDOpReply and
> possibly new file layouts but those aren't in master yet.
>
> Thanks,
>
> Ilya
>
>
> Ilya Dryomov (5):
> crush: ensure bucket id is valid before indexing buckets array
> crush: ensure take bucket value is valid
> crush: add chooseleaf_stable tunable
> crush: decode and initialize chooseleaf_stable
> libceph: advertise support for TUNABLES5
>
> include/linux/ceph/ceph_features.h | 13 ++++++++++++-
> include/linux/crush/crush.h | 8 +++++++-
> net/ceph/crush/mapper.c | 33 ++++++++++++++++++++++++++-------
> net/ceph/osdmap.c | 19 ++++++++++++++-----
> 4 files changed, 59 insertions(+), 14 deletions(-)
Whole series
Reviewed-by: Sage Weil <sage@redhat.com>
Thanks!
sage
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 0/5] libceph: crush tunables5
2016-02-03 14:23 [PATCH 0/5] libceph: crush tunables5 Ilya Dryomov
` (5 preceding siblings ...)
2016-02-03 14:28 ` [PATCH 0/5] libceph: crush tunables5 Sage Weil
@ 2016-02-03 14:46 ` Yan, Zheng
6 siblings, 0 replies; 8+ messages in thread
From: Yan, Zheng @ 2016-02-03 14:46 UTC (permalink / raw)
To: Ilya Dryomov; +Cc: ceph-devel, Sage Weil
On Wed, Feb 3, 2016 at 10:23 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
> Hello,
>
> This series adds support for the new chooseleaf_stable CRUSH tunable
> and enables TUNABLES5 feature bit. It's likely that TUNABLES5 bit will
> be shared and cover request_redirect_t encoding in MOSDOpReply and
> possibly new file layouts but those aren't in master yet.
>
file layout RP hasn't been merged. I will push the kernel patch once
it get merged
Regards
Yan, Zheng
> Thanks,
>
> Ilya
>
>
> Ilya Dryomov (5):
> crush: ensure bucket id is valid before indexing buckets array
> crush: ensure take bucket value is valid
> crush: add chooseleaf_stable tunable
> crush: decode and initialize chooseleaf_stable
> libceph: advertise support for TUNABLES5
>
> include/linux/ceph/ceph_features.h | 13 ++++++++++++-
> include/linux/crush/crush.h | 8 +++++++-
> net/ceph/crush/mapper.c | 33 ++++++++++++++++++++++++++-------
> net/ceph/osdmap.c | 19 ++++++++++++++-----
> 4 files changed, 59 insertions(+), 14 deletions(-)
>
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2016-02-03 14:46 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-03 14:23 [PATCH 0/5] libceph: crush tunables5 Ilya Dryomov
2016-02-03 14:23 ` [PATCH 1/5] crush: ensure bucket id is valid before indexing buckets array Ilya Dryomov
2016-02-03 14:23 ` [PATCH 2/5] crush: ensure take bucket value is valid Ilya Dryomov
2016-02-03 14:23 ` [PATCH 3/5] crush: add chooseleaf_stable tunable Ilya Dryomov
2016-02-03 14:23 ` [PATCH 4/5] crush: decode and initialize chooseleaf_stable Ilya Dryomov
2016-02-03 14:23 ` [PATCH 5/5] libceph: advertise support for TUNABLES5 Ilya Dryomov
2016-02-03 14:28 ` [PATCH 0/5] libceph: crush tunables5 Sage Weil
2016-02-03 14:46 ` Yan, Zheng
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.