All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] rbd: version 2 parent probing
@ 2012-10-31  1:41 Alex Elder
  2012-10-31  1:49 ` [PATCH 1/6] rbd: skip getting image id if known Alex Elder
                   ` (5 more replies)
  0 siblings, 6 replies; 41+ messages in thread
From: Alex Elder @ 2012-10-31  1:41 UTC (permalink / raw)
  To: ceph-devel

This series puts in place a few remaining pieces before finally
implementing the call to rbd_dev_probe() for the parent of a
layered rbd image if present.

					-Alex

[PATCH 1/6] rbd: skip getting image id if known
[PATCH 2/6] rbd: allow null image name
    These two take care of two issues that will arise once
    we have activated probing parents.
[PATCH 3/6] rbd: get parent spec for version 2 images
    This fetches the identities for a parent image.
[PATCH 4/6] libceph: define ceph_pg_pool_name_by_id()
[PATCH 5/6] rbd: get additional info in parent spec
    This populates the parent spec with the names of the
    ids for a parent image.
[PATCH 6/6] rbd: probe the parent of an image if present
    This finally adds a call to the probe routine, and
    handles teardown of layered images as well.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/6] rbd: skip getting image id if known
  2012-10-31  1:41 [PATCH 0/6] rbd: version 2 parent probing Alex Elder
@ 2012-10-31  1:49 ` Alex Elder
  2012-10-31 21:05   ` Josh Durgin
  2012-10-31  1:49 ` [PATCH 2/6] rbd: allow null image name Alex Elder
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: Alex Elder @ 2012-10-31  1:49 UTC (permalink / raw)
  To: ceph-devel

We will know the image id for format 2 parent images, but won't
initially know its image name.  Avoid making the query for an image
id in rbd_dev_image_id() if it's already known.

Signed-off-by: Alex Elder <elder@inktank.com>
---
 drivers/block/rbd.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 8d26c0f..a852133 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3068,6 +3068,14 @@ static int rbd_dev_image_id(struct rbd_device
*rbd_dev)
 	void *p;

 	/*
+	 * When probing a parent image, the image id is already
+	 * known (and the image name likely is not).  There's no
+	 * need to fetch the image id again in this case.
+	 */
+	if (rbd_dev->spec->image_id)
+		return 0;
+
+	/*
 	 * First, see if the format 2 image id file exists, and if
 	 * so, get the image's persistent id from it.
 	 */
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/6] rbd: allow null image name
  2012-10-31  1:41 [PATCH 0/6] rbd: version 2 parent probing Alex Elder
  2012-10-31  1:49 ` [PATCH 1/6] rbd: skip getting image id if known Alex Elder
@ 2012-10-31  1:49 ` Alex Elder
  2012-10-31 21:07   ` Josh Durgin
  2012-10-31  1:49 ` [PATCH 3/6] rbd: get parent spec for version 2 images Alex Elder
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: Alex Elder @ 2012-10-31  1:49 UTC (permalink / raw)
  To: ceph-devel

Format 2 parent images are partially identified by their image id,
but it may not be possible to determine their image name.  The name
is not strictly needed for correct operation, so we won't be
treating it as an error if we don't know it.  Handle this case
gracefully in rbd_name_show().

Signed-off-by: Alex Elder <elder@inktank.com>
---
 drivers/block/rbd.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index a852133..28052ff 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1982,7 +1982,10 @@ static ssize_t rbd_name_show(struct device *dev,
 {
 	struct rbd_device *rbd_dev = dev_to_rbd_dev(dev);

-	return sprintf(buf, "%s\n", rbd_dev->spec->image_name);
+	if (rbd_dev->spec->image_name)
+		return sprintf(buf, "%s\n", rbd_dev->spec->image_name);
+
+	return sprintf(buf, "(unknown)\n");
 }

 static ssize_t rbd_image_id_show(struct device *dev,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/6] rbd: get parent spec for version 2 images
  2012-10-31  1:41 [PATCH 0/6] rbd: version 2 parent probing Alex Elder
  2012-10-31  1:49 ` [PATCH 1/6] rbd: skip getting image id if known Alex Elder
  2012-10-31  1:49 ` [PATCH 2/6] rbd: allow null image name Alex Elder
@ 2012-10-31  1:49 ` Alex Elder
  2012-11-01  1:33   ` Josh Durgin
  2012-10-31  1:49 ` [PATCH 4/6] libceph: define ceph_pg_pool_name_by_id() Alex Elder
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: Alex Elder @ 2012-10-31  1:49 UTC (permalink / raw)
  To: ceph-devel

Add support for getting the the information identifying the parent
image for rbd images that have them.  The child image holds a
reference to its parent image specification structure.  Create a new
entry "parent" in /sys/bus/rbd/image/N/ to report the identifying
information for the parent image, if any.

Signed-off-by: Alex Elder <elder@inktank.com>
---
 Documentation/ABI/testing/sysfs-bus-rbd |    4 +
 drivers/block/rbd.c                     |  131
+++++++++++++++++++++++++++++++
 include/linux/ceph/rados.h              |    2 +
 3 files changed, 137 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-rbd
b/Documentation/ABI/testing/sysfs-bus-rbd
index 1cf2adf..cd9213c 100644
--- a/Documentation/ABI/testing/sysfs-bus-rbd
+++ b/Documentation/ABI/testing/sysfs-bus-rbd
@@ -70,6 +70,10 @@ snap_*

 	A directory per each snapshot

+parent
+
+	Information identifying the pool, image, and snapshot id for
+	the parent image in a layered rbd image (format 2 only).

 Entries under /sys/bus/rbd/devices/<dev-id>/snap_<snap-name>
 -------------------------------------------------------------
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 28052ff..bce1fcf 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -217,6 +217,9 @@ struct rbd_device {
 	struct ceph_osd_event   *watch_event;
 	struct ceph_osd_request *watch_request;

+	struct rbd_spec		*parent_spec;
+	u64			parent_overlap;
+
 	/* protects updating the header */
 	struct rw_semaphore     header_rwsem;

@@ -2009,6 +2012,49 @@ static ssize_t rbd_snap_show(struct device *dev,
 	return sprintf(buf, "%s\n", rbd_dev->spec->snap_name);
 }

+/*
+ * For an rbd v2 image, shows the pool id, image id, and snapshot id
+ * for the parent image.  If there is no parent, simply shows
+ * "(no parent image)".
+ */
+static ssize_t rbd_parent_show(struct device *dev,
+			     struct device_attribute *attr,
+			     char *buf)
+{
+	struct rbd_device *rbd_dev = dev_to_rbd_dev(dev);
+	struct rbd_spec *spec = rbd_dev->parent_spec;
+	int count;
+	char *bufp = buf;
+
+	if (!spec)
+		return sprintf(buf, "(no parent image)\n");
+
+	count = sprintf(bufp, "pool_id %llu\npool_name %s\n",
+			(unsigned long long) spec->pool_id, spec->pool_name);
+	if (count < 0)
+		return count;
+	bufp += count;
+
+	count = sprintf(bufp, "image_id %s\nimage_name %s\n", spec->image_id,
+			spec->image_name ? spec->image_name : "(unknown)");
+	if (count < 0)
+		return count;
+	bufp += count;
+
+	count = sprintf(bufp, "snap_id %llu\nsnap_name %s\n",
+			(unsigned long long) spec->snap_id, spec->snap_name);
+	if (count < 0)
+		return count;
+	bufp += count;
+
+	count = sprintf(bufp, "overlap %llu\n", rbd_dev->parent_overlap);
+	if (count < 0)
+		return count;
+	bufp += count;
+
+	return (ssize_t) (bufp - buf);
+}
+
 static ssize_t rbd_image_refresh(struct device *dev,
 				 struct device_attribute *attr,
 				 const char *buf,
@@ -2032,6 +2078,7 @@ static DEVICE_ATTR(name, S_IRUGO, rbd_name_show,
NULL);
 static DEVICE_ATTR(image_id, S_IRUGO, rbd_image_id_show, NULL);
 static DEVICE_ATTR(refresh, S_IWUSR, NULL, rbd_image_refresh);
 static DEVICE_ATTR(current_snap, S_IRUGO, rbd_snap_show, NULL);
+static DEVICE_ATTR(parent, S_IRUGO, rbd_parent_show, NULL);

 static struct attribute *rbd_attrs[] = {
 	&dev_attr_size.attr,
@@ -2043,6 +2090,7 @@ static struct attribute *rbd_attrs[] = {
 	&dev_attr_name.attr,
 	&dev_attr_image_id.attr,
 	&dev_attr_current_snap.attr,
+	&dev_attr_parent.attr,
 	&dev_attr_refresh.attr,
 	NULL
 };
@@ -2192,6 +2240,7 @@ struct rbd_device *rbd_dev_create(struct
rbd_client *rbdc,

 static void rbd_dev_destroy(struct rbd_device *rbd_dev)
 {
+	rbd_spec_put(rbd_dev->parent_spec);
 	kfree(rbd_dev->header_name);
 	rbd_put_client(rbd_dev->rbd_client);
 	rbd_spec_put(rbd_dev->spec);
@@ -2400,6 +2449,71 @@ static int rbd_dev_v2_features(struct rbd_device
*rbd_dev)
 						&rbd_dev->header.features);
 }

+static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev)
+{
+	struct rbd_spec *parent_spec;
+	size_t size;
+	void *reply_buf = NULL;
+	__le64 snapid;
+	void *p;
+	void *end;
+	char *image_id;
+	u64 overlap;
+	size_t len = 0;
+	int ret;
+
+	parent_spec = rbd_spec_alloc();
+	if (!parent_spec)
+		return -ENOMEM;
+
+	size = sizeof (__le64) +				/* pool_id */
+		sizeof (__le32) + RBD_IMAGE_ID_LEN_MAX +	/* image_id */
+		sizeof (__le64) +				/* snap_id */
+		sizeof (__le64);				/* overlap */
+	reply_buf = kmalloc(size, GFP_KERNEL);
+	if (!reply_buf) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+	snapid = cpu_to_le64(CEPH_NOSNAP);
+	ret = rbd_req_sync_exec(rbd_dev, rbd_dev->header_name,
+				"rbd", "get_parent",
+				(char *) &snapid, sizeof (snapid),
+				(char *) reply_buf, size,
+				CEPH_OSD_FLAG_READ, NULL);
+	dout("%s: rbd_req_sync_exec returned %d\n", __func__, ret);
+	if (ret < 0)
+		goto out_err;
+
+	ret = -ERANGE;
+	p = reply_buf;
+	end = (char *) reply_buf + size;
+	ceph_decode_64_safe(&p, end, parent_spec->pool_id, out_err);
+	if (parent_spec->pool_id == CEPH_NOPOOL)
+		goto out;	/* No parent?  No problem. */
+
+	image_id = ceph_extract_encoded_string(&p, end, &len, GFP_KERNEL);
+	if (IS_ERR(image_id)) {
+		ret = PTR_ERR(image_id);
+		goto out_err;
+	}
+	parent_spec->image_id = image_id;
+	ceph_decode_64_safe(&p, end, parent_spec->snap_id, out_err);
+	ceph_decode_64_safe(&p, end, overlap, out_err);
+
+	rbd_dev->parent_overlap = overlap;
+	rbd_dev->parent_spec = parent_spec;
+	parent_spec = NULL;	/* rbd_dev now owns this */
+out:
+	ret = 0;
+out_err:
+	kfree(reply_buf);
+	rbd_spec_put(parent_spec);
+
+	return ret;
+}
+
 static int rbd_dev_v2_snap_context(struct rbd_device *rbd_dev, u64 *ver)
 {
 	size_t size;
@@ -3154,6 +3268,12 @@ static int rbd_dev_v1_probe(struct rbd_device
*rbd_dev)
 	ret = rbd_read_header(rbd_dev, &rbd_dev->header);
 	if (ret < 0)
 		goto out_err;
+
+	/* Version 1 images have no parent (no layering) */
+
+	rbd_dev->parent_spec = NULL;
+	rbd_dev->parent_overlap = 0;
+
 	rbd_dev->image_format = 1;

 	dout("discovered version 1 image, header name is %s\n",
@@ -3205,6 +3325,14 @@ static int rbd_dev_v2_probe(struct rbd_device
*rbd_dev)
 	if (ret < 0)
 		goto out_err;

+	/* If the image supports layering, get the parent info */
+
+	if (rbd_dev->header.features & RBD_FEATURE_LAYERING) {
+		ret = rbd_dev_v2_parent_info(rbd_dev);
+		if (ret < 0)
+			goto out_err;
+	}
+
 	/* crypto and compression type aren't (yet) supported for v2 images */

 	rbd_dev->header.crypt_type = 0;
@@ -3224,6 +3352,9 @@ static int rbd_dev_v2_probe(struct rbd_device
*rbd_dev)

 	return 0;
 out_err:
+	rbd_dev->parent_overlap = 0;
+	rbd_spec_put(rbd_dev->parent_spec);
+	rbd_dev->parent_spec = NULL;
 	kfree(rbd_dev->header_name);
 	rbd_dev->header_name = NULL;
 	kfree(rbd_dev->header.object_prefix);
diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
index 0a99099..15077db 100644
--- a/include/linux/ceph/rados.h
+++ b/include/linux/ceph/rados.h
@@ -87,6 +87,8 @@ struct ceph_pg {
  *
  *  lpgp_num -- as above.
  */
+#define CEPH_NOPOOL  ((__u64) (-1))  /* pool id not defined */
+
 #define CEPH_PG_TYPE_REP     1
 #define CEPH_PG_TYPE_RAID4   2
 #define CEPH_PG_POOL_VERSION 2
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/6] libceph: define ceph_pg_pool_name_by_id()
  2012-10-31  1:41 [PATCH 0/6] rbd: version 2 parent probing Alex Elder
                   ` (2 preceding siblings ...)
  2012-10-31  1:49 ` [PATCH 3/6] rbd: get parent spec for version 2 images Alex Elder
@ 2012-10-31  1:49 ` Alex Elder
  2012-11-01  1:34   ` Josh Durgin
  2012-10-31  1:49 ` [PATCH 5/6] rbd: get additional info in parent spec Alex Elder
  2012-10-31  1:50 ` [PATCH 6/6] rbd: probe the parent of an image if present Alex Elder
  5 siblings, 1 reply; 41+ messages in thread
From: Alex Elder @ 2012-10-31  1:49 UTC (permalink / raw)
  To: ceph-devel

Define and export function ceph_pg_pool_name_by_id() to supply
the name of a pg pool whose id is given.  This will be used by
the next patch.

Signed-off-by: Alex Elder <elder@inktank.com>
---
 include/linux/ceph/osdmap.h |    1 +
 net/ceph/osdmap.c           |   16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
index e88a620..5ea57ba 100644
--- a/include/linux/ceph/osdmap.h
+++ b/include/linux/ceph/osdmap.h
@@ -123,6 +123,7 @@ extern int ceph_calc_pg_acting(struct ceph_osdmap
*osdmap, struct ceph_pg pgid,
 extern int ceph_calc_pg_primary(struct ceph_osdmap *osdmap,
 				struct ceph_pg pgid);

+extern const char *ceph_pg_pool_name_by_id(struct ceph_osdmap *map, u64
id);
 extern int ceph_pg_poolid_by_name(struct ceph_osdmap *map, const char
*name);

 #endif
diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index f552aa4..de73214 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -469,6 +469,22 @@ static struct ceph_pg_pool_info
*__lookup_pg_pool(struct rb_root *root, int id)
 	return NULL;
 }

+const char *ceph_pg_pool_name_by_id(struct ceph_osdmap *map, u64 id)
+{
+	struct ceph_pg_pool_info *pi;
+
+	if (id == CEPH_NOPOOL)
+		return NULL;
+
+	if (WARN_ON_ONCE(id > (u64) INT_MAX))
+		return NULL;
+
+	pi = __lookup_pg_pool(&map->pg_pools, (int) id);
+
+	return pi ? pi->name : NULL;
+}
+EXPORT_SYMBOL(ceph_pg_pool_name_by_id);
+
 int ceph_pg_poolid_by_name(struct ceph_osdmap *map, const char *name)
 {
 	struct rb_node *rbp;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/6] rbd: get additional info in parent spec
  2012-10-31  1:41 [PATCH 0/6] rbd: version 2 parent probing Alex Elder
                   ` (3 preceding siblings ...)
  2012-10-31  1:49 ` [PATCH 4/6] libceph: define ceph_pg_pool_name_by_id() Alex Elder
@ 2012-10-31  1:49 ` Alex Elder
  2012-10-31 14:11   ` Alex Elder
  2012-11-01  1:49   ` Josh Durgin
  2012-10-31  1:50 ` [PATCH 6/6] rbd: probe the parent of an image if present Alex Elder
  5 siblings, 2 replies; 41+ messages in thread
From: Alex Elder @ 2012-10-31  1:49 UTC (permalink / raw)
  To: ceph-devel

When a layered rbd image has a parent, that parent is identified
only by its pool id, image id, and snapshot id.  Images that have
been mapped also record *names* for those three id's.

Add code to look up these names for parent images so they match
mapped images more closely.  Skip doing this for an image if it
already has its pool name defined (this will be the case for images
mapped by the user).

It is possible that an the name of a parent image can't be
determined, even if the image id is valid.  If this occurs it
does not preclude correct operation, so don't treat this as
an error.

On the other hand, defined pools will always have both an id and a
name.   And any snapshot of an image identified as a parent for a
clone image will exist, and will have a name (if not it indicates
some other internal error).  So treat failure to get these bits
of information as errors.

Signed-off-by: Alex Elder <elder@inktank.com>
---
 drivers/block/rbd.c |  131
+++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 131 insertions(+)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index bce1fcf..04062c1 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -70,7 +70,10 @@

 #define RBD_SNAP_HEAD_NAME	"-"

+/* This allows a single page to hold an image name sent by OSD */
+#define RBD_IMAGE_NAME_LEN_MAX	(PAGE_SIZE - sizeof (__le32) - 1)
 #define RBD_IMAGE_ID_LEN_MAX	64
+
 #define RBD_OBJ_PREFIX_LEN_MAX	64

 /* Feature bits */
@@ -658,6 +661,20 @@ out_err:
 	return -ENOMEM;
 }

+static const char *rbd_snap_name(struct rbd_device *rbd_dev, u64 snap_id)
+{
+	struct rbd_snap *snap;
+
+	if (snap_id == CEPH_NOSNAP)
+		return RBD_SNAP_HEAD_NAME;
+
+	list_for_each_entry(snap, &rbd_dev->snaps, node)
+		if (snap_id == snap->id)
+			return snap->name;
+
+	return NULL;
+}
+
 static int snap_by_name(struct rbd_device *rbd_dev, const char *snap_name)
 {

@@ -2499,6 +2516,7 @@ static int rbd_dev_v2_parent_info(struct
rbd_device *rbd_dev)
 		goto out_err;
 	}
 	parent_spec->image_id = image_id;
+	parent_spec->image_id_len = len;
 	ceph_decode_64_safe(&p, end, parent_spec->snap_id, out_err);
 	ceph_decode_64_safe(&p, end, overlap, out_err);

@@ -2514,6 +2532,115 @@ out_err:
 	return ret;
 }

+static char *rbd_dev_image_name(struct rbd_device *rbd_dev)
+{
+	size_t image_id_size;
+	char *image_id;
+	void *p;
+	void *end;
+	size_t size;
+	void *reply_buf = NULL;
+	size_t len = 0;
+	char *image_name = NULL;
+	int ret;
+
+	rbd_assert(!rbd_dev->spec->image_name);
+
+	image_id_size = sizeof (__le32) + rbd_dev->spec->image_id_len;
+	image_id = kmalloc(image_id_size, GFP_KERNEL);
+	if (!image_id)
+		return NULL;
+
+	p = image_id;
+	end = (char *) image_id + image_id_size;
+	ceph_encode_string(&p, end, rbd_dev->spec->image_id,
+				(u32) rbd_dev->spec->image_id_len);
+
+	size = sizeof (__le32) + RBD_IMAGE_NAME_LEN_MAX;
+	reply_buf = kmalloc(size, GFP_KERNEL);
+	if (!reply_buf)
+		goto out;
+
+	ret = rbd_req_sync_exec(rbd_dev, RBD_DIRECTORY,
+				"rbd", "dir_get_name",
+				image_id, image_id_size,
+				(char *) reply_buf, size,
+				CEPH_OSD_FLAG_READ, NULL);
+	if (ret < 0)
+		goto out;
+	p = reply_buf;
+	end = (char *) reply_buf + size;
+	image_name = ceph_extract_encoded_string(&p, end, &len, GFP_KERNEL);
+	if (image_name)
+		dout("%s: name is %s len is %zd\n", __func__, image_name, len);
+out:
+	kfree(reply_buf);
+	kfree(image_id);
+
+	return image_name;
+}
+
+/*
+ * When a parent image gets probed, we only have the pool, image,
+ * and snapshot ids but not the names of any of them.  This call
+ * is made later to fill in those names.  It has to be done after
+ * rbd_dev_snaps_update() has completed because some of the
+ * information (in particular, snapshot name) is not available
+ * until then.
+ */
+static int rbd_dev_probe_update_spec(struct rbd_device *rbd_dev)
+{
+	struct ceph_osd_client *osdc;
+	const char *name;
+	void *reply_buf = NULL;
+	int ret;
+
+	if (rbd_dev->spec->pool_name)
+		return 0;	/* Already have the names */
+
+	/* Look up the pool name */
+
+	osdc = &rbd_dev->rbd_client->client->osdc;
+	name = ceph_pg_pool_name_by_id(osdc->osdmap, rbd_dev->spec->pool_id);
+	if (!name)
+		return -EIO;	/* pool id too large (>= 2^31) */
+
+	rbd_dev->spec->pool_name = kstrdup(name, GFP_KERNEL);
+	if (!rbd_dev->spec->pool_name)
+		return -ENOMEM;
+
+	/* Fetch the image name; tolerate failure here */
+
+	name = rbd_dev_image_name(rbd_dev);
+	if (name) {
+		rbd_dev->spec->image_name_len = strlen(name);
+		rbd_dev->spec->image_name = (char *) name;
+	} else {
+		pr_warning(RBD_DRV_NAME "%d "
+			"unable to get image name for image id %s\n",
+			rbd_dev->major, rbd_dev->spec->image_id);
+	}
+
+	/* Look up the snapshot name. */
+
+	name = rbd_snap_name(rbd_dev, rbd_dev->spec->snap_id);
+	if (!name) {
+		ret = -EIO;
+		goto out_err;
+	}
+	rbd_dev->spec->snap_name = kstrdup(name, GFP_KERNEL);
+	if(!rbd_dev->spec->snap_name)
+		goto out_err;
+
+	return 0;
+out_err:
+	kfree(reply_buf);
+	kfree(rbd_dev->spec->pool_name);
+	rbd_dev->spec->pool_name = NULL;
+
+	return ret;
+}
+
 static int rbd_dev_v2_snap_context(struct rbd_device *rbd_dev, u64 *ver)
 {
 	size_t size;
@@ -3372,6 +3499,10 @@ static int rbd_dev_probe_finish(struct rbd_device
*rbd_dev)
 	if (ret)
 		return ret;

+	ret = rbd_dev_probe_update_spec(rbd_dev);
+	if (ret)
+		goto err_out_snaps;
+
 	ret = rbd_dev_set_mapping(rbd_dev);
 	if (ret)
 		goto err_out_snaps;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 6/6] rbd: probe the parent of an image if present
  2012-10-31  1:41 [PATCH 0/6] rbd: version 2 parent probing Alex Elder
                   ` (4 preceding siblings ...)
  2012-10-31  1:49 ` [PATCH 5/6] rbd: get additional info in parent spec Alex Elder
@ 2012-10-31  1:50 ` Alex Elder
  2012-10-31 11:59   ` slow fio random read benchmark, need help Alexandre DERUMIER
  2012-11-01  2:07   ` [PATCH 6/6] rbd: probe the parent of an image if present Josh Durgin
  5 siblings, 2 replies; 41+ messages in thread
From: Alex Elder @ 2012-10-31  1:50 UTC (permalink / raw)
  To: ceph-devel

Call the probe function for the parent device.

Signed-off-by: Alex Elder <elder@inktank.com>
---
 drivers/block/rbd.c |   79
+++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 76 insertions(+), 3 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 04062c1..8ef13f72 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -222,6 +222,7 @@ struct rbd_device {

 	struct rbd_spec		*parent_spec;
 	u64			parent_overlap;
+	struct rbd_device	*parent;

 	/* protects updating the header */
 	struct rw_semaphore     header_rwsem;
@@ -255,6 +256,7 @@ static ssize_t rbd_add(struct bus_type *bus, const
char *buf,
 		       size_t count);
 static ssize_t rbd_remove(struct bus_type *bus, const char *buf,
 			  size_t count);
+static int rbd_dev_probe(struct rbd_device *rbd_dev);

 static struct bus_attribute rbd_bus_attrs[] = {
 	__ATTR(add, S_IWUSR, NULL, rbd_add),
@@ -378,6 +380,13 @@ out_opt:
 	return ERR_PTR(ret);
 }

+static struct rbd_client *__rbd_get_client(struct rbd_client *rbdc)
+{
+	kref_get(&rbdc->kref);
+
+	return rbdc;
+}
+
 /*
  * Find a ceph client with specific addr and configuration.  If
  * found, bump its reference count.
@@ -393,7 +402,8 @@ static struct rbd_client *rbd_client_find(struct
ceph_options *ceph_opts)
 	spin_lock(&rbd_client_list_lock);
 	list_for_each_entry(client_node, &rbd_client_list, node) {
 		if (!ceph_compare_options(ceph_opts, client_node->client)) {
-			kref_get(&client_node->kref);
+			__rbd_get_client(client_node);
+
 			found = true;
 			break;
 		}
@@ -3311,6 +3321,11 @@ static int rbd_dev_image_id(struct rbd_device
*rbd_dev)
 	void *response;
 	void *p;

+	/* If we already have it we don't need to look it up */
+
+	if (rbd_dev->spec->image_id)
+		return 0;
+
 	/*
 	 * When probing a parent image, the image id is already
 	 * known (and the image name likely is not).  There's no
@@ -3492,6 +3507,9 @@ out_err:

 static int rbd_dev_probe_finish(struct rbd_device *rbd_dev)
 {
+	struct rbd_device *parent = NULL;
+	struct rbd_spec *parent_spec = NULL;
+	struct rbd_client *rbdc = NULL;
 	int ret;

 	/* no need to lock here, as rbd_dev is not registered yet */
@@ -3536,6 +3554,31 @@ static int rbd_dev_probe_finish(struct rbd_device
*rbd_dev)
 	 * At this point cleanup in the event of an error is the job
 	 * of the sysfs code (initiated by rbd_bus_del_dev()).
 	 */
+	/* Probe the parent if there is one */
+
+	if (rbd_dev->parent_spec) {
+		/*
+		 * We need to pass a reference to the client and the
+		 * parent spec when creating the parent rbd_dev.
+		 * Images related by parent/child relationships
+		 * always share both.
+		 */
+		parent_spec = rbd_spec_get(rbd_dev->parent_spec);
+		rbdc = __rbd_get_client(rbd_dev->rbd_client);
+
+		parent = rbd_dev_create(rbdc, parent_spec);
+		if (!parent) {
+			ret = -ENOMEM;
+			goto err_out_spec;
+		}
+		rbdc = NULL;		/* parent now owns reference */
+		parent_spec = NULL;	/* parent now owns reference */
+		ret = rbd_dev_probe(parent);
+		if (ret < 0)
+			goto err_out_parent;
+		rbd_dev->parent = parent;
+	}
+
 	down_write(&rbd_dev->header_rwsem);
 	ret = rbd_dev_snaps_register(rbd_dev);
 	up_write(&rbd_dev->header_rwsem);
@@ -3554,6 +3597,12 @@ static int rbd_dev_probe_finish(struct rbd_device
*rbd_dev)
 		(unsigned long long) rbd_dev->mapping.size);

 	return ret;
+
+err_out_parent:
+	rbd_dev_destroy(parent);
+err_out_spec:
+	rbd_spec_put(parent_spec);
+	rbd_put_client(rbdc);
 err_out_bus:
 	/* this will also clean up rest of rbd_dev stuff */

@@ -3717,6 +3766,12 @@ static void rbd_dev_release(struct device *dev)
 	module_put(THIS_MODULE);
 }

+static void __rbd_remove(struct rbd_device *rbd_dev)
+{
+	rbd_remove_all_snaps(rbd_dev);
+	rbd_bus_del_dev(rbd_dev);
+}
+
 static ssize_t rbd_remove(struct bus_type *bus,
 			  const char *buf,
 			  size_t count)
@@ -3743,8 +3798,26 @@ static ssize_t rbd_remove(struct bus_type *bus,
 		goto done;
 	}

-	rbd_remove_all_snaps(rbd_dev);
-	rbd_bus_del_dev(rbd_dev);
+	while (rbd_dev->parent_spec) {
+		struct rbd_device *first = rbd_dev;
+		struct rbd_device *second = first->parent;
+		struct rbd_device *third;
+
+		/*
+		 * Follow to the parent with no grandparent and
+		 * remove it.
+		 */
+		while (second && (third = second->parent)) {
+			first = second;
+			second = third;
+		}
+		__rbd_remove(second);
+		rbd_spec_put(first->parent_spec);
+		first->parent_spec = NULL;
+		first->parent_overlap = 0;
+		first->parent = NULL;
+	}
+	__rbd_remove(rbd_dev);

 done:
 	mutex_unlock(&ctl_mutex);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* slow fio random read benchmark, need help
  2012-10-31  1:50 ` [PATCH 6/6] rbd: probe the parent of an image if present Alex Elder
@ 2012-10-31 11:59   ` Alexandre DERUMIER
  2012-10-31 15:57     ` Sage Weil
  2012-11-01  2:07   ` [PATCH 6/6] rbd: probe the parent of an image if present Josh Durgin
  1 sibling, 1 reply; 41+ messages in thread
From: Alexandre DERUMIER @ 2012-10-31 11:59 UTC (permalink / raw)
  To: ceph-devel

Hello,

I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster)


fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40  --group_reporting --name=file1 --ioengine=libaio --direct=1


I can't get more than 5000 iops.


RBD cluster is :
---------------
3 nodes,with each node : 
-6 x osd 15k drives (xfs), journal on tmpfs, 1 mon 
-cpu: 2x 4 cores intel xeon E5420@2.5GHZ
rbd 0.53

ceph.conf

        journal dio = false
        filestore fiemap = false
        filestore flusher = false
        osd op threads = 24
        osd disk threads = 24
        filestore op threads = 6

kvm host is : 4 x 12 cores opteron
------------


During the bench:

on ceph nodes:
- cpu  is around 10% used
- iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer)


on kvm host:

-cpu is around 20% used


I really don't see where is the bottleneck....

Any Ideas, hints ?


Regards,

Alexandre

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] rbd: get additional info in parent spec
  2012-10-31  1:49 ` [PATCH 5/6] rbd: get additional info in parent spec Alex Elder
@ 2012-10-31 14:11   ` Alex Elder
  2012-11-01  1:49   ` Josh Durgin
  1 sibling, 0 replies; 41+ messages in thread
From: Alex Elder @ 2012-10-31 14:11 UTC (permalink / raw)
  To: ceph-devel

On 10/30/2012 08:49 PM, Alex Elder wrote:
> When a layered rbd image has a parent, that parent is identified
> only by its pool id, image id, and snapshot id.  Images that have
> been mapped also record *names* for those three id's.
> 
> Add code to look up these names for parent images so they match
> mapped images more closely.  Skip doing this for an image if it
> already has its pool name defined (this will be the case for images
> mapped by the user).
> 
> It is possible that an the name of a parent image can't be
> determined, even if the image id is valid.  If this occurs it
> does not preclude correct operation, so don't treat this as
> an error.
> 
> On the other hand, defined pools will always have both an id and a
> name.   And any snapshot of an image identified as a parent for a
> clone image will exist, and will have a name (if not it indicates
> some other internal error).  So treat failure to get these bits
> of information as errors.
> 
> Signed-off-by: Alex Elder <elder@inktank.com>
> ---
>  drivers/block/rbd.c |  131
> +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 131 insertions(+)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index bce1fcf..04062c1 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c

. . .

> @@ -2514,6 +2532,115 @@ out_err:
>  	return ret;
>  }
> 
> +static char *rbd_dev_image_name(struct rbd_device *rbd_dev)
> +{
> +	size_t image_id_size;
> +	char *image_id;
> +	void *p;
> +	void *end;
> +	size_t size;
> +	void *reply_buf = NULL;
> +	size_t len = 0;
> +	char *image_name = NULL;
> +	int ret;
> +
> +	rbd_assert(!rbd_dev->spec->image_name);
> +
> +	image_id_size = sizeof (__le32) + rbd_dev->spec->image_id_len;
> +	image_id = kmalloc(image_id_size, GFP_KERNEL);
> +	if (!image_id)
> +		return NULL;
> +
> +	p = image_id;
> +	end = (char *) image_id + image_id_size;
> +	ceph_encode_string(&p, end, rbd_dev->spec->image_id,
> +				(u32) rbd_dev->spec->image_id_len);
> +
> +	size = sizeof (__le32) + RBD_IMAGE_NAME_LEN_MAX;
> +	reply_buf = kmalloc(size, GFP_KERNEL);
> +	if (!reply_buf)
> +		goto out;
> +
> +	ret = rbd_req_sync_exec(rbd_dev, RBD_DIRECTORY,
> +				"rbd", "dir_get_name",
> +				image_id, image_id_size,
> +				(char *) reply_buf, size,
> +				CEPH_OSD_FLAG_READ, NULL);
> +	if (ret < 0)
> +		goto out;
> +	p = reply_buf;
> +	end = (char *) reply_buf + size;
> +	image_name = ceph_extract_encoded_string(&p, end, &len, GFP_KERNEL);

The next line will need to be changed to:

	if (IS_ERR(image_name))
		image_name = NULL;
	else

> +	if (image_name)
> +		dout("%s: name is %s len is %zd\n", __func__, image_name, len);
> +out:
> +	kfree(reply_buf);
> +	kfree(image_id);
> +
> +	return image_name;
> +}
> +

. . .


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-10-31 11:59   ` slow fio random read benchmark, need help Alexandre DERUMIER
@ 2012-10-31 15:57     ` Sage Weil
  2012-10-31 16:29       ` Alexandre DERUMIER
  0 siblings, 1 reply; 41+ messages in thread
From: Sage Weil @ 2012-10-31 15:57 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel

On Wed, 31 Oct 2012, Alexandre DERUMIER wrote:
> Hello,
> 
> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster)
> 
> 
> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40  --group_reporting --name=file1 --ioengine=libaio --direct=1
> 
> 
> I can't get more than 5000 iops.

Have you tried increasing the iodepth?

sage

> 
> 
> RBD cluster is :
> ---------------
> 3 nodes,with each node : 
> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon 
> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ
> rbd 0.53
> 
> ceph.conf
> 
>         journal dio = false
>         filestore fiemap = false
>         filestore flusher = false
>         osd op threads = 24
>         osd disk threads = 24
>         filestore op threads = 6
> 
> kvm host is : 4 x 12 cores opteron
> ------------
> 
> 
> During the bench:
> 
> on ceph nodes:
> - cpu  is around 10% used
> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer)
> 
> 
> on kvm host:
> 
> -cpu is around 20% used
> 
> 
> I really don't see where is the bottleneck....
> 
> Any Ideas, hints ?
> 
> 
> Regards,
> 
> Alexandre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-10-31 15:57     ` Sage Weil
@ 2012-10-31 16:29       ` Alexandre DERUMIER
  2012-10-31 16:50         ` Alexandre DERUMIER
  2012-10-31 17:08         ` Marcus Sorensen
  0 siblings, 2 replies; 41+ messages in thread
From: Alexandre DERUMIER @ 2012-10-31 16:29 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

>>Have you tried increasing the iodepth? 
Yes, I have try with 100 and 200, same results.

I have also try directly from the host, with /dev/rbd1, and I have same result.
I have also try with 3 differents hosts, with differents cpus models.

(note: I can reach around 40.000 iops with same fio config on a zfs iscsi array)

My test ceph cluster nodes cpus are old (xeon E5420), but they are around 10% usage, so I think it's ok.


Do you have an idea if I can trace something ?

Thanks,

Alexandre

----- Mail original ----- 

De: "Sage Weil" <sage@inktank.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 31 Octobre 2012 16:57:05 
Objet: Re: slow fio random read benchmark, need help 

On Wed, 31 Oct 2012, Alexandre DERUMIER wrote: 
> Hello, 
> 
> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster) 
> 
> 
> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40 --group_reporting --name=file1 --ioengine=libaio --direct=1 
> 
> 
> I can't get more than 5000 iops. 

Have you tried increasing the iodepth? 

sage 

> 
> 
> RBD cluster is : 
> --------------- 
> 3 nodes,with each node : 
> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon 
> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ 
> rbd 0.53 
> 
> ceph.conf 
> 
> journal dio = false 
> filestore fiemap = false 
> filestore flusher = false 
> osd op threads = 24 
> osd disk threads = 24 
> filestore op threads = 6 
> 
> kvm host is : 4 x 12 cores opteron 
> ------------ 
> 
> 
> During the bench: 
> 
> on ceph nodes: 
> - cpu is around 10% used 
> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer) 
> 
> 
> on kvm host: 
> 
> -cpu is around 20% used 
> 
> 
> I really don't see where is the bottleneck.... 
> 
> Any Ideas, hints ? 
> 
> 
> Regards, 
> 
> Alexandre 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-10-31 16:29       ` Alexandre DERUMIER
@ 2012-10-31 16:50         ` Alexandre DERUMIER
  2012-10-31 17:08         ` Marcus Sorensen
  1 sibling, 0 replies; 41+ messages in thread
From: Alexandre DERUMIER @ 2012-10-31 16:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Also, I have same results with 8K or 16K block size....

Don't know if it's help, here a extract of perf dump of 1 mon and 1 osd

ceph --admin-daemon ceph-mon.a.asok perf dump
{"cluster":{"num_mon":3,"num_mon_quorum":3,"num_osd":15,"num_osd_up":15,"num_osd_in":15,"osd_epoch":54,"osd_kb":2140015680,"osd_kb_used":627624,"osd_kb_avail":2139388056,"num_pool":3,"num_pg":3072,"num_pg_active_clean":3072,"num_pg_active":3072,"num_pg_peering":0,"num_object":3,"num_object_degraded":0,"num_object_unfound":0,"num_bytes":274,"num_mds_up":0,"num_mds_in":0,"num_mds_failed":0,"mds_epoch":1},"mon":{},"throttle-mon_client_bytes":{"val":0,"max":104857600,"get":8773,"get_sum":556770,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum":0,"put":8773,"put_sum":556770,"wait":{"avgcount":0,"sum":0}},"throttle-mon_daemon_bytes":{"val":0,"max":419430400,"get":1308,"get_sum":1859977,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum":0,"put":1308,"put_sum":1859977,"wait":{"avgcount":0,"sum":0}},"throttle-msgr_dispatch_throttler-mon":{"val":0,"max":104857600,"get":76565,"get_sum":14066376,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum":0,"put":76565,"put_sum":14066376,"wait":{"avgcount":0,"sum":0}}}


 ceph --admin-daemon ceph-osd.1.asok perf dump
{"filestore":{"journal_queue_max_ops":500,"journal_queue_ops":0,"journal_ops":2847,"journal_queue_max_bytes":104857600,"journal_queue_bytes":0,"journal_bytes":10502288,"journal_latency":{"avgcount":2847,"sum":3.553},"journal_wr":1523,"journal_wr_bytes":{"avgcount":1523,"sum":31055872},"op_queue_max_ops":500,"op_queue_ops":0,"ops":2847,"op_queue_max_bytes":104857600,"op_queue_bytes":0,"bytes":10487898,"apply_latency":{"avgcount":2847,"sum":114.43},"committing":0,"commitcycle":12,"commitcycle_interval":{"avgcount":12,"sum":60.1172},"commitcycle_latency":{"avgcount":12,"sum":0.116291},"journal_full":0},"osd":{"opq":0,"op_wip":0,"op":48366,"op_in_bytes":3168,"op_out_bytes":198000640,"op_latency":{"avgcount":48366,"sum":71.4412},"op_r":48340,"op_r_out_bytes":198000640,"op_r_latency":{"avgcount":48340,"sum":71.1109},"op_w":26,"op_w_in_bytes":3168,"op_w_rlat":{"avgcount":26,"sum":0.034785},"op_w_latency":{"avgcount":26,"sum":0.3303},"op_rw":0,"op_rw_in_bytes":0,"op_rw_out_bytes":0,"op_rw_rlat":{"avgcount":0,"sum":0},"op_rw_latency":{"avgcount":0,"sum":0},"subop":18,"subop_in_bytes":2281,"subop_latency":{"avgcount":18,"sum":0.011883},"subop_w":0,"subop_w_in_bytes":2281,"subop_w_latency":{"avgcount":18,"sum":0.011883},"subop_pull":0,"subop_pull_latency":{"avgcount":0,"sum":0},"subop_push":0,"subop_push_in_bytes":0,"subop_push_latency":{"avgcount":0,"sum":0},"pull":0,"push":0,"push_out_bytes":0,"push_in":0,"push_in_bytes":0,"recovery_ops":0,"loadavg":0.1,"buffer_bytes":0,"numpg":408,"numpg_primary":189,"numpg_replica":219,"numpg_stray":0,"heartbeat_to_peers":10,"heartbeat_from_peers":0,"map_messages":195,"map_message_epochs":231,"map_message_epoch_dups":194},"throttle-filestore_bytes":{"val":0,"max":104857600,"get":0,"get_sum":0,"get_or_fail_fail":0,"get_or_fail_success":0,"take":2847,"take_sum":10502288,"put":1523,"put_sum":10502288,"wait":{"avgcount":0,"sum":0}},"throttle-filestore_ops":{"val":0,"max":500,"get":0,"get_sum":0,"get_or_fail_fail":0,"get_or_fail_success":0,"take":2847,"take_sum":2847,"put":1523,"put_sum":2847,"wait":{"avgcount":0,"sum":0}},"throttle-msgr_dispatch_throttler-client":{"val":0,"max":104857600,"get":67047,"get_sum":10334526,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum":0,"put":67047,"put_sum":10334526,"wait":{"avgcount":0,"sum":0}},"throttle-msgr_dispatch_throttler-cluster":{"val":0,"max":104857600,"get":1880,"get_sum":1556536,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum":0,"put":1880,"put_sum":1556536,"wait":{"avgcount":0,"sum":0}},"throttle-msgr_dispatch_throttler-hbclient":{"val":0,"max":104857600,"get":49046,"get_sum":2305162,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum":0,"put":49046,"put_sum":2305162,"wait":{"avgcount":0,"sum":0}},"throttle-msgr_dispatch_throttler-hbserver":{"val":0,"max":104857600,"get":48858,"get_sum":2296326,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum":0,"put":48858,"put_sum":2296326,"wait":{"avgcount":0,"sum":0}},"throttle-osd_client_bytes":{"val":0,"max":524288000,"get":66603,"get_sum":10236339,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum":0,"put":66605,"put_sum":10236339,"wait":{"avgcount":0,"sum":0}}}


----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Sage Weil" <sage@inktank.com> 
Cc: "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 31 Octobre 2012 17:29:28 
Objet: Re: slow fio random read benchmark, need help 

>>Have you tried increasing the iodepth? 
Yes, I have try with 100 and 200, same results. 

I have also try directly from the host, with /dev/rbd1, and I have same result. 
I have also try with 3 differents hosts, with differents cpus models. 

(note: I can reach around 40.000 iops with same fio config on a zfs iscsi array) 

My test ceph cluster nodes cpus are old (xeon E5420), but they are around 10% usage, so I think it's ok. 


Do you have an idea if I can trace something ? 

Thanks, 

Alexandre 

----- Mail original ----- 

De: "Sage Weil" <sage@inktank.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 31 Octobre 2012 16:57:05 
Objet: Re: slow fio random read benchmark, need help 

On Wed, 31 Oct 2012, Alexandre DERUMIER wrote: 
> Hello, 
> 
> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster) 
> 
> 
> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40 --group_reporting --name=file1 --ioengine=libaio --direct=1 
> 
> 
> I can't get more than 5000 iops. 

Have you tried increasing the iodepth? 

sage 

> 
> 
> RBD cluster is : 
> --------------- 
> 3 nodes,with each node : 
> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon 
> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ 
> rbd 0.53 
> 
> ceph.conf 
> 
> journal dio = false 
> filestore fiemap = false 
> filestore flusher = false 
> osd op threads = 24 
> osd disk threads = 24 
> filestore op threads = 6 
> 
> kvm host is : 4 x 12 cores opteron 
> ------------ 
> 
> 
> During the bench: 
> 
> on ceph nodes: 
> - cpu is around 10% used 
> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer) 
> 
> 
> on kvm host: 
> 
> -cpu is around 20% used 
> 
> 
> I really don't see where is the bottleneck.... 
> 
> Any Ideas, hints ? 
> 
> 
> Regards, 
> 
> Alexandre 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-10-31 16:29       ` Alexandre DERUMIER
  2012-10-31 16:50         ` Alexandre DERUMIER
@ 2012-10-31 17:08         ` Marcus Sorensen
  2012-10-31 17:27           ` Alexandre DERUMIER
  1 sibling, 1 reply; 41+ messages in thread
From: Marcus Sorensen @ 2012-10-31 17:08 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Sage Weil, ceph-devel

5000 is actually really good, if you ask me. Assuming everything is
connected via gigabit. If you get 40k iops locally, you add the
latency of tcp, as well as that of the ceph services and VM layer, and
that's what you get. On my network I get about a .1ms round trip on
gigabit over the same switch, which by definition can only do 10,000
iops. Then if you have storage on the other end capable of 40k iops,
you add the latencies together (.1ms + .025ms) and you're at 8k iops.
Then add the small latency of the application servicing the io (NFS,
Ceph, etc), and the latency introduced by your VM layer, and 5k sounds
about right.

The good news is that you probably aren't taxing the storage, you can
likely do many simultaneous tests from several VMs and get the same
results.

You can try adding --numjobs to your fio to parallelize the specific
test you're doing, or launching a second VM and doing the same test at
the same time. This would be a good indicator if it's latency.

On Wed, Oct 31, 2012 at 10:29 AM, Alexandre DERUMIER
<aderumier@odiso.com> wrote:
>>>Have you tried increasing the iodepth?
> Yes, I have try with 100 and 200, same results.
>
> I have also try directly from the host, with /dev/rbd1, and I have same result.
> I have also try with 3 differents hosts, with differents cpus models.
>
> (note: I can reach around 40.000 iops with same fio config on a zfs iscsi array)
>
> My test ceph cluster nodes cpus are old (xeon E5420), but they are around 10% usage, so I think it's ok.
>
>
> Do you have an idea if I can trace something ?
>
> Thanks,
>
> Alexandre
>
> ----- Mail original -----
>
> De: "Sage Weil" <sage@inktank.com>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Mercredi 31 Octobre 2012 16:57:05
> Objet: Re: slow fio random read benchmark, need help
>
> On Wed, 31 Oct 2012, Alexandre DERUMIER wrote:
>> Hello,
>>
>> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster)
>>
>>
>> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40 --group_reporting --name=file1 --ioengine=libaio --direct=1
>>
>>
>> I can't get more than 5000 iops.
>
> Have you tried increasing the iodepth?
>
> sage
>
>>
>>
>> RBD cluster is :
>> ---------------
>> 3 nodes,with each node :
>> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon
>> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ
>> rbd 0.53
>>
>> ceph.conf
>>
>> journal dio = false
>> filestore fiemap = false
>> filestore flusher = false
>> osd op threads = 24
>> osd disk threads = 24
>> filestore op threads = 6
>>
>> kvm host is : 4 x 12 cores opteron
>> ------------
>>
>>
>> During the bench:
>>
>> on ceph nodes:
>> - cpu is around 10% used
>> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer)
>>
>>
>> on kvm host:
>>
>> -cpu is around 20% used
>>
>>
>> I really don't see where is the bottleneck....
>>
>> Any Ideas, hints ?
>>
>>
>> Regards,
>>
>> Alexandre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-10-31 17:08         ` Marcus Sorensen
@ 2012-10-31 17:27           ` Alexandre DERUMIER
  2012-10-31 17:38             ` Marcus Sorensen
  2012-11-01  7:38             ` Dietmar Maurer
  0 siblings, 2 replies; 41+ messages in thread
From: Alexandre DERUMIER @ 2012-10-31 17:27 UTC (permalink / raw)
  To: Marcus Sorensen; +Cc: Sage Weil, ceph-devel

Thanks Marcus, 

indeed gigabit ethernet.

note that my iscsi results  (40k)was with multipath, so multiple gigabit links.

I have also done tests with a netapp array, with nfs, single link, I'm around 13000 iops

I will do more tests with multiples vms, from differents hosts, and with --numjobs.

I'll keep you in touch,

Thanks for help,

Regards,

Alexandre


----- Mail original ----- 

De: "Marcus Sorensen" <shadowsor@gmail.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 31 Octobre 2012 18:08:11 
Objet: Re: slow fio random read benchmark, need help 

5000 is actually really good, if you ask me. Assuming everything is 
connected via gigabit. If you get 40k iops locally, you add the 
latency of tcp, as well as that of the ceph services and VM layer, and 
that's what you get. On my network I get about a .1ms round trip on 
gigabit over the same switch, which by definition can only do 10,000 
iops. Then if you have storage on the other end capable of 40k iops, 
you add the latencies together (.1ms + .025ms) and you're at 8k iops. 
Then add the small latency of the application servicing the io (NFS, 
Ceph, etc), and the latency introduced by your VM layer, and 5k sounds 
about right. 

The good news is that you probably aren't taxing the storage, you can 
likely do many simultaneous tests from several VMs and get the same 
results. 

You can try adding --numjobs to your fio to parallelize the specific 
test you're doing, or launching a second VM and doing the same test at 
the same time. This would be a good indicator if it's latency. 

On Wed, Oct 31, 2012 at 10:29 AM, Alexandre DERUMIER 
<aderumier@odiso.com> wrote: 
>>>Have you tried increasing the iodepth? 
> Yes, I have try with 100 and 200, same results. 
> 
> I have also try directly from the host, with /dev/rbd1, and I have same result. 
> I have also try with 3 differents hosts, with differents cpus models. 
> 
> (note: I can reach around 40.000 iops with same fio config on a zfs iscsi array) 
> 
> My test ceph cluster nodes cpus are old (xeon E5420), but they are around 10% usage, so I think it's ok. 
> 
> 
> Do you have an idea if I can trace something ? 
> 
> Thanks, 
> 
> Alexandre 
> 
> ----- Mail original ----- 
> 
> De: "Sage Weil" <sage@inktank.com> 
> À: "Alexandre DERUMIER" <aderumier@odiso.com> 
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Mercredi 31 Octobre 2012 16:57:05 
> Objet: Re: slow fio random read benchmark, need help 
> 
> On Wed, 31 Oct 2012, Alexandre DERUMIER wrote: 
>> Hello, 
>> 
>> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster) 
>> 
>> 
>> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40 --group_reporting --name=file1 --ioengine=libaio --direct=1 
>> 
>> 
>> I can't get more than 5000 iops. 
> 
> Have you tried increasing the iodepth? 
> 
> sage 
> 
>> 
>> 
>> RBD cluster is : 
>> --------------- 
>> 3 nodes,with each node : 
>> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon 
>> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ 
>> rbd 0.53 
>> 
>> ceph.conf 
>> 
>> journal dio = false 
>> filestore fiemap = false 
>> filestore flusher = false 
>> osd op threads = 24 
>> osd disk threads = 24 
>> filestore op threads = 6 
>> 
>> kvm host is : 4 x 12 cores opteron 
>> ------------ 
>> 
>> 
>> During the bench: 
>> 
>> on ceph nodes: 
>> - cpu is around 10% used 
>> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer) 
>> 
>> 
>> on kvm host: 
>> 
>> -cpu is around 20% used 
>> 
>> 
>> I really don't see where is the bottleneck.... 
>> 
>> Any Ideas, hints ? 
>> 
>> 
>> Regards, 
>> 
>> Alexandre 
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>> the body of a message to majordomo@vger.kernel.org 
>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>> 
>> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-10-31 17:27           ` Alexandre DERUMIER
@ 2012-10-31 17:38             ` Marcus Sorensen
  2012-10-31 18:56               ` Alexandre DERUMIER
  2012-11-01  7:38             ` Dietmar Maurer
  1 sibling, 1 reply; 41+ messages in thread
From: Marcus Sorensen @ 2012-10-31 17:38 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Sage Weil, ceph-devel

Yes, I was going to say that the most I've ever seen out of gigabit is
about 15k iops, with parallel tests and NFS (or iSCSI). Multipathing
may not really parallelize the io for you. It can send an io down one
path, then move to the next path and send the next io without
necessarily waiting for the previous one to respond, but it only
shaves a slight amount from your latency under some scenarios as
opposed to sending down all paths simultaneously. I have seen it help
with high latency links.

I don't remember the Ceph design that well, but with distributed
storage systems you're going to pay a penalty. If you can do 10-15k
with one TCP round trip, you'll get half that with the round trip to
talk to the metadata server to find your blocks and then to fetch
them. Like I said, that might not be exactly what Ceph does, but
you're going to have more traffic than just a straight single attached
NFS or iscsi server.

On Wed, Oct 31, 2012 at 11:27 AM, Alexandre DERUMIER
<aderumier@odiso.com> wrote:
> Thanks Marcus,
>
> indeed gigabit ethernet.
>
> note that my iscsi results  (40k)was with multipath, so multiple gigabit links.
>
> I have also done tests with a netapp array, with nfs, single link, I'm around 13000 iops
>
> I will do more tests with multiples vms, from differents hosts, and with --numjobs.
>
> I'll keep you in touch,
>
> Thanks for help,
>
> Regards,
>
> Alexandre
>
>
> ----- Mail original -----
>
> De: "Marcus Sorensen" <shadowsor@gmail.com>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Mercredi 31 Octobre 2012 18:08:11
> Objet: Re: slow fio random read benchmark, need help
>
> 5000 is actually really good, if you ask me. Assuming everything is
> connected via gigabit. If you get 40k iops locally, you add the
> latency of tcp, as well as that of the ceph services and VM layer, and
> that's what you get. On my network I get about a .1ms round trip on
> gigabit over the same switch, which by definition can only do 10,000
> iops. Then if you have storage on the other end capable of 40k iops,
> you add the latencies together (.1ms + .025ms) and you're at 8k iops.
> Then add the small latency of the application servicing the io (NFS,
> Ceph, etc), and the latency introduced by your VM layer, and 5k sounds
> about right.
>
> The good news is that you probably aren't taxing the storage, you can
> likely do many simultaneous tests from several VMs and get the same
> results.
>
> You can try adding --numjobs to your fio to parallelize the specific
> test you're doing, or launching a second VM and doing the same test at
> the same time. This would be a good indicator if it's latency.
>
> On Wed, Oct 31, 2012 at 10:29 AM, Alexandre DERUMIER
> <aderumier@odiso.com> wrote:
>>>>Have you tried increasing the iodepth?
>> Yes, I have try with 100 and 200, same results.
>>
>> I have also try directly from the host, with /dev/rbd1, and I have same result.
>> I have also try with 3 differents hosts, with differents cpus models.
>>
>> (note: I can reach around 40.000 iops with same fio config on a zfs iscsi array)
>>
>> My test ceph cluster nodes cpus are old (xeon E5420), but they are around 10% usage, so I think it's ok.
>>
>>
>> Do you have an idea if I can trace something ?
>>
>> Thanks,
>>
>> Alexandre
>>
>> ----- Mail original -----
>>
>> De: "Sage Weil" <sage@inktank.com>
>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Mercredi 31 Octobre 2012 16:57:05
>> Objet: Re: slow fio random read benchmark, need help
>>
>> On Wed, 31 Oct 2012, Alexandre DERUMIER wrote:
>>> Hello,
>>>
>>> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster)
>>>
>>>
>>> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40 --group_reporting --name=file1 --ioengine=libaio --direct=1
>>>
>>>
>>> I can't get more than 5000 iops.
>>
>> Have you tried increasing the iodepth?
>>
>> sage
>>
>>>
>>>
>>> RBD cluster is :
>>> ---------------
>>> 3 nodes,with each node :
>>> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon
>>> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ
>>> rbd 0.53
>>>
>>> ceph.conf
>>>
>>> journal dio = false
>>> filestore fiemap = false
>>> filestore flusher = false
>>> osd op threads = 24
>>> osd disk threads = 24
>>> filestore op threads = 6
>>>
>>> kvm host is : 4 x 12 cores opteron
>>> ------------
>>>
>>>
>>> During the bench:
>>>
>>> on ceph nodes:
>>> - cpu is around 10% used
>>> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer)
>>>
>>>
>>> on kvm host:
>>>
>>> -cpu is around 20% used
>>>
>>>
>>> I really don't see where is the bottleneck....
>>>
>>> Any Ideas, hints ?
>>>
>>>
>>> Regards,
>>>
>>> Alexandre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-10-31 17:38             ` Marcus Sorensen
@ 2012-10-31 18:56               ` Alexandre DERUMIER
  2012-10-31 19:50                 ` Marcus Sorensen
  2012-10-31 20:22                 ` Josh Durgin
  0 siblings, 2 replies; 41+ messages in thread
From: Alexandre DERUMIER @ 2012-10-31 18:56 UTC (permalink / raw)
  To: Marcus Sorensen; +Cc: Sage Weil, ceph-devel

Yes, I think you are right, round trip with mon must cut by half the performance.

I have just done test with 2 parallel fio bench, from 2 differents host,
I get 2 x 5000 iops

so it must be related to network latency.

I have also done tests with --numjob 1000, it doesn't help, same results.


Do you have an idea how I can have more io from 1 host ?
Doing lacp with multiple links ?

I think that 10gigabit latency is almost same, i'm not sure it will improve iops too much
Maybe InfiniBand can help?

----- Mail original ----- 

De: "Marcus Sorensen" <shadowsor@gmail.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 31 Octobre 2012 18:38:46 
Objet: Re: slow fio random read benchmark, need help 

Yes, I was going to say that the most I've ever seen out of gigabit is 
about 15k iops, with parallel tests and NFS (or iSCSI). Multipathing 
may not really parallelize the io for you. It can send an io down one 
path, then move to the next path and send the next io without 
necessarily waiting for the previous one to respond, but it only 
shaves a slight amount from your latency under some scenarios as 
opposed to sending down all paths simultaneously. I have seen it help 
with high latency links. 

I don't remember the Ceph design that well, but with distributed 
storage systems you're going to pay a penalty. If you can do 10-15k 
with one TCP round trip, you'll get half that with the round trip to 
talk to the metadata server to find your blocks and then to fetch 
them. Like I said, that might not be exactly what Ceph does, but 
you're going to have more traffic than just a straight single attached 
NFS or iscsi server. 

On Wed, Oct 31, 2012 at 11:27 AM, Alexandre DERUMIER 
<aderumier@odiso.com> wrote: 
> Thanks Marcus, 
> 
> indeed gigabit ethernet. 
> 
> note that my iscsi results (40k)was with multipath, so multiple gigabit links. 
> 
> I have also done tests with a netapp array, with nfs, single link, I'm around 13000 iops 
> 
> I will do more tests with multiples vms, from differents hosts, and with --numjobs. 
> 
> I'll keep you in touch, 
> 
> Thanks for help, 
> 
> Regards, 
> 
> Alexandre 
> 
> 
> ----- Mail original ----- 
> 
> De: "Marcus Sorensen" <shadowsor@gmail.com> 
> À: "Alexandre DERUMIER" <aderumier@odiso.com> 
> Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
> Envoyé: Mercredi 31 Octobre 2012 18:08:11 
> Objet: Re: slow fio random read benchmark, need help 
> 
> 5000 is actually really good, if you ask me. Assuming everything is 
> connected via gigabit. If you get 40k iops locally, you add the 
> latency of tcp, as well as that of the ceph services and VM layer, and 
> that's what you get. On my network I get about a .1ms round trip on 
> gigabit over the same switch, which by definition can only do 10,000 
> iops. Then if you have storage on the other end capable of 40k iops, 
> you add the latencies together (.1ms + .025ms) and you're at 8k iops. 
> Then add the small latency of the application servicing the io (NFS, 
> Ceph, etc), and the latency introduced by your VM layer, and 5k sounds 
> about right. 
> 
> The good news is that you probably aren't taxing the storage, you can 
> likely do many simultaneous tests from several VMs and get the same 
> results. 
> 
> You can try adding --numjobs to your fio to parallelize the specific 
> test you're doing, or launching a second VM and doing the same test at 
> the same time. This would be a good indicator if it's latency. 
> 
> On Wed, Oct 31, 2012 at 10:29 AM, Alexandre DERUMIER 
> <aderumier@odiso.com> wrote: 
>>>>Have you tried increasing the iodepth? 
>> Yes, I have try with 100 and 200, same results. 
>> 
>> I have also try directly from the host, with /dev/rbd1, and I have same result. 
>> I have also try with 3 differents hosts, with differents cpus models. 
>> 
>> (note: I can reach around 40.000 iops with same fio config on a zfs iscsi array) 
>> 
>> My test ceph cluster nodes cpus are old (xeon E5420), but they are around 10% usage, so I think it's ok. 
>> 
>> 
>> Do you have an idea if I can trace something ? 
>> 
>> Thanks, 
>> 
>> Alexandre 
>> 
>> ----- Mail original ----- 
>> 
>> De: "Sage Weil" <sage@inktank.com> 
>> À: "Alexandre DERUMIER" <aderumier@odiso.com> 
>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org> 
>> Envoyé: Mercredi 31 Octobre 2012 16:57:05 
>> Objet: Re: slow fio random read benchmark, need help 
>> 
>> On Wed, 31 Oct 2012, Alexandre DERUMIER wrote: 
>>> Hello, 
>>> 
>>> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster) 
>>> 
>>> 
>>> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40 --group_reporting --name=file1 --ioengine=libaio --direct=1 
>>> 
>>> 
>>> I can't get more than 5000 iops. 
>> 
>> Have you tried increasing the iodepth? 
>> 
>> sage 
>> 
>>> 
>>> 
>>> RBD cluster is : 
>>> --------------- 
>>> 3 nodes,with each node : 
>>> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon 
>>> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ 
>>> rbd 0.53 
>>> 
>>> ceph.conf 
>>> 
>>> journal dio = false 
>>> filestore fiemap = false 
>>> filestore flusher = false 
>>> osd op threads = 24 
>>> osd disk threads = 24 
>>> filestore op threads = 6 
>>> 
>>> kvm host is : 4 x 12 cores opteron 
>>> ------------ 
>>> 
>>> 
>>> During the bench: 
>>> 
>>> on ceph nodes: 
>>> - cpu is around 10% used 
>>> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer) 
>>> 
>>> 
>>> on kvm host: 
>>> 
>>> -cpu is around 20% used 
>>> 
>>> 
>>> I really don't see where is the bottleneck.... 
>>> 
>>> Any Ideas, hints ? 
>>> 
>>> 
>>> Regards, 
>>> 
>>> Alexandre 
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>>> the body of a message to majordomo@vger.kernel.org 
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>>> 
>>> 
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>> the body of a message to majordomo@vger.kernel.org 
>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-10-31 18:56               ` Alexandre DERUMIER
@ 2012-10-31 19:50                 ` Marcus Sorensen
  2012-11-01  5:11                   ` Alexandre DERUMIER
  2012-10-31 20:22                 ` Josh Durgin
  1 sibling, 1 reply; 41+ messages in thread
From: Marcus Sorensen @ 2012-10-31 19:50 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Sage Weil, ceph-devel

Come to think of it that 15k iops I mentioned was on 10G ethernet with
NFS. I have tried infiniband with ipoib and tcp, it's similar to 10G
ethernet.

You will need to get creative. What you're asking for really is to
have local latencies with remote storage. Just off of the top of my
head you may look into some way to do local caching on SSD for your
RBD volume, like bcache or flashcache.

Depending on your application, it may actually be a bonus that no
single server (or handful of servers) can crush your storage's
performance. if you only have one or two clients anyway then that may
not be much consolation, but if you're going to have dozens or more
then there's not much benefit to having one take all performance at
the expense of everyone, except for perhaps in bursts.

At any rate, 5000 iops is not as good as a new SSD, but far better
than a normal disk. Is there some specific application requirement, or
is it just that you are feeling like you want the full performance
from the VM?

On Wed, Oct 31, 2012 at 12:56 PM, Alexandre DERUMIER
<aderumier@odiso.com> wrote:
> Yes, I think you are right, round trip with mon must cut by half the performance.
>
> I have just done test with 2 parallel fio bench, from 2 differents host,
> I get 2 x 5000 iops
>
> so it must be related to network latency.
>
> I have also done tests with --numjob 1000, it doesn't help, same results.
>
>
> Do you have an idea how I can have more io from 1 host ?
> Doing lacp with multiple links ?
>
> I think that 10gigabit latency is almost same, i'm not sure it will improve iops too much
> Maybe InfiniBand can help?
>
> ----- Mail original -----
>
> De: "Marcus Sorensen" <shadowsor@gmail.com>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Mercredi 31 Octobre 2012 18:38:46
> Objet: Re: slow fio random read benchmark, need help
>
> Yes, I was going to say that the most I've ever seen out of gigabit is
> about 15k iops, with parallel tests and NFS (or iSCSI). Multipathing
> may not really parallelize the io for you. It can send an io down one
> path, then move to the next path and send the next io without
> necessarily waiting for the previous one to respond, but it only
> shaves a slight amount from your latency under some scenarios as
> opposed to sending down all paths simultaneously. I have seen it help
> with high latency links.
>
> I don't remember the Ceph design that well, but with distributed
> storage systems you're going to pay a penalty. If you can do 10-15k
> with one TCP round trip, you'll get half that with the round trip to
> talk to the metadata server to find your blocks and then to fetch
> them. Like I said, that might not be exactly what Ceph does, but
> you're going to have more traffic than just a straight single attached
> NFS or iscsi server.
>
> On Wed, Oct 31, 2012 at 11:27 AM, Alexandre DERUMIER
> <aderumier@odiso.com> wrote:
>> Thanks Marcus,
>>
>> indeed gigabit ethernet.
>>
>> note that my iscsi results (40k)was with multipath, so multiple gigabit links.
>>
>> I have also done tests with a netapp array, with nfs, single link, I'm around 13000 iops
>>
>> I will do more tests with multiples vms, from differents hosts, and with --numjobs.
>>
>> I'll keep you in touch,
>>
>> Thanks for help,
>>
>> Regards,
>>
>> Alexandre
>>
>>
>> ----- Mail original -----
>>
>> De: "Marcus Sorensen" <shadowsor@gmail.com>
>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>> Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Mercredi 31 Octobre 2012 18:08:11
>> Objet: Re: slow fio random read benchmark, need help
>>
>> 5000 is actually really good, if you ask me. Assuming everything is
>> connected via gigabit. If you get 40k iops locally, you add the
>> latency of tcp, as well as that of the ceph services and VM layer, and
>> that's what you get. On my network I get about a .1ms round trip on
>> gigabit over the same switch, which by definition can only do 10,000
>> iops. Then if you have storage on the other end capable of 40k iops,
>> you add the latencies together (.1ms + .025ms) and you're at 8k iops.
>> Then add the small latency of the application servicing the io (NFS,
>> Ceph, etc), and the latency introduced by your VM layer, and 5k sounds
>> about right.
>>
>> The good news is that you probably aren't taxing the storage, you can
>> likely do many simultaneous tests from several VMs and get the same
>> results.
>>
>> You can try adding --numjobs to your fio to parallelize the specific
>> test you're doing, or launching a second VM and doing the same test at
>> the same time. This would be a good indicator if it's latency.
>>
>> On Wed, Oct 31, 2012 at 10:29 AM, Alexandre DERUMIER
>> <aderumier@odiso.com> wrote:
>>>>>Have you tried increasing the iodepth?
>>> Yes, I have try with 100 and 200, same results.
>>>
>>> I have also try directly from the host, with /dev/rbd1, and I have same result.
>>> I have also try with 3 differents hosts, with differents cpus models.
>>>
>>> (note: I can reach around 40.000 iops with same fio config on a zfs iscsi array)
>>>
>>> My test ceph cluster nodes cpus are old (xeon E5420), but they are around 10% usage, so I think it's ok.
>>>
>>>
>>> Do you have an idea if I can trace something ?
>>>
>>> Thanks,
>>>
>>> Alexandre
>>>
>>> ----- Mail original -----
>>>
>>> De: "Sage Weil" <sage@inktank.com>
>>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
>>> Envoyé: Mercredi 31 Octobre 2012 16:57:05
>>> Objet: Re: slow fio random read benchmark, need help
>>>
>>> On Wed, 31 Oct 2012, Alexandre DERUMIER wrote:
>>>> Hello,
>>>>
>>>> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster)
>>>>
>>>>
>>>> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40 --group_reporting --name=file1 --ioengine=libaio --direct=1
>>>>
>>>>
>>>> I can't get more than 5000 iops.
>>>
>>> Have you tried increasing the iodepth?
>>>
>>> sage
>>>
>>>>
>>>>
>>>> RBD cluster is :
>>>> ---------------
>>>> 3 nodes,with each node :
>>>> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon
>>>> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ
>>>> rbd 0.53
>>>>
>>>> ceph.conf
>>>>
>>>> journal dio = false
>>>> filestore fiemap = false
>>>> filestore flusher = false
>>>> osd op threads = 24
>>>> osd disk threads = 24
>>>> filestore op threads = 6
>>>>
>>>> kvm host is : 4 x 12 cores opteron
>>>> ------------
>>>>
>>>>
>>>> During the bench:
>>>>
>>>> on ceph nodes:
>>>> - cpu is around 10% used
>>>> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer)
>>>>
>>>>
>>>> on kvm host:
>>>>
>>>> -cpu is around 20% used
>>>>
>>>>
>>>> I really don't see where is the bottleneck....
>>>>
>>>> Any Ideas, hints ?
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Alexandre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-10-31 18:56               ` Alexandre DERUMIER
  2012-10-31 19:50                 ` Marcus Sorensen
@ 2012-10-31 20:22                 ` Josh Durgin
  1 sibling, 0 replies; 41+ messages in thread
From: Josh Durgin @ 2012-10-31 20:22 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Marcus Sorensen, Sage Weil, ceph-devel

On 10/31/2012 11:56 AM, Alexandre DERUMIER wrote:
> Yes, I think you are right, round trip with mon must cut by half the performance.

I just want to note that the monitors aren't in the data path.
The client knows how to reach the osds and which osds to talk to based
on the osdmap. This is updated asynchronously from the client's
perspective.

> I have just done test with 2 parallel fio bench, from 2 differents host,
> I get 2 x 5000 iops

It'd be interesting to try smaller rbd objects (rbd create --order 12
...) to rule out contention in the OSD for particular objects.

Josh

> so it must be related to network latency.
>
> I have also done tests with --numjob 1000, it doesn't help, same results.
>
>
> Do you have an idea how I can have more io from 1 host ?
> Doing lacp with multiple links ?
>
> I think that 10gigabit latency is almost same, i'm not sure it will improve iops too much
> Maybe InfiniBand can help?
>
> ----- Mail original -----
>
> De: "Marcus Sorensen" <shadowsor@gmail.com>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Mercredi 31 Octobre 2012 18:38:46
> Objet: Re: slow fio random read benchmark, need help
>
> Yes, I was going to say that the most I've ever seen out of gigabit is
> about 15k iops, with parallel tests and NFS (or iSCSI). Multipathing
> may not really parallelize the io for you. It can send an io down one
> path, then move to the next path and send the next io without
> necessarily waiting for the previous one to respond, but it only
> shaves a slight amount from your latency under some scenarios as
> opposed to sending down all paths simultaneously. I have seen it help
> with high latency links.
>
> I don't remember the Ceph design that well, but with distributed
> storage systems you're going to pay a penalty. If you can do 10-15k
> with one TCP round trip, you'll get half that with the round trip to
> talk to the metadata server to find your blocks and then to fetch
> them. Like I said, that might not be exactly what Ceph does, but
> you're going to have more traffic than just a straight single attached
> NFS or iscsi server.
>
> On Wed, Oct 31, 2012 at 11:27 AM, Alexandre DERUMIER
> <aderumier@odiso.com> wrote:
>> Thanks Marcus,
>>
>> indeed gigabit ethernet.
>>
>> note that my iscsi results (40k)was with multipath, so multiple gigabit links.
>>
>> I have also done tests with a netapp array, with nfs, single link, I'm around 13000 iops
>>
>> I will do more tests with multiples vms, from differents hosts, and with --numjobs.
>>
>> I'll keep you in touch,
>>
>> Thanks for help,
>>
>> Regards,
>>
>> Alexandre
>>
>>
>> ----- Mail original -----
>>
>> De: "Marcus Sorensen" <shadowsor@gmail.com>
>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>> Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Mercredi 31 Octobre 2012 18:08:11
>> Objet: Re: slow fio random read benchmark, need help
>>
>> 5000 is actually really good, if you ask me. Assuming everything is
>> connected via gigabit. If you get 40k iops locally, you add the
>> latency of tcp, as well as that of the ceph services and VM layer, and
>> that's what you get. On my network I get about a .1ms round trip on
>> gigabit over the same switch, which by definition can only do 10,000
>> iops. Then if you have storage on the other end capable of 40k iops,
>> you add the latencies together (.1ms + .025ms) and you're at 8k iops.
>> Then add the small latency of the application servicing the io (NFS,
>> Ceph, etc), and the latency introduced by your VM layer, and 5k sounds
>> about right.
>>
>> The good news is that you probably aren't taxing the storage, you can
>> likely do many simultaneous tests from several VMs and get the same
>> results.
>>
>> You can try adding --numjobs to your fio to parallelize the specific
>> test you're doing, or launching a second VM and doing the same test at
>> the same time. This would be a good indicator if it's latency.
>>
>> On Wed, Oct 31, 2012 at 10:29 AM, Alexandre DERUMIER
>> <aderumier@odiso.com> wrote:
>>>>> Have you tried increasing the iodepth?
>>> Yes, I have try with 100 and 200, same results.
>>>
>>> I have also try directly from the host, with /dev/rbd1, and I have same result.
>>> I have also try with 3 differents hosts, with differents cpus models.
>>>
>>> (note: I can reach around 40.000 iops with same fio config on a zfs iscsi array)
>>>
>>> My test ceph cluster nodes cpus are old (xeon E5420), but they are around 10% usage, so I think it's ok.
>>>
>>>
>>> Do you have an idea if I can trace something ?
>>>
>>> Thanks,
>>>
>>> Alexandre
>>>
>>> ----- Mail original -----
>>>
>>> De: "Sage Weil" <sage@inktank.com>
>>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
>>> Envoyé: Mercredi 31 Octobre 2012 16:57:05
>>> Objet: Re: slow fio random read benchmark, need help
>>>
>>> On Wed, 31 Oct 2012, Alexandre DERUMIER wrote:
>>>> Hello,
>>>>
>>>> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster)
>>>>
>>>>
>>>> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40 --group_reporting --name=file1 --ioengine=libaio --direct=1
>>>>
>>>>
>>>> I can't get more than 5000 iops.
>>>
>>> Have you tried increasing the iodepth?
>>>
>>> sage
>>>
>>>>
>>>>
>>>> RBD cluster is :
>>>> ---------------
>>>> 3 nodes,with each node :
>>>> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon
>>>> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ
>>>> rbd 0.53
>>>>
>>>> ceph.conf
>>>>
>>>> journal dio = false
>>>> filestore fiemap = false
>>>> filestore flusher = false
>>>> osd op threads = 24
>>>> osd disk threads = 24
>>>> filestore op threads = 6
>>>>
>>>> kvm host is : 4 x 12 cores opteron
>>>> ------------
>>>>
>>>>
>>>> During the bench:
>>>>
>>>> on ceph nodes:
>>>> - cpu is around 10% used
>>>> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer)
>>>>
>>>>
>>>> on kvm host:
>>>>
>>>> -cpu is around 20% used
>>>>
>>>>
>>>> I really don't see where is the bottleneck....
>>>>
>>>> Any Ideas, hints ?
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Alexandre


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] rbd: skip getting image id if known
  2012-10-31  1:49 ` [PATCH 1/6] rbd: skip getting image id if known Alex Elder
@ 2012-10-31 21:05   ` Josh Durgin
  0 siblings, 0 replies; 41+ messages in thread
From: Josh Durgin @ 2012-10-31 21:05 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

On 10/30/2012 06:49 PM, Alex Elder wrote:
> We will know the image id for format 2 parent images, but won't
> initially know its image name.  Avoid making the query for an image
> id in rbd_dev_image_id() if it's already known.
>
> Signed-off-by: Alex Elder <elder@inktank.com>
> ---
>   drivers/block/rbd.c |    8 ++++++++
>   1 file changed, 8 insertions(+)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 8d26c0f..a852133 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -3068,6 +3068,14 @@ static int rbd_dev_image_id(struct rbd_device
> *rbd_dev)
>   	void *p;
>
>   	/*
> +	 * When probing a parent image, the image id is already
> +	 * known (and the image name likely is not).  There's no
> +	 * need to fetch the image id again in this case.
> +	 */
> +	if (rbd_dev->spec->image_id)
> +		return 0;
> +
> +	/*
>   	 * First, see if the format 2 image id file exists, and if
>   	 * so, get the image's persistent id from it.
>   	 */
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] rbd: allow null image name
  2012-10-31  1:49 ` [PATCH 2/6] rbd: allow null image name Alex Elder
@ 2012-10-31 21:07   ` Josh Durgin
  0 siblings, 0 replies; 41+ messages in thread
From: Josh Durgin @ 2012-10-31 21:07 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

On 10/30/2012 06:49 PM, Alex Elder wrote:
> Format 2 parent images are partially identified by their image id,
> but it may not be possible to determine their image name.  The name
> is not strictly needed for correct operation, so we won't be
> treating it as an error if we don't know it.  Handle this case
> gracefully in rbd_name_show().
>
> Signed-off-by: Alex Elder <elder@inktank.com>
> ---
>   drivers/block/rbd.c |    5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index a852133..28052ff 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -1982,7 +1982,10 @@ static ssize_t rbd_name_show(struct device *dev,
>   {
>   	struct rbd_device *rbd_dev = dev_to_rbd_dev(dev);
>
> -	return sprintf(buf, "%s\n", rbd_dev->spec->image_name);
> +	if (rbd_dev->spec->image_name)
> +		return sprintf(buf, "%s\n", rbd_dev->spec->image_name);
> +
> +	return sprintf(buf, "(unknown)\n");
>   }
>
>   static ssize_t rbd_image_id_show(struct device *dev,
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/6] rbd: get parent spec for version 2 images
  2012-10-31  1:49 ` [PATCH 3/6] rbd: get parent spec for version 2 images Alex Elder
@ 2012-11-01  1:33   ` Josh Durgin
  0 siblings, 0 replies; 41+ messages in thread
From: Josh Durgin @ 2012-11-01  1:33 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

On 10/30/2012 06:49 PM, Alex Elder wrote:
> Add support for getting the the information identifying the parent
> image for rbd images that have them.  The child image holds a
> reference to its parent image specification structure.  Create a new
> entry "parent" in /sys/bus/rbd/image/N/ to report the identifying
> information for the parent image, if any.
>
> Signed-off-by: Alex Elder <elder@inktank.com>
> ---
>   Documentation/ABI/testing/sysfs-bus-rbd |    4 +
>   drivers/block/rbd.c                     |  131
> +++++++++++++++++++++++++++++++
>   include/linux/ceph/rados.h              |    2 +
>   3 files changed, 137 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-rbd
> b/Documentation/ABI/testing/sysfs-bus-rbd
> index 1cf2adf..cd9213c 100644
> --- a/Documentation/ABI/testing/sysfs-bus-rbd
> +++ b/Documentation/ABI/testing/sysfs-bus-rbd
> @@ -70,6 +70,10 @@ snap_*
>
>   	A directory per each snapshot
>
> +parent
> +
> +	Information identifying the pool, image, and snapshot id for
> +	the parent image in a layered rbd image (format 2 only).
>
>   Entries under /sys/bus/rbd/devices/<dev-id>/snap_<snap-name>
>   -------------------------------------------------------------
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 28052ff..bce1fcf 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -217,6 +217,9 @@ struct rbd_device {
>   	struct ceph_osd_event   *watch_event;
>   	struct ceph_osd_request *watch_request;
>
> +	struct rbd_spec		*parent_spec;
> +	u64			parent_overlap;
> +
>   	/* protects updating the header */
>   	struct rw_semaphore     header_rwsem;
>
> @@ -2009,6 +2012,49 @@ static ssize_t rbd_snap_show(struct device *dev,
>   	return sprintf(buf, "%s\n", rbd_dev->spec->snap_name);
>   }
>
> +/*
> + * For an rbd v2 image, shows the pool id, image id, and snapshot id
> + * for the parent image.  If there is no parent, simply shows
> + * "(no parent image)".
> + */
> +static ssize_t rbd_parent_show(struct device *dev,
> +			     struct device_attribute *attr,
> +			     char *buf)
> +{
> +	struct rbd_device *rbd_dev = dev_to_rbd_dev(dev);
> +	struct rbd_spec *spec = rbd_dev->parent_spec;
> +	int count;
> +	char *bufp = buf;
> +
> +	if (!spec)
> +		return sprintf(buf, "(no parent image)\n");
> +
> +	count = sprintf(bufp, "pool_id %llu\npool_name %s\n",
> +			(unsigned long long) spec->pool_id, spec->pool_name);
> +	if (count < 0)
> +		return count;
> +	bufp += count;
> +
> +	count = sprintf(bufp, "image_id %s\nimage_name %s\n", spec->image_id,
> +			spec->image_name ? spec->image_name : "(unknown)");
> +	if (count < 0)
> +		return count;
> +	bufp += count;
> +
> +	count = sprintf(bufp, "snap_id %llu\nsnap_name %s\n",
> +			(unsigned long long) spec->snap_id, spec->snap_name);
> +	if (count < 0)
> +		return count;
> +	bufp += count;
> +
> +	count = sprintf(bufp, "overlap %llu\n", rbd_dev->parent_overlap);
> +	if (count < 0)
> +		return count;
> +	bufp += count;
> +
> +	return (ssize_t) (bufp - buf);
> +}
> +
>   static ssize_t rbd_image_refresh(struct device *dev,
>   				 struct device_attribute *attr,
>   				 const char *buf,
> @@ -2032,6 +2078,7 @@ static DEVICE_ATTR(name, S_IRUGO, rbd_name_show,
> NULL);
>   static DEVICE_ATTR(image_id, S_IRUGO, rbd_image_id_show, NULL);
>   static DEVICE_ATTR(refresh, S_IWUSR, NULL, rbd_image_refresh);
>   static DEVICE_ATTR(current_snap, S_IRUGO, rbd_snap_show, NULL);
> +static DEVICE_ATTR(parent, S_IRUGO, rbd_parent_show, NULL);
>
>   static struct attribute *rbd_attrs[] = {
>   	&dev_attr_size.attr,
> @@ -2043,6 +2090,7 @@ static struct attribute *rbd_attrs[] = {
>   	&dev_attr_name.attr,
>   	&dev_attr_image_id.attr,
>   	&dev_attr_current_snap.attr,
> +	&dev_attr_parent.attr,
>   	&dev_attr_refresh.attr,
>   	NULL
>   };
> @@ -2192,6 +2240,7 @@ struct rbd_device *rbd_dev_create(struct
> rbd_client *rbdc,
>
>   static void rbd_dev_destroy(struct rbd_device *rbd_dev)
>   {
> +	rbd_spec_put(rbd_dev->parent_spec);
>   	kfree(rbd_dev->header_name);
>   	rbd_put_client(rbd_dev->rbd_client);
>   	rbd_spec_put(rbd_dev->spec);
> @@ -2400,6 +2449,71 @@ static int rbd_dev_v2_features(struct rbd_device
> *rbd_dev)
>   						&rbd_dev->header.features);
>   }
>
> +static int rbd_dev_v2_parent_info(struct rbd_device *rbd_dev)
> +{
> +	struct rbd_spec *parent_spec;
> +	size_t size;
> +	void *reply_buf = NULL;
> +	__le64 snapid;
> +	void *p;
> +	void *end;
> +	char *image_id;
> +	u64 overlap;
> +	size_t len = 0;
> +	int ret;
> +
> +	parent_spec = rbd_spec_alloc();
> +	if (!parent_spec)
> +		return -ENOMEM;
> +
> +	size = sizeof (__le64) +				/* pool_id */
> +		sizeof (__le32) + RBD_IMAGE_ID_LEN_MAX +	/* image_id */
> +		sizeof (__le64) +				/* snap_id */
> +		sizeof (__le64);				/* overlap */
> +	reply_buf = kmalloc(size, GFP_KERNEL);
> +	if (!reply_buf) {
> +		ret = -ENOMEM;
> +		goto out_err;
> +	}
> +
> +	snapid = cpu_to_le64(CEPH_NOSNAP);
> +	ret = rbd_req_sync_exec(rbd_dev, rbd_dev->header_name,
> +				"rbd", "get_parent",
> +				(char *) &snapid, sizeof (snapid),
> +				(char *) reply_buf, size,
> +				CEPH_OSD_FLAG_READ, NULL);
> +	dout("%s: rbd_req_sync_exec returned %d\n", __func__, ret);
> +	if (ret < 0)
> +		goto out_err;
> +
> +	ret = -ERANGE;
> +	p = reply_buf;
> +	end = (char *) reply_buf + size;
> +	ceph_decode_64_safe(&p, end, parent_spec->pool_id, out_err);
> +	if (parent_spec->pool_id == CEPH_NOPOOL)
> +		goto out;	/* No parent?  No problem. */
> +
> +	image_id = ceph_extract_encoded_string(&p, end, &len, GFP_KERNEL);
> +	if (IS_ERR(image_id)) {
> +		ret = PTR_ERR(image_id);
> +		goto out_err;
> +	}
> +	parent_spec->image_id = image_id;
> +	ceph_decode_64_safe(&p, end, parent_spec->snap_id, out_err);
> +	ceph_decode_64_safe(&p, end, overlap, out_err);
> +
> +	rbd_dev->parent_overlap = overlap;
> +	rbd_dev->parent_spec = parent_spec;
> +	parent_spec = NULL;	/* rbd_dev now owns this */
> +out:
> +	ret = 0;
> +out_err:
> +	kfree(reply_buf);
> +	rbd_spec_put(parent_spec);
> +
> +	return ret;
> +}
> +
>   static int rbd_dev_v2_snap_context(struct rbd_device *rbd_dev, u64 *ver)
>   {
>   	size_t size;
> @@ -3154,6 +3268,12 @@ static int rbd_dev_v1_probe(struct rbd_device
> *rbd_dev)
>   	ret = rbd_read_header(rbd_dev, &rbd_dev->header);
>   	if (ret < 0)
>   		goto out_err;
> +
> +	/* Version 1 images have no parent (no layering) */
> +
> +	rbd_dev->parent_spec = NULL;
> +	rbd_dev->parent_overlap = 0;
> +
>   	rbd_dev->image_format = 1;
>
>   	dout("discovered version 1 image, header name is %s\n",
> @@ -3205,6 +3325,14 @@ static int rbd_dev_v2_probe(struct rbd_device
> *rbd_dev)
>   	if (ret < 0)
>   		goto out_err;
>
> +	/* If the image supports layering, get the parent info */
> +
> +	if (rbd_dev->header.features & RBD_FEATURE_LAYERING) {
> +		ret = rbd_dev_v2_parent_info(rbd_dev);
> +		if (ret < 0)
> +			goto out_err;
> +	}
> +
>   	/* crypto and compression type aren't (yet) supported for v2 images */
>
>   	rbd_dev->header.crypt_type = 0;
> @@ -3224,6 +3352,9 @@ static int rbd_dev_v2_probe(struct rbd_device
> *rbd_dev)
>
>   	return 0;
>   out_err:
> +	rbd_dev->parent_overlap = 0;
> +	rbd_spec_put(rbd_dev->parent_spec);
> +	rbd_dev->parent_spec = NULL;
>   	kfree(rbd_dev->header_name);
>   	rbd_dev->header_name = NULL;
>   	kfree(rbd_dev->header.object_prefix);
> diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
> index 0a99099..15077db 100644
> --- a/include/linux/ceph/rados.h
> +++ b/include/linux/ceph/rados.h
> @@ -87,6 +87,8 @@ struct ceph_pg {
>    *
>    *  lpgp_num -- as above.
>    */
> +#define CEPH_NOPOOL  ((__u64) (-1))  /* pool id not defined */
> +
>   #define CEPH_PG_TYPE_REP     1
>   #define CEPH_PG_TYPE_RAID4   2
>   #define CEPH_PG_POOL_VERSION 2
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] libceph: define ceph_pg_pool_name_by_id()
  2012-10-31  1:49 ` [PATCH 4/6] libceph: define ceph_pg_pool_name_by_id() Alex Elder
@ 2012-11-01  1:34   ` Josh Durgin
  0 siblings, 0 replies; 41+ messages in thread
From: Josh Durgin @ 2012-11-01  1:34 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

On 10/30/2012 06:49 PM, Alex Elder wrote:
> Define and export function ceph_pg_pool_name_by_id() to supply
> the name of a pg pool whose id is given.  This will be used by
> the next patch.
>
> Signed-off-by: Alex Elder <elder@inktank.com>
> ---
>   include/linux/ceph/osdmap.h |    1 +
>   net/ceph/osdmap.c           |   16 ++++++++++++++++
>   2 files changed, 17 insertions(+)
>
> diff --git a/include/linux/ceph/osdmap.h b/include/linux/ceph/osdmap.h
> index e88a620..5ea57ba 100644
> --- a/include/linux/ceph/osdmap.h
> +++ b/include/linux/ceph/osdmap.h
> @@ -123,6 +123,7 @@ extern int ceph_calc_pg_acting(struct ceph_osdmap
> *osdmap, struct ceph_pg pgid,
>   extern int ceph_calc_pg_primary(struct ceph_osdmap *osdmap,
>   				struct ceph_pg pgid);
>
> +extern const char *ceph_pg_pool_name_by_id(struct ceph_osdmap *map, u64
> id);
>   extern int ceph_pg_poolid_by_name(struct ceph_osdmap *map, const char
> *name);
>
>   #endif
> diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
> index f552aa4..de73214 100644
> --- a/net/ceph/osdmap.c
> +++ b/net/ceph/osdmap.c
> @@ -469,6 +469,22 @@ static struct ceph_pg_pool_info
> *__lookup_pg_pool(struct rb_root *root, int id)
>   	return NULL;
>   }
>
> +const char *ceph_pg_pool_name_by_id(struct ceph_osdmap *map, u64 id)
> +{
> +	struct ceph_pg_pool_info *pi;
> +
> +	if (id == CEPH_NOPOOL)
> +		return NULL;
> +
> +	if (WARN_ON_ONCE(id > (u64) INT_MAX))
> +		return NULL;
> +
> +	pi = __lookup_pg_pool(&map->pg_pools, (int) id);
> +
> +	return pi ? pi->name : NULL;
> +}
> +EXPORT_SYMBOL(ceph_pg_pool_name_by_id);
> +
>   int ceph_pg_poolid_by_name(struct ceph_osdmap *map, const char *name)
>   {
>   	struct rb_node *rbp;
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] rbd: get additional info in parent spec
  2012-10-31  1:49 ` [PATCH 5/6] rbd: get additional info in parent spec Alex Elder
  2012-10-31 14:11   ` Alex Elder
@ 2012-11-01  1:49   ` Josh Durgin
  2012-11-01 12:18     ` Alex Elder
  1 sibling, 1 reply; 41+ messages in thread
From: Josh Durgin @ 2012-11-01  1:49 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

I know you've got a queue of these already, but here's another:
rbd_dev_probe_update_spec() could definitely use some warnings
to distinguish its error cases.

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

On 10/30/2012 06:49 PM, Alex Elder wrote:
> When a layered rbd image has a parent, that parent is identified
> only by its pool id, image id, and snapshot id.  Images that have
> been mapped also record *names* for those three id's.
>
> Add code to look up these names for parent images so they match
> mapped images more closely.  Skip doing this for an image if it
> already has its pool name defined (this will be the case for images
> mapped by the user).
>
> It is possible that an the name of a parent image can't be
> determined, even if the image id is valid.  If this occurs it
> does not preclude correct operation, so don't treat this as
> an error.
>
> On the other hand, defined pools will always have both an id and a
> name.   And any snapshot of an image identified as a parent for a
> clone image will exist, and will have a name (if not it indicates
> some other internal error).  So treat failure to get these bits
> of information as errors.
>
> Signed-off-by: Alex Elder <elder@inktank.com>
> ---
>   drivers/block/rbd.c |  131
> +++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 131 insertions(+)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index bce1fcf..04062c1 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -70,7 +70,10 @@
>
>   #define RBD_SNAP_HEAD_NAME	"-"
>
> +/* This allows a single page to hold an image name sent by OSD */
> +#define RBD_IMAGE_NAME_LEN_MAX	(PAGE_SIZE - sizeof (__le32) - 1)
>   #define RBD_IMAGE_ID_LEN_MAX	64
> +
>   #define RBD_OBJ_PREFIX_LEN_MAX	64
>
>   /* Feature bits */
> @@ -658,6 +661,20 @@ out_err:
>   	return -ENOMEM;
>   }
>
> +static const char *rbd_snap_name(struct rbd_device *rbd_dev, u64 snap_id)
> +{
> +	struct rbd_snap *snap;
> +
> +	if (snap_id == CEPH_NOSNAP)
> +		return RBD_SNAP_HEAD_NAME;
> +
> +	list_for_each_entry(snap, &rbd_dev->snaps, node)
> +		if (snap_id == snap->id)
> +			return snap->name;
> +
> +	return NULL;
> +}
> +
>   static int snap_by_name(struct rbd_device *rbd_dev, const char *snap_name)
>   {
>
> @@ -2499,6 +2516,7 @@ static int rbd_dev_v2_parent_info(struct
> rbd_device *rbd_dev)
>   		goto out_err;
>   	}
>   	parent_spec->image_id = image_id;
> +	parent_spec->image_id_len = len;
>   	ceph_decode_64_safe(&p, end, parent_spec->snap_id, out_err);
>   	ceph_decode_64_safe(&p, end, overlap, out_err);
>
> @@ -2514,6 +2532,115 @@ out_err:
>   	return ret;
>   }
>
> +static char *rbd_dev_image_name(struct rbd_device *rbd_dev)
> +{
> +	size_t image_id_size;
> +	char *image_id;
> +	void *p;
> +	void *end;
> +	size_t size;
> +	void *reply_buf = NULL;
> +	size_t len = 0;
> +	char *image_name = NULL;
> +	int ret;
> +
> +	rbd_assert(!rbd_dev->spec->image_name);
> +
> +	image_id_size = sizeof (__le32) + rbd_dev->spec->image_id_len;
> +	image_id = kmalloc(image_id_size, GFP_KERNEL);
> +	if (!image_id)
> +		return NULL;
> +
> +	p = image_id;
> +	end = (char *) image_id + image_id_size;
> +	ceph_encode_string(&p, end, rbd_dev->spec->image_id,
> +				(u32) rbd_dev->spec->image_id_len);
> +
> +	size = sizeof (__le32) + RBD_IMAGE_NAME_LEN_MAX;
> +	reply_buf = kmalloc(size, GFP_KERNEL);
> +	if (!reply_buf)
> +		goto out;
> +
> +	ret = rbd_req_sync_exec(rbd_dev, RBD_DIRECTORY,
> +				"rbd", "dir_get_name",
> +				image_id, image_id_size,
> +				(char *) reply_buf, size,
> +				CEPH_OSD_FLAG_READ, NULL);
> +	if (ret < 0)
> +		goto out;
> +	p = reply_buf;
> +	end = (char *) reply_buf + size;
> +	image_name = ceph_extract_encoded_string(&p, end, &len, GFP_KERNEL);
> +	if (image_name)
> +		dout("%s: name is %s len is %zd\n", __func__, image_name, len);
> +out:
> +	kfree(reply_buf);
> +	kfree(image_id);
> +
> +	return image_name;
> +}
> +
> +/*
> + * When a parent image gets probed, we only have the pool, image,
> + * and snapshot ids but not the names of any of them.  This call
> + * is made later to fill in those names.  It has to be done after
> + * rbd_dev_snaps_update() has completed because some of the
> + * information (in particular, snapshot name) is not available
> + * until then.
> + */
> +static int rbd_dev_probe_update_spec(struct rbd_device *rbd_dev)
> +{
> +	struct ceph_osd_client *osdc;
> +	const char *name;
> +	void *reply_buf = NULL;
> +	int ret;
> +
> +	if (rbd_dev->spec->pool_name)
> +		return 0;	/* Already have the names */
> +
> +	/* Look up the pool name */
> +
> +	osdc = &rbd_dev->rbd_client->client->osdc;
> +	name = ceph_pg_pool_name_by_id(osdc->osdmap, rbd_dev->spec->pool_id);
> +	if (!name)
> +		return -EIO;	/* pool id too large (>= 2^31) */
> +
> +	rbd_dev->spec->pool_name = kstrdup(name, GFP_KERNEL);
> +	if (!rbd_dev->spec->pool_name)
> +		return -ENOMEM;
> +
> +	/* Fetch the image name; tolerate failure here */
> +
> +	name = rbd_dev_image_name(rbd_dev);
> +	if (name) {
> +		rbd_dev->spec->image_name_len = strlen(name);
> +		rbd_dev->spec->image_name = (char *) name;
> +	} else {
> +		pr_warning(RBD_DRV_NAME "%d "
> +			"unable to get image name for image id %s\n",
> +			rbd_dev->major, rbd_dev->spec->image_id);
> +	}
> +
> +	/* Look up the snapshot name. */
> +
> +	name = rbd_snap_name(rbd_dev, rbd_dev->spec->snap_id);
> +	if (!name) {
> +		ret = -EIO;
> +		goto out_err;
> +	}
> +	rbd_dev->spec->snap_name = kstrdup(name, GFP_KERNEL);
> +	if(!rbd_dev->spec->snap_name)
> +		goto out_err;
> +
> +	return 0;
> +out_err:
> +	kfree(reply_buf);
> +	kfree(rbd_dev->spec->pool_name);
> +	rbd_dev->spec->pool_name = NULL;
> +
> +	return ret;
> +}
> +
>   static int rbd_dev_v2_snap_context(struct rbd_device *rbd_dev, u64 *ver)
>   {
>   	size_t size;
> @@ -3372,6 +3499,10 @@ static int rbd_dev_probe_finish(struct rbd_device
> *rbd_dev)
>   	if (ret)
>   		return ret;
>
> +	ret = rbd_dev_probe_update_spec(rbd_dev);
> +	if (ret)
> +		goto err_out_snaps;
> +
>   	ret = rbd_dev_set_mapping(rbd_dev);
>   	if (ret)
>   		goto err_out_snaps;
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] rbd: probe the parent of an image if present
  2012-10-31  1:50 ` [PATCH 6/6] rbd: probe the parent of an image if present Alex Elder
  2012-10-31 11:59   ` slow fio random read benchmark, need help Alexandre DERUMIER
@ 2012-11-01  2:07   ` Josh Durgin
  2012-11-01 12:26     ` Alex Elder
  1 sibling, 1 reply; 41+ messages in thread
From: Josh Durgin @ 2012-11-01  2:07 UTC (permalink / raw)
  To: Alex Elder; +Cc: ceph-devel

This all makes sense, but it reminds me of another issue we'll need to
address:

http://www.tracker.newdream.net/issues/2533

We don't need to watch the header of a parent snapshot, since it's
immutable and guaranteed not to be deleted out from under us.
This avoids the bug referenced above. So I guess rbd_dev_probe{_finish}
can take a parameter telling them whether to watch the header or not.

We should check whether multiple mapped rbds (without layering) hit
this issue as well, and if so, default to not sharing the ceph_client
until the bug is fixed.

On 10/30/2012 06:50 PM, Alex Elder wrote:
> Call the probe function for the parent device.
>
> Signed-off-by: Alex Elder <elder@inktank.com>
> ---
>   drivers/block/rbd.c |   79
> +++++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 76 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 04062c1..8ef13f72 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -222,6 +222,7 @@ struct rbd_device {
>
>   	struct rbd_spec		*parent_spec;
>   	u64			parent_overlap;
> +	struct rbd_device	*parent;
>
>   	/* protects updating the header */
>   	struct rw_semaphore     header_rwsem;
> @@ -255,6 +256,7 @@ static ssize_t rbd_add(struct bus_type *bus, const
> char *buf,
>   		       size_t count);
>   static ssize_t rbd_remove(struct bus_type *bus, const char *buf,
>   			  size_t count);
> +static int rbd_dev_probe(struct rbd_device *rbd_dev);
>
>   static struct bus_attribute rbd_bus_attrs[] = {
>   	__ATTR(add, S_IWUSR, NULL, rbd_add),
> @@ -378,6 +380,13 @@ out_opt:
>   	return ERR_PTR(ret);
>   }
>
> +static struct rbd_client *__rbd_get_client(struct rbd_client *rbdc)
> +{
> +	kref_get(&rbdc->kref);
> +
> +	return rbdc;
> +}
> +
>   /*
>    * Find a ceph client with specific addr and configuration.  If
>    * found, bump its reference count.
> @@ -393,7 +402,8 @@ static struct rbd_client *rbd_client_find(struct
> ceph_options *ceph_opts)
>   	spin_lock(&rbd_client_list_lock);
>   	list_for_each_entry(client_node, &rbd_client_list, node) {
>   		if (!ceph_compare_options(ceph_opts, client_node->client)) {
> -			kref_get(&client_node->kref);
> +			__rbd_get_client(client_node);
> +
>   			found = true;
>   			break;
>   		}
> @@ -3311,6 +3321,11 @@ static int rbd_dev_image_id(struct rbd_device
> *rbd_dev)
>   	void *response;
>   	void *p;
>
> +	/* If we already have it we don't need to look it up */
> +
> +	if (rbd_dev->spec->image_id)
> +		return 0;
> +
>   	/*
>   	 * When probing a parent image, the image id is already
>   	 * known (and the image name likely is not).  There's no
> @@ -3492,6 +3507,9 @@ out_err:
>
>   static int rbd_dev_probe_finish(struct rbd_device *rbd_dev)
>   {
> +	struct rbd_device *parent = NULL;
> +	struct rbd_spec *parent_spec = NULL;
> +	struct rbd_client *rbdc = NULL;
>   	int ret;
>
>   	/* no need to lock here, as rbd_dev is not registered yet */
> @@ -3536,6 +3554,31 @@ static int rbd_dev_probe_finish(struct rbd_device
> *rbd_dev)
>   	 * At this point cleanup in the event of an error is the job
>   	 * of the sysfs code (initiated by rbd_bus_del_dev()).
>   	 */
> +	/* Probe the parent if there is one */
> +
> +	if (rbd_dev->parent_spec) {
> +		/*
> +		 * We need to pass a reference to the client and the
> +		 * parent spec when creating the parent rbd_dev.
> +		 * Images related by parent/child relationships
> +		 * always share both.
> +		 */
> +		parent_spec = rbd_spec_get(rbd_dev->parent_spec);
> +		rbdc = __rbd_get_client(rbd_dev->rbd_client);
> +
> +		parent = rbd_dev_create(rbdc, parent_spec);
> +		if (!parent) {
> +			ret = -ENOMEM;
> +			goto err_out_spec;
> +		}
> +		rbdc = NULL;		/* parent now owns reference */
> +		parent_spec = NULL;	/* parent now owns reference */
> +		ret = rbd_dev_probe(parent);
> +		if (ret < 0)
> +			goto err_out_parent;
> +		rbd_dev->parent = parent;
> +	}
> +
>   	down_write(&rbd_dev->header_rwsem);
>   	ret = rbd_dev_snaps_register(rbd_dev);
>   	up_write(&rbd_dev->header_rwsem);
> @@ -3554,6 +3597,12 @@ static int rbd_dev_probe_finish(struct rbd_device
> *rbd_dev)
>   		(unsigned long long) rbd_dev->mapping.size);
>
>   	return ret;
> +
> +err_out_parent:
> +	rbd_dev_destroy(parent);
> +err_out_spec:
> +	rbd_spec_put(parent_spec);
> +	rbd_put_client(rbdc);
>   err_out_bus:
>   	/* this will also clean up rest of rbd_dev stuff */
>
> @@ -3717,6 +3766,12 @@ static void rbd_dev_release(struct device *dev)
>   	module_put(THIS_MODULE);
>   }
>
> +static void __rbd_remove(struct rbd_device *rbd_dev)
> +{
> +	rbd_remove_all_snaps(rbd_dev);
> +	rbd_bus_del_dev(rbd_dev);
> +}
> +
>   static ssize_t rbd_remove(struct bus_type *bus,
>   			  const char *buf,
>   			  size_t count)
> @@ -3743,8 +3798,26 @@ static ssize_t rbd_remove(struct bus_type *bus,
>   		goto done;
>   	}
>
> -	rbd_remove_all_snaps(rbd_dev);
> -	rbd_bus_del_dev(rbd_dev);
> +	while (rbd_dev->parent_spec) {
> +		struct rbd_device *first = rbd_dev;
> +		struct rbd_device *second = first->parent;
> +		struct rbd_device *third;
> +
> +		/*
> +		 * Follow to the parent with no grandparent and
> +		 * remove it.
> +		 */
> +		while (second && (third = second->parent)) {
> +			first = second;
> +			second = third;
> +		}
> +		__rbd_remove(second);
> +		rbd_spec_put(first->parent_spec);
> +		first->parent_spec = NULL;
> +		first->parent_overlap = 0;
> +		first->parent = NULL;
> +	}
> +	__rbd_remove(rbd_dev);
>
>   done:
>   	mutex_unlock(&ctl_mutex);
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-10-31 19:50                 ` Marcus Sorensen
@ 2012-11-01  5:11                   ` Alexandre DERUMIER
  2012-11-01  5:41                     ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 41+ messages in thread
From: Alexandre DERUMIER @ 2012-11-01  5:11 UTC (permalink / raw)
  To: Marcus Sorensen; +Cc: Sage Weil, ceph-devel

>>Come to think of it that 15k iops I mentioned was on 10G ethernet with
>>NFS. I have tried infiniband with ipoib and tcp, it's similar to 10G
>>ethernet.

I have see new arista 10GBe switch with latency around 1microsecond, that seem pretty good to do the job.



>>You will need to get creative. What you're asking for really is to
>>have local latencies with remote storage. Just off of the top of my
>>head you may look into some way to do local caching on SSD for your
>>RBD volume, like bcache or flashcache.
I have already thinked about it. (But I would like to use qemu-rbd if possible)


>>At any rate, 5000 iops is not as good as a new SSD, but far better
>>than a normal disk. Is there some specific application requirement, or
>>is it just that you are feeling like you want the full performance
>>from the VM?

I have some customers with some huge databases (too big to be handle in the bufer), require a lot of ios. (around 10K).

I have redone tests with 4 guest in parallel, I get 4 x 5000iops, so it seem to scale ! (and cpu is very low on the ceph cluster).


So I'll try some tricks, like raid over multiple rbd devices, maybe it'll help.

Thanks again for the help Marcus, I was not aware of these latencies problems.

Regards,

Alexandre


----- Mail original -----

De: "Marcus Sorensen" <shadowsor@gmail.com>
À: "Alexandre DERUMIER" <aderumier@odiso.com>
Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mercredi 31 Octobre 2012 20:50:36
Objet: Re: slow fio random read benchmark, need help

Come to think of it that 15k iops I mentioned was on 10G ethernet with
NFS. I have tried infiniband with ipoib and tcp, it's similar to 10G
ethernet.

You will need to get creative. What you're asking for really is to
have local latencies with remote storage. Just off of the top of my
head you may look into some way to do local caching on SSD for your
RBD volume, like bcache or flashcache.

Depending on your application, it may actually be a bonus that no
single server (or handful of servers) can crush your storage's
performance. if you only have one or two clients anyway then that may
not be much consolation, but if you're going to have dozens or more
then there's not much benefit to having one take all performance at
the expense of everyone, except for perhaps in bursts.

At any rate, 5000 iops is not as good as a new SSD, but far better
than a normal disk. Is there some specific application requirement, or
is it just that you are feeling like you want the full performance
from the VM?

On Wed, Oct 31, 2012 at 12:56 PM, Alexandre DERUMIER
<aderumier@odiso.com> wrote:
> Yes, I think you are right, round trip with mon must cut by half the performance.
>
> I have just done test with 2 parallel fio bench, from 2 differents host, 
> I get 2 x 5000 iops
>
> so it must be related to network latency.
>
> I have also done tests with --numjob 1000, it doesn't help, same results.
>
>
> Do you have an idea how I can have more io from 1 host ?
> Doing lacp with multiple links ?
>
> I think that 10gigabit latency is almost same, i'm not sure it will improve iops too much
> Maybe InfiniBand can help?
>
> ----- Mail original -----
>
> De: "Marcus Sorensen" <shadowsor@gmail.com>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Mercredi 31 Octobre 2012 18:38:46
> Objet: Re: slow fio random read benchmark, need help
>
> Yes, I was going to say that the most I've ever seen out of gigabit is
> about 15k iops, with parallel tests and NFS (or iSCSI). Multipathing
> may not really parallelize the io for you. It can send an io down one
> path, then move to the next path and send the next io without
> necessarily waiting for the previous one to respond, but it only
> shaves a slight amount from your latency under some scenarios as
> opposed to sending down all paths simultaneously. I have seen it help
> with high latency links.
>
> I don't remember the Ceph design that well, but with distributed
> storage systems you're going to pay a penalty. If you can do 10-15k
> with one TCP round trip, you'll get half that with the round trip to
> talk to the metadata server to find your blocks and then to fetch
> them. Like I said, that might not be exactly what Ceph does, but
> you're going to have more traffic than just a straight single attached
> NFS or iscsi server.
>
> On Wed, Oct 31, 2012 at 11:27 AM, Alexandre DERUMIER
> <aderumier@odiso.com> wrote:
>> Thanks Marcus,
>>
>> indeed gigabit ethernet.
>>
>> note that my iscsi results (40k)was with multipath, so multiple gigabit links.
>>
>> I have also done tests with a netapp array, with nfs, single link, I'm around 13000 iops
>>
>> I will do more tests with multiples vms, from differents hosts, and with --numjobs.
>>
>> I'll keep you in touch,
>>
>> Thanks for help,
>>
>> Regards,
>>
>> Alexandre
>>
>>
>> ----- Mail original -----
>>
>> De: "Marcus Sorensen" <shadowsor@gmail.com>
>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>> Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Mercredi 31 Octobre 2012 18:08:11
>> Objet: Re: slow fio random read benchmark, need help
>>
>> 5000 is actually really good, if you ask me. Assuming everything is
>> connected via gigabit. If you get 40k iops locally, you add the
>> latency of tcp, as well as that of the ceph services and VM layer, and
>> that's what you get. On my network I get about a .1ms round trip on
>> gigabit over the same switch, which by definition can only do 10,000
>> iops. Then if you have storage on the other end capable of 40k iops,
>> you add the latencies together (.1ms + .025ms) and you're at 8k iops.
>> Then add the small latency of the application servicing the io (NFS,
>> Ceph, etc), and the latency introduced by your VM layer, and 5k sounds
>> about right.
>>
>> The good news is that you probably aren't taxing the storage, you can
>> likely do many simultaneous tests from several VMs and get the same
>> results.
>>
>> You can try adding --numjobs to your fio to parallelize the specific
>> test you're doing, or launching a second VM and doing the same test at
>> the same time. This would be a good indicator if it's latency.
>>
>> On Wed, Oct 31, 2012 at 10:29 AM, Alexandre DERUMIER
>> <aderumier@odiso.com> wrote:
>>>>>Have you tried increasing the iodepth?
>>> Yes, I have try with 100 and 200, same results.
>>>
>>> I have also try directly from the host, with /dev/rbd1, and I have same result.
>>> I have also try with 3 differents hosts, with differents cpus models.
>>>
>>> (note: I can reach around 40.000 iops with same fio config on a zfs iscsi array)
>>>
>>> My test ceph cluster nodes cpus are old (xeon E5420), but they are around 10% usage, so I think it's ok.
>>>
>>>
>>> Do you have an idea if I can trace something ?
>>>
>>> Thanks,
>>>
>>> Alexandre
>>>
>>> ----- Mail original -----
>>>
>>> De: "Sage Weil" <sage@inktank.com>
>>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
>>> Envoyé: Mercredi 31 Octobre 2012 16:57:05
>>> Objet: Re: slow fio random read benchmark, need help
>>>
>>> On Wed, 31 Oct 2012, Alexandre DERUMIER wrote:
>>>> Hello,
>>>>
>>>> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster)
>>>>
>>>>
>>>> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40 --group_reporting --name=file1 --ioengine=libaio --direct=1
>>>>
>>>>
>>>> I can't get more than 5000 iops.
>>>
>>> Have you tried increasing the iodepth?
>>>
>>> sage
>>>
>>>>
>>>>
>>>> RBD cluster is :
>>>> ---------------
>>>> 3 nodes,with each node :
>>>> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon
>>>> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ
>>>> rbd 0.53
>>>>
>>>> ceph.conf
>>>>
>>>> journal dio = false
>>>> filestore fiemap = false
>>>> filestore flusher = false
>>>> osd op threads = 24
>>>> osd disk threads = 24
>>>> filestore op threads = 6
>>>>
>>>> kvm host is : 4 x 12 cores opteron
>>>> ------------
>>>>
>>>>
>>>> During the bench:
>>>>
>>>> on ceph nodes:
>>>> - cpu is around 10% used
>>>> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer)
>>>>
>>>>
>>>> on kvm host:
>>>>
>>>> -cpu is around 20% used
>>>>
>>>>
>>>> I really don't see where is the bottleneck....
>>>>
>>>> Any Ideas, hints ?
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Alexandre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-11-01  5:11                   ` Alexandre DERUMIER
@ 2012-11-01  5:41                     ` Stefan Priebe - Profihost AG
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-01  5:41 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Marcus Sorensen, Sage Weil, ceph-devel

Am 01.11.2012 um 06:11 schrieb Alexandre DERUMIER <aderumier@odiso.com>:

>>> Come to think of it that 15k iops I mentioned was on 10G ethernet with
>>> NFS. I have tried infiniband with ipoib and tcp, it's similar to 10G
>>> ethernet.
> 
> I have see new arista 10GBe switch with latency around 1microsecond, that seem pretty good to do the job.

Pretty interesting. How can i measure switch / network latency.

Stefan

> 
> 
> 
>>> You will need to get creative. What you're asking for really is to
>>> have local latencies with remote storage. Just off of the top of my
>>> head you may look into some way to do local caching on SSD for your
>>> RBD volume, like bcache or flashcache.
> I have already thinked about it. (But I would like to use qemu-rbd if possible)
> 
> 
>>> At any rate, 5000 iops is not as good as a new SSD, but far better
>>> than a normal disk. Is there some specific application requirement, or
>>> is it just that you are feeling like you want the full performance
>>> from the VM?
> 
> I have some customers with some huge databases (too big to be handle in the bufer), require a lot of ios. (around 10K).
> 
> I have redone tests with 4 guest in parallel, I get 4 x 5000iops, so it seem to scale ! (and cpu is very low on the ceph cluster).
> 
> 
> So I'll try some tricks, like raid over multiple rbd devices, maybe it'll help.
> 
> Thanks again for the help Marcus, I was not aware of these latencies problems.
> 
> Regards,
> 
> Alexandre
> 
> 
> ----- Mail original -----
> 
> De: "Marcus Sorensen" <shadowsor@gmail.com>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
> Envoyé: Mercredi 31 Octobre 2012 20:50:36
> Objet: Re: slow fio random read benchmark, need help
> 
> Come to think of it that 15k iops I mentioned was on 10G ethernet with
> NFS. I have tried infiniband with ipoib and tcp, it's similar to 10G
> ethernet.
> 
> You will need to get creative. What you're asking for really is to
> have local latencies with remote storage. Just off of the top of my
> head you may look into some way to do local caching on SSD for your
> RBD volume, like bcache or flashcache.
> 
> Depending on your application, it may actually be a bonus that no
> single server (or handful of servers) can crush your storage's
> performance. if you only have one or two clients anyway then that may
> not be much consolation, but if you're going to have dozens or more
> then there's not much benefit to having one take all performance at
> the expense of everyone, except for perhaps in bursts.
> 
> At any rate, 5000 iops is not as good as a new SSD, but far better
> than a normal disk. Is there some specific application requirement, or
> is it just that you are feeling like you want the full performance
> from the VM?
> 
> On Wed, Oct 31, 2012 at 12:56 PM, Alexandre DERUMIER
> <aderumier@odiso.com> wrote:
>> Yes, I think you are right, round trip with mon must cut by half the performance.
>> 
>> I have just done test with 2 parallel fio bench, from 2 differents host, 
>> I get 2 x 5000 iops
>> 
>> so it must be related to network latency.
>> 
>> I have also done tests with --numjob 1000, it doesn't help, same results.
>> 
>> 
>> Do you have an idea how I can have more io from 1 host ?
>> Doing lacp with multiple links ?
>> 
>> I think that 10gigabit latency is almost same, i'm not sure it will improve iops too much
>> Maybe InfiniBand can help?
>> 
>> ----- Mail original -----
>> 
>> De: "Marcus Sorensen" <shadowsor@gmail.com>
>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>> Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>> Envoyé: Mercredi 31 Octobre 2012 18:38:46
>> Objet: Re: slow fio random read benchmark, need help
>> 
>> Yes, I was going to say that the most I've ever seen out of gigabit is
>> about 15k iops, with parallel tests and NFS (or iSCSI). Multipathing
>> may not really parallelize the io for you. It can send an io down one
>> path, then move to the next path and send the next io without
>> necessarily waiting for the previous one to respond, but it only
>> shaves a slight amount from your latency under some scenarios as
>> opposed to sending down all paths simultaneously. I have seen it help
>> with high latency links.
>> 
>> I don't remember the Ceph design that well, but with distributed
>> storage systems you're going to pay a penalty. If you can do 10-15k
>> with one TCP round trip, you'll get half that with the round trip to
>> talk to the metadata server to find your blocks and then to fetch
>> them. Like I said, that might not be exactly what Ceph does, but
>> you're going to have more traffic than just a straight single attached
>> NFS or iscsi server.
>> 
>> On Wed, Oct 31, 2012 at 11:27 AM, Alexandre DERUMIER
>> <aderumier@odiso.com> wrote:
>>> Thanks Marcus,
>>> 
>>> indeed gigabit ethernet.
>>> 
>>> note that my iscsi results (40k)was with multipath, so multiple gigabit links.
>>> 
>>> I have also done tests with a netapp array, with nfs, single link, I'm around 13000 iops
>>> 
>>> I will do more tests with multiples vms, from differents hosts, and with --numjobs.
>>> 
>>> I'll keep you in touch,
>>> 
>>> Thanks for help,
>>> 
>>> Regards,
>>> 
>>> Alexandre
>>> 
>>> 
>>> ----- Mail original -----
>>> 
>>> De: "Marcus Sorensen" <shadowsor@gmail.com>
>>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>>> Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
>>> Envoyé: Mercredi 31 Octobre 2012 18:08:11
>>> Objet: Re: slow fio random read benchmark, need help
>>> 
>>> 5000 is actually really good, if you ask me. Assuming everything is
>>> connected via gigabit. If you get 40k iops locally, you add the
>>> latency of tcp, as well as that of the ceph services and VM layer, and
>>> that's what you get. On my network I get about a .1ms round trip on
>>> gigabit over the same switch, which by definition can only do 10,000
>>> iops. Then if you have storage on the other end capable of 40k iops,
>>> you add the latencies together (.1ms + .025ms) and you're at 8k iops.
>>> Then add the small latency of the application servicing the io (NFS,
>>> Ceph, etc), and the latency introduced by your VM layer, and 5k sounds
>>> about right.
>>> 
>>> The good news is that you probably aren't taxing the storage, you can
>>> likely do many simultaneous tests from several VMs and get the same
>>> results.
>>> 
>>> You can try adding --numjobs to your fio to parallelize the specific
>>> test you're doing, or launching a second VM and doing the same test at
>>> the same time. This would be a good indicator if it's latency.
>>> 
>>> On Wed, Oct 31, 2012 at 10:29 AM, Alexandre DERUMIER
>>> <aderumier@odiso.com> wrote:
>>>>>> Have you tried increasing the iodepth?
>>>> Yes, I have try with 100 and 200, same results.
>>>> 
>>>> I have also try directly from the host, with /dev/rbd1, and I have same result.
>>>> I have also try with 3 differents hosts, with differents cpus models.
>>>> 
>>>> (note: I can reach around 40.000 iops with same fio config on a zfs iscsi array)
>>>> 
>>>> My test ceph cluster nodes cpus are old (xeon E5420), but they are around 10% usage, so I think it's ok.
>>>> 
>>>> 
>>>> Do you have an idea if I can trace something ?
>>>> 
>>>> Thanks,
>>>> 
>>>> Alexandre
>>>> 
>>>> ----- Mail original -----
>>>> 
>>>> De: "Sage Weil" <sage@inktank.com>
>>>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>>>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
>>>> Envoyé: Mercredi 31 Octobre 2012 16:57:05
>>>> Objet: Re: slow fio random read benchmark, need help
>>>> 
>>>> On Wed, 31 Oct 2012, Alexandre DERUMIER wrote:
>>>>> Hello,
>>>>> 
>>>>> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster)
>>>>> 
>>>>> 
>>>>> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40 --group_reporting --name=file1 --ioengine=libaio --direct=1
>>>>> 
>>>>> 
>>>>> I can't get more than 5000 iops.
>>>> 
>>>> Have you tried increasing the iodepth?
>>>> 
>>>> sage
>>>> 
>>>>> 
>>>>> 
>>>>> RBD cluster is :
>>>>> ---------------
>>>>> 3 nodes,with each node :
>>>>> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon
>>>>> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ
>>>>> rbd 0.53
>>>>> 
>>>>> ceph.conf
>>>>> 
>>>>> journal dio = false
>>>>> filestore fiemap = false
>>>>> filestore flusher = false
>>>>> osd op threads = 24
>>>>> osd disk threads = 24
>>>>> filestore op threads = 6
>>>>> 
>>>>> kvm host is : 4 x 12 cores opteron
>>>>> ------------
>>>>> 
>>>>> 
>>>>> During the bench:
>>>>> 
>>>>> on ceph nodes:
>>>>> - cpu is around 10% used
>>>>> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer)
>>>>> 
>>>>> 
>>>>> on kvm host:
>>>>> 
>>>>> -cpu is around 20% used
>>>>> 
>>>>> 
>>>>> I really don't see where is the bottleneck....
>>>>> 
>>>>> Any Ideas, hints ?
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Alexandre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> 
>>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: slow fio random read benchmark, need help
  2012-10-31 17:27           ` Alexandre DERUMIER
  2012-10-31 17:38             ` Marcus Sorensen
@ 2012-11-01  7:38             ` Dietmar Maurer
  2012-11-01  8:08               ` Stefan Priebe - Profihost AG
  2012-11-01 10:40               ` Gregory Farnum
  1 sibling, 2 replies; 41+ messages in thread
From: Dietmar Maurer @ 2012-11-01  7:38 UTC (permalink / raw)
  To: Alexandre DERUMIER, Marcus Sorensen; +Cc: Sage Weil, ceph-devel

I do not really understand that network latency argument.

If one can get 40K iops with iSCSI, why can't I get the same with rados/ceph?

Note: network latency is the same in both cases

What do I miss?

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Alexandre DERUMIER
> Sent: Mittwoch, 31. Oktober 2012 18:27
> To: Marcus Sorensen
> Cc: Sage Weil; ceph-devel
> Subject: Re: slow fio random read benchmark, need help
> 
> Thanks Marcus,
> 
> indeed gigabit ethernet.
> 
> note that my iscsi results  (40k)was with multipath, so multiple gigabit links.
> 
> I have also done tests with a netapp array, with nfs, single link, I'm around
> 13000 iops
> 
> I will do more tests with multiples vms, from differents hosts, and with --
> numjobs.
> 
> I'll keep you in touch,
> 
> Thanks for help,
> 
> Regards,
> 
> Alexandre
> 
> 
> ----- Mail original -----
> 
> De: "Marcus Sorensen" <shadowsor@gmail.com>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-
> devel@vger.kernel.org>
> Envoyé: Mercredi 31 Octobre 2012 18:08:11
> Objet: Re: slow fio random read benchmark, need help
> 
> 5000 is actually really good, if you ask me. Assuming everything is connected
> via gigabit. If you get 40k iops locally, you add the latency of tcp, as well as
> that of the ceph services and VM layer, and that's what you get. On my
> network I get about a .1ms round trip on gigabit over the same switch, which
> by definition can only do 10,000 iops. Then if you have storage on the other
> end capable of 40k iops, you add the latencies together (.1ms + .025ms) and
> you're at 8k iops.
> Then add the small latency of the application servicing the io (NFS, Ceph, etc),
> and the latency introduced by your VM layer, and 5k sounds about right.
> 
> The good news is that you probably aren't taxing the storage, you can likely
> do many simultaneous tests from several VMs and get the same results.
> 
> You can try adding --numjobs to your fio to parallelize the specific test you're
> doing, or launching a second VM and doing the same test at the same time.
> This would be a good indicator if it's latency.
> 
> On Wed, Oct 31, 2012 at 10:29 AM, Alexandre DERUMIER
> <aderumier@odiso.com> wrote:
> >>>Have you tried increasing the iodepth?
> > Yes, I have try with 100 and 200, same results.
> >
> > I have also try directly from the host, with /dev/rbd1, and I have same
> result.
> > I have also try with 3 differents hosts, with differents cpus models.
> >
> > (note: I can reach around 40.000 iops with same fio config on a zfs
> > iscsi array)
> >
> > My test ceph cluster nodes cpus are old (xeon E5420), but they are around
> 10% usage, so I think it's ok.
> >
> >
> > Do you have an idea if I can trace something ?
> >
> > Thanks,
> >
> > Alexandre
> >
> > ----- Mail original -----
> >
> > De: "Sage Weil" <sage@inktank.com>
> > À: "Alexandre DERUMIER" <aderumier@odiso.com>
> > Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
> > Envoyé: Mercredi 31 Octobre 2012 16:57:05
> > Objet: Re: slow fio random read benchmark, need help
> >
> > On Wed, 31 Oct 2012, Alexandre DERUMIER wrote:
> >> Hello,
> >>
> >> I'm doing some tests with fio from a qemu 1.2 guest (virtio
> >> disk,cache=none), randread, with 4K block size on a small size of 1G
> >> (so it can be handle by the buffer cache on ceph cluster)
> >>
> >>
> >> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M
> >> --iodepth=40 --group_reporting --name=file1 --ioengine=libaio
> >> --direct=1
> >>
> >>
> >> I can't get more than 5000 iops.
> >
> > Have you tried increasing the iodepth?
> >
> > sage
> >
> >>
> >>
> >> RBD cluster is :
> >> ---------------
> >> 3 nodes,with each node :
> >> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon
> >> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ rbd 0.53
> >>
> >> ceph.conf
> >>
> >> journal dio = false
> >> filestore fiemap = false
> >> filestore flusher = false
> >> osd op threads = 24
> >> osd disk threads = 24
> >> filestore op threads = 6
> >>
> >> kvm host is : 4 x 12 cores opteron
> >> ------------
> >>
> >>
> >> During the bench:
> >>
> >> on ceph nodes:
> >> - cpu is around 10% used
> >> - iostat show no disks activity on osds. (so I think that the 1G file
> >> is handle in the linux buffer)
> >>
> >>
> >> on kvm host:
> >>
> >> -cpu is around 20% used
> >>
> >>
> >> I really don't see where is the bottleneck....
> >>
> >> Any Ideas, hints ?
> >>
> >>
> >> Regards,
> >>
> >> Alexandre
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at http://vger.kernel.org/majordomo-info.html
> >>
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-11-01  7:38             ` Dietmar Maurer
@ 2012-11-01  8:08               ` Stefan Priebe - Profihost AG
  2012-11-01 10:40               ` Gregory Farnum
  1 sibling, 0 replies; 41+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-01  8:08 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Alexandre DERUMIER, Marcus Sorensen, Sage Weil, ceph-devel

Am 01.11.2012 08:38, schrieb Dietmar Maurer:
> I do not really understand that network latency argument.
> If one can get 40K iops with iSCSI, why can't I get the same with rados/ceph?
> Note: network latency is the same in both cases
> What do I miss?

Good question. Also i've seen 20k iops on ceph with 10GBE but i'm able 
to get 100.000 iops via iSCSI.

Stefan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-11-01  7:38             ` Dietmar Maurer
  2012-11-01  8:08               ` Stefan Priebe - Profihost AG
@ 2012-11-01 10:40               ` Gregory Farnum
  2012-11-01 10:54                 ` Stefan Priebe - Profihost AG
  2012-11-01 15:46                 ` slow fio random read benchmark, need help Marcus Sorensen
  1 sibling, 2 replies; 41+ messages in thread
From: Gregory Farnum @ 2012-11-01 10:40 UTC (permalink / raw)
  To: Dietmar Maurer, Josh Durgin
  Cc: Alexandre DERUMIER, Marcus Sorensen, Sage Weil, ceph-devel

I'm not sure that latency addition is quite correct. Most use cases
cases do multiple IOs at the same time, and good benchmarks tend to
reflect that.

I suspect the IO limitations here are a result of QEMU's storage
handling (or possibly our client layer) more than anything else — Josh
can talk about that more than I can, though!
-Greg

On Thu, Nov 1, 2012 at 8:38 AM, Dietmar Maurer <dietmar@proxmox.com> wrote:
> I do not really understand that network latency argument.
>
> If one can get 40K iops with iSCSI, why can't I get the same with rados/ceph?
>
> Note: network latency is the same in both cases
>
> What do I miss?
>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Alexandre DERUMIER
>> Sent: Mittwoch, 31. Oktober 2012 18:27
>> To: Marcus Sorensen
>> Cc: Sage Weil; ceph-devel
>> Subject: Re: slow fio random read benchmark, need help
>>
>> Thanks Marcus,
>>
>> indeed gigabit ethernet.
>>
>> note that my iscsi results  (40k)was with multipath, so multiple gigabit links.
>>
>> I have also done tests with a netapp array, with nfs, single link, I'm around
>> 13000 iops
>>
>> I will do more tests with multiples vms, from differents hosts, and with --
>> numjobs.
>>
>> I'll keep you in touch,
>>
>> Thanks for help,
>>
>> Regards,
>>
>> Alexandre
>>
>>
>> ----- Mail original -----
>>
>> De: "Marcus Sorensen" <shadowsor@gmail.com>
>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>> Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-
>> devel@vger.kernel.org>
>> Envoyé: Mercredi 31 Octobre 2012 18:08:11
>> Objet: Re: slow fio random read benchmark, need help
>>
>> 5000 is actually really good, if you ask me. Assuming everything is connected
>> via gigabit. If you get 40k iops locally, you add the latency of tcp, as well as
>> that of the ceph services and VM layer, and that's what you get. On my
>> network I get about a .1ms round trip on gigabit over the same switch, which
>> by definition can only do 10,000 iops. Then if you have storage on the other
>> end capable of 40k iops, you add the latencies together (.1ms + .025ms) and
>> you're at 8k iops.
>> Then add the small latency of the application servicing the io (NFS, Ceph, etc),
>> and the latency introduced by your VM layer, and 5k sounds about right.
>>
>> The good news is that you probably aren't taxing the storage, you can likely
>> do many simultaneous tests from several VMs and get the same results.
>>
>> You can try adding --numjobs to your fio to parallelize the specific test you're
>> doing, or launching a second VM and doing the same test at the same time.
>> This would be a good indicator if it's latency.
>>
>> On Wed, Oct 31, 2012 at 10:29 AM, Alexandre DERUMIER
>> <aderumier@odiso.com> wrote:
>> >>>Have you tried increasing the iodepth?
>> > Yes, I have try with 100 and 200, same results.
>> >
>> > I have also try directly from the host, with /dev/rbd1, and I have same
>> result.
>> > I have also try with 3 differents hosts, with differents cpus models.
>> >
>> > (note: I can reach around 40.000 iops with same fio config on a zfs
>> > iscsi array)
>> >
>> > My test ceph cluster nodes cpus are old (xeon E5420), but they are around
>> 10% usage, so I think it's ok.
>> >
>> >
>> > Do you have an idea if I can trace something ?
>> >
>> > Thanks,
>> >
>> > Alexandre
>> >
>> > ----- Mail original -----
>> >
>> > De: "Sage Weil" <sage@inktank.com>
>> > À: "Alexandre DERUMIER" <aderumier@odiso.com>
>> > Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
>> > Envoyé: Mercredi 31 Octobre 2012 16:57:05
>> > Objet: Re: slow fio random read benchmark, need help
>> >
>> > On Wed, 31 Oct 2012, Alexandre DERUMIER wrote:
>> >> Hello,
>> >>
>> >> I'm doing some tests with fio from a qemu 1.2 guest (virtio
>> >> disk,cache=none), randread, with 4K block size on a small size of 1G
>> >> (so it can be handle by the buffer cache on ceph cluster)
>> >>
>> >>
>> >> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M
>> >> --iodepth=40 --group_reporting --name=file1 --ioengine=libaio
>> >> --direct=1
>> >>
>> >>
>> >> I can't get more than 5000 iops.
>> >
>> > Have you tried increasing the iodepth?
>> >
>> > sage
>> >
>> >>
>> >>
>> >> RBD cluster is :
>> >> ---------------
>> >> 3 nodes,with each node :
>> >> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon
>> >> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ rbd 0.53
>> >>
>> >> ceph.conf
>> >>
>> >> journal dio = false
>> >> filestore fiemap = false
>> >> filestore flusher = false
>> >> osd op threads = 24
>> >> osd disk threads = 24
>> >> filestore op threads = 6
>> >>
>> >> kvm host is : 4 x 12 cores opteron
>> >> ------------
>> >>
>> >>
>> >> During the bench:
>> >>
>> >> on ceph nodes:
>> >> - cpu is around 10% used
>> >> - iostat show no disks activity on osds. (so I think that the 1G file
>> >> is handle in the linux buffer)
>> >>
>> >>
>> >> on kvm host:
>> >>
>> >> -cpu is around 20% used
>> >>
>> >>
>> >> I really don't see where is the bottleneck....
>> >>
>> >> Any Ideas, hints ?
>> >>
>> >>
>> >> Regards,
>> >>
>> >> Alexandre
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> >> in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>> >> info at http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> > in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>> > info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>> body of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-11-01 10:40               ` Gregory Farnum
@ 2012-11-01 10:54                 ` Stefan Priebe - Profihost AG
  2012-11-02  9:38                   ` Alexandre DERUMIER
  2012-11-01 15:46                 ` slow fio random read benchmark, need help Marcus Sorensen
  1 sibling, 1 reply; 41+ messages in thread
From: Stefan Priebe - Profihost AG @ 2012-11-01 10:54 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: Dietmar Maurer, Josh Durgin, Alexandre DERUMIER, Marcus Sorensen,
	Sage Weil, ceph-devel

Am 01.11.2012 11:40, schrieb Gregory Farnum:
> I'm not sure that latency addition is quite correct. Most use cases
> cases do multiple IOs at the same time, and good benchmarks tend to
> reflect that.
>
> I suspect the IO limitations here are a result of QEMU's storage
> handling (or possibly our client layer) more than anything else — Josh
> can talk about that more than I can, though!
> -Greg

Same results with rbd kernel driver without QEMU involved.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] rbd: get additional info in parent spec
  2012-11-01  1:49   ` Josh Durgin
@ 2012-11-01 12:18     ` Alex Elder
  0 siblings, 0 replies; 41+ messages in thread
From: Alex Elder @ 2012-11-01 12:18 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 10/31/2012 08:49 PM, Josh Durgin wrote:
> I know you've got a queue of these already, but here's another:
> rbd_dev_probe_update_spec() could definitely use some warnings
> to distinguish its error cases.
> 
> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>

Finally!  I was going to accuse you of slacking for
having no comments on all of the other reviews up
to this one...

I've got some of these implemented already, I'll finish
off this as you suggest and a few other categories as
well this morning and will get them posted for review.

					-Alex

> 
> On 10/30/2012 06:49 PM, Alex Elder wrote:
>> When a layered rbd image has a parent, that parent is identified
>> only by its pool id, image id, and snapshot id.  Images that have
>> been mapped also record *names* for those three id's.
>>
>> Add code to look up these names for parent images so they match
>> mapped images more closely.  Skip doing this for an image if it
>> already has its pool name defined (this will be the case for images
>> mapped by the user).
>>
>> It is possible that an the name of a parent image can't be
>> determined, even if the image id is valid.  If this occurs it
>> does not preclude correct operation, so don't treat this as
>> an error.
>>
>> On the other hand, defined pools will always have both an id and a
>> name.   And any snapshot of an image identified as a parent for a
>> clone image will exist, and will have a name (if not it indicates
>> some other internal error).  So treat failure to get these bits
>> of information as errors.
>>
>> Signed-off-by: Alex Elder <elder@inktank.com>
>> ---
>>   drivers/block/rbd.c |  131
>> +++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 131 insertions(+)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index bce1fcf..04062c1 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -70,7 +70,10 @@
>>
>>   #define RBD_SNAP_HEAD_NAME    "-"
>>
>> +/* This allows a single page to hold an image name sent by OSD */
>> +#define RBD_IMAGE_NAME_LEN_MAX    (PAGE_SIZE - sizeof (__le32) - 1)
>>   #define RBD_IMAGE_ID_LEN_MAX    64
>> +
>>   #define RBD_OBJ_PREFIX_LEN_MAX    64
>>
>>   /* Feature bits */
>> @@ -658,6 +661,20 @@ out_err:
>>       return -ENOMEM;
>>   }
>>
>> +static const char *rbd_snap_name(struct rbd_device *rbd_dev, u64
>> snap_id)
>> +{
>> +    struct rbd_snap *snap;
>> +
>> +    if (snap_id == CEPH_NOSNAP)
>> +        return RBD_SNAP_HEAD_NAME;
>> +
>> +    list_for_each_entry(snap, &rbd_dev->snaps, node)
>> +        if (snap_id == snap->id)
>> +            return snap->name;
>> +
>> +    return NULL;
>> +}
>> +
>>   static int snap_by_name(struct rbd_device *rbd_dev, const char
>> *snap_name)
>>   {
>>
>> @@ -2499,6 +2516,7 @@ static int rbd_dev_v2_parent_info(struct
>> rbd_device *rbd_dev)
>>           goto out_err;
>>       }
>>       parent_spec->image_id = image_id;
>> +    parent_spec->image_id_len = len;
>>       ceph_decode_64_safe(&p, end, parent_spec->snap_id, out_err);
>>       ceph_decode_64_safe(&p, end, overlap, out_err);
>>
>> @@ -2514,6 +2532,115 @@ out_err:
>>       return ret;
>>   }
>>
>> +static char *rbd_dev_image_name(struct rbd_device *rbd_dev)
>> +{
>> +    size_t image_id_size;
>> +    char *image_id;
>> +    void *p;
>> +    void *end;
>> +    size_t size;
>> +    void *reply_buf = NULL;
>> +    size_t len = 0;
>> +    char *image_name = NULL;
>> +    int ret;
>> +
>> +    rbd_assert(!rbd_dev->spec->image_name);
>> +
>> +    image_id_size = sizeof (__le32) + rbd_dev->spec->image_id_len;
>> +    image_id = kmalloc(image_id_size, GFP_KERNEL);
>> +    if (!image_id)
>> +        return NULL;
>> +
>> +    p = image_id;
>> +    end = (char *) image_id + image_id_size;
>> +    ceph_encode_string(&p, end, rbd_dev->spec->image_id,
>> +                (u32) rbd_dev->spec->image_id_len);
>> +
>> +    size = sizeof (__le32) + RBD_IMAGE_NAME_LEN_MAX;
>> +    reply_buf = kmalloc(size, GFP_KERNEL);
>> +    if (!reply_buf)
>> +        goto out;
>> +
>> +    ret = rbd_req_sync_exec(rbd_dev, RBD_DIRECTORY,
>> +                "rbd", "dir_get_name",
>> +                image_id, image_id_size,
>> +                (char *) reply_buf, size,
>> +                CEPH_OSD_FLAG_READ, NULL);
>> +    if (ret < 0)
>> +        goto out;
>> +    p = reply_buf;
>> +    end = (char *) reply_buf + size;
>> +    image_name = ceph_extract_encoded_string(&p, end, &len, GFP_KERNEL);
>> +    if (image_name)
>> +        dout("%s: name is %s len is %zd\n", __func__, image_name, len);
>> +out:
>> +    kfree(reply_buf);
>> +    kfree(image_id);
>> +
>> +    return image_name;
>> +}
>> +
>> +/*
>> + * When a parent image gets probed, we only have the pool, image,
>> + * and snapshot ids but not the names of any of them.  This call
>> + * is made later to fill in those names.  It has to be done after
>> + * rbd_dev_snaps_update() has completed because some of the
>> + * information (in particular, snapshot name) is not available
>> + * until then.
>> + */
>> +static int rbd_dev_probe_update_spec(struct rbd_device *rbd_dev)
>> +{
>> +    struct ceph_osd_client *osdc;
>> +    const char *name;
>> +    void *reply_buf = NULL;
>> +    int ret;
>> +
>> +    if (rbd_dev->spec->pool_name)
>> +        return 0;    /* Already have the names */
>> +
>> +    /* Look up the pool name */
>> +
>> +    osdc = &rbd_dev->rbd_client->client->osdc;
>> +    name = ceph_pg_pool_name_by_id(osdc->osdmap,
>> rbd_dev->spec->pool_id);
>> +    if (!name)
>> +        return -EIO;    /* pool id too large (>= 2^31) */
>> +
>> +    rbd_dev->spec->pool_name = kstrdup(name, GFP_KERNEL);
>> +    if (!rbd_dev->spec->pool_name)
>> +        return -ENOMEM;
>> +
>> +    /* Fetch the image name; tolerate failure here */
>> +
>> +    name = rbd_dev_image_name(rbd_dev);
>> +    if (name) {
>> +        rbd_dev->spec->image_name_len = strlen(name);
>> +        rbd_dev->spec->image_name = (char *) name;
>> +    } else {
>> +        pr_warning(RBD_DRV_NAME "%d "
>> +            "unable to get image name for image id %s\n",
>> +            rbd_dev->major, rbd_dev->spec->image_id);
>> +    }
>> +
>> +    /* Look up the snapshot name. */
>> +
>> +    name = rbd_snap_name(rbd_dev, rbd_dev->spec->snap_id);
>> +    if (!name) {
>> +        ret = -EIO;
>> +        goto out_err;
>> +    }
>> +    rbd_dev->spec->snap_name = kstrdup(name, GFP_KERNEL);
>> +    if(!rbd_dev->spec->snap_name)
>> +        goto out_err;
>> +
>> +    return 0;
>> +out_err:
>> +    kfree(reply_buf);
>> +    kfree(rbd_dev->spec->pool_name);
>> +    rbd_dev->spec->pool_name = NULL;
>> +
>> +    return ret;
>> +}
>> +
>>   static int rbd_dev_v2_snap_context(struct rbd_device *rbd_dev, u64
>> *ver)
>>   {
>>       size_t size;
>> @@ -3372,6 +3499,10 @@ static int rbd_dev_probe_finish(struct rbd_device
>> *rbd_dev)
>>       if (ret)
>>           return ret;
>>
>> +    ret = rbd_dev_probe_update_spec(rbd_dev);
>> +    if (ret)
>> +        goto err_out_snaps;
>> +
>>       ret = rbd_dev_set_mapping(rbd_dev);
>>       if (ret)
>>           goto err_out_snaps;
>>
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] rbd: probe the parent of an image if present
  2012-11-01  2:07   ` [PATCH 6/6] rbd: probe the parent of an image if present Josh Durgin
@ 2012-11-01 12:26     ` Alex Elder
  0 siblings, 0 replies; 41+ messages in thread
From: Alex Elder @ 2012-11-01 12:26 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 10/31/2012 09:07 PM, Josh Durgin wrote:
> This all makes sense, but it reminds me of another issue we'll need to
> address:
> 
> http://www.tracker.newdream.net/issues/2533

I was not aware of that one.  That's no good.

> We don't need to watch the header of a parent snapshot, since it's
> immutable and guaranteed not to be deleted out from under us.
> This avoids the bug referenced above. So I guess rbd_dev_probe{_finish}
> can take a parameter telling them whether to watch the header or not.

Yes, I've been holding off fixing this for the time being, keeping
all types of image equal as much as possible and then refining it
after I've got more functionality completed.

I was thinking of having the parent image rbd_dev have a pointer
to the child for this purpose (as well as helping debug in the
event of a crash).  This pointer would become a list (empty for
the initially-mapped image) at the point we implement shared
parent images.

> We should check whether multiple mapped rbds (without layering) hit
> this issue as well, and if so, default to not sharing the ceph_client
> until the bug is fixed.

I'm not sure what precisely a rados_cluster_t represents but if
you can help me get a test defined for this I could check it out.

In the mean time we can hold off on committing this last patch
if you like.

					-Alex

> On 10/30/2012 06:50 PM, Alex Elder wrote:
>> Call the probe function for the parent device.
>>
>> Signed-off-by: Alex Elder <elder@inktank.com>
>> ---
>>   drivers/block/rbd.c |   79
>> +++++++++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 76 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index 04062c1..8ef13f72 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -222,6 +222,7 @@ struct rbd_device {
>>
>>       struct rbd_spec        *parent_spec;
>>       u64            parent_overlap;
>> +    struct rbd_device    *parent;
>>
>>       /* protects updating the header */
>>       struct rw_semaphore     header_rwsem;
>> @@ -255,6 +256,7 @@ static ssize_t rbd_add(struct bus_type *bus, const
>> char *buf,
>>                  size_t count);
>>   static ssize_t rbd_remove(struct bus_type *bus, const char *buf,
>>                 size_t count);
>> +static int rbd_dev_probe(struct rbd_device *rbd_dev);
>>
>>   static struct bus_attribute rbd_bus_attrs[] = {
>>       __ATTR(add, S_IWUSR, NULL, rbd_add),
>> @@ -378,6 +380,13 @@ out_opt:
>>       return ERR_PTR(ret);
>>   }
>>
>> +static struct rbd_client *__rbd_get_client(struct rbd_client *rbdc)
>> +{
>> +    kref_get(&rbdc->kref);
>> +
>> +    return rbdc;
>> +}
>> +
>>   /*
>>    * Find a ceph client with specific addr and configuration.  If
>>    * found, bump its reference count.
>> @@ -393,7 +402,8 @@ static struct rbd_client *rbd_client_find(struct
>> ceph_options *ceph_opts)
>>       spin_lock(&rbd_client_list_lock);
>>       list_for_each_entry(client_node, &rbd_client_list, node) {
>>           if (!ceph_compare_options(ceph_opts, client_node->client)) {
>> -            kref_get(&client_node->kref);
>> +            __rbd_get_client(client_node);
>> +
>>               found = true;
>>               break;
>>           }
>> @@ -3311,6 +3321,11 @@ static int rbd_dev_image_id(struct rbd_device
>> *rbd_dev)
>>       void *response;
>>       void *p;
>>
>> +    /* If we already have it we don't need to look it up */
>> +
>> +    if (rbd_dev->spec->image_id)
>> +        return 0;
>> +
>>       /*
>>        * When probing a parent image, the image id is already
>>        * known (and the image name likely is not).  There's no
>> @@ -3492,6 +3507,9 @@ out_err:
>>
>>   static int rbd_dev_probe_finish(struct rbd_device *rbd_dev)
>>   {
>> +    struct rbd_device *parent = NULL;
>> +    struct rbd_spec *parent_spec = NULL;
>> +    struct rbd_client *rbdc = NULL;
>>       int ret;
>>
>>       /* no need to lock here, as rbd_dev is not registered yet */
>> @@ -3536,6 +3554,31 @@ static int rbd_dev_probe_finish(struct rbd_device
>> *rbd_dev)
>>        * At this point cleanup in the event of an error is the job
>>        * of the sysfs code (initiated by rbd_bus_del_dev()).
>>        */
>> +    /* Probe the parent if there is one */
>> +
>> +    if (rbd_dev->parent_spec) {
>> +        /*
>> +         * We need to pass a reference to the client and the
>> +         * parent spec when creating the parent rbd_dev.
>> +         * Images related by parent/child relationships
>> +         * always share both.
>> +         */
>> +        parent_spec = rbd_spec_get(rbd_dev->parent_spec);
>> +        rbdc = __rbd_get_client(rbd_dev->rbd_client);
>> +
>> +        parent = rbd_dev_create(rbdc, parent_spec);
>> +        if (!parent) {
>> +            ret = -ENOMEM;
>> +            goto err_out_spec;
>> +        }
>> +        rbdc = NULL;        /* parent now owns reference */
>> +        parent_spec = NULL;    /* parent now owns reference */
>> +        ret = rbd_dev_probe(parent);
>> +        if (ret < 0)
>> +            goto err_out_parent;
>> +        rbd_dev->parent = parent;
>> +    }
>> +
>>       down_write(&rbd_dev->header_rwsem);
>>       ret = rbd_dev_snaps_register(rbd_dev);
>>       up_write(&rbd_dev->header_rwsem);
>> @@ -3554,6 +3597,12 @@ static int rbd_dev_probe_finish(struct rbd_device
>> *rbd_dev)
>>           (unsigned long long) rbd_dev->mapping.size);
>>
>>       return ret;
>> +
>> +err_out_parent:
>> +    rbd_dev_destroy(parent);
>> +err_out_spec:
>> +    rbd_spec_put(parent_spec);
>> +    rbd_put_client(rbdc);
>>   err_out_bus:
>>       /* this will also clean up rest of rbd_dev stuff */
>>
>> @@ -3717,6 +3766,12 @@ static void rbd_dev_release(struct device *dev)
>>       module_put(THIS_MODULE);
>>   }
>>
>> +static void __rbd_remove(struct rbd_device *rbd_dev)
>> +{
>> +    rbd_remove_all_snaps(rbd_dev);
>> +    rbd_bus_del_dev(rbd_dev);
>> +}
>> +
>>   static ssize_t rbd_remove(struct bus_type *bus,
>>                 const char *buf,
>>                 size_t count)
>> @@ -3743,8 +3798,26 @@ static ssize_t rbd_remove(struct bus_type *bus,
>>           goto done;
>>       }
>>
>> -    rbd_remove_all_snaps(rbd_dev);
>> -    rbd_bus_del_dev(rbd_dev);
>> +    while (rbd_dev->parent_spec) {
>> +        struct rbd_device *first = rbd_dev;
>> +        struct rbd_device *second = first->parent;
>> +        struct rbd_device *third;
>> +
>> +        /*
>> +         * Follow to the parent with no grandparent and
>> +         * remove it.
>> +         */
>> +        while (second && (third = second->parent)) {
>> +            first = second;
>> +            second = third;
>> +        }
>> +        __rbd_remove(second);
>> +        rbd_spec_put(first->parent_spec);
>> +        first->parent_spec = NULL;
>> +        first->parent_overlap = 0;
>> +        first->parent = NULL;
>> +    }
>> +    __rbd_remove(rbd_dev);
>>
>>   done:
>>       mutex_unlock(&ctl_mutex);
>>
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-11-01 10:40               ` Gregory Farnum
  2012-11-01 10:54                 ` Stefan Priebe - Profihost AG
@ 2012-11-01 15:46                 ` Marcus Sorensen
  2012-11-01 16:28                   ` Marcus Sorensen
  1 sibling, 1 reply; 41+ messages in thread
From: Marcus Sorensen @ 2012-11-01 15:46 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: Dietmar Maurer, Josh Durgin, Alexandre DERUMIER, Sage Weil, ceph-devel

In this case he's doing a direct random read, so the ios queue one at
a time on his various multipath channels. Be may have defined a depth
that sends a bunch at once, but they still get split up, he could run
a blktrace to verify. If they could merge he could maybe send
multiples, or perhaps he could change his multipathing io grouping or
RR io numbers but I don't suspect it would help.

Just to take this further, if I do his benchmark locally, I see that
it does a good job of keeping the queue full, but the ios are still
4k. They can't be merged, and they're sync (read), so they're issued
one at a time.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00 6378.00    0.00 25512.00     0.00
8.00    45.49    5.87    5.87    0.00   0.16 100.00

If I do a blktrace ( I know you're not interested in pages of output,
but here are a few IOs), I see that the 4k IO (seen as sector number +
8 512b sectors) is issued (make_request = Q), then a new request
descriptor is allocated (G), then the block device queue is
plugged(P), then the request descriptor is inserted into the queue(I),
then the queue is unplugged so it can be proccessed, then the block
device driver kicks in and pops a single request off the queue, and
tells the disk controller to raise an interrupt whenever that is
completed.

So even though FIO is doing a good job of keeping the queue size high
as seen in iostat, the IOs are not merged and are issued to the device
driver in single file.
In this case, the point between D and whenever the interrupt is raised
and we see a "C" is subject to the latency of whatever is between our
driver and the actual data. You can see at the bottom that each one came
back in about 4-5 ms (4th column is timestamp).

  8,0    0    78905     1.932584450 10413  Q   R 34215048 + 8 [fio]
  8,0    0    78906     1.932586964 10413  G   R 34215048 + 8 [fio]
  8,0    0    78907     1.932589199 10413  P   N [fio]
  8,0    0    78908     1.932591713 10413  I   R 34215048 + 8 [fio]
  8,0    0    78909     1.932593948 10413  U   N [fio] 1
  8,0    0    78910     1.932596183 10413  D   R 34215048 + 8 [fio]

  8,0    0    78911     1.932659879 10413  Q   R 36222288 + 8 [fio]
  8,0    0    78912     1.932662393 10413  G   R 36222288 + 8 [fio]
  8,0    0    78913     1.932664907 10413  P   N [fio]
  8,0    0    78914     1.932667421 10413  I   R 36222288 + 8 [fio]
  8,0    0    78915     1.932669656 10413  U   N [fio] 1
  8,0    0    78916     1.932671891 10413  D   R 36222288 + 8 [fio]

  8,0    0    78918     1.932822469 10413  Q   R 2857800 + 8 [fio]
  8,0    0    78919     1.932827218 10413  G   R 2857800 + 8 [fio]
  8,0    0    78920     1.932829732 10413  P   N [fio]
  8,0    0    78921     1.932832247 10413  I   R 2857800 + 8 [fio]
  8,0    0    78922     1.932834482 10413  U   N [fio] 1
  8,0    0    78923     1.932836717 10413  D   R 2857800 + 8 [fio]

  8,0    0    78924     1.932902926 10413  Q   R 58687488 + 8 [fio]
  8,0    0    78925     1.932905440 10413  G   R 58687488 + 8 [fio]
  8,0    0    78926     1.932907675 10413  P   N [fio]
  8,0    0    78927     1.932910469 10413  I   R 58687488 + 8 [fio]
  8,0    0    78928     1.932912704 10413  U   N [fio] 1
  8,0    0    78929     1.932914939 10413  D   R 58687488 + 8 [fio]

  8,0    0    78930     1.932953212 10413  Q   R 31928168 + 8 [fio]
  8,0    0    78931     1.932956005 10413  G   R 31928168 + 8 [fio]
  8,0    0    78932     1.932958240 10413  P   N [fio]
  8,0    0    78933     1.932960755 10413  I   R 31928168 + 8 [fio]
  8,0    0    78934     1.932962990 10413  U   N [fio] 1
  8,0    0    78935     1.932965225 10413  D   R 31928168 + 8 [fio]

  8,0    0    79101     1.936660108     0  C   R 34215048 + 8 [0]

  8,0    0    79147     1.937862217     0  C   R 36222288 + 8 [0]

  8,0    0    79149     1.937944909     0  C   R 58687488 + 8 [0]

  8,0    0    79105     1.936713466     0  C   R 31928168 + 8 [0]


On Thu, Nov 1, 2012 at 4:40 AM, Gregory Farnum <greg@inktank.com> wrote:
> I'm not sure that latency addition is quite correct. Most use cases
> cases do multiple IOs at the same time, and good benchmarks tend to
> reflect that.
>
> I suspect the IO limitations here are a result of QEMU's storage
> handling (or possibly our client layer) more than anything else — Josh
> can talk about that more than I can, though!
> -Greg
>
> On Thu, Nov 1, 2012 at 8:38 AM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> I do not really understand that network latency argument.
>>
>> If one can get 40K iops with iSCSI, why can't I get the same with rados/ceph?
>>
>> Note: network latency is the same in both cases
>>
>> What do I miss?
>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Alexandre DERUMIER
>>> Sent: Mittwoch, 31. Oktober 2012 18:27
>>> To: Marcus Sorensen
>>> Cc: Sage Weil; ceph-devel
>>> Subject: Re: slow fio random read benchmark, need help
>>>
>>> Thanks Marcus,
>>>
>>> indeed gigabit ethernet.
>>>
>>> note that my iscsi results  (40k)was with multipath, so multiple gigabit links.
>>>
>>> I have also done tests with a netapp array, with nfs, single link, I'm around
>>> 13000 iops
>>>
>>> I will do more tests with multiples vms, from differents hosts, and with --
>>> numjobs.
>>>
>>> I'll keep you in touch,
>>>
>>> Thanks for help,
>>>
>>> Regards,
>>>
>>> Alexandre
>>>
>>>
>>> ----- Mail original -----
>>>
>>> De: "Marcus Sorensen" <shadowsor@gmail.com>
>>> À: "Alexandre DERUMIER" <aderumier@odiso.com>
>>> Cc: "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-
>>> devel@vger.kernel.org>
>>> Envoyé: Mercredi 31 Octobre 2012 18:08:11
>>> Objet: Re: slow fio random read benchmark, need help
>>>
>>> 5000 is actually really good, if you ask me. Assuming everything is connected
>>> via gigabit. If you get 40k iops locally, you add the latency of tcp, as well as
>>> that of the ceph services and VM layer, and that's what you get. On my
>>> network I get about a .1ms round trip on gigabit over the same switch, which
>>> by definition can only do 10,000 iops. Then if you have storage on the other
>>> end capable of 40k iops, you add the latencies together (.1ms + .025ms) and
>>> you're at 8k iops.
>>> Then add the small latency of the application servicing the io (NFS, Ceph, etc),
>>> and the latency introduced by your VM layer, and 5k sounds about right.
>>>
>>> The good news is that you probably aren't taxing the storage, you can likely
>>> do many simultaneous tests from several VMs and get the same results.
>>>
>>> You can try adding --numjobs to your fio to parallelize the specific test you're
>>> doing, or launching a second VM and doing the same test at the same time.
>>> This would be a good indicator if it's latency.
>>>
>>> On Wed, Oct 31, 2012 at 10:29 AM, Alexandre DERUMIER
>>> <aderumier@odiso.com> wrote:
>>> >>>Have you tried increasing the iodepth?
>>> > Yes, I have try with 100 and 200, same results.
>>> >
>>> > I have also try directly from the host, with /dev/rbd1, and I have same
>>> result.
>>> > I have also try with 3 differents hosts, with differents cpus models.
>>> >
>>> > (note: I can reach around 40.000 iops with same fio config on a zfs
>>> > iscsi array)
>>> >
>>> > My test ceph cluster nodes cpus are old (xeon E5420), but they are around
>>> 10% usage, so I think it's ok.
>>> >
>>> >
>>> > Do you have an idea if I can trace something ?
>>> >
>>> > Thanks,
>>> >
>>> > Alexandre
>>> >
>>> > ----- Mail original -----
>>> >
>>> > De: "Sage Weil" <sage@inktank.com>
>>> > À: "Alexandre DERUMIER" <aderumier@odiso.com>
>>> > Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
>>> > Envoyé: Mercredi 31 Octobre 2012 16:57:05
>>> > Objet: Re: slow fio random read benchmark, need help
>>> >
>>> > On Wed, 31 Oct 2012, Alexandre DERUMIER wrote:
>>> >> Hello,
>>> >>
>>> >> I'm doing some tests with fio from a qemu 1.2 guest (virtio
>>> >> disk,cache=none), randread, with 4K block size on a small size of 1G
>>> >> (so it can be handle by the buffer cache on ceph cluster)
>>> >>
>>> >>
>>> >> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M
>>> >> --iodepth=40 --group_reporting --name=file1 --ioengine=libaio
>>> >> --direct=1
>>> >>
>>> >>
>>> >> I can't get more than 5000 iops.
>>> >
>>> > Have you tried increasing the iodepth?
>>> >
>>> > sage
>>> >
>>> >>
>>> >>
>>> >> RBD cluster is :
>>> >> ---------------
>>> >> 3 nodes,with each node :
>>> >> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon
>>> >> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ rbd 0.53
>>> >>
>>> >> ceph.conf
>>> >>
>>> >> journal dio = false
>>> >> filestore fiemap = false
>>> >> filestore flusher = false
>>> >> osd op threads = 24
>>> >> osd disk threads = 24
>>> >> filestore op threads = 6
>>> >>
>>> >> kvm host is : 4 x 12 cores opteron
>>> >> ------------
>>> >>
>>> >>
>>> >> During the bench:
>>> >>
>>> >> on ceph nodes:
>>> >> - cpu is around 10% used
>>> >> - iostat show no disks activity on osds. (so I think that the 1G file
>>> >> is handle in the linux buffer)
>>> >>
>>> >>
>>> >> on kvm host:
>>> >>
>>> >> -cpu is around 20% used
>>> >>
>>> >>
>>> >> I really don't see where is the bottleneck....
>>> >>
>>> >> Any Ideas, hints ?
>>> >>
>>> >>
>>> >> Regards,
>>> >>
>>> >> Alexandre
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> >> in the body of a message to majordomo@vger.kernel.org More
>>> majordomo
>>> >> info at http://vger.kernel.org/majordomo-info.html
>>> >>
>>> >>
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> > in the body of a message to majordomo@vger.kernel.org More
>>> majordomo
>>> > info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>>> body of a message to majordomo@vger.kernel.org More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-11-01 15:46                 ` slow fio random read benchmark, need help Marcus Sorensen
@ 2012-11-01 16:28                   ` Marcus Sorensen
  2012-11-01 17:00                     ` Dietmar Maurer
  0 siblings, 1 reply; 41+ messages in thread
From: Marcus Sorensen @ 2012-11-01 16:28 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: Dietmar Maurer, Josh Durgin, Alexandre DERUMIER, Sage Weil, ceph-devel

Actually that didn't illustrate my point very well, since you see
individual requests being sent to the driver without waiting for
individual completion, but if you look at the full output you can see
that once the queue is full, you're at the mercy of waiting for
individual IOs to complete before sending new ones. Sometimes it's one
at a time, sometimes you get 3-4 completed and can insert a few at
once. I think this is countered by the fact that there's roundtrip
network latency in sending the request and in receiving the result.

For the record, I'm not saying that it's the entire reason why the
performance is lower (obviously since iscsi is better), I'm just
saying that when you're talking about high iops, adding 100us (best
case gigabit) to each request and response is significant. If an io
takes 25us locally (for example an SSD can do 40k iops or more at a
queue depth of 1), and you share that storage over gigabit, you just
increased the latency by an order of magnitude, and as seen there is
only so much simultaneous io going on when the queue depth is raised.
Add to that that multipathing isn't doing parallel but interleaving,
and extra traffic for distributed storage.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: slow fio random read benchmark, need help
  2012-11-01 16:28                   ` Marcus Sorensen
@ 2012-11-01 17:00                     ` Dietmar Maurer
  2012-11-03 17:09                       ` Gregory Farnum
  0 siblings, 1 reply; 41+ messages in thread
From: Dietmar Maurer @ 2012-11-01 17:00 UTC (permalink / raw)
  To: Marcus Sorensen, Gregory Farnum
  Cc: Josh Durgin, Alexandre DERUMIER, Sage Weil, ceph-devel


> For the record, I'm not saying that it's the entire reason why the performance
> is lower (obviously since iscsi is better), I'm just saying that when you're
> talking about high iops, adding 100us (best case gigabit) to each request and
> response is significant

iSCSI also uses the network (also adds 100us to esach request), so that simply can't be the reason.

I always thought a distributed block storage could do such things 
faster (or at least as fast) than a single centralized store?

- Dietmar


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-11-01 10:54                 ` Stefan Priebe - Profihost AG
@ 2012-11-02  9:38                   ` Alexandre DERUMIER
  2012-11-03 10:01                     ` slow fio random read benchmark: last librbd git : 20000iops ! Alexandre DERUMIER
  0 siblings, 1 reply; 41+ messages in thread
From: Alexandre DERUMIER @ 2012-11-02  9:38 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Dietmar Maurer, Josh Durgin, Marcus Sorensen, Sage Weil,
	ceph-devel, Gregory Farnum

>>Same results with rbd kernel driver without QEMU involved. 

I confirm , I have same result wirh rbd kernel driver,so it not a qemu problem.


----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Gregory Farnum" <greg@inktank.com> 
Cc: "Dietmar Maurer" <dietmar@proxmox.com>, "Josh Durgin" <josh.durgin@inktank.com>, "Alexandre DERUMIER" <aderumier@odiso.com>, "Marcus Sorensen" <shadowsor@gmail.com>, "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Jeudi 1 Novembre 2012 11:54:03 
Objet: Re: slow fio random read benchmark, need help 

Am 01.11.2012 11:40, schrieb Gregory Farnum: 
> I'm not sure that latency addition is quite correct. Most use cases 
> cases do multiple IOs at the same time, and good benchmarks tend to 
> reflect that. 
> 
> I suspect the IO limitations here are a result of QEMU's storage 
> handling (or possibly our client layer) more than anything else — Josh 
> can talk about that more than I can, though! 
> -Greg 

Same results with rbd kernel driver without QEMU involved. 

Stefan 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* slow fio random read benchmark: last librbd git : 20000iops !
  2012-11-02  9:38                   ` Alexandre DERUMIER
@ 2012-11-03 10:01                     ` Alexandre DERUMIER
  2012-11-03 12:09                       ` Alexandre DERUMIER
  0 siblings, 1 reply; 41+ messages in thread
From: Alexandre DERUMIER @ 2012-11-03 10:01 UTC (permalink / raw)
  To: ceph-devel
  Cc: Dietmar Maurer, Josh Durgin, Marcus Sorensen, Sage Weil,
	Gregory Farnum, Stefan Priebe - Profihost AG

Hi Everybody,

I have just recompiled my qemu-kvm package with last librbd git,

my iops have jumped from 5000 to 20000iops for a single fio randread benchmark ! 
cpu of the client seem to be the bottleneck,I'll test on a bigger cpu this week.


Qemu-kvm was previously compiled with librbd 0.53.


I don't know what have change ? (I don't have used stripping feature, and I have same result for image format 1 or 2).


Regards,

Alexandre

----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
Cc: "Dietmar Maurer" <dietmar@proxmox.com>, "Josh Durgin" <josh.durgin@inktank.com>, "Marcus Sorensen" <shadowsor@gmail.com>, "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Gregory Farnum" <greg@inktank.com> 
Envoyé: Vendredi 2 Novembre 2012 10:38:42 
Objet: Re: slow fio random read benchmark, need help 

>>Same results with rbd kernel driver without QEMU involved. 

I confirm , I have same result wirh rbd kernel driver,so it not a qemu problem. 


----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Gregory Farnum" <greg@inktank.com> 
Cc: "Dietmar Maurer" <dietmar@proxmox.com>, "Josh Durgin" <josh.durgin@inktank.com>, "Alexandre DERUMIER" <aderumier@odiso.com>, "Marcus Sorensen" <shadowsor@gmail.com>, "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Jeudi 1 Novembre 2012 11:54:03 
Objet: Re: slow fio random read benchmark, need help 

Am 01.11.2012 11:40, schrieb Gregory Farnum: 
> I'm not sure that latency addition is quite correct. Most use cases 
> cases do multiple IOs at the same time, and good benchmarks tend to 
> reflect that. 
> 
> I suspect the IO limitations here are a result of QEMU's storage 
> handling (or possibly our client layer) more than anything else — Josh 
> can talk about that more than I can, though! 
> -Greg 

Same results with rbd kernel driver without QEMU involved. 

Stefan 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark: last librbd git : 20000iops !
  2012-11-03 10:01                     ` slow fio random read benchmark: last librbd git : 20000iops ! Alexandre DERUMIER
@ 2012-11-03 12:09                       ` Alexandre DERUMIER
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre DERUMIER @ 2012-11-03 12:09 UTC (permalink / raw)
  To: ceph-devel
  Cc: Dietmar Maurer, Josh Durgin, Marcus Sorensen, Sage Weil,
	Gregory Farnum, Stefan Priebe - Profihost AG

Oh, 

Forget What I say,I think I speak too fast, seem to be a bug in git, as I don't see any network traffic coming from ceph cluster with I reach 20.000 iops :(

After some new image creation, I have sometime 20.000iops (with no real traffic) and sometimes 5000 iops like before (and I see traffic coming from ceph cluster).

I'll for next 0.54 stable before redoing tests.

Regards,

Alexandre



----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "ceph-devel" <ceph-devel@vger.kernel.org> 
Cc: "Dietmar Maurer" <dietmar@proxmox.com>, "Josh Durgin" <josh.durgin@inktank.com>, "Marcus Sorensen" <shadowsor@gmail.com>, "Sage Weil" <sage@inktank.com>, "Gregory Farnum" <greg@inktank.com>, "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
Envoyé: Samedi 3 Novembre 2012 11:01:39 
Objet: slow fio random read benchmark: last librbd git : 20000iops ! 

Hi Everybody, 

I have just recompiled my qemu-kvm package with last librbd git, 

my iops have jumped from 5000 to 20000iops for a single fio randread benchmark ! 
cpu of the client seem to be the bottleneck,I'll test on a bigger cpu this week. 


Qemu-kvm was previously compiled with librbd 0.53. 


I don't know what have change ? (I don't have used stripping feature, and I have same result for image format 1 or 2). 


Regards, 

Alexandre 

----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
Cc: "Dietmar Maurer" <dietmar@proxmox.com>, "Josh Durgin" <josh.durgin@inktank.com>, "Marcus Sorensen" <shadowsor@gmail.com>, "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Gregory Farnum" <greg@inktank.com> 
Envoyé: Vendredi 2 Novembre 2012 10:38:42 
Objet: Re: slow fio random read benchmark, need help 

>>Same results with rbd kernel driver without QEMU involved. 

I confirm , I have same result wirh rbd kernel driver,so it not a qemu problem. 


----- Mail original ----- 

De: "Stefan Priebe - Profihost AG" <s.priebe@profihost.ag> 
À: "Gregory Farnum" <greg@inktank.com> 
Cc: "Dietmar Maurer" <dietmar@proxmox.com>, "Josh Durgin" <josh.durgin@inktank.com>, "Alexandre DERUMIER" <aderumier@odiso.com>, "Marcus Sorensen" <shadowsor@gmail.com>, "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Jeudi 1 Novembre 2012 11:54:03 
Objet: Re: slow fio random read benchmark, need help 

Am 01.11.2012 11:40, schrieb Gregory Farnum: 
> I'm not sure that latency addition is quite correct. Most use cases 
> cases do multiple IOs at the same time, and good benchmarks tend to 
> reflect that. 
> 
> I suspect the IO limitations here are a result of QEMU's storage 
> handling (or possibly our client layer) more than anything else — Josh 
> can talk about that more than I can, though! 
> -Greg 

Same results with rbd kernel driver without QEMU involved. 

Stefan 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-11-01 17:00                     ` Dietmar Maurer
@ 2012-11-03 17:09                       ` Gregory Farnum
  2012-11-04 14:54                         ` Alexandre DERUMIER
  0 siblings, 1 reply; 41+ messages in thread
From: Gregory Farnum @ 2012-11-03 17:09 UTC (permalink / raw)
  To: Dietmar Maurer
  Cc: Marcus Sorensen, Josh Durgin, Alexandre DERUMIER, Sage Weil, ceph-devel

On Thu, Nov 1, 2012 at 6:00 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
> I always thought a distributed block storage could do such things
> faster (or at least as fast) than a single centralized store?

That rather depends on what makes up each of them. ;)

On Thu, Nov 1, 2012 at 6:11 AM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
> I have some customers with some huge databases (too big to be handle in the bufer), require a lot of ios. (around 10K).
>
> I have redone tests with 4 guest in parallel, I get 4 x 5000iops, so it seem to scale ! (and cpu is very low on the ceph cluster).
>
>
> So I'll try some tricks, like raid over multiple rbd devices, maybe it'll help.

Did your RAID setup improve anything? Have you tried scaling past 4
guests in parallel?
I still haven't come up with a good model for what could be causing
these symptoms. :/
-Greg

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
  2012-11-03 17:09                       ` Gregory Farnum
@ 2012-11-04 14:54                         ` Alexandre DERUMIER
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre DERUMIER @ 2012-11-04 14:54 UTC (permalink / raw)
  To: Gregory Farnum
  Cc: Marcus Sorensen, Josh Durgin, Sage Weil, ceph-devel, Dietmar Maurer

>>Did your RAID setup improve anything?

I have  tried to launch 2 fio test in parallel, on 2 disks in the same guest vm, I get 2500iops for each test ....

Running 2 fio tests, on 2 differents guests, give me 5000iops for each test.

I really don't understand...Maybe something don't use parallelim from 1 kvm process ? (monitor access, or something else...)

So raid don't help.

>> Have you tried scaling past 4 guests in parallel?

Not Yet,I'll done more tests this week


----- Mail original ----- 

De: "Gregory Farnum" <greg@inktank.com> 
À: "Dietmar Maurer" <dietmar@proxmox.com> 
Cc: "Marcus Sorensen" <shadowsor@gmail.com>, "Josh Durgin" <josh.durgin@inktank.com>, "Alexandre DERUMIER" <aderumier@odiso.com>, "Sage Weil" <sage@inktank.com>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Samedi 3 Novembre 2012 18:09:11 
Objet: Re: slow fio random read benchmark, need help 

On Thu, Nov 1, 2012 at 6:00 PM, Dietmar Maurer <dietmar@proxmox.com> wrote: 
> I always thought a distributed block storage could do such things 
> faster (or at least as fast) than a single centralized store? 

That rather depends on what makes up each of them. ;) 

On Thu, Nov 1, 2012 at 6:11 AM, Alexandre DERUMIER <aderumier@odiso.com> wrote: 
> I have some customers with some huge databases (too big to be handle in the bufer), require a lot of ios. (around 10K). 
> 
> I have redone tests with 4 guest in parallel, I get 4 x 5000iops, so it seem to scale ! (and cpu is very low on the ceph cluster). 
> 
> 
> So I'll try some tricks, like raid over multiple rbd devices, maybe it'll help. 

Did your RAID setup improve anything? Have you tried scaling past 4 
guests in parallel? 
I still haven't come up with a good model for what could be causing 
these symptoms. :/ 
-Greg 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: slow fio random read benchmark, need help
       [not found] <CAMiztYLY364EXVQu6d6+He-FXd_AsDOqxkTO_DzKk24iJjwTcQ@mail.gmail.com>
@ 2012-10-31 17:22 ` Alexandre DERUMIER
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre DERUMIER @ 2012-10-31 17:22 UTC (permalink / raw)
  To: Mark Kampe; +Cc: ceph-devel

Hi,
I use a small file size (1G), to be sure it can be handle in buffer. (I don't see any read access on disks with iostat during the test)
But I think the problem is not the disk hardware ios, but a bottleneck somewhere in the ceph protocol
(All benchs I have see in the ceph mailing never reach more than 20.000 iops with full ssd ceph cluster and bigger cpus)

>>If you can get 40K random read IOPS out of 18 spindles, I 
>>have to ask why you think most of those operations made 
>>it to disk. It sounds to me like they were being satisfied 
>>out of cache. 

(40k was on zfs san, handle in zfs arc memory buffer, so no read access on disk too)


>>Are you sure that fio is doing what you think it is doing? 
Yes, sure, I have already bench a lot of san array with fio.
I have also done same test with a sheepdog cluster (same hardware), I can reach 20.000-30.000io/s if the buffer cache is big enough.



I would like to know where is the bottleneck, before build an ceph cluster with more powerfull servers and full ssds osd.


Does Intank have some random read/write io benchmarks ?  
(I see a lot of sequential benchs with high bandwith results, but not so much random io/s results)


Regards,

Alexandre

----- Mail original ----- 

De: "Mark Kampe" <mark.kampe@inktank.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Mercredi 31 Octobre 2012 17:56:26 
Objet: Re: slow fio random read benchmark, need help 

I'm a little confused by the math here: 

15K RPM = 250 rotations/second 
3 hosts * 6 OSDs/host = 18 spindles 
18 spindles * 250 rotations/second = 4500 tracks/second (sans seeks) 

4K direct random reads against large files should have negligible 
cache hits. But large numbers of parallel operations (greater 
iodepth) may give us multiple (coincidental) reads per track, 
which could push us a little above one read per track, and 
enable some good head scheduling (keeping the seeks small) 
but even so the seeks are probably going to cut that number 
by half or worse. 

If you can get 40K random read IOPS out of 18 spindles, I 
have to ask why you think most of those operations made 
it to disk. It sounds to me like they were being satisfied 
out of cache. 

Are you sure that fio is doing what you think it is doing? 


On Wed, Oct 31, 2012 at 9:29 AM, Alexandre DERUMIER < aderumier@odiso.com > wrote: 



>>Have you tried increasing the iodepth? 
Yes, I have try with 100 and 200, same results. 

I have also try directly from the host, with /dev/rbd1, and I have same result. 
I have also try with 3 differents hosts, with differents cpus models. 

(note: I can reach around 40.000 iops with same fio config on a zfs iscsi array) 

My test ceph cluster nodes cpus are old (xeon E5420), but they are around 10% usage, so I think it's ok. 


Do you have an idea if I can trace something ? 

Thanks, 

Alexandre 

----- Mail original ----- 

De: "Sage Weil" < sage@inktank.com > 
À: "Alexandre DERUMIER" < aderumier@odiso.com > 
Cc: "ceph-devel" < ceph-devel@vger.kernel.org > 
Envoyé: Mercredi 31 Octobre 2012 16:57:05 
Objet: Re: slow fio random read benchmark, need help 



On Wed, 31 Oct 2012, Alexandre DERUMIER wrote: 
> Hello, 
> 
> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster) 
> 
> 
> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40 --group_reporting --name=file1 --ioengine=libaio --direct=1 
> 
> 
> I can't get more than 5000 iops. 

Have you tried increasing the iodepth? 

sage 

> 
> 
> RBD cluster is : 
> --------------- 
> 3 nodes,with each node : 
> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon 
> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ 
> rbd 0.53 
> 
> ceph.conf 
> 
> journal dio = false 
> filestore fiemap = false 
> filestore flusher = false 
> osd op threads = 24 
> osd disk threads = 24 
> filestore op threads = 6 
> 
> kvm host is : 4 x 12 cores opteron 
> ------------ 
> 
> 
> During the bench: 
> 
> on ceph nodes: 
> - cpu is around 10% used 
> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer) 
> 
> 
> on kvm host: 
> 
> -cpu is around 20% used 
> 
> 
> I really don't see where is the bottleneck.... 
> 
> Any Ideas, hints ? 
> 
> 
> Regards, 
> 
> Alexandre 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 





-- 
Mark.Kampe@inktank.com 
VP, Engineering 
Mobile: +1-213-400-8857 
Office: +1-323-375-3863 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2012-11-04 14:54 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-31  1:41 [PATCH 0/6] rbd: version 2 parent probing Alex Elder
2012-10-31  1:49 ` [PATCH 1/6] rbd: skip getting image id if known Alex Elder
2012-10-31 21:05   ` Josh Durgin
2012-10-31  1:49 ` [PATCH 2/6] rbd: allow null image name Alex Elder
2012-10-31 21:07   ` Josh Durgin
2012-10-31  1:49 ` [PATCH 3/6] rbd: get parent spec for version 2 images Alex Elder
2012-11-01  1:33   ` Josh Durgin
2012-10-31  1:49 ` [PATCH 4/6] libceph: define ceph_pg_pool_name_by_id() Alex Elder
2012-11-01  1:34   ` Josh Durgin
2012-10-31  1:49 ` [PATCH 5/6] rbd: get additional info in parent spec Alex Elder
2012-10-31 14:11   ` Alex Elder
2012-11-01  1:49   ` Josh Durgin
2012-11-01 12:18     ` Alex Elder
2012-10-31  1:50 ` [PATCH 6/6] rbd: probe the parent of an image if present Alex Elder
2012-10-31 11:59   ` slow fio random read benchmark, need help Alexandre DERUMIER
2012-10-31 15:57     ` Sage Weil
2012-10-31 16:29       ` Alexandre DERUMIER
2012-10-31 16:50         ` Alexandre DERUMIER
2012-10-31 17:08         ` Marcus Sorensen
2012-10-31 17:27           ` Alexandre DERUMIER
2012-10-31 17:38             ` Marcus Sorensen
2012-10-31 18:56               ` Alexandre DERUMIER
2012-10-31 19:50                 ` Marcus Sorensen
2012-11-01  5:11                   ` Alexandre DERUMIER
2012-11-01  5:41                     ` Stefan Priebe - Profihost AG
2012-10-31 20:22                 ` Josh Durgin
2012-11-01  7:38             ` Dietmar Maurer
2012-11-01  8:08               ` Stefan Priebe - Profihost AG
2012-11-01 10:40               ` Gregory Farnum
2012-11-01 10:54                 ` Stefan Priebe - Profihost AG
2012-11-02  9:38                   ` Alexandre DERUMIER
2012-11-03 10:01                     ` slow fio random read benchmark: last librbd git : 20000iops ! Alexandre DERUMIER
2012-11-03 12:09                       ` Alexandre DERUMIER
2012-11-01 15:46                 ` slow fio random read benchmark, need help Marcus Sorensen
2012-11-01 16:28                   ` Marcus Sorensen
2012-11-01 17:00                     ` Dietmar Maurer
2012-11-03 17:09                       ` Gregory Farnum
2012-11-04 14:54                         ` Alexandre DERUMIER
2012-11-01  2:07   ` [PATCH 6/6] rbd: probe the parent of an image if present Josh Durgin
2012-11-01 12:26     ` Alex Elder
     [not found] <CAMiztYLY364EXVQu6d6+He-FXd_AsDOqxkTO_DzKk24iJjwTcQ@mail.gmail.com>
2012-10-31 17:22 ` slow fio random read benchmark, need help Alexandre DERUMIER

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.