All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] libceph: CEPH_OSD_OP_SETALLOCHINT osd op
@ 2014-02-21 18:55 Ilya Dryomov
  2014-02-21 18:55 ` [PATCH 1/6] libceph: encode CEPH_OSD_OP_FLAG_* op flags Ilya Dryomov
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Ilya Dryomov @ 2014-02-21 18:55 UTC (permalink / raw)
  To: ceph-devel

Hello,

This series adds support for CEPH_OSD_OP_SETALLOCHINT osd op to libceph
along with adjusting rbd to make use of it.  The rationale and the
basic desing was outlined in the "rados io hints" thread on ceph-devel
about a month ago.

Thanks,

                Ilya


Ilya Dryomov (6):
  libceph: encode CEPH_OSD_OP_FLAG_* op flags
  libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
  libceph: bump CEPH_OSD_MAX_OP to 3
  rbd: do not hard-code CEPH_OSD_MAX_OP in rbd_osd_req_callback()
  rbd: num_ops parameter for rbd_osd_req_create()
  rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op

 drivers/block/rbd.c             |   90 ++++++++++++++++++++++++++++-----------
 include/linux/ceph/osd_client.h |   12 +++++-
 include/linux/ceph/rados.h      |   10 ++++-
 net/ceph/osd_client.c           |   32 ++++++++++++++
 4 files changed, 117 insertions(+), 27 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/6] libceph: encode CEPH_OSD_OP_FLAG_* op flags
  2014-02-21 18:55 [PATCH 0/6] libceph: CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
@ 2014-02-21 18:55 ` Ilya Dryomov
  2014-02-24 14:58   ` Alex Elder
  2014-02-21 18:55 ` [PATCH 2/6] libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Ilya Dryomov @ 2014-02-21 18:55 UTC (permalink / raw)
  To: ceph-devel

Encode ceph_osd_op::flags field so that it gets sent over the wire.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 include/linux/ceph/osd_client.h |    1 +
 include/linux/ceph/rados.h      |    2 +-
 net/ceph/osd_client.c           |    2 ++
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index fd47e872ebcc..e94f5da251d6 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -76,6 +76,7 @@ struct ceph_osd_data {
 
 struct ceph_osd_req_op {
 	u16 op;           /* CEPH_OSD_OP_* */
+	u32 flags;        /* CEPH_OSD_OP_FLAG_* */
 	u32 payload_len;
 	union {
 		struct ceph_osd_data raw_data_in;
diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
index 96292df4041b..8f9bf4570215 100644
--- a/include/linux/ceph/rados.h
+++ b/include/linux/ceph/rados.h
@@ -382,7 +382,7 @@ enum {
  */
 struct ceph_osd_op {
 	__le16 op;           /* CEPH_OSD_OP_* */
-	__le32 flags;        /* CEPH_OSD_FLAG_* */
+	__le32 flags;        /* CEPH_OSD_OP_FLAG_* */
 	union {
 		struct {
 			__le64 offset, length;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 0676f2b199d6..5d7fd0b8c1c8 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -688,7 +688,9 @@ static u64 osd_req_encode_op(struct ceph_osd_request *req,
 
 		return 0;
 	}
+
 	dst->op = cpu_to_le16(src->op);
+	dst->flags = cpu_to_le32(src->flags);
 	dst->payload_len = cpu_to_le32(src->payload_len);
 
 	return request_data_len;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/6] libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-21 18:55 [PATCH 0/6] libceph: CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
  2014-02-21 18:55 ` [PATCH 1/6] libceph: encode CEPH_OSD_OP_FLAG_* op flags Ilya Dryomov
@ 2014-02-21 18:55 ` Ilya Dryomov
  2014-02-23 16:03   ` Sage Weil
  2014-02-24 14:59   ` Alex Elder
  2014-02-21 18:55 ` [PATCH 3/6] libceph: bump CEPH_OSD_MAX_OP to 3 Ilya Dryomov
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Ilya Dryomov @ 2014-02-21 18:55 UTC (permalink / raw)
  To: ceph-devel

This is primarily for rbd's benefit and is supposed to combat
fragmentation:

"... knowing that rbd images have a 4m size, librbd can pass a hint
that will let the osd do the xfs allocation size ioctl on new files so
that they are allocated in 1m or 4m chunks.  We've seen cases where
users with rbd workloads have very high levels of fragmentation in xfs
and this would mitigate that and probably have a pretty nice
performance benefit."

SETALLOCHINT is considered advisory, so our backwards compatibility
mechanism here is to set FAILOK flag for all SETALLOCHINT ops.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 include/linux/ceph/osd_client.h |    9 +++++++++
 include/linux/ceph/rados.h      |    8 ++++++++
 net/ceph/osd_client.c           |   30 ++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index e94f5da251d6..6bfcb0eca8ab 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -103,6 +103,11 @@ struct ceph_osd_req_op {
 			u32 timeout;
 			__u8 flag;
 		} watch;
+		struct {
+			u64 expected_size;
+			u64 expected_write_size;
+			__u8 expected_size_probability;
+		} hint;
 	};
 };
 
@@ -294,6 +299,10 @@ extern void osd_req_op_cls_init(struct ceph_osd_request *osd_req,
 extern void osd_req_op_watch_init(struct ceph_osd_request *osd_req,
 					unsigned int which, u16 opcode,
 					u64 cookie, u64 version, int flag);
+extern void osd_req_op_hint_init(struct ceph_osd_request *osd_req,
+				 unsigned int which, u16 opcode,
+				 u64 expected_size, u64 expected_write_size,
+				 u8 expected_size_probability);
 
 extern struct ceph_osd_request *ceph_osdc_alloc_request(struct ceph_osd_client *osdc,
 					       struct ceph_snap_context *snapc,
diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
index 8f9bf4570215..b8e2dd11f186 100644
--- a/include/linux/ceph/rados.h
+++ b/include/linux/ceph/rados.h
@@ -227,6 +227,9 @@ enum {
 	CEPH_OSD_OP_OMAPRMKEYS    = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 24,
 	CEPH_OSD_OP_OMAP_CMP      = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_DATA | 25,
 
+	/* hints */
+	CEPH_OSD_OP_SETALLOCHINT = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 35,
+
 	/** multi **/
 	CEPH_OSD_OP_CLONERANGE = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_MULTI | 1,
 	CEPH_OSD_OP_ASSERT_SRC_VERSION = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_MULTI | 2,
@@ -416,6 +419,11 @@ struct ceph_osd_op {
 			__le64 offset, length;
 			__le64 src_offset;
 		} __attribute__ ((packed)) clonerange;
+		struct {
+			__le64 expected_size;
+			__le64 expected_write_size;
+			__u8 expected_size_probability;
+		} __attribute__ ((packed)) hint;
 	};
 	__le32 payload_len;
 } __attribute__ ((packed));
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 5d7fd0b8c1c8..4090f6e8db3a 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -436,6 +436,7 @@ static bool osd_req_opcode_valid(u16 opcode)
 	case CEPH_OSD_OP_OMAPCLEAR:
 	case CEPH_OSD_OP_OMAPRMKEYS:
 	case CEPH_OSD_OP_OMAP_CMP:
+	case CEPH_OSD_OP_SETALLOCHINT:
 	case CEPH_OSD_OP_CLONERANGE:
 	case CEPH_OSD_OP_ASSERT_SRC_VERSION:
 	case CEPH_OSD_OP_SRC_CMPXATTR:
@@ -591,6 +592,28 @@ void osd_req_op_watch_init(struct ceph_osd_request *osd_req,
 }
 EXPORT_SYMBOL(osd_req_op_watch_init);
 
+void osd_req_op_hint_init(struct ceph_osd_request *osd_req,
+			  unsigned int which, u16 opcode,
+			  u64 expected_size, u64 expected_write_size,
+			  u8 expected_size_probability)
+{
+	struct ceph_osd_req_op *op = _osd_req_op_init(osd_req, which, opcode);
+
+	BUG_ON(opcode != CEPH_OSD_OP_SETALLOCHINT);
+
+	op->hint.expected_size = expected_size;
+	op->hint.expected_write_size = expected_write_size;
+	op->hint.expected_size_probability = expected_size_probability;
+
+	/*
+	 * CEPH_OSD_OP_SETALLOCHINT op is advisory and therefore deemed
+	 * not worth a feature bit.  Set FAILOK per-op flag to make
+	 * sure older osds don't trip over an unsupported opcode.
+	 */
+	op->flags |= CEPH_OSD_OP_FLAG_FAILOK;
+}
+EXPORT_SYMBOL(osd_req_op_hint_init);
+
 static void ceph_osdc_msg_data_add(struct ceph_msg *msg,
 				struct ceph_osd_data *osd_data)
 {
@@ -681,6 +704,13 @@ static u64 osd_req_encode_op(struct ceph_osd_request *req,
 		dst->watch.ver = cpu_to_le64(src->watch.ver);
 		dst->watch.flag = src->watch.flag;
 		break;
+	case CEPH_OSD_OP_SETALLOCHINT:
+		dst->hint.expected_size = cpu_to_le64(src->hint.expected_size);
+		dst->hint.expected_write_size =
+		    cpu_to_le64(src->hint.expected_write_size);
+		dst->hint.expected_size_probability =
+		    src->hint.expected_size_probability;
+		break;
 	default:
 		pr_err("unsupported osd opcode %s\n",
 			ceph_osd_op_name(src->op));
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/6] libceph: bump CEPH_OSD_MAX_OP to 3
  2014-02-21 18:55 [PATCH 0/6] libceph: CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
  2014-02-21 18:55 ` [PATCH 1/6] libceph: encode CEPH_OSD_OP_FLAG_* op flags Ilya Dryomov
  2014-02-21 18:55 ` [PATCH 2/6] libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
@ 2014-02-21 18:55 ` Ilya Dryomov
  2014-02-24 14:59   ` Alex Elder
  2014-02-21 18:55 ` [PATCH 4/6] rbd: do not hard-code CEPH_OSD_MAX_OP in rbd_osd_req_callback() Ilya Dryomov
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Ilya Dryomov @ 2014-02-21 18:55 UTC (permalink / raw)
  To: ceph-devel

Our longest osd request now contains 3 ops: copyup+hint+write.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 include/linux/ceph/osd_client.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 6bfcb0eca8ab..bc874b2398d7 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -43,7 +43,7 @@ struct ceph_osd {
 };
 
 
-#define CEPH_OSD_MAX_OP	2
+#define CEPH_OSD_MAX_OP	3
 
 enum ceph_osd_data_type {
 	CEPH_OSD_DATA_TYPE_NONE = 0,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/6] rbd: do not hard-code CEPH_OSD_MAX_OP in rbd_osd_req_callback()
  2014-02-21 18:55 [PATCH 0/6] libceph: CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
                   ` (2 preceding siblings ...)
  2014-02-21 18:55 ` [PATCH 3/6] libceph: bump CEPH_OSD_MAX_OP to 3 Ilya Dryomov
@ 2014-02-21 18:55 ` Ilya Dryomov
  2014-02-24 14:59   ` Alex Elder
  2014-02-21 18:55 ` [PATCH 5/6] rbd: num_ops parameter for rbd_osd_req_create() Ilya Dryomov
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Ilya Dryomov @ 2014-02-21 18:55 UTC (permalink / raw)
  To: ceph-devel

CEPH_OSD_MAX_OP value in rbd_osd_req_callback() is hard-coded to 2.
Fix it.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 drivers/block/rbd.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index b365e0dfccb6..48a889866824 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1654,7 +1654,7 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req,
 	if (osd_req->r_result < 0)
 		obj_request->result = osd_req->r_result;
 
-	BUG_ON(osd_req->r_num_ops > 2);
+	BUG_ON(osd_req->r_num_ops > CEPH_OSD_MAX_OP);
 
 	/*
 	 * We support a 64-bit length, but ultimately it has to be
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/6] rbd: num_ops parameter for rbd_osd_req_create()
  2014-02-21 18:55 [PATCH 0/6] libceph: CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
                   ` (3 preceding siblings ...)
  2014-02-21 18:55 ` [PATCH 4/6] rbd: do not hard-code CEPH_OSD_MAX_OP in rbd_osd_req_callback() Ilya Dryomov
@ 2014-02-21 18:55 ` Ilya Dryomov
  2014-02-24 14:59   ` Alex Elder
  2014-02-21 18:55 ` [PATCH 6/6] rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Ilya Dryomov @ 2014-02-21 18:55 UTC (permalink / raw)
  To: ceph-devel

In preparation for prefixing rbd writes with an allocation hint
introduce a num_ops parameter for rbd_osd_req_create().  The rationale
is that not every write request is a write op that needs to be prefixed
(e.g. watch op), so the num_ops logic needs to be in the callers.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 drivers/block/rbd.c |   28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 48a889866824..6cf001ef00bc 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1718,6 +1718,7 @@ static void rbd_osd_req_format_write(struct rbd_obj_request *obj_request)
 static struct ceph_osd_request *rbd_osd_req_create(
 					struct rbd_device *rbd_dev,
 					bool write_request,
+					unsigned int num_ops,
 					struct rbd_obj_request *obj_request)
 {
 	struct ceph_snap_context *snapc = NULL;
@@ -1733,10 +1734,13 @@ static struct ceph_osd_request *rbd_osd_req_create(
 			snapc = img_request->snapc;
 	}
 
-	/* Allocate and initialize the request, for the single op */
+	rbd_assert(num_ops == 1);
+
+	/* Allocate and initialize the request, for the num_ops ops */
 
 	osdc = &rbd_dev->rbd_client->client->osdc;
-	osd_req = ceph_osdc_alloc_request(osdc, snapc, 1, false, GFP_ATOMIC);
+	osd_req = ceph_osdc_alloc_request(osdc, snapc, num_ops, false,
+					  GFP_ATOMIC);
 	if (!osd_req)
 		return NULL;	/* ENOMEM */
 
@@ -2220,8 +2224,8 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
 			pages += page_count;
 		}
 
-		osd_req = rbd_osd_req_create(rbd_dev, write_request,
-						obj_request);
+		osd_req = rbd_osd_req_create(rbd_dev, write_request, 1,
+					     obj_request);
 		if (!osd_req)
 			goto out_partial;
 		obj_request->osd_req = osd_req;
@@ -2604,8 +2608,8 @@ static int rbd_img_obj_exists_submit(struct rbd_obj_request *obj_request)
 
 	rbd_assert(obj_request->img_request);
 	rbd_dev = obj_request->img_request->rbd_dev;
-	stat_request->osd_req = rbd_osd_req_create(rbd_dev, false,
-						stat_request);
+	stat_request->osd_req = rbd_osd_req_create(rbd_dev, false, 1,
+						   stat_request);
 	if (!stat_request->osd_req)
 		goto out;
 	stat_request->callback = rbd_img_obj_exists_callback;
@@ -2808,7 +2812,8 @@ static int rbd_obj_notify_ack_sync(struct rbd_device *rbd_dev, u64 notify_id)
 		return -ENOMEM;
 
 	ret = -ENOMEM;
-	obj_request->osd_req = rbd_osd_req_create(rbd_dev, false, obj_request);
+	obj_request->osd_req = rbd_osd_req_create(rbd_dev, false, 1,
+						  obj_request);
 	if (!obj_request->osd_req)
 		goto out;
 
@@ -2871,7 +2876,8 @@ static int __rbd_dev_header_watch_sync(struct rbd_device *rbd_dev, bool start)
 	if (!obj_request)
 		goto out_cancel;
 
-	obj_request->osd_req = rbd_osd_req_create(rbd_dev, true, obj_request);
+	obj_request->osd_req = rbd_osd_req_create(rbd_dev, true, 1,
+						  obj_request);
 	if (!obj_request->osd_req)
 		goto out_cancel;
 
@@ -2979,7 +2985,8 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
 	obj_request->pages = pages;
 	obj_request->page_count = page_count;
 
-	obj_request->osd_req = rbd_osd_req_create(rbd_dev, false, obj_request);
+	obj_request->osd_req = rbd_osd_req_create(rbd_dev, false, 1,
+						  obj_request);
 	if (!obj_request->osd_req)
 		goto out;
 
@@ -3212,7 +3219,8 @@ static int rbd_obj_read_sync(struct rbd_device *rbd_dev,
 	obj_request->pages = pages;
 	obj_request->page_count = page_count;
 
-	obj_request->osd_req = rbd_osd_req_create(rbd_dev, false, obj_request);
+	obj_request->osd_req = rbd_osd_req_create(rbd_dev, false, 1,
+						  obj_request);
 	if (!obj_request->osd_req)
 		goto out;
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/6] rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-21 18:55 [PATCH 0/6] libceph: CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
                   ` (4 preceding siblings ...)
  2014-02-21 18:55 ` [PATCH 5/6] rbd: num_ops parameter for rbd_osd_req_create() Ilya Dryomov
@ 2014-02-21 18:55 ` Ilya Dryomov
  2014-02-24 14:59   ` Alex Elder
  2014-02-23 16:14 ` [PATCH 0/6] libceph: " Sage Weil
  2014-02-24 14:58 ` Alex Elder
  7 siblings, 1 reply; 26+ messages in thread
From: Ilya Dryomov @ 2014-02-21 18:55 UTC (permalink / raw)
  To: ceph-devel

In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 << order).  Backwards compatibility is taken care
of on the libceph/osd side.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
---
 drivers/block/rbd.c |   66 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 49 insertions(+), 17 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 6cf001ef00bc..14496f39c770 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1662,7 +1662,12 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req,
 	 */
 	obj_request->xferred = osd_req->r_reply_op_len[0];
 	rbd_assert(obj_request->xferred < (u64)UINT_MAX);
+
 	opcode = osd_req->r_ops[0].op;
+	if (opcode == CEPH_OSD_OP_SETALLOCHINT) {
+		BUG_ON(osd_req->r_ops[1].op != CEPH_OSD_OP_WRITE);
+		opcode = CEPH_OSD_OP_WRITE;
+	}
 	switch (opcode) {
 	case CEPH_OSD_OP_READ:
 		rbd_osd_read_callback(obj_request);
@@ -1715,6 +1720,12 @@ static void rbd_osd_req_format_write(struct rbd_obj_request *obj_request)
 			snapc, CEPH_NOSNAP, &mtime);
 }
 
+/*
+ * Create an osd request.  A read request has one osd op (read).
+ * A write request has either one (watch) or two (hint+write) osd ops.
+ * (All rbd writes are prefixed with an allocation hint op, but
+ * technically osd watch is a write request, hence this distinction.)
+ */
 static struct ceph_osd_request *rbd_osd_req_create(
 					struct rbd_device *rbd_dev,
 					bool write_request,
@@ -1734,7 +1745,8 @@ static struct ceph_osd_request *rbd_osd_req_create(
 			snapc = img_request->snapc;
 	}
 
-	rbd_assert(num_ops == 1);
+	rbd_assert((!write_request && num_ops == 1) ||
+		   (write_request && num_ops >= 1 && num_ops <= 2));
 
 	/* Allocate and initialize the request, for the num_ops ops */
 
@@ -1760,8 +1772,8 @@ static struct ceph_osd_request *rbd_osd_req_create(
 
 /*
  * Create a copyup osd request based on the information in the
- * object request supplied.  A copyup request has two osd ops,
- * a copyup method call, and a "normal" write request.
+ * object request supplied.  A copyup request has three osd ops,
+ * a copyup method call, a hint op, and a write op.
  */
 static struct ceph_osd_request *
 rbd_osd_req_create_copyup(struct rbd_obj_request *obj_request)
@@ -1777,12 +1789,12 @@ rbd_osd_req_create_copyup(struct rbd_obj_request *obj_request)
 	rbd_assert(img_request);
 	rbd_assert(img_request_write_test(img_request));
 
-	/* Allocate and initialize the request, for the two ops */
+	/* Allocate and initialize the request, for the three ops */
 
 	snapc = img_request->snapc;
 	rbd_dev = img_request->rbd_dev;
 	osdc = &rbd_dev->rbd_client->client->osdc;
-	osd_req = ceph_osdc_alloc_request(osdc, snapc, 2, false, GFP_ATOMIC);
+	osd_req = ceph_osdc_alloc_request(osdc, snapc, 3, false, GFP_ATOMIC);
 	if (!osd_req)
 		return NULL;	/* ENOMEM */
 
@@ -2159,12 +2171,10 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
 	struct page **pages = NULL;
 	u64 img_offset;
 	u64 resid;
-	u16 opcode;
 
 	dout("%s: img %p type %d data_desc %p\n", __func__, img_request,
 		(int)type, data_desc);
 
-	opcode = write_request ? CEPH_OSD_OP_WRITE : CEPH_OSD_OP_READ;
 	img_offset = img_request->offset;
 	resid = img_request->length;
 	rbd_assert(resid > 0);
@@ -2183,6 +2193,8 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
 		const char *object_name;
 		u64 offset;
 		u64 length;
+		unsigned int which;
+		u16 opcode;
 
 		object_name = rbd_segment_name(rbd_dev, img_offset);
 		if (!object_name)
@@ -2224,20 +2236,34 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
 			pages += page_count;
 		}
 
-		osd_req = rbd_osd_req_create(rbd_dev, write_request, 1,
+		osd_req = rbd_osd_req_create(rbd_dev, write_request,
+					     (write_request ? 2 : 1),
 					     obj_request);
 		if (!osd_req)
 			goto out_partial;
 		obj_request->osd_req = osd_req;
 		obj_request->callback = rbd_img_obj_callback;
 
-		osd_req_op_extent_init(osd_req, 0, opcode, offset, length,
-						0, 0);
+		if (write_request) {
+			osd_req_op_hint_init(osd_req, 0,
+					     CEPH_OSD_OP_SETALLOCHINT,
+					     rbd_obj_bytes(&rbd_dev->header),
+					     rbd_obj_bytes(&rbd_dev->header),
+					     0);
+
+			which = 1;
+			opcode = CEPH_OSD_OP_WRITE;
+		} else {
+			which = 0;
+			opcode = CEPH_OSD_OP_READ;
+		}
+		osd_req_op_extent_init(osd_req, which, opcode, offset, length,
+				       0, 0);
 		if (type == OBJ_REQUEST_BIO)
-			osd_req_op_extent_osd_data_bio(osd_req, 0,
+			osd_req_op_extent_osd_data_bio(osd_req, which,
 					obj_request->bio_list, length);
 		else
-			osd_req_op_extent_osd_data_pages(osd_req, 0,
+			osd_req_op_extent_osd_data_pages(osd_req, which,
 					obj_request->pages, length,
 					offset & ~PAGE_MASK, false, false);
 
@@ -2358,7 +2384,7 @@ rbd_img_obj_parent_read_full_callback(struct rbd_img_request *img_request)
 
 	/*
 	 * The original osd request is of no use to use any more.
-	 * We need a new one that can hold the two ops in a copyup
+	 * We need a new one that can hold the three ops in a copyup
 	 * request.  Allocate the new copyup osd request for the
 	 * original request, and release the old one.
 	 */
@@ -2377,17 +2403,23 @@ rbd_img_obj_parent_read_full_callback(struct rbd_img_request *img_request)
 	osd_req_op_cls_request_data_pages(osd_req, 0, pages, parent_length, 0,
 						false, false);
 
-	/* Then the original write request op */
+	/* Then the hint op */
+
+	osd_req_op_hint_init(osd_req, 1, CEPH_OSD_OP_SETALLOCHINT,
+			     rbd_obj_bytes(&rbd_dev->header),
+			     rbd_obj_bytes(&rbd_dev->header), 0);
+
+	/* And the original write request op */
 
 	offset = orig_request->offset;
 	length = orig_request->length;
-	osd_req_op_extent_init(osd_req, 1, CEPH_OSD_OP_WRITE,
+	osd_req_op_extent_init(osd_req, 2, CEPH_OSD_OP_WRITE,
 					offset, length, 0, 0);
 	if (orig_request->type == OBJ_REQUEST_BIO)
-		osd_req_op_extent_osd_data_bio(osd_req, 1,
+		osd_req_op_extent_osd_data_bio(osd_req, 2,
 					orig_request->bio_list, length);
 	else
-		osd_req_op_extent_osd_data_pages(osd_req, 1,
+		osd_req_op_extent_osd_data_pages(osd_req, 2,
 					orig_request->pages, length,
 					offset & ~PAGE_MASK, false, false);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-21 18:55 ` [PATCH 2/6] libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
@ 2014-02-23 16:03   ` Sage Weil
  2014-02-24 14:59   ` Alex Elder
  1 sibling, 0 replies; 26+ messages in thread
From: Sage Weil @ 2014-02-23 16:03 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: ceph-devel

On Fri, 21 Feb 2014, Ilya Dryomov wrote:
> This is primarily for rbd's benefit and is supposed to combat
> fragmentation:
> 
> "... knowing that rbd images have a 4m size, librbd can pass a hint
> that will let the osd do the xfs allocation size ioctl on new files so
> that they are allocated in 1m or 4m chunks.  We've seen cases where
> users with rbd workloads have very high levels of fragmentation in xfs
> and this would mitigate that and probably have a pretty nice
> performance benefit."
> 
> SETALLOCHINT is considered advisory, so our backwards compatibility
> mechanism here is to set FAILOK flag for all SETALLOCHINT ops.
> 
> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  include/linux/ceph/osd_client.h |    9 +++++++++
>  include/linux/ceph/rados.h      |    8 ++++++++
>  net/ceph/osd_client.c           |   30 ++++++++++++++++++++++++++++++
>  3 files changed, 47 insertions(+)
> 
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index e94f5da251d6..6bfcb0eca8ab 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -103,6 +103,11 @@ struct ceph_osd_req_op {
>  			u32 timeout;
>  			__u8 flag;
>  		} watch;
> +		struct {
> +			u64 expected_size;
> +			u64 expected_write_size;
> +			__u8 expected_size_probability;
> +		} hint;

s/hint/alloc_hint/ ?

>  	};
>  };
>  
> @@ -294,6 +299,10 @@ extern void osd_req_op_cls_init(struct ceph_osd_request *osd_req,
>  extern void osd_req_op_watch_init(struct ceph_osd_request *osd_req,
>  					unsigned int which, u16 opcode,
>  					u64 cookie, u64 version, int flag);
> +extern void osd_req_op_hint_init(struct ceph_osd_request *osd_req,
> +				 unsigned int which, u16 opcode,
> +				 u64 expected_size, u64 expected_write_size,
> +				 u8 expected_size_probability);
>  
>  extern struct ceph_osd_request *ceph_osdc_alloc_request(struct ceph_osd_client *osdc,
>  					       struct ceph_snap_context *snapc,
> diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
> index 8f9bf4570215..b8e2dd11f186 100644
> --- a/include/linux/ceph/rados.h
> +++ b/include/linux/ceph/rados.h
> @@ -227,6 +227,9 @@ enum {
>  	CEPH_OSD_OP_OMAPRMKEYS    = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 24,
>  	CEPH_OSD_OP_OMAP_CMP      = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_DATA | 25,
>  
> +	/* hints */
> +	CEPH_OSD_OP_SETALLOCHINT = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 35,
> +
>  	/** multi **/
>  	CEPH_OSD_OP_CLONERANGE = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_MULTI | 1,
>  	CEPH_OSD_OP_ASSERT_SRC_VERSION = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_MULTI | 2,
> @@ -416,6 +419,11 @@ struct ceph_osd_op {
>  			__le64 offset, length;
>  			__le64 src_offset;
>  		} __attribute__ ((packed)) clonerange;
> +		struct {
> +			__le64 expected_size;
> +			__le64 expected_write_size;
> +			__u8 expected_size_probability;
> +		} __attribute__ ((packed)) hint;

s/hint/alloc_hint/, I think.  Just made the same comment on the user space 
side.

>  	};
>  	__le32 payload_len;
>  } __attribute__ ((packed));
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 5d7fd0b8c1c8..4090f6e8db3a 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -436,6 +436,7 @@ static bool osd_req_opcode_valid(u16 opcode)
>  	case CEPH_OSD_OP_OMAPCLEAR:
>  	case CEPH_OSD_OP_OMAPRMKEYS:
>  	case CEPH_OSD_OP_OMAP_CMP:
> +	case CEPH_OSD_OP_SETALLOCHINT:
>  	case CEPH_OSD_OP_CLONERANGE:
>  	case CEPH_OSD_OP_ASSERT_SRC_VERSION:
>  	case CEPH_OSD_OP_SRC_CMPXATTR:
> @@ -591,6 +592,28 @@ void osd_req_op_watch_init(struct ceph_osd_request *osd_req,
>  }
>  EXPORT_SYMBOL(osd_req_op_watch_init);
>  
> +void osd_req_op_hint_init(struct ceph_osd_request *osd_req,
> +			  unsigned int which, u16 opcode,
> +			  u64 expected_size, u64 expected_write_size,
> +			  u8 expected_size_probability)
> +{
> +	struct ceph_osd_req_op *op = _osd_req_op_init(osd_req, which, opcode);
> +
> +	BUG_ON(opcode != CEPH_OSD_OP_SETALLOCHINT);

I would just drop the opcode argument all together.  And 
s/hint/alloc_hint/ in the function name...  I wouldn't expect that any 
other type of hint would have these same arguments.

> +
> +	op->hint.expected_size = expected_size;
> +	op->hint.expected_write_size = expected_write_size;
> +	op->hint.expected_size_probability = expected_size_probability;
> +
> +	/*
> +	 * CEPH_OSD_OP_SETALLOCHINT op is advisory and therefore deemed
> +	 * not worth a feature bit.  Set FAILOK per-op flag to make
> +	 * sure older osds don't trip over an unsupported opcode.
> +	 */
> +	op->flags |= CEPH_OSD_OP_FLAG_FAILOK;
> +}
> +EXPORT_SYMBOL(osd_req_op_hint_init);
> +
>  static void ceph_osdc_msg_data_add(struct ceph_msg *msg,
>  				struct ceph_osd_data *osd_data)
>  {
> @@ -681,6 +704,13 @@ static u64 osd_req_encode_op(struct ceph_osd_request *req,
>  		dst->watch.ver = cpu_to_le64(src->watch.ver);
>  		dst->watch.flag = src->watch.flag;
>  		break;
> +	case CEPH_OSD_OP_SETALLOCHINT:
> +		dst->hint.expected_size = cpu_to_le64(src->hint.expected_size);
> +		dst->hint.expected_write_size =
> +		    cpu_to_le64(src->hint.expected_write_size);
> +		dst->hint.expected_size_probability =
> +		    src->hint.expected_size_probability;
> +		break;
>  	default:
>  		pr_err("unsupported osd opcode %s\n",
>  			ceph_osd_op_name(src->op));
> -- 
> 1.7.10.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] libceph: CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-21 18:55 [PATCH 0/6] libceph: CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
                   ` (5 preceding siblings ...)
  2014-02-21 18:55 ` [PATCH 6/6] rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
@ 2014-02-23 16:14 ` Sage Weil
  2014-02-23 16:15   ` Alex Elder
  2014-02-24 14:58 ` Alex Elder
  7 siblings, 1 reply; 26+ messages in thread
From: Sage Weil @ 2014-02-23 16:14 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: ceph-devel

Reviewed-by: Sage Weil <sage@inktank.com>

It would be great if Josh or Alex could also look (at 6 especially) as 
they are more familiar with the RBD code than I am, but I think it's safe 
to put this in testing for now.

Thanks!
sage


On Fri, 21 Feb 2014, Ilya Dryomov wrote:

> Hello,
> 
> This series adds support for CEPH_OSD_OP_SETALLOCHINT osd op to libceph
> along with adjusting rbd to make use of it.  The rationale and the
> basic desing was outlined in the "rados io hints" thread on ceph-devel
> about a month ago.
> 
> Thanks,
> 
>                 Ilya
> 
> 
> Ilya Dryomov (6):
>   libceph: encode CEPH_OSD_OP_FLAG_* op flags
>   libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
>   libceph: bump CEPH_OSD_MAX_OP to 3
>   rbd: do not hard-code CEPH_OSD_MAX_OP in rbd_osd_req_callback()
>   rbd: num_ops parameter for rbd_osd_req_create()
>   rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op
> 
>  drivers/block/rbd.c             |   90 ++++++++++++++++++++++++++++-----------
>  include/linux/ceph/osd_client.h |   12 +++++-
>  include/linux/ceph/rados.h      |   10 ++++-
>  net/ceph/osd_client.c           |   32 ++++++++++++++
>  4 files changed, 117 insertions(+), 27 deletions(-)
> 
> -- 
> 1.7.10.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] libceph: CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-23 16:14 ` [PATCH 0/6] libceph: " Sage Weil
@ 2014-02-23 16:15   ` Alex Elder
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Elder @ 2014-02-23 16:15 UTC (permalink / raw)
  To: Sage Weil, Ilya Dryomov; +Cc: ceph-devel

On 02/23/2014 10:14 AM, Sage Weil wrote:
> Reviewed-by: Sage Weil <sage@inktank.com>
> 
> It would be great if Josh or Alex could also look (at 6 especially) as 
> they are more familiar with the RBD code than I am, but I think it's safe 
> to put this in testing for now.

I will.  I have been planning on it but I didn't have time
on Friday.  I think I'll be able to today though.

					-Alex

> Thanks!
> sage
> 
> 
> On Fri, 21 Feb 2014, Ilya Dryomov wrote:
> 
>> Hello,
>>
>> This series adds support for CEPH_OSD_OP_SETALLOCHINT osd op to libceph
>> along with adjusting rbd to make use of it.  The rationale and the
>> basic desing was outlined in the "rados io hints" thread on ceph-devel
>> about a month ago.
>>
>> Thanks,
>>
>>                 Ilya
>>
>>
>> Ilya Dryomov (6):
>>   libceph: encode CEPH_OSD_OP_FLAG_* op flags
>>   libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
>>   libceph: bump CEPH_OSD_MAX_OP to 3
>>   rbd: do not hard-code CEPH_OSD_MAX_OP in rbd_osd_req_callback()
>>   rbd: num_ops parameter for rbd_osd_req_create()
>>   rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op
>>
>>  drivers/block/rbd.c             |   90 ++++++++++++++++++++++++++++-----------
>>  include/linux/ceph/osd_client.h |   12 +++++-
>>  include/linux/ceph/rados.h      |   10 ++++-
>>  net/ceph/osd_client.c           |   32 ++++++++++++++
>>  4 files changed, 117 insertions(+), 27 deletions(-)
>>
>> -- 
>> 1.7.10.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] libceph: CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-21 18:55 [PATCH 0/6] libceph: CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
                   ` (6 preceding siblings ...)
  2014-02-23 16:14 ` [PATCH 0/6] libceph: " Sage Weil
@ 2014-02-24 14:58 ` Alex Elder
  2014-02-25 12:50   ` Ilya Dryomov
  7 siblings, 1 reply; 26+ messages in thread
From: Alex Elder @ 2014-02-24 14:58 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
> Hello,
> 
> This series adds support for CEPH_OSD_OP_SETALLOCHINT osd op to libceph
> along with adjusting rbd to make use of it.  The rationale and the
> basic desing was outlined in the "rados io hints" thread on ceph-devel
> about a month ago.

It looks like some of this discussion is still active, so it
might be a little premature to implement this.  That being said,
I reviewed what you have.  Overall it looks good to me.  I have
some small suggestions for you to consider.  I also think that
in this case just extending the definition of a write request
is preferable to encoding this stuff in a distinct new op, but
I may be missing some information.

					-Alex

> 
> Thanks,
> 
>                 Ilya
> 
> 
> Ilya Dryomov (6):
>   libceph: encode CEPH_OSD_OP_FLAG_* op flags
>   libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
>   libceph: bump CEPH_OSD_MAX_OP to 3
>   rbd: do not hard-code CEPH_OSD_MAX_OP in rbd_osd_req_callback()
>   rbd: num_ops parameter for rbd_osd_req_create()
>   rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op
> 
>  drivers/block/rbd.c             |   90 ++++++++++++++++++++++++++++-----------
>  include/linux/ceph/osd_client.h |   12 +++++-
>  include/linux/ceph/rados.h      |   10 ++++-
>  net/ceph/osd_client.c           |   32 ++++++++++++++
>  4 files changed, 117 insertions(+), 27 deletions(-)
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] libceph: encode CEPH_OSD_OP_FLAG_* op flags
  2014-02-21 18:55 ` [PATCH 1/6] libceph: encode CEPH_OSD_OP_FLAG_* op flags Ilya Dryomov
@ 2014-02-24 14:58   ` Alex Elder
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Elder @ 2014-02-24 14:58 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
> Encode ceph_osd_op::flags field so that it gets sent over the wire.

This looks fine.  The flags field was never being encoded
before, so there's no change in the protocol.  And you're
just adding a flags field to the local structure.

Reviewed-by: Alex Elder <elder@linaro.org>


> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  include/linux/ceph/osd_client.h |    1 +
>  include/linux/ceph/rados.h      |    2 +-
>  net/ceph/osd_client.c           |    2 ++
>  3 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index fd47e872ebcc..e94f5da251d6 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -76,6 +76,7 @@ struct ceph_osd_data {
>  
>  struct ceph_osd_req_op {
>  	u16 op;           /* CEPH_OSD_OP_* */
> +	u32 flags;        /* CEPH_OSD_OP_FLAG_* */
>  	u32 payload_len;
>  	union {
>  		struct ceph_osd_data raw_data_in;
> diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
> index 96292df4041b..8f9bf4570215 100644
> --- a/include/linux/ceph/rados.h
> +++ b/include/linux/ceph/rados.h
> @@ -382,7 +382,7 @@ enum {
>   */
>  struct ceph_osd_op {
>  	__le16 op;           /* CEPH_OSD_OP_* */
> -	__le32 flags;        /* CEPH_OSD_FLAG_* */
> +	__le32 flags;        /* CEPH_OSD_OP_FLAG_* */
>  	union {
>  		struct {
>  			__le64 offset, length;
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 0676f2b199d6..5d7fd0b8c1c8 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -688,7 +688,9 @@ static u64 osd_req_encode_op(struct ceph_osd_request *req,
>  
>  		return 0;
>  	}
> +
>  	dst->op = cpu_to_le16(src->op);
> +	dst->flags = cpu_to_le32(src->flags);
>  	dst->payload_len = cpu_to_le32(src->payload_len);
>  
>  	return request_data_len;
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-21 18:55 ` [PATCH 2/6] libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
  2014-02-23 16:03   ` Sage Weil
@ 2014-02-24 14:59   ` Alex Elder
  2014-02-25 12:52     ` Ilya Dryomov
  1 sibling, 1 reply; 26+ messages in thread
From: Alex Elder @ 2014-02-24 14:59 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
> This is primarily for rbd's benefit and is supposed to combat
> fragmentation:
> 
> "... knowing that rbd images have a 4m size, librbd can pass a hint
> that will let the osd do the xfs allocation size ioctl on new files so
> that they are allocated in 1m or 4m chunks.  We've seen cases where
> users with rbd workloads have very high levels of fragmentation in xfs
> and this would mitigate that and probably have a pretty nice
> performance benefit."
> 
> SETALLOCHINT is considered advisory, so our backwards compatibility
> mechanism here is to set FAILOK flag for all SETALLOCHINT ops.
> 
> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>

I have a few small comments for you to consider but this
looks good to me.

Reviewed-by: Alex Elder <elder@linaro.org>

> ---
>  include/linux/ceph/osd_client.h |    9 +++++++++
>  include/linux/ceph/rados.h      |    8 ++++++++
>  net/ceph/osd_client.c           |   30 ++++++++++++++++++++++++++++++
>  3 files changed, 47 insertions(+)
> 
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index e94f5da251d6..6bfcb0eca8ab 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -103,6 +103,11 @@ struct ceph_osd_req_op {
>  			u32 timeout;
>  			__u8 flag;
>  		} watch;
> +		struct {
> +			u64 expected_size;

Maybe "expected_object_size"?  I wasn't sure until I read some
of the e-mail discussion what size this was referring to, and
wondered whether it was for reads or something.  This should
be done on the over-the-wire and server side structure
definitions too (if it's done at all).

The other thing is that the expected size is limited by
rbd_image_header->obj_order, which is a single byte.  I
think you should encode this the same way.  Even if the
hint were for more than RBD, this level of granularity
may be sufficient.

> +			u64 expected_write_size;

Probably the same thing here, a byte might be enough
to encode this by using log2(expected_write_size).

> +			__u8 expected_size_probability;

I'm not sure why these single-byte values use __u8 while the
multi-byte values use (e.g.) u32.  The leading underscores
are supposed to be used for values exposed to user space (or
something like that).  Anyway, not a problem with your change,
but something maybe you could check into.

> +		} hint;
>  	};
>  };
>  
> @@ -294,6 +299,10 @@ extern void osd_req_op_cls_init(struct ceph_osd_request *osd_req,
>  extern void osd_req_op_watch_init(struct ceph_osd_request *osd_req,
>  					unsigned int which, u16 opcode,
>  					u64 cookie, u64 version, int flag);
> +extern void osd_req_op_hint_init(struct ceph_osd_request *osd_req,
> +				 unsigned int which, u16 opcode,
> +				 u64 expected_size, u64 expected_write_size,
> +				 u8 expected_size_probability);
>  
>  extern struct ceph_osd_request *ceph_osdc_alloc_request(struct ceph_osd_client *osdc,
>  					       struct ceph_snap_context *snapc,
> diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
> index 8f9bf4570215..b8e2dd11f186 100644
> --- a/include/linux/ceph/rados.h
> +++ b/include/linux/ceph/rados.h
> @@ -227,6 +227,9 @@ enum {
>  	CEPH_OSD_OP_OMAPRMKEYS    = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 24,
>  	CEPH_OSD_OP_OMAP_CMP      = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_DATA | 25,
>  
> +	/* hints */
> +	CEPH_OSD_OP_SETALLOCHINT = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 35,
> +
>  	/** multi **/
>  	CEPH_OSD_OP_CLONERANGE = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_MULTI | 1,
>  	CEPH_OSD_OP_ASSERT_SRC_VERSION = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_MULTI | 2,
> @@ -416,6 +419,11 @@ struct ceph_osd_op {
>  			__le64 offset, length;
>  			__le64 src_offset;
>  		} __attribute__ ((packed)) clonerange;
> +		struct {
> +			__le64 expected_size;
> +			__le64 expected_write_size;
> +			__u8 expected_size_probability;
> +		} __attribute__ ((packed)) hint;
>  	};
>  	__le32 payload_len;
>  } __attribute__ ((packed));
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index 5d7fd0b8c1c8..4090f6e8db3a 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -436,6 +436,7 @@ static bool osd_req_opcode_valid(u16 opcode)
>  	case CEPH_OSD_OP_OMAPCLEAR:
>  	case CEPH_OSD_OP_OMAPRMKEYS:
>  	case CEPH_OSD_OP_OMAP_CMP:
> +	case CEPH_OSD_OP_SETALLOCHINT:
>  	case CEPH_OSD_OP_CLONERANGE:
>  	case CEPH_OSD_OP_ASSERT_SRC_VERSION:
>  	case CEPH_OSD_OP_SRC_CMPXATTR:
> @@ -591,6 +592,28 @@ void osd_req_op_watch_init(struct ceph_osd_request *osd_req,
>  }
>  EXPORT_SYMBOL(osd_req_op_watch_init);
>  
> +void osd_req_op_hint_init(struct ceph_osd_request *osd_req,
> +			  unsigned int which, u16 opcode,
> +			  u64 expected_size, u64 expected_write_size,
> +			  u8 expected_size_probability)
> +{
> +	struct ceph_osd_req_op *op = _osd_req_op_init(osd_req, which, opcode);
> +
> +	BUG_ON(opcode != CEPH_OSD_OP_SETALLOCHINT);
> +
> +	op->hint.expected_size = expected_size;
> +	op->hint.expected_write_size = expected_write_size;
> +	op->hint.expected_size_probability = expected_size_probability;
> +
> +	/*
> +	 * CEPH_OSD_OP_SETALLOCHINT op is advisory and therefore deemed
> +	 * not worth a feature bit.  Set FAILOK per-op flag to make
> +	 * sure older osds don't trip over an unsupported opcode.
> +	 */
> +	op->flags |= CEPH_OSD_OP_FLAG_FAILOK;

I think this is reasonable.  The only thing I wonder about is
whether there could be any other failure that could occur on
an OSD server that actually supports the op that we would care
about.  It's still advisory though, so functionally it won't
matter.

> +}
> +EXPORT_SYMBOL(osd_req_op_hint_init);
> +
>  static void ceph_osdc_msg_data_add(struct ceph_msg *msg,
>  				struct ceph_osd_data *osd_data)
>  {
> @@ -681,6 +704,13 @@ static u64 osd_req_encode_op(struct ceph_osd_request *req,
>  		dst->watch.ver = cpu_to_le64(src->watch.ver);
>  		dst->watch.flag = src->watch.flag;
>  		break;
> +	case CEPH_OSD_OP_SETALLOCHINT:
> +		dst->hint.expected_size = cpu_to_le64(src->hint.expected_size);
> +		dst->hint.expected_write_size =
> +		    cpu_to_le64(src->hint.expected_write_size);
> +		dst->hint.expected_size_probability =
> +		    src->hint.expected_size_probability;
> +		break;
>  	default:
>  		pr_err("unsupported osd opcode %s\n",
>  			ceph_osd_op_name(src->op));
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/6] libceph: bump CEPH_OSD_MAX_OP to 3
  2014-02-21 18:55 ` [PATCH 3/6] libceph: bump CEPH_OSD_MAX_OP to 3 Ilya Dryomov
@ 2014-02-24 14:59   ` Alex Elder
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Elder @ 2014-02-24 14:59 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
> Our longest osd request now contains 3 ops: copyup+hint+write.

This is pretty trivial and could be squashed into the
patch that actually starts using more than two ops.
It could certainly be squashed together with the next
patch (that fixes the assertion).

That being said, I appreciate your breaking up the
functionality into logically separate changes, it's
important.

So I guess, do what you like...

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  include/linux/ceph/osd_client.h |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index 6bfcb0eca8ab..bc874b2398d7 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -43,7 +43,7 @@ struct ceph_osd {
>  };
>  
>  
> -#define CEPH_OSD_MAX_OP	2
> +#define CEPH_OSD_MAX_OP	3
>  
>  enum ceph_osd_data_type {
>  	CEPH_OSD_DATA_TYPE_NONE = 0,
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] rbd: do not hard-code CEPH_OSD_MAX_OP in rbd_osd_req_callback()
  2014-02-21 18:55 ` [PATCH 4/6] rbd: do not hard-code CEPH_OSD_MAX_OP in rbd_osd_req_callback() Ilya Dryomov
@ 2014-02-24 14:59   ` Alex Elder
  2014-02-25 12:53     ` Ilya Dryomov
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Elder @ 2014-02-24 14:59 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
> CEPH_OSD_MAX_OP value in rbd_osd_req_callback() is hard-coded to 2.
> Fix it.

Please squash this in with the previous patch (at least).
Change the BUG_ON() to rbd_assert() while you're at it
(and invert the logic).

Reviewed-by: Alex Elder <elder@linaro.org>

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  drivers/block/rbd.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index b365e0dfccb6..48a889866824 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -1654,7 +1654,7 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req,
>  	if (osd_req->r_result < 0)
>  		obj_request->result = osd_req->r_result;
>  
> -	BUG_ON(osd_req->r_num_ops > 2);
> +	BUG_ON(osd_req->r_num_ops > CEPH_OSD_MAX_OP);
>  
>  	/*
>  	 * We support a 64-bit length, but ultimately it has to be
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/6] rbd: num_ops parameter for rbd_osd_req_create()
  2014-02-21 18:55 ` [PATCH 5/6] rbd: num_ops parameter for rbd_osd_req_create() Ilya Dryomov
@ 2014-02-24 14:59   ` Alex Elder
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Elder @ 2014-02-24 14:59 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
> In preparation for prefixing rbd writes with an allocation hint
> introduce a num_ops parameter for rbd_osd_req_create().  The rationale
> is that not every write request is a write op that needs to be prefixed
> (e.g. watch op), so the num_ops logic needs to be in the callers.
> 
> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>

Looks good.

Reviewed-by: Alex Elder <elder@linaro.org>

> ---
>  drivers/block/rbd.c |   28 ++++++++++++++++++----------
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 48a889866824..6cf001ef00bc 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -1718,6 +1718,7 @@ static void rbd_osd_req_format_write(struct rbd_obj_request *obj_request)
>  static struct ceph_osd_request *rbd_osd_req_create(
>  					struct rbd_device *rbd_dev,
>  					bool write_request,
> +					unsigned int num_ops,
>  					struct rbd_obj_request *obj_request)
>  {
>  	struct ceph_snap_context *snapc = NULL;
> @@ -1733,10 +1734,13 @@ static struct ceph_osd_request *rbd_osd_req_create(
>  			snapc = img_request->snapc;
>  	}
>  
> -	/* Allocate and initialize the request, for the single op */
> +	rbd_assert(num_ops == 1);
> +
> +	/* Allocate and initialize the request, for the num_ops ops */
>  
>  	osdc = &rbd_dev->rbd_client->client->osdc;
> -	osd_req = ceph_osdc_alloc_request(osdc, snapc, 1, false, GFP_ATOMIC);
> +	osd_req = ceph_osdc_alloc_request(osdc, snapc, num_ops, false,
> +					  GFP_ATOMIC);
>  	if (!osd_req)
>  		return NULL;	/* ENOMEM */
>  
> @@ -2220,8 +2224,8 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
>  			pages += page_count;
>  		}
>  
> -		osd_req = rbd_osd_req_create(rbd_dev, write_request,
> -						obj_request);
> +		osd_req = rbd_osd_req_create(rbd_dev, write_request, 1,
> +					     obj_request);
>  		if (!osd_req)
>  			goto out_partial;
>  		obj_request->osd_req = osd_req;
> @@ -2604,8 +2608,8 @@ static int rbd_img_obj_exists_submit(struct rbd_obj_request *obj_request)
>  
>  	rbd_assert(obj_request->img_request);
>  	rbd_dev = obj_request->img_request->rbd_dev;
> -	stat_request->osd_req = rbd_osd_req_create(rbd_dev, false,
> -						stat_request);
> +	stat_request->osd_req = rbd_osd_req_create(rbd_dev, false, 1,
> +						   stat_request);
>  	if (!stat_request->osd_req)
>  		goto out;
>  	stat_request->callback = rbd_img_obj_exists_callback;
> @@ -2808,7 +2812,8 @@ static int rbd_obj_notify_ack_sync(struct rbd_device *rbd_dev, u64 notify_id)
>  		return -ENOMEM;
>  
>  	ret = -ENOMEM;
> -	obj_request->osd_req = rbd_osd_req_create(rbd_dev, false, obj_request);
> +	obj_request->osd_req = rbd_osd_req_create(rbd_dev, false, 1,
> +						  obj_request);
>  	if (!obj_request->osd_req)
>  		goto out;
>  
> @@ -2871,7 +2876,8 @@ static int __rbd_dev_header_watch_sync(struct rbd_device *rbd_dev, bool start)
>  	if (!obj_request)
>  		goto out_cancel;
>  
> -	obj_request->osd_req = rbd_osd_req_create(rbd_dev, true, obj_request);
> +	obj_request->osd_req = rbd_osd_req_create(rbd_dev, true, 1,
> +						  obj_request);
>  	if (!obj_request->osd_req)
>  		goto out_cancel;
>  
> @@ -2979,7 +2985,8 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
>  	obj_request->pages = pages;
>  	obj_request->page_count = page_count;
>  
> -	obj_request->osd_req = rbd_osd_req_create(rbd_dev, false, obj_request);
> +	obj_request->osd_req = rbd_osd_req_create(rbd_dev, false, 1,
> +						  obj_request);
>  	if (!obj_request->osd_req)
>  		goto out;
>  
> @@ -3212,7 +3219,8 @@ static int rbd_obj_read_sync(struct rbd_device *rbd_dev,
>  	obj_request->pages = pages;
>  	obj_request->page_count = page_count;
>  
> -	obj_request->osd_req = rbd_osd_req_create(rbd_dev, false, obj_request);
> +	obj_request->osd_req = rbd_osd_req_create(rbd_dev, false, 1,
> +						  obj_request);
>  	if (!obj_request->osd_req)
>  		goto out;
>  
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-21 18:55 ` [PATCH 6/6] rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
@ 2014-02-24 14:59   ` Alex Elder
  2014-02-25 12:58     ` Ilya Dryomov
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Elder @ 2014-02-24 14:59 UTC (permalink / raw)
  To: Ilya Dryomov, ceph-devel

On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
> In an effort to reduce fragmentation, prefix every rbd write with
> a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
> to the object size (1 << order).  Backwards compatibility is taken care
> of on the libceph/osd side.

If *every* write will include a hint, why even encode this as
a distinct opcode?  Why not just extend the definition of a
write operation to include the write hint data?  The server
side could check expected_object_size, and if 0 (or some other
invalid value) it means the client isn't supplying a hint.

However, on the assumption you want this to be a distinct
OSD op I think you generally did the right thing.  See my
comments below.  For now I'm not indicating "Reviewed-by"
because it sounds like the nature of this change is under
discussion still.  And I really do think that if the hint
is not going to be made more generic (and possibly even if
it is) I'd rather see this hinting done using an extension
of the write operation (like I suggest above).  In this
case it is clearly directly tied to every write operation
and separating it sort of obscures that.

					-Alex

> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
> ---
>  drivers/block/rbd.c |   66 ++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 49 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 6cf001ef00bc..14496f39c770 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -1662,7 +1662,12 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req,
>  	 */
>  	obj_request->xferred = osd_req->r_reply_op_len[0];
>  	rbd_assert(obj_request->xferred < (u64)UINT_MAX);
> +
>  	opcode = osd_req->r_ops[0].op;
> +	if (opcode == CEPH_OSD_OP_SETALLOCHINT) {
> +		BUG_ON(osd_req->r_ops[1].op != CEPH_OSD_OP_WRITE);
> +		opcode = CEPH_OSD_OP_WRITE;
> +	}

I would rather see this handled as a distinct case below
rather than fudging the opcode.  Bury that BUG_ON() inside
inside rbd_osd_write_callback() logic to avoid encoding those
op-specific details here.  And change the BUG_ON() to rbd_assert().

>  	switch (opcode) {
>  	case CEPH_OSD_OP_READ:
>  		rbd_osd_read_callback(obj_request);
> @@ -1715,6 +1720,12 @@ static void rbd_osd_req_format_write(struct rbd_obj_request *obj_request)
>  			snapc, CEPH_NOSNAP, &mtime);
>  }
>  
> +/*
> + * Create an osd request.  A read request has one osd op (read).
> + * A write request has either one (watch) or two (hint+write) osd ops.
> + * (All rbd writes are prefixed with an allocation hint op, but

Maybe "rbd data writes" or something.  A watch changes state but
doesn't write data.

> + * technically osd watch is a write request, hence this distinction.)
> + */
>  static struct ceph_osd_request *rbd_osd_req_create(
>  					struct rbd_device *rbd_dev,
>  					bool write_request,
> @@ -1734,7 +1745,8 @@ static struct ceph_osd_request *rbd_osd_req_create(
>  			snapc = img_request->snapc;
>  	}
>  
> -	rbd_assert(num_ops == 1);
> +	rbd_assert((!write_request && num_ops == 1) ||
> +		   (write_request && num_ops >= 1 && num_ops <= 2));

Maybe:
	rbd_assert(num_ops == 1 || write_request && num_ops == 2);

>  
>  	/* Allocate and initialize the request, for the num_ops ops */
>  
> @@ -1760,8 +1772,8 @@ static struct ceph_osd_request *rbd_osd_req_create(
>  
>  /*
>   * Create a copyup osd request based on the information in the
> - * object request supplied.  A copyup request has two osd ops,
> - * a copyup method call, and a "normal" write request.
> + * object request supplied.  A copyup request has three osd ops,
> + * a copyup method call, a hint op, and a write op.

There's a comment in the caller that should be similarly updated:
         * We need a new one that can hold the two ops in a copyup
                                               ^^^
>   */
>  static struct ceph_osd_request *
>  rbd_osd_req_create_copyup(struct rbd_obj_request *obj_request)
> @@ -1777,12 +1789,12 @@ rbd_osd_req_create_copyup(struct rbd_obj_request *obj_request)
>  	rbd_assert(img_request);
>  	rbd_assert(img_request_write_test(img_request));
>  
> -	/* Allocate and initialize the request, for the two ops */
> +	/* Allocate and initialize the request, for the three ops */
>  
>  	snapc = img_request->snapc;
>  	rbd_dev = img_request->rbd_dev;
>  	osdc = &rbd_dev->rbd_client->client->osdc;
> -	osd_req = ceph_osdc_alloc_request(osdc, snapc, 2, false, GFP_ATOMIC);
> +	osd_req = ceph_osdc_alloc_request(osdc, snapc, 3, false, GFP_ATOMIC);
>  	if (!osd_req)
>  		return NULL;	/* ENOMEM */
>  
> @@ -2159,12 +2171,10 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
>  	struct page **pages = NULL;
>  	u64 img_offset;
>  	u64 resid;
> -	u16 opcode;

Why move opcode inside the inner loop?  It's not going
to change, is it?  (Maybe this was a leftover from an
abandoned attempt to do this a different way.)

>  
>  	dout("%s: img %p type %d data_desc %p\n", __func__, img_request,
>  		(int)type, data_desc);
>  
> -	opcode = write_request ? CEPH_OSD_OP_WRITE : CEPH_OSD_OP_READ;
>  	img_offset = img_request->offset;
>  	resid = img_request->length;
>  	rbd_assert(resid > 0);
> @@ -2183,6 +2193,8 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
>  		const char *object_name;
>  		u64 offset;
>  		u64 length;
> +		unsigned int which;

		unsigned int which = 0;

> +		u16 opcode;
>  
>  		object_name = rbd_segment_name(rbd_dev, img_offset);
>  		if (!object_name)
> @@ -2224,20 +2236,34 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
>  			pages += page_count;
>  		}
>  
> -		osd_req = rbd_osd_req_create(rbd_dev, write_request, 1,
> +		osd_req = rbd_osd_req_create(rbd_dev, write_request,
> +					     (write_request ? 2 : 1),
>  					     obj_request);
>  		if (!osd_req)
>  			goto out_partial;
>  		obj_request->osd_req = osd_req;
>  		obj_request->callback = rbd_img_obj_callback;
>  
> -		osd_req_op_extent_init(osd_req, 0, opcode, offset, length,
> -						0, 0);
> +		if (write_request) {



> +			osd_req_op_hint_init(osd_req, 0,
> +					     CEPH_OSD_OP_SETALLOCHINT,
> +					     rbd_obj_bytes(&rbd_dev->header),
> +					     rbd_obj_bytes(&rbd_dev->header),
> +					     0);
> +
> +			which = 1;
			which++;

> +			opcode = CEPH_OSD_OP_WRITE;
> +		} else {

And this block goes away (assign opcode outside the loop).

> +			which = 0;
> +			opcode = CEPH_OSD_OP_READ;
> +		}
> +		osd_req_op_extent_init(osd_req, which, opcode, offset, length,
> +				       0, 0);
>  		if (type == OBJ_REQUEST_BIO)
> -			osd_req_op_extent_osd_data_bio(osd_req, 0,
> +			osd_req_op_extent_osd_data_bio(osd_req, which,
>  					obj_request->bio_list, length);
>  		else
> -			osd_req_op_extent_osd_data_pages(osd_req, 0,
> +			osd_req_op_extent_osd_data_pages(osd_req, which,
>  					obj_request->pages, length,
>  					offset & ~PAGE_MASK, false, false);
>  
> @@ -2358,7 +2384,7 @@ rbd_img_obj_parent_read_full_callback(struct rbd_img_request *img_request)
>  
>  	/*
>  	 * The original osd request is of no use to use any more.
> -	 * We need a new one that can hold the two ops in a copyup
> +	 * We need a new one that can hold the three ops in a copyup
>  	 * request.  Allocate the new copyup osd request for the
>  	 * original request, and release the old one.
>  	 */
> @@ -2377,17 +2403,23 @@ rbd_img_obj_parent_read_full_callback(struct rbd_img_request *img_request)
>  	osd_req_op_cls_request_data_pages(osd_req, 0, pages, parent_length, 0,
>  						false, false);
>  
> -	/* Then the original write request op */
> +	/* Then the hint op */
> +
> +	osd_req_op_hint_init(osd_req, 1, CEPH_OSD_OP_SETALLOCHINT,
> +			     rbd_obj_bytes(&rbd_dev->header),
> +			     rbd_obj_bytes(&rbd_dev->header), 0);
> +
> +	/* And the original write request op */
>  
>  	offset = orig_request->offset;
>  	length = orig_request->length;
> -	osd_req_op_extent_init(osd_req, 1, CEPH_OSD_OP_WRITE,
> +	osd_req_op_extent_init(osd_req, 2, CEPH_OSD_OP_WRITE,
>  					offset, length, 0, 0);
>  	if (orig_request->type == OBJ_REQUEST_BIO)
> -		osd_req_op_extent_osd_data_bio(osd_req, 1,
> +		osd_req_op_extent_osd_data_bio(osd_req, 2,
>  					orig_request->bio_list, length);
>  	else
> -		osd_req_op_extent_osd_data_pages(osd_req, 1,
> +		osd_req_op_extent_osd_data_pages(osd_req, 2,
>  					orig_request->pages, length,
>  					offset & ~PAGE_MASK, false, false);
>  
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] libceph: CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-24 14:58 ` Alex Elder
@ 2014-02-25 12:50   ` Ilya Dryomov
  0 siblings, 0 replies; 26+ messages in thread
From: Ilya Dryomov @ 2014-02-25 12:50 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Mon, Feb 24, 2014 at 4:58 PM, Alex Elder <elder@ieee.org> wrote:
> On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
>> Hello,
>>
>> This series adds support for CEPH_OSD_OP_SETALLOCHINT osd op to libceph
>> along with adjusting rbd to make use of it.  The rationale and the
>> basic desing was outlined in the "rados io hints" thread on ceph-devel
>> about a month ago.
>
> It looks like some of this discussion is still active, so it
> might be a little premature to implement this.  That being said,
> I reviewed what you have.  Overall it looks good to me.  I have
> some small suggestions for you to consider.  I also think that
> in this case just extending the definition of a write request
> is preferable to encoding this stuff in a distinct new op, but
> I may be missing some information.

Hi Alex,

Thanks for the review!  I've incorporated most of your suggestions, see
my replies for details.  I'll post v2 series shortly, feel free to send
your Reviewed-by for 6/6 any time ;)

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-24 14:59   ` Alex Elder
@ 2014-02-25 12:52     ` Ilya Dryomov
  2014-02-25 13:05       ` Alex Elder
  2014-02-25 17:12       ` Sage Weil
  0 siblings, 2 replies; 26+ messages in thread
From: Ilya Dryomov @ 2014-02-25 12:52 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Mon, Feb 24, 2014 at 4:59 PM, Alex Elder <elder@ieee.org> wrote:
> On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
>> This is primarily for rbd's benefit and is supposed to combat
>> fragmentation:
>>
>> "... knowing that rbd images have a 4m size, librbd can pass a hint
>> that will let the osd do the xfs allocation size ioctl on new files so
>> that they are allocated in 1m or 4m chunks.  We've seen cases where
>> users with rbd workloads have very high levels of fragmentation in xfs
>> and this would mitigate that and probably have a pretty nice
>> performance benefit."
>>
>> SETALLOCHINT is considered advisory, so our backwards compatibility
>> mechanism here is to set FAILOK flag for all SETALLOCHINT ops.
>>
>> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
>
> I have a few small comments for you to consider but this
> looks good to me.
>
> Reviewed-by: Alex Elder <elder@linaro.org>
>
>> ---
>>  include/linux/ceph/osd_client.h |    9 +++++++++
>>  include/linux/ceph/rados.h      |    8 ++++++++
>>  net/ceph/osd_client.c           |   30 ++++++++++++++++++++++++++++++
>>  3 files changed, 47 insertions(+)
>>
>> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
>> index e94f5da251d6..6bfcb0eca8ab 100644
>> --- a/include/linux/ceph/osd_client.h
>> +++ b/include/linux/ceph/osd_client.h
>> @@ -103,6 +103,11 @@ struct ceph_osd_req_op {
>>                       u32 timeout;
>>                       __u8 flag;
>>               } watch;
>> +             struct {
>> +                     u64 expected_size;
>
> Maybe "expected_object_size"?  I wasn't sure until I read some
> of the e-mail discussion what size this was referring to, and
> wondered whether it was for reads or something.  This should
> be done on the over-the-wire and server side structure
> definitions too (if it's done at all).

Done, on both sides.

>
> The other thing is that the expected size is limited by
> rbd_image_header->obj_order, which is a single byte.  I
> think you should encode this the same way.  Even if the
> hint were for more than RBD, this level of granularity
> may be sufficient.
>
>> +                     u64 expected_write_size;
>
> Probably the same thing here, a byte might be enough
> to encode this by using log2(expected_write_size).
>
>> +                     __u8 expected_size_probability;

I think in the interest of genericity expected_object_size should be an
arbitrary, not limited to a power-of-2 value.  Now, given the current
90M object size limit 64 bits might seem a lot, but extent offset and
length are 64 bit values and to be future-proof I followed that here.

expected_size_probability is currently unused.  It was supposed to be
a 0-100 value, indicating whether we expect the object to be smaller
than expected_object_size or not and how strong that expectation is.

I'm not sure if you've seen it, but this (both userspace and kernel
sides) were implemented according to the design laid out by Sage in the
"rados io hints" thread on ceph-devel about a month ago.  There wasn't
any discussion that I'm aware of in the interim, that is until the
recent "wip-hint" thread, which was triggered by my PR.

>
> I'm not sure why these single-byte values use __u8 while the
> multi-byte values use (e.g.) u32.  The leading underscores
> are supposed to be used for values exposed to user space (or
> something like that).  Anyway, not a problem with your change,
> but something maybe you could check into.

I'm not sure either.  I vaguely assumed it's related to the fact that
single-byte ceph_osd_req_op fields are directly assigned to the
corresponding ceph_osd_op fields which are exported to userspace,
whereas the multi-byte values go through the endianness conversion
macros, which change the type to __le*.

>
>> +             } hint;
>>       };
>>  };
>>
>> @@ -294,6 +299,10 @@ extern void osd_req_op_cls_init(struct ceph_osd_request *osd_req,
>>  extern void osd_req_op_watch_init(struct ceph_osd_request *osd_req,
>>                                       unsigned int which, u16 opcode,
>>                                       u64 cookie, u64 version, int flag);
>> +extern void osd_req_op_hint_init(struct ceph_osd_request *osd_req,
>> +                              unsigned int which, u16 opcode,
>> +                              u64 expected_size, u64 expected_write_size,
>> +                              u8 expected_size_probability);
>>
>>  extern struct ceph_osd_request *ceph_osdc_alloc_request(struct ceph_osd_client *osdc,
>>                                              struct ceph_snap_context *snapc,
>> diff --git a/include/linux/ceph/rados.h b/include/linux/ceph/rados.h
>> index 8f9bf4570215..b8e2dd11f186 100644
>> --- a/include/linux/ceph/rados.h
>> +++ b/include/linux/ceph/rados.h
>> @@ -227,6 +227,9 @@ enum {
>>       CEPH_OSD_OP_OMAPRMKEYS    = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 24,
>>       CEPH_OSD_OP_OMAP_CMP      = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_DATA | 25,
>>
>> +     /* hints */
>> +     CEPH_OSD_OP_SETALLOCHINT = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_DATA | 35,
>> +
>>       /** multi **/
>>       CEPH_OSD_OP_CLONERANGE = CEPH_OSD_OP_MODE_WR | CEPH_OSD_OP_TYPE_MULTI | 1,
>>       CEPH_OSD_OP_ASSERT_SRC_VERSION = CEPH_OSD_OP_MODE_RD | CEPH_OSD_OP_TYPE_MULTI | 2,
>> @@ -416,6 +419,11 @@ struct ceph_osd_op {
>>                       __le64 offset, length;
>>                       __le64 src_offset;
>>               } __attribute__ ((packed)) clonerange;
>> +             struct {
>> +                     __le64 expected_size;
>> +                     __le64 expected_write_size;
>> +                     __u8 expected_size_probability;
>> +             } __attribute__ ((packed)) hint;
>>       };
>>       __le32 payload_len;
>>  } __attribute__ ((packed));
>> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>> index 5d7fd0b8c1c8..4090f6e8db3a 100644
>> --- a/net/ceph/osd_client.c
>> +++ b/net/ceph/osd_client.c
>> @@ -436,6 +436,7 @@ static bool osd_req_opcode_valid(u16 opcode)
>>       case CEPH_OSD_OP_OMAPCLEAR:
>>       case CEPH_OSD_OP_OMAPRMKEYS:
>>       case CEPH_OSD_OP_OMAP_CMP:
>> +     case CEPH_OSD_OP_SETALLOCHINT:
>>       case CEPH_OSD_OP_CLONERANGE:
>>       case CEPH_OSD_OP_ASSERT_SRC_VERSION:
>>       case CEPH_OSD_OP_SRC_CMPXATTR:
>> @@ -591,6 +592,28 @@ void osd_req_op_watch_init(struct ceph_osd_request *osd_req,
>>  }
>>  EXPORT_SYMBOL(osd_req_op_watch_init);
>>
>> +void osd_req_op_hint_init(struct ceph_osd_request *osd_req,
>> +                       unsigned int which, u16 opcode,
>> +                       u64 expected_size, u64 expected_write_size,
>> +                       u8 expected_size_probability)
>> +{
>> +     struct ceph_osd_req_op *op = _osd_req_op_init(osd_req, which, opcode);
>> +
>> +     BUG_ON(opcode != CEPH_OSD_OP_SETALLOCHINT);
>> +
>> +     op->hint.expected_size = expected_size;
>> +     op->hint.expected_write_size = expected_write_size;
>> +     op->hint.expected_size_probability = expected_size_probability;
>> +
>> +     /*
>> +      * CEPH_OSD_OP_SETALLOCHINT op is advisory and therefore deemed
>> +      * not worth a feature bit.  Set FAILOK per-op flag to make
>> +      * sure older osds don't trip over an unsupported opcode.
>> +      */
>> +     op->flags |= CEPH_OSD_OP_FLAG_FAILOK;
>
> I think this is reasonable.  The only thing I wonder about is
> whether there could be any other failure that could occur on
> an OSD server that actually supports the op that we would care
> about.  It's still advisory though, so functionally it won't
> matter.

Currently there isn't any such failure.  In fact, the only thing that
this FAILOK hides is the EOPNOTSUPP from an OSD that doesn't support
the new op.  Everything else is done on the backend, and there all
errors are explicitly ignored because this is an advisory op and it
would bring down the OSD otherwise.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] rbd: do not hard-code CEPH_OSD_MAX_OP in rbd_osd_req_callback()
  2014-02-24 14:59   ` Alex Elder
@ 2014-02-25 12:53     ` Ilya Dryomov
  0 siblings, 0 replies; 26+ messages in thread
From: Ilya Dryomov @ 2014-02-25 12:53 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Mon, Feb 24, 2014 at 4:59 PM, Alex Elder <elder@ieee.org> wrote:
> On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
>> CEPH_OSD_MAX_OP value in rbd_osd_req_callback() is hard-coded to 2.
>> Fix it.
>
> Please squash this in with the previous patch (at least).
> Change the BUG_ON() to rbd_assert() while you're at it
> (and invert the logic).

Done.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-24 14:59   ` Alex Elder
@ 2014-02-25 12:58     ` Ilya Dryomov
  2014-02-25 13:19       ` Alex Elder
  0 siblings, 1 reply; 26+ messages in thread
From: Ilya Dryomov @ 2014-02-25 12:58 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Mon, Feb 24, 2014 at 4:59 PM, Alex Elder <elder@ieee.org> wrote:
> On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
>> In an effort to reduce fragmentation, prefix every rbd write with
>> a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
>> to the object size (1 << order).  Backwards compatibility is taken care
>> of on the libceph/osd side.
>
> If *every* write will include a hint, why even encode this as
> a distinct opcode?  Why not just extend the definition of a
> write operation to include the write hint data?  The server
> side could check expected_object_size, and if 0 (or some other
> invalid value) it means the client isn't supplying a hint.
>
> However, on the assumption you want this to be a distinct
> OSD op I think you generally did the right thing.  See my
> comments below.  For now I'm not indicating "Reviewed-by"
> because it sounds like the nature of this change is under
> discussion still.  And I really do think that if the hint
> is not going to be made more generic (and possibly even if
> it is) I'd rather see this hinting done using an extension
> of the write operation (like I suggest above).  In this
> case it is clearly directly tied to every write operation
> and separating it sort of obscures that.

Yes, the assumption is that we want to do this in a separate op.  The
hint is durable, in that it's enough to do it once, so it doesn't make
much sense to fold it into the write op(s).  The reason every rbd write
is prefixed is that rbd doesn't explicitly create objects and relies on
writes creating them implicitly, so there is no place to stick a single
hint op into.  To get around that we decided to prefix every rbd write
with a hint (just like write and setattr ops, hint op will create an
object implicitly if it doesn't exist).

I'll add the above paragraph to the commit message.

>
>                                         -Alex
>
>> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
>> ---
>>  drivers/block/rbd.c |   66 ++++++++++++++++++++++++++++++++++++++-------------
>>  1 file changed, 49 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index 6cf001ef00bc..14496f39c770 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -1662,7 +1662,12 @@ static void rbd_osd_req_callback(struct ceph_osd_request *osd_req,
>>        */
>>       obj_request->xferred = osd_req->r_reply_op_len[0];
>>       rbd_assert(obj_request->xferred < (u64)UINT_MAX);
>> +
>>       opcode = osd_req->r_ops[0].op;
>> +     if (opcode == CEPH_OSD_OP_SETALLOCHINT) {
>> +             BUG_ON(osd_req->r_ops[1].op != CEPH_OSD_OP_WRITE);
>> +             opcode = CEPH_OSD_OP_WRITE;
>> +     }
>
> I would rather see this handled as a distinct case below
> rather than fudging the opcode.  Bury that BUG_ON() inside
> inside rbd_osd_write_callback() logic to avoid encoding those
> op-specific details here.  And change the BUG_ON() to rbd_assert().

It can't be buried inside rbd_osd_write_callback() because the latter
takes obj_request and opcode is the osd_request property.  I changed it
to the following though:

    ...
    case CEPH_OSD_OP_SETALLOCHINT:
        rbd_assert(osd_req->r_ops[1].op == CEPH_OSD_OP_WRITE);
        /* fall through */
    case CEPH_OSD_OP_WRITE:
    ...

>
>>       switch (opcode) {
>>       case CEPH_OSD_OP_READ:
>>               rbd_osd_read_callback(obj_request);
>> @@ -1715,6 +1720,12 @@ static void rbd_osd_req_format_write(struct rbd_obj_request *obj_request)
>>                       snapc, CEPH_NOSNAP, &mtime);
>>  }
>>
>> +/*
>> + * Create an osd request.  A read request has one osd op (read).
>> + * A write request has either one (watch) or two (hint+write) osd ops.
>> + * (All rbd writes are prefixed with an allocation hint op, but
>
> Maybe "rbd data writes" or something.  A watch changes state but
> doesn't write data.

Done.

>
>> + * technically osd watch is a write request, hence this distinction.)
>> + */
>>  static struct ceph_osd_request *rbd_osd_req_create(
>>                                       struct rbd_device *rbd_dev,
>>                                       bool write_request,
>> @@ -1734,7 +1745,8 @@ static struct ceph_osd_request *rbd_osd_req_create(
>>                       snapc = img_request->snapc;
>>       }
>>
>> -     rbd_assert(num_ops == 1);
>> +     rbd_assert((!write_request && num_ops == 1) ||
>> +                (write_request && num_ops >= 1 && num_ops <= 2));
>
> Maybe:
>         rbd_assert(num_ops == 1 || write_request && num_ops == 2);

Done.  This isn't very clear but definitely better than what I had and
not so verbose as

    if (write_request)
        rbd_assert(num_ops == 1 || num_ops == 2);
    else
        rbd_assert(num_ops == 1);

I had to add extra parentheses around && though to silence gcc.

>
>>
>>       /* Allocate and initialize the request, for the num_ops ops */
>>
>> @@ -1760,8 +1772,8 @@ static struct ceph_osd_request *rbd_osd_req_create(
>>
>>  /*
>>   * Create a copyup osd request based on the information in the
>> - * object request supplied.  A copyup request has two osd ops,
>> - * a copyup method call, and a "normal" write request.
>> + * object request supplied.  A copyup request has three osd ops,
>> + * a copyup method call, a hint op, and a write op.
>
> There's a comment in the caller that should be similarly updated:
>          * We need a new one that can hold the two ops in a copyup
>                                                ^^^

It already is, in this patch, below.

>>   */
>>  static struct ceph_osd_request *
>>  rbd_osd_req_create_copyup(struct rbd_obj_request *obj_request)
>> @@ -1777,12 +1789,12 @@ rbd_osd_req_create_copyup(struct rbd_obj_request *obj_request)
>>       rbd_assert(img_request);
>>       rbd_assert(img_request_write_test(img_request));
>>
>> -     /* Allocate and initialize the request, for the two ops */
>> +     /* Allocate and initialize the request, for the three ops */
>>
>>       snapc = img_request->snapc;
>>       rbd_dev = img_request->rbd_dev;
>>       osdc = &rbd_dev->rbd_client->client->osdc;
>> -     osd_req = ceph_osdc_alloc_request(osdc, snapc, 2, false, GFP_ATOMIC);
>> +     osd_req = ceph_osdc_alloc_request(osdc, snapc, 3, false, GFP_ATOMIC);
>>       if (!osd_req)
>>               return NULL;    /* ENOMEM */
>>
>> @@ -2159,12 +2171,10 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
>>       struct page **pages = NULL;
>>       u64 img_offset;
>>       u64 resid;
>> -     u16 opcode;
>
> Why move opcode inside the inner loop?  It's not going
> to change, is it?  (Maybe this was a leftover from an
> abandoned attempt to do this a different way.)

Probably, done.

>
>>
>>       dout("%s: img %p type %d data_desc %p\n", __func__, img_request,
>>               (int)type, data_desc);
>>
>> -     opcode = write_request ? CEPH_OSD_OP_WRITE : CEPH_OSD_OP_READ;
>>       img_offset = img_request->offset;
>>       resid = img_request->length;
>>       rbd_assert(resid > 0);
>> @@ -2183,6 +2193,8 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
>>               const char *object_name;
>>               u64 offset;
>>               u64 length;
>> +             unsigned int which;
>
>                 unsigned int which = 0;
>
>> +             u16 opcode;
>>
>>               object_name = rbd_segment_name(rbd_dev, img_offset);
>>               if (!object_name)
>> @@ -2224,20 +2236,34 @@ static int rbd_img_request_fill(struct rbd_img_request *img_request,
>>                       pages += page_count;
>>               }
>>
>> -             osd_req = rbd_osd_req_create(rbd_dev, write_request, 1,
>> +             osd_req = rbd_osd_req_create(rbd_dev, write_request,
>> +                                          (write_request ? 2 : 1),
>>                                            obj_request);
>>               if (!osd_req)
>>                       goto out_partial;
>>               obj_request->osd_req = osd_req;
>>               obj_request->callback = rbd_img_obj_callback;
>>
>> -             osd_req_op_extent_init(osd_req, 0, opcode, offset, length,
>> -                                             0, 0);
>> +             if (write_request) {
>
>
>
>> +                     osd_req_op_hint_init(osd_req, 0,
>> +                                          CEPH_OSD_OP_SETALLOCHINT,
>> +                                          rbd_obj_bytes(&rbd_dev->header),
>> +                                          rbd_obj_bytes(&rbd_dev->header),
>> +                                          0);
>> +
>> +                     which = 1;
>                         which++;
>
>> +                     opcode = CEPH_OSD_OP_WRITE;
>> +             } else {
>
> And this block goes away (assign opcode outside the loop).

Done.

>
>> +                     which = 0;
>> +                     opcode = CEPH_OSD_OP_READ;
>> +             }
>> +             osd_req_op_extent_init(osd_req, which, opcode, offset, length,
>> +                                    0, 0);
>>               if (type == OBJ_REQUEST_BIO)
>> -                     osd_req_op_extent_osd_data_bio(osd_req, 0,
>> +                     osd_req_op_extent_osd_data_bio(osd_req, which,
>>                                       obj_request->bio_list, length);
>>               else
>> -                     osd_req_op_extent_osd_data_pages(osd_req, 0,
>> +                     osd_req_op_extent_osd_data_pages(osd_req, which,
>>                                       obj_request->pages, length,
>>                                       offset & ~PAGE_MASK, false, false);
>>
>> @@ -2358,7 +2384,7 @@ rbd_img_obj_parent_read_full_callback(struct rbd_img_request *img_request)
>>
>>       /*
>>        * The original osd request is of no use to use any more.
>> -      * We need a new one that can hold the two ops in a copyup
>> +      * We need a new one that can hold the three ops in a copyup

Here ^^ (re: comment).

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-25 12:52     ` Ilya Dryomov
@ 2014-02-25 13:05       ` Alex Elder
  2014-02-25 13:38         ` Ilya Dryomov
  2014-02-25 17:12       ` Sage Weil
  1 sibling, 1 reply; 26+ messages in thread
From: Alex Elder @ 2014-02-25 13:05 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Development

On 02/25/2014 06:52 AM, Ilya Dryomov wrote:
> On Mon, Feb 24, 2014 at 4:59 PM, Alex Elder <elder@ieee.org> wrote:
>> On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
>>> This is primarily for rbd's benefit and is supposed to combat
>>> fragmentation:
>>>
>>> "... knowing that rbd images have a 4m size, librbd can pass a hint
>>> that will let the osd do the xfs allocation size ioctl on new files so
>>> that they are allocated in 1m or 4m chunks.  We've seen cases where
>>> users with rbd workloads have very high levels of fragmentation in xfs
>>> and this would mitigate that and probably have a pretty nice
>>> performance benefit."
>>>
>>> SETALLOCHINT is considered advisory, so our backwards compatibility
>>> mechanism here is to set FAILOK flag for all SETALLOCHINT ops.
>>>
>>> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
>>
>> I have a few small comments for you to consider but this
>> looks good to me.
>>
>> Reviewed-by: Alex Elder <elder@linaro.org>

. . .

>> The other thing is that the expected size is limited by
>> rbd_image_header->obj_order, which is a single byte.  I
>> think you should encode this the same way.  Even if the
>> hint were for more than RBD, this level of granularity
>> may be sufficient.
>>
>>> +                     u64 expected_write_size;
>>
>> Probably the same thing here, a byte might be enough
>> to encode this by using log2(expected_write_size).
>>
>>> +                     __u8 expected_size_probability;
> 
> I think in the interest of genericity expected_object_size should be an
> arbitrary, not limited to a power-of-2 value.  Now, given the current
> 90M object size limit 64 bits might seem a lot, but extent offset and
> length are 64 bit values and to be future-proof I followed that here.

I have no objection to the 64-bit size but I still think
a byte representing log2(size) is sufficient.  Power-of-2
granularity is most likely fine (and might even be what
such a hint gets converted to anyway) for file systems
or other backing store.  But again, you can do that with
a 64 bit value as well.

> expected_size_probability is currently unused.  It was supposed to be
> a 0-100 value, indicating whether we expect the object to be smaller
> than expected_object_size or not and how strong that expectation is.
> 
> I'm not sure if you've seen it, but this (both userspace and kernel
> sides) were implemented according to the design laid out by Sage in the
> "rados io hints" thread on ceph-devel about a month ago.  There wasn't
> any discussion that I'm aware of in the interim, that is until the
> recent "wip-hint" thread, which was triggered by my PR.

I'm aware of it---I saw the discussion go by and skimmed
through it but did not look at it very closely.  I can't
keep up with all the traffic but I do try to pay attention
to code for review.

>> I'm not sure why these single-byte values use __u8 while the
>> multi-byte values use (e.g.) u32.  The leading underscores
>> are supposed to be used for values exposed to user space (or
>> something like that).  Anyway, not a problem with your change,
>> but something maybe you could check into.
> 
> I'm not sure either.  I vaguely assumed it's related to the fact that
> single-byte ceph_osd_req_op fields are directly assigned to the
> corresponding ceph_osd_op fields which are exported to userspace,
> whereas the multi-byte values go through the endianness conversion
> macros, which change the type to __le*.

Not a big deal regardless.

>>> +             } hint;
>>>       };
>>>  };

. . .

>>> +     /*
>>> +      * CEPH_OSD_OP_SETALLOCHINT op is advisory and therefore deemed
>>> +      * not worth a feature bit.  Set FAILOK per-op flag to make
>>> +      * sure older osds don't trip over an unsupported opcode.
>>> +      */
>>> +     op->flags |= CEPH_OSD_OP_FLAG_FAILOK;
>>
>> I think this is reasonable.  The only thing I wonder about is
>> whether there could be any other failure that could occur on
>> an OSD server that actually supports the op that we would care
>> about.  It's still advisory though, so functionally it won't
>> matter.
> 
> Currently there isn't any such failure.  In fact, the only thing that
> this FAILOK hides is the EOPNOTSUPP from an OSD that doesn't support
> the new op.  Everything else is done on the backend, and there all
> errors are explicitly ignored because this is an advisory op and it
> would bring down the OSD otherwise.

Sounds good.  Thanks for the explanations.

					-Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-25 12:58     ` Ilya Dryomov
@ 2014-02-25 13:19       ` Alex Elder
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Elder @ 2014-02-25 13:19 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Development

On 02/25/2014 06:58 AM, Ilya Dryomov wrote:
> On Mon, Feb 24, 2014 at 4:59 PM, Alex Elder <elder@ieee.org> wrote:
>> On 02/21/2014 12:55 PM, Ilya Dryomov wrote:
>>> In an effort to reduce fragmentation, prefix every rbd write with
>>> a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
>>> to the object size (1 << order).  Backwards compatibility is taken care
>>> of on the libceph/osd side.
>>
>> If *every* write will include a hint, why even encode this as
>> a distinct opcode?  Why not just extend the definition of a
>> write operation to include the write hint data?  The server
>> side could check expected_object_size, and if 0 (or some other
>> invalid value) it means the client isn't supplying a hint.
>>
>> However, on the assumption you want this to be a distinct
>> OSD op I think you generally did the right thing.  See my
>> comments below.  For now I'm not indicating "Reviewed-by"
>> because it sounds like the nature of this change is under
>> discussion still.  And I really do think that if the hint
>> is not going to be made more generic (and possibly even if
>> it is) I'd rather see this hinting done using an extension
>> of the write operation (like I suggest above).  In this
>> case it is clearly directly tied to every write operation
>> and separating it sort of obscures that.
> 
> Yes, the assumption is that we want to do this in a separate op.  The
> hint is durable, in that it's enough to do it once, so it doesn't make
> much sense to fold it into the write op(s).  The reason every rbd write
> is prefixed is that rbd doesn't explicitly create objects and relies on
> writes creating them implicitly, so there is no place to stick a single
> hint op into.  To get around that we decided to prefix every rbd write
> with a hint (just like write and setattr ops, hint op will create an
> object implicitly if it doesn't exist).

I was thinking primarily in the RBD context and not the
OSD write more generally I guess.  I suspected it was durable
and knew why it still needs to be attached to every rbd write.

On a separate note, it seems to me we've discussed how one
could maintain a bitmap of created (known to be written) RBD
objects for an image, which could be used for layered images
to avoid the separate parent read request.  If such a thing
ever got implemented it could be used to skip the hint as well.

> I'll add the above paragraph to the commit message.

Everything else you incorporated or explained, so this looks
good to me.

Reviewed-by: Alex Elder <elder@linaro.org>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-25 13:05       ` Alex Elder
@ 2014-02-25 13:38         ` Ilya Dryomov
  2014-02-25 17:12           ` Sage Weil
  0 siblings, 1 reply; 26+ messages in thread
From: Ilya Dryomov @ 2014-02-25 13:38 UTC (permalink / raw)
  To: Alex Elder; +Cc: Ceph Development

On Tue, Feb 25, 2014 at 3:05 PM, Alex Elder <elder@ieee.org> wrote:
>>> The other thing is that the expected size is limited by
>>> rbd_image_header->obj_order, which is a single byte.  I
>>> think you should encode this the same way.  Even if the
>>> hint were for more than RBD, this level of granularity
>>> may be sufficient.
>>>
>>>> +                     u64 expected_write_size;
>>>
>>> Probably the same thing here, a byte might be enough
>>> to encode this by using log2(expected_write_size).
>>>
>>>> +                     __u8 expected_size_probability;
>>
>> I think in the interest of genericity expected_object_size should be an
>> arbitrary, not limited to a power-of-2 value.  Now, given the current
>> 90M object size limit 64 bits might seem a lot, but extent offset and
>> length are 64 bit values and to be future-proof I followed that here.
>
> I have no objection to the 64-bit size but I still think
> a byte representing log2(size) is sufficient.  Power-of-2
> granularity is most likely fine (and might even be what
> such a hint gets converted to anyway) for file systems
> or other backing store.  But again, you can do that with
> a 64 bit value as well.

Filesystems of course round it, but probably not to a power-of-2.
I guess it's adjusted to a multiple of block size and then capped with
some value that the allocator can cope with.  xfs for example would
happily take on say 5M.  power-of-2 is sufficient for rbd, but
_probably_ not in general.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-25 12:52     ` Ilya Dryomov
  2014-02-25 13:05       ` Alex Elder
@ 2014-02-25 17:12       ` Sage Weil
  1 sibling, 0 replies; 26+ messages in thread
From: Sage Weil @ 2014-02-25 17:12 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Alex Elder, Ceph Development

On Tue, 25 Feb 2014, Ilya Dryomov wrote:
> >> +                     u64 expected_write_size;
> >
> > Probably the same thing here, a byte might be enough
> > to encode this by using log2(expected_write_size).
> >
> >> +                     __u8 expected_size_probability;
> 
> I think in the interest of genericity expected_object_size should be an
> arbitrary, not limited to a power-of-2 value.  Now, given the current
> 90M object size limit 64 bits might seem a lot, but extent offset and
> length are 64 bit values and to be future-proof I followed that here.
> 
> expected_size_probability is currently unused.  It was supposed to be
> a 0-100 value, indicating whether we expect the object to be smaller
> than expected_object_size or not and how strong that expectation is.
> 
> I'm not sure if you've seen it, but this (both userspace and kernel
> sides) were implemented according to the design laid out by Sage in the
> "rados io hints" thread on ceph-devel about a month ago.  There wasn't
> any discussion that I'm aware of in the interim, that is until the
> recent "wip-hint" thread, which was triggered by my PR.
> 
> >
> > I'm not sure why these single-byte values use __u8 while the
> > multi-byte values use (e.g.) u32.  The leading underscores
> > are supposed to be used for values exposed to user space (or
> > something like that).  Anyway, not a problem with your change,
> > but something maybe you could check into.
> 
> I'm not sure either.  I vaguely assumed it's related to the fact that
> single-byte ceph_osd_req_op fields are directly assigned to the
> corresponding ceph_osd_op fields which are exported to userspace,
> whereas the multi-byte values go through the endianness conversion
> macros, which change the type to __le*.

Oh, good catch.  Yeah, these need to be __le64 so that on the kernel side 
sparse knows what is going on and so on the userspace side the magic __le* 
classes do the endian conversion.

sage

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op
  2014-02-25 13:38         ` Ilya Dryomov
@ 2014-02-25 17:12           ` Sage Weil
  0 siblings, 0 replies; 26+ messages in thread
From: Sage Weil @ 2014-02-25 17:12 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Alex Elder, Ceph Development

On Tue, 25 Feb 2014, Ilya Dryomov wrote:
> On Tue, Feb 25, 2014 at 3:05 PM, Alex Elder <elder@ieee.org> wrote:
> >>> The other thing is that the expected size is limited by
> >>> rbd_image_header->obj_order, which is a single byte.  I
> >>> think you should encode this the same way.  Even if the
> >>> hint were for more than RBD, this level of granularity
> >>> may be sufficient.
> >>>
> >>>> +                     u64 expected_write_size;
> >>>
> >>> Probably the same thing here, a byte might be enough
> >>> to encode this by using log2(expected_write_size).
> >>>
> >>>> +                     __u8 expected_size_probability;
> >>
> >> I think in the interest of genericity expected_object_size should be an
> >> arbitrary, not limited to a power-of-2 value.  Now, given the current
> >> 90M object size limit 64 bits might seem a lot, but extent offset and
> >> length are 64 bit values and to be future-proof I followed that here.
> >
> > I have no objection to the 64-bit size but I still think
> > a byte representing log2(size) is sufficient.  Power-of-2
> > granularity is most likely fine (and might even be what
> > such a hint gets converted to anyway) for file systems
> > or other backing store.  But again, you can do that with
> > a 64 bit value as well.
> 
> Filesystems of course round it, but probably not to a power-of-2.
> I guess it's adjusted to a multiple of block size and then capped with
> some value that the allocator can cope with.  xfs for example would
> happily take on say 5M.  power-of-2 is sufficient for rbd, but
> _probably_ not in general.

I agree; let's stick with u64 here.

Thanks!
sage

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-02-25 17:12 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-21 18:55 [PATCH 0/6] libceph: CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
2014-02-21 18:55 ` [PATCH 1/6] libceph: encode CEPH_OSD_OP_FLAG_* op flags Ilya Dryomov
2014-02-24 14:58   ` Alex Elder
2014-02-21 18:55 ` [PATCH 2/6] libceph: add support for CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
2014-02-23 16:03   ` Sage Weil
2014-02-24 14:59   ` Alex Elder
2014-02-25 12:52     ` Ilya Dryomov
2014-02-25 13:05       ` Alex Elder
2014-02-25 13:38         ` Ilya Dryomov
2014-02-25 17:12           ` Sage Weil
2014-02-25 17:12       ` Sage Weil
2014-02-21 18:55 ` [PATCH 3/6] libceph: bump CEPH_OSD_MAX_OP to 3 Ilya Dryomov
2014-02-24 14:59   ` Alex Elder
2014-02-21 18:55 ` [PATCH 4/6] rbd: do not hard-code CEPH_OSD_MAX_OP in rbd_osd_req_callback() Ilya Dryomov
2014-02-24 14:59   ` Alex Elder
2014-02-25 12:53     ` Ilya Dryomov
2014-02-21 18:55 ` [PATCH 5/6] rbd: num_ops parameter for rbd_osd_req_create() Ilya Dryomov
2014-02-24 14:59   ` Alex Elder
2014-02-21 18:55 ` [PATCH 6/6] rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op Ilya Dryomov
2014-02-24 14:59   ` Alex Elder
2014-02-25 12:58     ` Ilya Dryomov
2014-02-25 13:19       ` Alex Elder
2014-02-23 16:14 ` [PATCH 0/6] libceph: " Sage Weil
2014-02-23 16:15   ` Alex Elder
2014-02-24 14:58 ` Alex Elder
2014-02-25 12:50   ` Ilya Dryomov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.