linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/15] rbd: switch to blk-mq
@ 2020-01-31 10:37 Hannes Reinecke
  2020-01-31 10:37 ` [PATCH 01/15] rbd: lock object request list Hannes Reinecke
                   ` (14 more replies)
  0 siblings, 15 replies; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

Hi all,

this patchset implements multiqueue support for rbd, which gives
a nice performance benefit (I measured up to 25% on my grid).
To achieve this several steps are required:
1) drop the 'state_mutex' in rbd_img_advance.
   To do so the state machines had to be reordered so ensure
   no race windows are left when invoking asynchronous methods.
2) Embed image request into the block request to save a memory
   allocation in the hot path
3) Enable one queue per CPU to enhance parallelism.

I also took the opportunity to clean up the state machines, by
adding a 'done' step and ensuring that the step is always set
correctly upon exit. This allows for better debugging as now
the final states must always be set when destroying an object.

As usual, comments and reviews are welcome.

Hannes Reinecke (15):
  rbd: lock object request list
  rbd: use READ_ONCE() when checking the mapping size
  rbd: reorder rbd_img_advance()
  rbd: reorder switch statement in rbd_advance_read()
  rbd: reorder switch statement in rbd_advance_write()
  rbd: add 'done' state for rbd_obj_advance_copyup()
  rbd: use callback for image request completion
  rbd: add debugging statements for the state machine
  rbd: count pending object requests in-line
  rbd: kill 'work_result'
  rbd: drop state_mutex in __rbd_img_handle_request()
  rbd: kill img_request kref
  rbd: schedule image_request after preparation
  rbd: embed image request as blk_mq request payload
  rbd: switch to blk-mq

 drivers/block/rbd.c | 418 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 240 insertions(+), 178 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 01/15] rbd: lock object request list
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-02-03 16:38   ` Ilya Dryomov
  2020-01-31 10:37 ` [PATCH 02/15] rbd: use READ_ONCE() when checking the mapping size Hannes Reinecke
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

The object request list can be accessed from various contexts
so we need to lock it to avoid concurrent modifications and
random crashes.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 36 ++++++++++++++++++++++++++++--------
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 5710b2a8609c..db80b964d8ea 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -344,6 +344,7 @@ struct rbd_img_request {
 
 	struct list_head	lock_item;
 	struct list_head	object_extents;	/* obj_req.ex structs */
+	struct mutex		object_mutex;
 
 	struct mutex		state_mutex;
 	struct pending_result	pending;
@@ -1664,6 +1665,7 @@ static struct rbd_img_request *rbd_img_request_create(
 	INIT_LIST_HEAD(&img_request->lock_item);
 	INIT_LIST_HEAD(&img_request->object_extents);
 	mutex_init(&img_request->state_mutex);
+	mutex_init(&img_request->object_mutex);
 	kref_init(&img_request->kref);
 
 	return img_request;
@@ -1680,8 +1682,10 @@ static void rbd_img_request_destroy(struct kref *kref)
 	dout("%s: img %p\n", __func__, img_request);
 
 	WARN_ON(!list_empty(&img_request->lock_item));
+	mutex_lock(&img_request->object_mutex);
 	for_each_obj_request_safe(img_request, obj_request, next_obj_request)
 		rbd_img_obj_request_del(img_request, obj_request);
+	mutex_unlock(&img_request->object_mutex);
 
 	if (img_request_layered_test(img_request)) {
 		img_request_layered_clear(img_request);
@@ -2486,6 +2490,7 @@ static int __rbd_img_fill_request(struct rbd_img_request *img_req)
 	struct rbd_obj_request *obj_req, *next_obj_req;
 	int ret;
 
+	mutex_lock(&img_req->object_mutex);
 	for_each_obj_request_safe(img_req, obj_req, next_obj_req) {
 		switch (img_req->op_type) {
 		case OBJ_OP_READ:
@@ -2503,14 +2508,16 @@ static int __rbd_img_fill_request(struct rbd_img_request *img_req)
 		default:
 			BUG();
 		}
-		if (ret < 0)
+		if (ret < 0) {
+			mutex_unlock(&img_req->object_mutex);
 			return ret;
+		}
 		if (ret > 0) {
 			rbd_img_obj_request_del(img_req, obj_req);
 			continue;
 		}
 	}
-
+	mutex_unlock(&img_req->object_mutex);
 	img_req->state = RBD_IMG_START;
 	return 0;
 }
@@ -2569,6 +2576,7 @@ static int rbd_img_fill_request_nocopy(struct rbd_img_request *img_req,
 	 * position in the provided bio (list) or bio_vec array.
 	 */
 	fctx->iter = *fctx->pos;
+	mutex_lock(&img_req->object_mutex);
 	for (i = 0; i < num_img_extents; i++) {
 		ret = ceph_file_to_extents(&img_req->rbd_dev->layout,
 					   img_extents[i].fe_off,
@@ -2576,10 +2584,12 @@ static int rbd_img_fill_request_nocopy(struct rbd_img_request *img_req,
 					   &img_req->object_extents,
 					   alloc_object_extent, img_req,
 					   fctx->set_pos_fn, &fctx->iter);
-		if (ret)
+		if (ret) {
+			mutex_unlock(&img_req->object_mutex);
 			return ret;
+		}
 	}
-
+	mutex_unlock(&img_req->object_mutex);
 	return __rbd_img_fill_request(img_req);
 }
 
@@ -2620,6 +2630,7 @@ static int rbd_img_fill_request(struct rbd_img_request *img_req,
 	 * or bio_vec array because when mapped, those bio_vecs can straddle
 	 * stripe unit boundaries.
 	 */
+	mutex_lock(&img_req->object_mutex);
 	fctx->iter = *fctx->pos;
 	for (i = 0; i < num_img_extents; i++) {
 		ret = ceph_file_to_extents(&rbd_dev->layout,
@@ -2629,15 +2640,17 @@ static int rbd_img_fill_request(struct rbd_img_request *img_req,
 					   alloc_object_extent, img_req,
 					   fctx->count_fn, &fctx->iter);
 		if (ret)
-			return ret;
+			goto out_unlock;
 	}
 
 	for_each_obj_request(img_req, obj_req) {
 		obj_req->bvec_pos.bvecs = kmalloc_array(obj_req->bvec_count,
 					      sizeof(*obj_req->bvec_pos.bvecs),
 					      GFP_NOIO);
-		if (!obj_req->bvec_pos.bvecs)
-			return -ENOMEM;
+		if (!obj_req->bvec_pos.bvecs) {
+			ret = -ENOMEM;
+			goto out_unlock;
+		}
 	}
 
 	/*
@@ -2652,10 +2665,14 @@ static int rbd_img_fill_request(struct rbd_img_request *img_req,
 					   &img_req->object_extents,
 					   fctx->copy_fn, &fctx->iter);
 		if (ret)
-			return ret;
+			goto out_unlock;
 	}
+	mutex_unlock(&img_req->object_mutex);
 
 	return __rbd_img_fill_request(img_req);
+out_unlock:
+	mutex_unlock(&img_req->object_mutex);
+	return ret;
 }
 
 static int rbd_img_fill_nodata(struct rbd_img_request *img_req,
@@ -3552,18 +3569,21 @@ static void rbd_img_object_requests(struct rbd_img_request *img_req)
 
 	rbd_assert(!img_req->pending.result && !img_req->pending.num_pending);
 
+	mutex_lock(&img_req->object_mutex);
 	for_each_obj_request(img_req, obj_req) {
 		int result = 0;
 
 		if (__rbd_obj_handle_request(obj_req, &result)) {
 			if (result) {
 				img_req->pending.result = result;
+				mutex_unlock(&img_req->object_mutex);
 				return;
 			}
 		} else {
 			img_req->pending.num_pending++;
 		}
 	}
+	mutex_unlock(&img_req->object_mutex);
 }
 
 static bool rbd_img_advance(struct rbd_img_request *img_req, int *result)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 02/15] rbd: use READ_ONCE() when checking the mapping size
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
  2020-01-31 10:37 ` [PATCH 01/15] rbd: lock object request list Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-02-03 16:50   ` Ilya Dryomov
  2020-01-31 10:37 ` [PATCH 03/15] rbd: reorder rbd_img_advance() Hannes Reinecke
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

The mapping size is changed only very infrequently, so we don't
need to take the header mutex for checking; using READ_ONCE()
is sufficient here. And it avoids having to take a mutex in the
hot path.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index db80b964d8ea..792180548e89 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -4788,13 +4788,13 @@ static void rbd_queue_workfn(struct work_struct *work)
 
 	blk_mq_start_request(rq);
 
-	down_read(&rbd_dev->header_rwsem);
-	mapping_size = rbd_dev->mapping.size;
+	mapping_size = READ_ONCE(rbd_dev->mapping.size);
 	if (op_type != OBJ_OP_READ) {
+		down_read(&rbd_dev->header_rwsem);
 		snapc = rbd_dev->header.snapc;
 		ceph_get_snap_context(snapc);
+		up_read(&rbd_dev->header_rwsem);
 	}
-	up_read(&rbd_dev->header_rwsem);
 
 	if (offset + length > mapping_size) {
 		rbd_warn(rbd_dev, "beyond EOD (%llu~%llu > %llu)", offset,
@@ -4981,9 +4981,9 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 	u64 mapping_size;
 	int ret;
 
-	down_write(&rbd_dev->header_rwsem);
-	mapping_size = rbd_dev->mapping.size;
+	mapping_size = READ_ONCE(rbd_dev->mapping.size);
 
+	down_write(&rbd_dev->header_rwsem);
 	ret = rbd_dev_header_info(rbd_dev);
 	if (ret)
 		goto out;
@@ -4999,7 +4999,7 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 	}
 
 	rbd_assert(!rbd_is_snap(rbd_dev));
-	rbd_dev->mapping.size = rbd_dev->header.image_size;
+	WRITE_ONCE(rbd_dev->mapping.size, rbd_dev->header.image_size);
 
 out:
 	up_write(&rbd_dev->header_rwsem);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 03/15] rbd: reorder rbd_img_advance()
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
  2020-01-31 10:37 ` [PATCH 01/15] rbd: lock object request list Hannes Reinecke
  2020-01-31 10:37 ` [PATCH 02/15] rbd: use READ_ONCE() when checking the mapping size Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-01-31 10:37 ` [PATCH 04/15] rbd: reorder switch statement in rbd_advance_read() Hannes Reinecke
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

Reorder switch statement to avoid the use of a label/goto and add
an RBD_IMG_DONE state to signal that the state machine has completed.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 792180548e89..c80942e08164 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -321,9 +321,9 @@ enum img_req_flags {
 };
 
 enum rbd_img_state {
-	RBD_IMG_START = 1,
+	RBD_IMG_DONE,
+	RBD_IMG_START,
 	RBD_IMG_EXCLUSIVE_LOCK,
-	__RBD_IMG_OBJECT_REQUESTS,
 	RBD_IMG_OBJECT_REQUESTS,
 };
 
@@ -3591,40 +3591,44 @@ static bool rbd_img_advance(struct rbd_img_request *img_req, int *result)
 	struct rbd_device *rbd_dev = img_req->rbd_dev;
 	int ret;
 
-again:
+	dout("%s: img %p state %d\n", __func__, img_req, img_req->state);
 	switch (img_req->state) {
 	case RBD_IMG_START:
 		rbd_assert(!*result);
 
+		img_req->state = RBD_IMG_EXCLUSIVE_LOCK;
 		ret = rbd_img_exclusive_lock(img_req);
 		if (ret < 0) {
 			*result = ret;
+			img_req->state = RBD_IMG_DONE;
 			return true;
 		}
-		img_req->state = RBD_IMG_EXCLUSIVE_LOCK;
-		if (ret > 0)
-			goto again;
-		return false;
+		if (ret == 0)
+			return false;
+		/* fall through */
 	case RBD_IMG_EXCLUSIVE_LOCK:
-		if (*result)
+		if (*result) {
+			img_req->state = RBD_IMG_DONE;
 			return true;
+		}
 
 		rbd_assert(!need_exclusive_lock(img_req) ||
 			   __rbd_is_lock_owner(rbd_dev));
 
+		img_req->state = RBD_IMG_OBJECT_REQUESTS;
 		rbd_img_object_requests(img_req);
 		if (!img_req->pending.num_pending) {
 			*result = img_req->pending.result;
-			img_req->state = RBD_IMG_OBJECT_REQUESTS;
-			goto again;
+			img_req->state = RBD_IMG_DONE;
+			return true;
 		}
-		img_req->state = __RBD_IMG_OBJECT_REQUESTS;
 		return false;
-	case __RBD_IMG_OBJECT_REQUESTS:
+	case RBD_IMG_OBJECT_REQUESTS:
 		if (!pending_result_dec(&img_req->pending, result))
 			return false;
+		img_req->state = RBD_IMG_DONE;
 		/* fall through */
-	case RBD_IMG_OBJECT_REQUESTS:
+	case RBD_IMG_DONE:
 		return true;
 	default:
 		BUG();
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 04/15] rbd: reorder switch statement in rbd_advance_read()
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
                   ` (2 preceding siblings ...)
  2020-01-31 10:37 ` [PATCH 03/15] rbd: reorder rbd_img_advance() Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-01-31 10:37 ` [PATCH 05/15] rbd: reorder switch statement in rbd_advance_write() Hannes Reinecke
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

Reorder switch statement to avoid the use of a label/goto
statement and add a 'done' state to indicate that the state
machine has completed.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index c80942e08164..4d7857667e9c 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -235,7 +235,8 @@ enum obj_operation_type {
 #define RBD_OBJ_FLAG_NOOP_FOR_NONEXISTENT	(1U << 4)
 
 enum rbd_obj_read_state {
-	RBD_OBJ_READ_START = 1,
+	RBD_OBJ_READ_DONE,
+	RBD_OBJ_READ_START,
 	RBD_OBJ_READ_OBJECT,
 	RBD_OBJ_READ_PARENT,
 };
@@ -2924,36 +2925,38 @@ static bool rbd_obj_advance_read(struct rbd_obj_request *obj_req, int *result)
 	struct rbd_device *rbd_dev = obj_req->img_request->rbd_dev;
 	int ret;
 
-again:
+	dout("%s: obj %p state %d\n", __func__, obj_req, obj_req->read_state);
 	switch (obj_req->read_state) {
 	case RBD_OBJ_READ_START:
 		rbd_assert(!*result);
 
-		if (!rbd_obj_may_exist(obj_req)) {
-			*result = -ENOENT;
+		if (rbd_obj_may_exist(obj_req)) {
+			ret = rbd_obj_read_object(obj_req);
+			if (ret) {
+				*result = ret;
+				obj_req->read_state = RBD_OBJ_READ_DONE;
+				return true;
+			}
 			obj_req->read_state = RBD_OBJ_READ_OBJECT;
-			goto again;
-		}
-
-		ret = rbd_obj_read_object(obj_req);
-		if (ret) {
-			*result = ret;
-			return true;
+			return false;
 		}
+		*result = -ENOENT;
 		obj_req->read_state = RBD_OBJ_READ_OBJECT;
-		return false;
+		/* fall through */
 	case RBD_OBJ_READ_OBJECT:
 		if (*result == -ENOENT && rbd_dev->parent_overlap) {
 			/* reverse map this object extent onto the parent */
 			ret = rbd_obj_calc_img_extents(obj_req, false);
 			if (ret) {
 				*result = ret;
+				obj_req->read_state = RBD_OBJ_READ_DONE;
 				return true;
 			}
 			if (obj_req->num_img_extents) {
 				ret = rbd_obj_read_from_parent(obj_req);
 				if (ret) {
 					*result = ret;
+					obj_req->read_state = RBD_OBJ_READ_DONE;
 					return true;
 				}
 				obj_req->read_state = RBD_OBJ_READ_PARENT;
@@ -2977,6 +2980,7 @@ static bool rbd_obj_advance_read(struct rbd_obj_request *obj_req, int *result)
 				rbd_assert(*result == obj_req->ex.oe_len);
 			*result = 0;
 		}
+		obj_req->read_state = RBD_OBJ_READ_DONE;
 		return true;
 	case RBD_OBJ_READ_PARENT:
 		/*
@@ -2990,6 +2994,9 @@ static bool rbd_obj_advance_read(struct rbd_obj_request *obj_req, int *result)
 				rbd_obj_zero_range(obj_req, obj_overlap,
 					    obj_req->ex.oe_len - obj_overlap);
 		}
+		obj_req->read_state = RBD_OBJ_READ_DONE;
+		/* fall through */
+	case RBD_OBJ_READ_DONE:
 		return true;
 	default:
 		BUG();
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 05/15] rbd: reorder switch statement in rbd_advance_write()
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
                   ` (3 preceding siblings ...)
  2020-01-31 10:37 ` [PATCH 04/15] rbd: reorder switch statement in rbd_advance_read() Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-01-31 10:37 ` [PATCH 06/15] rbd: add 'done' state for rbd_obj_advance_copyup() Hannes Reinecke
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

Reorder switch statement to avoid the use of a label/goto
statement and add a 'done' state to indicate that the state
machine has completed.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 4d7857667e9c..766d67e4d5e5 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -267,7 +267,8 @@ enum rbd_obj_read_state {
  * even if there is a parent).
  */
 enum rbd_obj_write_state {
-	RBD_OBJ_WRITE_START = 1,
+	RBD_OBJ_WRITE_DONE,
+	RBD_OBJ_WRITE_START,
 	RBD_OBJ_WRITE_PRE_OBJECT_MAP,
 	RBD_OBJ_WRITE_OBJECT,
 	__RBD_OBJ_WRITE_COPYUP,
@@ -3382,31 +3383,37 @@ static bool rbd_obj_advance_write(struct rbd_obj_request *obj_req, int *result)
 	int ret;
 
 again:
+	dout("%s: obj %p state %d\n", __func__, obj_req, obj_req->write_state);
 	switch (obj_req->write_state) {
 	case RBD_OBJ_WRITE_START:
 		rbd_assert(!*result);
 
-		if (rbd_obj_write_is_noop(obj_req))
+		if (rbd_obj_write_is_noop(obj_req)) {
+			obj_req->write_state = RBD_OBJ_WRITE_DONE;
 			return true;
+		}
 
 		ret = rbd_obj_write_pre_object_map(obj_req);
 		if (ret < 0) {
 			*result = ret;
+			obj_req->write_state = RBD_OBJ_WRITE_DONE;
 			return true;
 		}
 		obj_req->write_state = RBD_OBJ_WRITE_PRE_OBJECT_MAP;
-		if (ret > 0)
-			goto again;
-		return false;
+		if (ret == 0)
+			return false;
+		/* fall through */
 	case RBD_OBJ_WRITE_PRE_OBJECT_MAP:
 		if (*result) {
 			rbd_warn(rbd_dev, "pre object map update failed: %d",
 				 *result);
+			obj_req->write_state = RBD_OBJ_WRITE_DONE;
 			return true;
 		}
 		ret = rbd_obj_write_object(obj_req);
 		if (ret) {
 			*result = ret;
+			obj_req->write_state = RBD_OBJ_WRITE_DONE;
 			return true;
 		}
 		obj_req->write_state = RBD_OBJ_WRITE_OBJECT;
@@ -3426,33 +3433,40 @@ static bool rbd_obj_advance_write(struct rbd_obj_request *obj_req, int *result)
 			if (obj_req->flags & RBD_OBJ_FLAG_DELETION)
 				*result = 0;
 		}
-		if (*result)
+		if (*result) {
+			obj_req->write_state = RBD_OBJ_WRITE_DONE;
 			return true;
-
+		}
 		obj_req->write_state = RBD_OBJ_WRITE_COPYUP;
 		goto again;
 	case __RBD_OBJ_WRITE_COPYUP:
 		if (!rbd_obj_advance_copyup(obj_req, result))
 			return false;
+		obj_req->write_state = RBD_OBJ_WRITE_COPYUP;
 		/* fall through */
 	case RBD_OBJ_WRITE_COPYUP:
 		if (*result) {
 			rbd_warn(rbd_dev, "copyup failed: %d", *result);
+			obj_req->write_state = RBD_OBJ_WRITE_DONE;
 			return true;
 		}
+		obj_req->write_state = RBD_OBJ_WRITE_POST_OBJECT_MAP;
 		ret = rbd_obj_write_post_object_map(obj_req);
 		if (ret < 0) {
 			*result = ret;
+			obj_req->write_state = RBD_OBJ_WRITE_DONE;
 			return true;
 		}
-		obj_req->write_state = RBD_OBJ_WRITE_POST_OBJECT_MAP;
-		if (ret > 0)
-			goto again;
-		return false;
+		if (ret == 0)
+			return false;
+		/* fall through */
 	case RBD_OBJ_WRITE_POST_OBJECT_MAP:
 		if (*result)
 			rbd_warn(rbd_dev, "post object map update failed: %d",
 				 *result);
+		obj_req->write_state = RBD_OBJ_WRITE_DONE;
+		/* fall through */
+	case RBD_OBJ_WRITE_DONE:
 		return true;
 	default:
 		BUG();
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 06/15] rbd: add 'done' state for rbd_obj_advance_copyup()
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
                   ` (4 preceding siblings ...)
  2020-01-31 10:37 ` [PATCH 05/15] rbd: reorder switch statement in rbd_advance_write() Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-01-31 10:37 ` [PATCH 07/15] rbd: use callback for image request completion Hannes Reinecke
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

Rename 'RBD_OBJ_COPYUP_WRITE_OBJECT' to 'RBD_OBJ_COPYUP_DONE' to
signal that the state machine has completed.
With that we can rename '__RBD_OBJ_COPYUP_WRITE_OBJECT' to
'RBD_OPJ_COPYUP_WRITE_OBJECT'.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 766d67e4d5e5..c31507a5fdd2 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -277,11 +277,11 @@ enum rbd_obj_write_state {
 };
 
 enum rbd_obj_copyup_state {
-	RBD_OBJ_COPYUP_START = 1,
+	RBD_OBJ_COPYUP_DONE,
+	RBD_OBJ_COPYUP_START,
 	RBD_OBJ_COPYUP_READ_PARENT,
 	__RBD_OBJ_COPYUP_OBJECT_MAPS,
 	RBD_OBJ_COPYUP_OBJECT_MAPS,
-	__RBD_OBJ_COPYUP_WRITE_OBJECT,
 	RBD_OBJ_COPYUP_WRITE_OBJECT,
 };
 
@@ -3294,6 +3294,8 @@ static bool rbd_obj_advance_copyup(struct rbd_obj_request *obj_req, int *result)
 	int ret;
 
 again:
+	dout("%s: obj %p copyup %d pending %d\n", __func__,
+	     obj_req, obj_req->copyup_state, obj_req->pending.num_pending);
 	switch (obj_req->copyup_state) {
 	case RBD_OBJ_COPYUP_START:
 		rbd_assert(!*result);
@@ -3301,17 +3303,19 @@ static bool rbd_obj_advance_copyup(struct rbd_obj_request *obj_req, int *result)
 		ret = rbd_obj_copyup_read_parent(obj_req);
 		if (ret) {
 			*result = ret;
+			obj_req->copyup_state = RBD_OBJ_COPYUP_DONE;
 			return true;
 		}
 		if (obj_req->num_img_extents)
 			obj_req->copyup_state = RBD_OBJ_COPYUP_READ_PARENT;
 		else
-			obj_req->copyup_state = RBD_OBJ_COPYUP_WRITE_OBJECT;
+			obj_req->copyup_state = RBD_OBJ_COPYUP_DONE;
 		return false;
 	case RBD_OBJ_COPYUP_READ_PARENT:
-		if (*result)
+		if (*result) {
+			obj_req->copyup_state = RBD_OBJ_COPYUP_DONE;
 			return true;
-
+		}
 		if (is_zero_bvecs(obj_req->copyup_bvecs,
 				  rbd_obj_img_extents_bytes(obj_req))) {
 			dout("%s %p detected zeros\n", __func__, obj_req);
@@ -3329,27 +3333,30 @@ static bool rbd_obj_advance_copyup(struct rbd_obj_request *obj_req, int *result)
 	case __RBD_OBJ_COPYUP_OBJECT_MAPS:
 		if (!pending_result_dec(&obj_req->pending, result))
 			return false;
+		obj_req->copyup_state = RBD_OBJ_COPYUP_OBJECT_MAPS;
 		/* fall through */
 	case RBD_OBJ_COPYUP_OBJECT_MAPS:
 		if (*result) {
 			rbd_warn(rbd_dev, "snap object map update failed: %d",
 				 *result);
+			obj_req->copyup_state = RBD_OBJ_COPYUP_DONE;
 			return true;
 		}
 
 		rbd_obj_copyup_write_object(obj_req);
 		if (!obj_req->pending.num_pending) {
 			*result = obj_req->pending.result;
-			obj_req->copyup_state = RBD_OBJ_COPYUP_WRITE_OBJECT;
+			obj_req->copyup_state = RBD_OBJ_COPYUP_DONE;
 			goto again;
 		}
-		obj_req->copyup_state = __RBD_OBJ_COPYUP_WRITE_OBJECT;
+		obj_req->copyup_state = RBD_OBJ_COPYUP_WRITE_OBJECT;
 		return false;
-	case __RBD_OBJ_COPYUP_WRITE_OBJECT:
+	case RBD_OBJ_COPYUP_WRITE_OBJECT:
 		if (!pending_result_dec(&obj_req->pending, result))
 			return false;
+		obj_req->copyup_state = RBD_OBJ_COPYUP_DONE;
 		/* fall through */
-	case RBD_OBJ_COPYUP_WRITE_OBJECT:
+	case RBD_OBJ_COPYUP_DONE:
 		return true;
 	default:
 		BUG();
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 07/15] rbd: use callback for image request completion
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
                   ` (5 preceding siblings ...)
  2020-01-31 10:37 ` [PATCH 06/15] rbd: add 'done' state for rbd_obj_advance_copyup() Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-02-03 17:13   ` Ilya Dryomov
  2020-01-31 10:37 ` [PATCH 08/15] rbd: add debugging statements for the state machine Hannes Reinecke
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

Using callbacks to simplify code and to separate out the different
code paths for parent and child requests.

Suggested-by: David Disseldorp <ddiss@suse.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 61 +++++++++++++++++++++++++++++------------------------
 1 file changed, 33 insertions(+), 28 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index c31507a5fdd2..8cfd9407cbb8 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -317,6 +317,9 @@ struct rbd_obj_request {
 	struct kref		kref;
 };
 
+typedef void (*rbd_img_request_cb_t)(struct rbd_img_request *img_request,
+				     int result);
+
 enum img_req_flags {
 	IMG_REQ_CHILD,		/* initiator: block = 0, child image = 1 */
 	IMG_REQ_LAYERED,	/* ENOENT handling: normal = 0, layered = 1 */
@@ -339,11 +342,8 @@ struct rbd_img_request {
 		u64			snap_id;	/* for reads */
 		struct ceph_snap_context *snapc;	/* for writes */
 	};
-	union {
-		struct request		*rq;		/* block request */
-		struct rbd_obj_request	*obj_request;	/* obj req initiator */
-	};
-
+	void			*callback_data;
+	rbd_img_request_cb_t	callback;
 	struct list_head	lock_item;
 	struct list_head	object_extents;	/* obj_req.ex structs */
 	struct mutex		object_mutex;
@@ -506,6 +506,8 @@ static ssize_t add_single_major_store(struct bus_type *bus, const char *buf,
 static ssize_t remove_single_major_store(struct bus_type *bus, const char *buf,
 					 size_t count);
 static int rbd_dev_image_probe(struct rbd_device *rbd_dev, int depth);
+static void rbd_img_end_child_request(struct rbd_img_request *img_req,
+				      int result);
 
 static int rbd_dev_id_to_minor(int dev_id)
 {
@@ -2882,7 +2884,8 @@ static int rbd_obj_read_from_parent(struct rbd_obj_request *obj_req)
 		return -ENOMEM;
 
 	__set_bit(IMG_REQ_CHILD, &child_img_req->flags);
-	child_img_req->obj_request = obj_req;
+	child_img_req->callback = rbd_img_end_child_request;
+	child_img_req->callback_data = obj_req;
 
 	dout("%s child_img_req %p for obj_req %p\n", __func__, child_img_req,
 	     obj_req);
@@ -3506,14 +3509,12 @@ static bool __rbd_obj_handle_request(struct rbd_obj_request *obj_req,
 	return done;
 }
 
-/*
- * This is open-coded in rbd_img_handle_request() to avoid parent chain
- * recursion.
- */
 static void rbd_obj_handle_request(struct rbd_obj_request *obj_req, int result)
 {
-	if (__rbd_obj_handle_request(obj_req, &result))
+	if (__rbd_obj_handle_request(obj_req, &result)) {
+		/* Recurse into parent */
 		rbd_img_handle_request(obj_req->img_request, result);
+	}
 }
 
 static bool need_exclusive_lock(struct rbd_img_request *img_req)
@@ -3695,26 +3696,29 @@ static bool __rbd_img_handle_request(struct rbd_img_request *img_req,
 	return done;
 }
 
-static void rbd_img_handle_request(struct rbd_img_request *img_req, int result)
+static void rbd_img_end_child_request(struct rbd_img_request *img_req,
+				      int result)
 {
-again:
-	if (!__rbd_img_handle_request(img_req, &result))
-		return;
+	struct rbd_obj_request *obj_req = img_req->callback_data;
 
-	if (test_bit(IMG_REQ_CHILD, &img_req->flags)) {
-		struct rbd_obj_request *obj_req = img_req->obj_request;
+	rbd_img_request_put(img_req);
+	rbd_obj_handle_request(obj_req, result);
+}
 
-		rbd_img_request_put(img_req);
-		if (__rbd_obj_handle_request(obj_req, &result)) {
-			img_req = obj_req->img_request;
-			goto again;
-		}
-	} else {
-		struct request *rq = img_req->rq;
+static void rbd_img_end_request(struct rbd_img_request *img_req, int result)
+{
+	struct request *rq = img_req->callback_data;
 
-		rbd_img_request_put(img_req);
-		blk_mq_end_request(rq, errno_to_blk_status(result));
-	}
+	rbd_img_request_put(img_req);
+	blk_mq_end_request(rq, errno_to_blk_status(result));
+}
+
+void rbd_img_handle_request(struct rbd_img_request *img_req, int result)
+{
+	if (!__rbd_img_handle_request(img_req, &result))
+		return;
+
+	img_req->callback(img_req, result);
 }
 
 static const struct rbd_client_id rbd_empty_cid;
@@ -4840,7 +4844,8 @@ static void rbd_queue_workfn(struct work_struct *work)
 		result = -ENOMEM;
 		goto err_rq;
 	}
-	img_request->rq = rq;
+	img_request->callback = rbd_img_end_request;
+	img_request->callback_data = rq;
 	snapc = NULL; /* img_request consumes a ref */
 
 	dout("%s rbd_dev %p img_req %p %s %llu~%llu\n", __func__, rbd_dev,
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 08/15] rbd: add debugging statements for the state machine
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
                   ` (6 preceding siblings ...)
  2020-01-31 10:37 ` [PATCH 07/15] rbd: use callback for image request completion Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-01-31 10:37 ` [PATCH 09/15] rbd: count pending object requests in-line Hannes Reinecke
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

Add additional debugging statements to analyse the state machine.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 8cfd9407cbb8..b708f5ecda07 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -291,6 +291,7 @@ struct rbd_obj_request {
 	union {
 		enum rbd_obj_read_state	 read_state;	/* for reads */
 		enum rbd_obj_write_state write_state;	/* for writes */
+		unsigned char		 obj_state;	/* generic access */
 	};
 
 	struct rbd_img_request	*img_request;
@@ -1352,8 +1353,12 @@ static inline void rbd_img_obj_request_add(struct rbd_img_request *img_request,
 static inline void rbd_img_obj_request_del(struct rbd_img_request *img_request,
 					struct rbd_obj_request *obj_request)
 {
-	dout("%s: img %p obj %p\n", __func__, img_request, obj_request);
-	list_del(&obj_request->ex.oe_item);
+	dout("%s: img %p obj %p state %d copyup %d pending %d\n", __func__,
+	     img_request, obj_request, obj_request->obj_state,
+	     obj_request->copyup_state, obj_request->pending.num_pending);
+	WARN_ON(obj_request->obj_state > 1);
+	WARN_ON(obj_request->pending.num_pending);
+	list_del_init(&obj_request->ex.oe_item);
 	rbd_assert(obj_request->img_request == img_request);
 	rbd_obj_request_put(obj_request);
 }
@@ -1497,6 +1502,8 @@ __rbd_obj_add_osd_request(struct rbd_obj_request *obj_req,
 	req->r_callback = rbd_osd_req_callback;
 	req->r_priv = obj_req;
 
+	dout("%s: osd_req %p for obj_req %p\n", __func__, req, obj_req);
+
 	/*
 	 * Data objects may be stored in a separate pool, but always in
 	 * the same namespace in that pool as the header in its pool.
@@ -1686,6 +1693,7 @@ static void rbd_img_request_destroy(struct kref *kref)
 	dout("%s: img %p\n", __func__, img_request);
 
 	WARN_ON(!list_empty(&img_request->lock_item));
+	WARN_ON(img_request->state != RBD_IMG_DONE);
 	mutex_lock(&img_request->object_mutex);
 	for_each_obj_request_safe(img_request, obj_request, next_obj_request)
 		rbd_img_obj_request_del(img_request, obj_request);
@@ -3513,6 +3521,8 @@ static void rbd_obj_handle_request(struct rbd_obj_request *obj_req, int result)
 {
 	if (__rbd_obj_handle_request(obj_req, &result)) {
 		/* Recurse into parent */
+		dout("%s: obj %p parent %p result %d\n", __func__,
+		     obj_req, obj_req->img_request, result);
 		rbd_img_handle_request(obj_req->img_request, result);
 	}
 }
@@ -3603,6 +3613,9 @@ static void rbd_img_object_requests(struct rbd_img_request *img_req)
 		int result = 0;
 
 		if (__rbd_obj_handle_request(obj_req, &result)) {
+			dout("%s: obj %p parent %p img %p result %d\n",
+			     __func__, obj_req, obj_req->img_request,
+			     img_req, result);
 			if (result) {
 				img_req->pending.result = result;
 				mutex_unlock(&img_req->object_mutex);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 09/15] rbd: count pending object requests in-line
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
                   ` (7 preceding siblings ...)
  2020-01-31 10:37 ` [PATCH 08/15] rbd: add debugging statements for the state machine Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-02-03 17:47   ` Ilya Dryomov
  2020-01-31 10:37 ` [PATCH 10/15] rbd: kill 'work_result' Hannes Reinecke
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

Instead of having a counter for outstanding object requests
check the state and count only those which are not in the final
state.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 41 ++++++++++++++++++++++++++++++-----------
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index b708f5ecda07..a6c95b6e9c0c 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -350,7 +350,7 @@ struct rbd_img_request {
 	struct mutex		object_mutex;
 
 	struct mutex		state_mutex;
-	struct pending_result	pending;
+	int			pending_result;
 	struct work_struct	work;
 	int			work_result;
 	struct kref		kref;
@@ -3602,11 +3602,12 @@ static int rbd_img_exclusive_lock(struct rbd_img_request *img_req)
 	return 0;
 }
 
-static void rbd_img_object_requests(struct rbd_img_request *img_req)
+static int rbd_img_object_requests(struct rbd_img_request *img_req)
 {
 	struct rbd_obj_request *obj_req;
+	int num_pending = 0;
 
-	rbd_assert(!img_req->pending.result && !img_req->pending.num_pending);
+	rbd_assert(!img_req->pending_result);
 
 	mutex_lock(&img_req->object_mutex);
 	for_each_obj_request(img_req, obj_req) {
@@ -3617,15 +3618,33 @@ static void rbd_img_object_requests(struct rbd_img_request *img_req)
 			     __func__, obj_req, obj_req->img_request,
 			     img_req, result);
 			if (result) {
-				img_req->pending.result = result;
-				mutex_unlock(&img_req->object_mutex);
-				return;
+				img_req->pending_result = result;
+				break;
 			}
 		} else {
-			img_req->pending.num_pending++;
+			num_pending++;
 		}
 	}
 	mutex_unlock(&img_req->object_mutex);
+	return num_pending;
+}
+
+static int rbd_img_object_requests_pending(struct rbd_img_request *img_req)
+{
+	struct rbd_obj_request *obj_req;
+	int num_pending = 0;
+
+	mutex_lock(&img_req->object_mutex);
+	for_each_obj_request(img_req, obj_req) {
+		if (obj_req->obj_state > 1)
+			num_pending++;
+		else if (WARN_ON(obj_req->obj_state == 1))
+			num_pending++;
+		else if (WARN_ON(obj_req->pending.num_pending))
+			num_pending++;
+	}
+	mutex_unlock(&img_req->object_mutex);
+	return num_pending;
 }
 
 static bool rbd_img_advance(struct rbd_img_request *img_req, int *result)
@@ -3658,16 +3677,16 @@ static bool rbd_img_advance(struct rbd_img_request *img_req, int *result)
 			   __rbd_is_lock_owner(rbd_dev));
 
 		img_req->state = RBD_IMG_OBJECT_REQUESTS;
-		rbd_img_object_requests(img_req);
-		if (!img_req->pending.num_pending) {
-			*result = img_req->pending.result;
+		if (!rbd_img_object_requests(img_req)) {
+			*result = img_req->pending_result;
 			img_req->state = RBD_IMG_DONE;
 			return true;
 		}
 		return false;
 	case RBD_IMG_OBJECT_REQUESTS:
-		if (!pending_result_dec(&img_req->pending, result))
+		if (rbd_img_object_requests_pending(img_req))
 			return false;
+		*result = img_req->pending_result;
 		img_req->state = RBD_IMG_DONE;
 		/* fall through */
 	case RBD_IMG_DONE:
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 10/15] rbd: kill 'work_result'
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
                   ` (8 preceding siblings ...)
  2020-01-31 10:37 ` [PATCH 09/15] rbd: count pending object requests in-line Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-01-31 10:37 ` [PATCH 11/15] rbd: drop state_mutex in __rbd_img_handle_request() Hannes Reinecke
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

Use 'pending_result' instead of 'work_result' and kill the latter.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index a6c95b6e9c0c..671e941d6edf 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -352,7 +352,6 @@ struct rbd_img_request {
 	struct mutex		state_mutex;
 	int			pending_result;
 	struct work_struct	work;
-	int			work_result;
 	struct kref		kref;
 };
 
@@ -2834,13 +2833,13 @@ static void rbd_img_handle_request_work(struct work_struct *work)
 	struct rbd_img_request *img_req =
 	    container_of(work, struct rbd_img_request, work);
 
-	rbd_img_handle_request(img_req, img_req->work_result);
+	rbd_img_handle_request(img_req, img_req->pending_result);
 }
 
 static void rbd_img_schedule(struct rbd_img_request *img_req, int result)
 {
 	INIT_WORK(&img_req->work, rbd_img_handle_request_work);
-	img_req->work_result = result;
+	img_req->pending_result = result;
 	queue_work(rbd_wq, &img_req->work);
 }
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 11/15] rbd: drop state_mutex in __rbd_img_handle_request()
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
                   ` (9 preceding siblings ...)
  2020-01-31 10:37 ` [PATCH 10/15] rbd: kill 'work_result' Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-02-03 18:01   ` Ilya Dryomov
  2020-01-31 10:37 ` [PATCH 12/15] rbd: kill img_request kref Hannes Reinecke
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

The use of READ_ONCE/WRITE_ONCE for the image request state allows
us to drop the state_mutex in __rbd_img_handle_request().

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 26 +++++++++-----------------
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 671e941d6edf..db04401c4d8b 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -349,7 +349,6 @@ struct rbd_img_request {
 	struct list_head	object_extents;	/* obj_req.ex structs */
 	struct mutex		object_mutex;
 
-	struct mutex		state_mutex;
 	int			pending_result;
 	struct work_struct	work;
 	struct kref		kref;
@@ -1674,7 +1673,6 @@ static struct rbd_img_request *rbd_img_request_create(
 
 	INIT_LIST_HEAD(&img_request->lock_item);
 	INIT_LIST_HEAD(&img_request->object_extents);
-	mutex_init(&img_request->state_mutex);
 	mutex_init(&img_request->object_mutex);
 	kref_init(&img_request->kref);
 
@@ -2529,7 +2527,7 @@ static int __rbd_img_fill_request(struct rbd_img_request *img_req)
 		}
 	}
 	mutex_unlock(&img_req->object_mutex);
-	img_req->state = RBD_IMG_START;
+	WRITE_ONCE(img_req->state, RBD_IMG_START);
 	return 0;
 }
 
@@ -3652,15 +3650,15 @@ static bool rbd_img_advance(struct rbd_img_request *img_req, int *result)
 	int ret;
 
 	dout("%s: img %p state %d\n", __func__, img_req, img_req->state);
-	switch (img_req->state) {
+	switch (READ_ONCE(img_req->state)) {
 	case RBD_IMG_START:
 		rbd_assert(!*result);
 
-		img_req->state = RBD_IMG_EXCLUSIVE_LOCK;
+		WRITE_ONCE(img_req->state, RBD_IMG_EXCLUSIVE_LOCK);
 		ret = rbd_img_exclusive_lock(img_req);
 		if (ret < 0) {
 			*result = ret;
-			img_req->state = RBD_IMG_DONE;
+			WRITE_ONCE(img_req->state, RBD_IMG_DONE);
 			return true;
 		}
 		if (ret == 0)
@@ -3668,17 +3666,17 @@ static bool rbd_img_advance(struct rbd_img_request *img_req, int *result)
 		/* fall through */
 	case RBD_IMG_EXCLUSIVE_LOCK:
 		if (*result) {
-			img_req->state = RBD_IMG_DONE;
+			WRITE_ONCE(img_req->state, RBD_IMG_DONE);
 			return true;
 		}
 
 		rbd_assert(!need_exclusive_lock(img_req) ||
 			   __rbd_is_lock_owner(rbd_dev));
 
-		img_req->state = RBD_IMG_OBJECT_REQUESTS;
+		WRITE_ONCE(img_req->state, RBD_IMG_OBJECT_REQUESTS);
 		if (!rbd_img_object_requests(img_req)) {
 			*result = img_req->pending_result;
-			img_req->state = RBD_IMG_DONE;
+			WRITE_ONCE(img_req->state, RBD_IMG_DONE);
 			return true;
 		}
 		return false;
@@ -3686,7 +3684,7 @@ static bool rbd_img_advance(struct rbd_img_request *img_req, int *result)
 		if (rbd_img_object_requests_pending(img_req))
 			return false;
 		*result = img_req->pending_result;
-		img_req->state = RBD_IMG_DONE;
+		WRITE_ONCE(img_req->state, RBD_IMG_DONE);
 		/* fall through */
 	case RBD_IMG_DONE:
 		return true;
@@ -3706,16 +3704,12 @@ static bool __rbd_img_handle_request(struct rbd_img_request *img_req,
 
 	if (need_exclusive_lock(img_req)) {
 		down_read(&rbd_dev->lock_rwsem);
-		mutex_lock(&img_req->state_mutex);
 		done = rbd_img_advance(img_req, result);
 		if (done)
 			rbd_lock_del_request(img_req);
-		mutex_unlock(&img_req->state_mutex);
 		up_read(&rbd_dev->lock_rwsem);
 	} else {
-		mutex_lock(&img_req->state_mutex);
 		done = rbd_img_advance(img_req, result);
-		mutex_unlock(&img_req->state_mutex);
 	}
 
 	if (done && *result) {
@@ -3985,10 +3979,8 @@ static void wake_lock_waiters(struct rbd_device *rbd_dev, int result)
 	}
 
 	list_for_each_entry(img_req, &rbd_dev->acquiring_list, lock_item) {
-		mutex_lock(&img_req->state_mutex);
-		rbd_assert(img_req->state == RBD_IMG_EXCLUSIVE_LOCK);
+		rbd_assert(READ_ONCE(img_req->state) == RBD_IMG_EXCLUSIVE_LOCK);
 		rbd_img_schedule(img_req, result);
-		mutex_unlock(&img_req->state_mutex);
 	}
 
 	list_splice_tail_init(&rbd_dev->acquiring_list, &rbd_dev->running_list);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 12/15] rbd: kill img_request kref
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
                   ` (10 preceding siblings ...)
  2020-01-31 10:37 ` [PATCH 11/15] rbd: drop state_mutex in __rbd_img_handle_request() Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-01-31 10:37 ` [PATCH 13/15] rbd: schedule image_request after preparation Hannes Reinecke
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

The reference counter is never increased, so we can as well call
rbd_img_request_destroy() directly and drop the kref.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 24 +++++-------------------
 1 file changed, 5 insertions(+), 19 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index db04401c4d8b..2566d6bd8230 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -351,7 +351,6 @@ struct rbd_img_request {
 
 	int			pending_result;
 	struct work_struct	work;
-	struct kref		kref;
 };
 
 #define for_each_obj_request(ireq, oreq) \
@@ -1329,15 +1328,6 @@ static void rbd_obj_request_put(struct rbd_obj_request *obj_request)
 	kref_put(&obj_request->kref, rbd_obj_request_destroy);
 }
 
-static void rbd_img_request_destroy(struct kref *kref);
-static void rbd_img_request_put(struct rbd_img_request *img_request)
-{
-	rbd_assert(img_request != NULL);
-	dout("%s: img %p (was %d)\n", __func__, img_request,
-		kref_read(&img_request->kref));
-	kref_put(&img_request->kref, rbd_img_request_destroy);
-}
-
 static inline void rbd_img_obj_request_add(struct rbd_img_request *img_request,
 					struct rbd_obj_request *obj_request)
 {
@@ -1674,19 +1664,15 @@ static struct rbd_img_request *rbd_img_request_create(
 	INIT_LIST_HEAD(&img_request->lock_item);
 	INIT_LIST_HEAD(&img_request->object_extents);
 	mutex_init(&img_request->object_mutex);
-	kref_init(&img_request->kref);
 
 	return img_request;
 }
 
-static void rbd_img_request_destroy(struct kref *kref)
+static void rbd_img_request_destroy(struct rbd_img_request *img_request)
 {
-	struct rbd_img_request *img_request;
 	struct rbd_obj_request *obj_request;
 	struct rbd_obj_request *next_obj_request;
 
-	img_request = container_of(kref, struct rbd_img_request, kref);
-
 	dout("%s: img %p\n", __func__, img_request);
 
 	WARN_ON(!list_empty(&img_request->lock_item));
@@ -2920,7 +2906,7 @@ static int rbd_obj_read_from_parent(struct rbd_obj_request *obj_req)
 					      obj_req->copyup_bvecs);
 	}
 	if (ret) {
-		rbd_img_request_put(child_img_req);
+		rbd_img_request_destroy(child_img_req);
 		return ret;
 	}
 
@@ -3726,7 +3712,7 @@ static void rbd_img_end_child_request(struct rbd_img_request *img_req,
 {
 	struct rbd_obj_request *obj_req = img_req->callback_data;
 
-	rbd_img_request_put(img_req);
+	rbd_img_request_destroy(img_req);
 	rbd_obj_handle_request(obj_req, result);
 }
 
@@ -3734,7 +3720,7 @@ static void rbd_img_end_request(struct rbd_img_request *img_req, int result)
 {
 	struct request *rq = img_req->callback_data;
 
-	rbd_img_request_put(img_req);
+	rbd_img_request_destroy(img_req);
 	blk_mq_end_request(rq, errno_to_blk_status(result));
 }
 
@@ -4886,7 +4872,7 @@ static void rbd_queue_workfn(struct work_struct *work)
 	return;
 
 err_img_request:
-	rbd_img_request_put(img_request);
+	rbd_img_request_destroy(img_request);
 err_rq:
 	if (result)
 		rbd_warn(rbd_dev, "%s %llx at %llx result %d",
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 13/15] rbd: schedule image_request after preparation
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
                   ` (11 preceding siblings ...)
  2020-01-31 10:37 ` [PATCH 12/15] rbd: kill img_request kref Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-02-03 18:40   ` Ilya Dryomov
  2020-01-31 10:37 ` [PATCH 14/15] rbd: embed image request as blk_mq request payload Hannes Reinecke
  2020-01-31 10:37 ` [PATCH 15/15] rbd: switch to blk-mq Hannes Reinecke
  14 siblings, 1 reply; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

Instead of pushing I/O directly to the workqueue we should be
preparing it first, and push it onto the workqueue as the last
step. This allows us to signal some back-pressure to the block
layer in case the queue fills up.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 52 +++++++++++++++-------------------------------------
 1 file changed, 15 insertions(+), 37 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 2566d6bd8230..9829f225c57d 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -4775,9 +4775,10 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
 	return ret;
 }
 
-static void rbd_queue_workfn(struct work_struct *work)
+static blk_status_t rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
+		const struct blk_mq_queue_data *bd)
 {
-	struct request *rq = blk_mq_rq_from_pdu(work);
+	struct request *rq = bd->rq;
 	struct rbd_device *rbd_dev = rq->q->queuedata;
 	struct rbd_img_request *img_request;
 	struct ceph_snap_context *snapc = NULL;
@@ -4802,24 +4803,14 @@ static void rbd_queue_workfn(struct work_struct *work)
 		break;
 	default:
 		dout("%s: non-fs request type %d\n", __func__, req_op(rq));
-		result = -EIO;
-		goto err;
-	}
-
-	/* Ignore/skip any zero-length requests */
-
-	if (!length) {
-		dout("%s: zero-length request\n", __func__);
-		result = 0;
-		goto err_rq;
+		return BLK_STS_IOERR;
 	}
 
 	if (op_type != OBJ_OP_READ) {
 		if (rbd_is_ro(rbd_dev)) {
 			rbd_warn(rbd_dev, "%s on read-only mapping",
 				 obj_op_name(op_type));
-			result = -EIO;
-			goto err;
+			return BLK_STS_IOERR;
 		}
 		rbd_assert(!rbd_is_snap(rbd_dev));
 	}
@@ -4827,11 +4818,17 @@ static void rbd_queue_workfn(struct work_struct *work)
 	if (offset && length > U64_MAX - offset + 1) {
 		rbd_warn(rbd_dev, "bad request range (%llu~%llu)", offset,
 			 length);
-		result = -EINVAL;
-		goto err_rq;	/* Shouldn't happen */
+		return BLK_STS_NOSPC;	/* Shouldn't happen */
 	}
 
 	blk_mq_start_request(rq);
+	/* Ignore/skip any zero-length requests */
+	if (!length) {
+		dout("%s: zero-length request\n", __func__);
+		result = 0;
+		goto err;
+	}
+
 
 	mapping_size = READ_ONCE(rbd_dev->mapping.size);
 	if (op_type != OBJ_OP_READ) {
@@ -4868,8 +4865,8 @@ static void rbd_queue_workfn(struct work_struct *work)
 	if (result)
 		goto err_img_request;
 
-	rbd_img_handle_request(img_request, 0);
-	return;
+	rbd_img_schedule(img_request, 0);
+	return BLK_STS_OK;
 
 err_img_request:
 	rbd_img_request_destroy(img_request);
@@ -4880,15 +4877,6 @@ static void rbd_queue_workfn(struct work_struct *work)
 	ceph_put_snap_context(snapc);
 err:
 	blk_mq_end_request(rq, errno_to_blk_status(result));
-}
-
-static blk_status_t rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
-		const struct blk_mq_queue_data *bd)
-{
-	struct request *rq = bd->rq;
-	struct work_struct *work = blk_mq_rq_to_pdu(rq);
-
-	queue_work(rbd_wq, work);
 	return BLK_STS_OK;
 }
 
@@ -5055,18 +5043,8 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 	return ret;
 }
 
-static int rbd_init_request(struct blk_mq_tag_set *set, struct request *rq,
-		unsigned int hctx_idx, unsigned int numa_node)
-{
-	struct work_struct *work = blk_mq_rq_to_pdu(rq);
-
-	INIT_WORK(work, rbd_queue_workfn);
-	return 0;
-}
-
 static const struct blk_mq_ops rbd_mq_ops = {
 	.queue_rq	= rbd_queue_rq,
-	.init_request	= rbd_init_request,
 };
 
 static int rbd_init_disk(struct rbd_device *rbd_dev)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 14/15] rbd: embed image request as blk_mq request payload
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
                   ` (12 preceding siblings ...)
  2020-01-31 10:37 ` [PATCH 13/15] rbd: schedule image_request after preparation Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-01-31 10:37 ` [PATCH 15/15] rbd: switch to blk-mq Hannes Reinecke
  14 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

Instead of allocating an image request for every block request
we can as well embed it as the payload and save the allocation.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 56 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 37 insertions(+), 19 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 9829f225c57d..cc3e5116fe58 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -324,6 +324,7 @@ typedef void (*rbd_img_request_cb_t)(struct rbd_img_request *img_request,
 enum img_req_flags {
 	IMG_REQ_CHILD,		/* initiator: block = 0, child image = 1 */
 	IMG_REQ_LAYERED,	/* ENOENT handling: normal = 0, layered = 1 */
+	IMG_REQ_EMBEDDED,	/* free handling: normal = 0, embedded = 1 */
 };
 
 enum rbd_img_state {
@@ -1640,17 +1641,10 @@ static bool rbd_dev_parent_get(struct rbd_device *rbd_dev)
  * that comprises the image request, and the Linux request pointer
  * (if there is one).
  */
-static struct rbd_img_request *rbd_img_request_create(
-					struct rbd_device *rbd_dev,
-					enum obj_operation_type op_type,
-					struct ceph_snap_context *snapc)
+static void rbd_img_request_setup(struct rbd_img_request *img_request,
+		struct rbd_device *rbd_dev, enum obj_operation_type op_type,
+		struct ceph_snap_context *snapc)
 {
-	struct rbd_img_request *img_request;
-
-	img_request = kmem_cache_zalloc(rbd_img_request_cache, GFP_NOIO);
-	if (!img_request)
-		return NULL;
-
 	img_request->rbd_dev = rbd_dev;
 	img_request->op_type = op_type;
 	if (!rbd_img_is_write(img_request))
@@ -1661,9 +1655,25 @@ static struct rbd_img_request *rbd_img_request_create(
 	if (rbd_dev_parent_get(rbd_dev))
 		img_request_layered_set(img_request);
 
+	img_request->pending_result = 0;
+	img_request->state = RBD_IMG_DONE;
 	INIT_LIST_HEAD(&img_request->lock_item);
 	INIT_LIST_HEAD(&img_request->object_extents);
 	mutex_init(&img_request->object_mutex);
+}
+
+struct rbd_img_request *rbd_img_request_create(
+					struct rbd_device *rbd_dev,
+					enum obj_operation_type op_type,
+					struct ceph_snap_context *snapc)
+{
+	struct rbd_img_request *img_request;
+
+	img_request = kmem_cache_zalloc(rbd_img_request_cache, GFP_NOIO);
+	if (!img_request)
+		return NULL;
+
+	rbd_img_request_setup(img_request, rbd_dev, op_type, snapc);
 
 	return img_request;
 }
@@ -1690,7 +1700,8 @@ static void rbd_img_request_destroy(struct rbd_img_request *img_request)
 	if (rbd_img_is_write(img_request))
 		ceph_put_snap_context(img_request->snapc);
 
-	kmem_cache_free(rbd_img_request_cache, img_request);
+	if (!test_bit(IMG_REQ_EMBEDDED, &img_request->flags))
+		kmem_cache_free(rbd_img_request_cache, img_request);
 }
 
 #define BITS_PER_OBJ	2
@@ -4780,7 +4791,7 @@ static blk_status_t rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
 {
 	struct request *rq = bd->rq;
 	struct rbd_device *rbd_dev = rq->q->queuedata;
-	struct rbd_img_request *img_request;
+	struct rbd_img_request *img_request = blk_mq_rq_to_pdu(rq);
 	struct ceph_snap_context *snapc = NULL;
 	u64 offset = (u64)blk_rq_pos(rq) << SECTOR_SHIFT;
 	u64 length = blk_rq_bytes(rq);
@@ -4845,11 +4856,7 @@ static blk_status_t rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
 		goto err_rq;
 	}
 
-	img_request = rbd_img_request_create(rbd_dev, op_type, snapc);
-	if (!img_request) {
-		result = -ENOMEM;
-		goto err_rq;
-	}
+	rbd_img_request_setup(img_request, rbd_dev, op_type, snapc);
 	img_request->callback = rbd_img_end_request;
 	img_request->callback_data = rq;
 	snapc = NULL; /* img_request consumes a ref */
@@ -4865,7 +4872,7 @@ static blk_status_t rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (result)
 		goto err_img_request;
 
-	rbd_img_schedule(img_request, 0);
+	queue_work(rbd_wq, &img_request->work);
 	return BLK_STS_OK;
 
 err_img_request:
@@ -5043,8 +5050,19 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
 	return ret;
 }
 
+static int rbd_init_request(struct blk_mq_tag_set *set, struct request *rq,
+		unsigned int hctx_idx, unsigned int numa_node)
+{
+	struct rbd_img_request *img_req = blk_mq_rq_to_pdu(rq);
+
+	INIT_WORK(&img_req->work, rbd_img_handle_request_work);
+	set_bit(IMG_REQ_EMBEDDED, &img_req->flags);
+	return 0;
+}
+
 static const struct blk_mq_ops rbd_mq_ops = {
 	.queue_rq	= rbd_queue_rq,
+	.init_request	= rbd_init_request,
 };
 
 static int rbd_init_disk(struct rbd_device *rbd_dev)
@@ -5077,7 +5095,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
 	rbd_dev->tag_set.numa_node = NUMA_NO_NODE;
 	rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 	rbd_dev->tag_set.nr_hw_queues = 1;
-	rbd_dev->tag_set.cmd_size = sizeof(struct work_struct);
+	rbd_dev->tag_set.cmd_size = sizeof(struct rbd_img_request);
 
 	err = blk_mq_alloc_tag_set(&rbd_dev->tag_set);
 	if (err)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 15/15] rbd: switch to blk-mq
  2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
                   ` (13 preceding siblings ...)
  2020-01-31 10:37 ` [PATCH 14/15] rbd: embed image request as blk_mq request payload Hannes Reinecke
@ 2020-01-31 10:37 ` Hannes Reinecke
  2020-02-03  8:36   ` Christoph Hellwig
  14 siblings, 1 reply; 25+ messages in thread
From: Hannes Reinecke @ 2020-01-31 10:37 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, ceph-devel,
	linux-block, Hannes Reinecke

Allocate one queue per CPU and get a performance boost from
higher parallelism.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/rbd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index cc3e5116fe58..dc3b44177fea 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -5094,7 +5094,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
 	rbd_dev->tag_set.queue_depth = rbd_dev->opts->queue_depth;
 	rbd_dev->tag_set.numa_node = NUMA_NO_NODE;
 	rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
-	rbd_dev->tag_set.nr_hw_queues = 1;
+	rbd_dev->tag_set.nr_hw_queues = num_present_cpus();
 	rbd_dev->tag_set.cmd_size = sizeof(struct rbd_img_request);
 
 	err = blk_mq_alloc_tag_set(&rbd_dev->tag_set);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 15/15] rbd: switch to blk-mq
  2020-01-31 10:37 ` [PATCH 15/15] rbd: switch to blk-mq Hannes Reinecke
@ 2020-02-03  8:36   ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2020-02-03  8:36 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Ilya Dryomov, Sage Weil, Daniel Disseldorp, Jens Axboe,
	ceph-devel, linux-block

On Fri, Jan 31, 2020 at 11:37:39AM +0100, Hannes Reinecke wrote:
> Allocate one queue per CPU and get a performance boost from
> higher parallelism.

Well, the driver already is using blk-mq so your subject is incorrect.
I think you want to say something like "rbd: support multiple queues"

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 01/15] rbd: lock object request list
  2020-01-31 10:37 ` [PATCH 01/15] rbd: lock object request list Hannes Reinecke
@ 2020-02-03 16:38   ` Ilya Dryomov
  0 siblings, 0 replies; 25+ messages in thread
From: Ilya Dryomov @ 2020-02-03 16:38 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, Ceph Development, linux-block

On Fri, Jan 31, 2020 at 11:38 AM Hannes Reinecke <hare@suse.de> wrote:
>
> The object request list can be accessed from various contexts
> so we need to lock it to avoid concurrent modifications and
> random crashes.
>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/block/rbd.c | 36 ++++++++++++++++++++++++++++--------
>  1 file changed, 28 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 5710b2a8609c..db80b964d8ea 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -344,6 +344,7 @@ struct rbd_img_request {
>
>         struct list_head        lock_item;
>         struct list_head        object_extents; /* obj_req.ex structs */
> +       struct mutex            object_mutex;
>
>         struct mutex            state_mutex;
>         struct pending_result   pending;
> @@ -1664,6 +1665,7 @@ static struct rbd_img_request *rbd_img_request_create(
>         INIT_LIST_HEAD(&img_request->lock_item);
>         INIT_LIST_HEAD(&img_request->object_extents);
>         mutex_init(&img_request->state_mutex);
> +       mutex_init(&img_request->object_mutex);
>         kref_init(&img_request->kref);
>
>         return img_request;
> @@ -1680,8 +1682,10 @@ static void rbd_img_request_destroy(struct kref *kref)
>         dout("%s: img %p\n", __func__, img_request);
>
>         WARN_ON(!list_empty(&img_request->lock_item));
> +       mutex_lock(&img_request->object_mutex);
>         for_each_obj_request_safe(img_request, obj_request, next_obj_request)
>                 rbd_img_obj_request_del(img_request, obj_request);
> +       mutex_unlock(&img_request->object_mutex);
>
>         if (img_request_layered_test(img_request)) {
>                 img_request_layered_clear(img_request);
> @@ -2486,6 +2490,7 @@ static int __rbd_img_fill_request(struct rbd_img_request *img_req)
>         struct rbd_obj_request *obj_req, *next_obj_req;
>         int ret;
>
> +       mutex_lock(&img_req->object_mutex);
>         for_each_obj_request_safe(img_req, obj_req, next_obj_req) {
>                 switch (img_req->op_type) {
>                 case OBJ_OP_READ:
> @@ -2503,14 +2508,16 @@ static int __rbd_img_fill_request(struct rbd_img_request *img_req)
>                 default:
>                         BUG();
>                 }
> -               if (ret < 0)
> +               if (ret < 0) {
> +                       mutex_unlock(&img_req->object_mutex);
>                         return ret;
> +               }
>                 if (ret > 0) {
>                         rbd_img_obj_request_del(img_req, obj_req);
>                         continue;
>                 }
>         }
> -
> +       mutex_unlock(&img_req->object_mutex);
>         img_req->state = RBD_IMG_START;
>         return 0;
>  }
> @@ -2569,6 +2576,7 @@ static int rbd_img_fill_request_nocopy(struct rbd_img_request *img_req,
>          * position in the provided bio (list) or bio_vec array.
>          */
>         fctx->iter = *fctx->pos;
> +       mutex_lock(&img_req->object_mutex);
>         for (i = 0; i < num_img_extents; i++) {
>                 ret = ceph_file_to_extents(&img_req->rbd_dev->layout,
>                                            img_extents[i].fe_off,
> @@ -2576,10 +2584,12 @@ static int rbd_img_fill_request_nocopy(struct rbd_img_request *img_req,
>                                            &img_req->object_extents,
>                                            alloc_object_extent, img_req,
>                                            fctx->set_pos_fn, &fctx->iter);
> -               if (ret)
> +               if (ret) {
> +                       mutex_unlock(&img_req->object_mutex);
>                         return ret;
> +               }
>         }
> -
> +       mutex_unlock(&img_req->object_mutex);
>         return __rbd_img_fill_request(img_req);
>  }
>
> @@ -2620,6 +2630,7 @@ static int rbd_img_fill_request(struct rbd_img_request *img_req,
>          * or bio_vec array because when mapped, those bio_vecs can straddle
>          * stripe unit boundaries.
>          */
> +       mutex_lock(&img_req->object_mutex);
>         fctx->iter = *fctx->pos;
>         for (i = 0; i < num_img_extents; i++) {
>                 ret = ceph_file_to_extents(&rbd_dev->layout,
> @@ -2629,15 +2640,17 @@ static int rbd_img_fill_request(struct rbd_img_request *img_req,
>                                            alloc_object_extent, img_req,
>                                            fctx->count_fn, &fctx->iter);
>                 if (ret)
> -                       return ret;
> +                       goto out_unlock;
>         }
>
>         for_each_obj_request(img_req, obj_req) {
>                 obj_req->bvec_pos.bvecs = kmalloc_array(obj_req->bvec_count,
>                                               sizeof(*obj_req->bvec_pos.bvecs),
>                                               GFP_NOIO);
> -               if (!obj_req->bvec_pos.bvecs)
> -                       return -ENOMEM;
> +               if (!obj_req->bvec_pos.bvecs) {
> +                       ret = -ENOMEM;
> +                       goto out_unlock;
> +               }
>         }
>
>         /*
> @@ -2652,10 +2665,14 @@ static int rbd_img_fill_request(struct rbd_img_request *img_req,
>                                            &img_req->object_extents,
>                                            fctx->copy_fn, &fctx->iter);
>                 if (ret)
> -                       return ret;
> +                       goto out_unlock;
>         }
> +       mutex_unlock(&img_req->object_mutex);
>
>         return __rbd_img_fill_request(img_req);
> +out_unlock:
> +       mutex_unlock(&img_req->object_mutex);
> +       return ret;
>  }
>
>  static int rbd_img_fill_nodata(struct rbd_img_request *img_req,
> @@ -3552,18 +3569,21 @@ static void rbd_img_object_requests(struct rbd_img_request *img_req)
>
>         rbd_assert(!img_req->pending.result && !img_req->pending.num_pending);
>
> +       mutex_lock(&img_req->object_mutex);
>         for_each_obj_request(img_req, obj_req) {
>                 int result = 0;
>
>                 if (__rbd_obj_handle_request(obj_req, &result)) {
>                         if (result) {
>                                 img_req->pending.result = result;
> +                               mutex_unlock(&img_req->object_mutex);
>                                 return;
>                         }
>                 } else {
>                         img_req->pending.num_pending++;
>                 }
>         }
> +       mutex_unlock(&img_req->object_mutex);
>  }
>
>  static bool rbd_img_advance(struct rbd_img_request *img_req, int *result)

Hi Hannes,

I asked for this in the previous posting, can you explain what
concurrent modifications are possible?  If you observed crashes,
please share stack traces.  This patch sounds like urgent material,
but I can't move forward on it until I understand the issue.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 02/15] rbd: use READ_ONCE() when checking the mapping size
  2020-01-31 10:37 ` [PATCH 02/15] rbd: use READ_ONCE() when checking the mapping size Hannes Reinecke
@ 2020-02-03 16:50   ` Ilya Dryomov
  2020-02-04  7:05     ` Hannes Reinecke
  0 siblings, 1 reply; 25+ messages in thread
From: Ilya Dryomov @ 2020-02-03 16:50 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, Ceph Development, linux-block

On Fri, Jan 31, 2020 at 11:38 AM Hannes Reinecke <hare@suse.de> wrote:
>
> The mapping size is changed only very infrequently, so we don't
> need to take the header mutex for checking; using READ_ONCE()
> is sufficient here. And it avoids having to take a mutex in the
> hot path.
>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/block/rbd.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index db80b964d8ea..792180548e89 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -4788,13 +4788,13 @@ static void rbd_queue_workfn(struct work_struct *work)
>
>         blk_mq_start_request(rq);
>
> -       down_read(&rbd_dev->header_rwsem);
> -       mapping_size = rbd_dev->mapping.size;
> +       mapping_size = READ_ONCE(rbd_dev->mapping.size);
>         if (op_type != OBJ_OP_READ) {
> +               down_read(&rbd_dev->header_rwsem);
>                 snapc = rbd_dev->header.snapc;
>                 ceph_get_snap_context(snapc);
> +               up_read(&rbd_dev->header_rwsem);
>         }
> -       up_read(&rbd_dev->header_rwsem);
>
>         if (offset + length > mapping_size) {
>                 rbd_warn(rbd_dev, "beyond EOD (%llu~%llu > %llu)", offset,
> @@ -4981,9 +4981,9 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>         u64 mapping_size;
>         int ret;
>
> -       down_write(&rbd_dev->header_rwsem);
> -       mapping_size = rbd_dev->mapping.size;
> +       mapping_size = READ_ONCE(rbd_dev->mapping.size);
>
> +       down_write(&rbd_dev->header_rwsem);
>         ret = rbd_dev_header_info(rbd_dev);
>         if (ret)
>                 goto out;
> @@ -4999,7 +4999,7 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>         }
>
>         rbd_assert(!rbd_is_snap(rbd_dev));
> -       rbd_dev->mapping.size = rbd_dev->header.image_size;
> +       WRITE_ONCE(rbd_dev->mapping.size, rbd_dev->header.image_size);
>
>  out:
>         up_write(&rbd_dev->header_rwsem);

Does this result in a measurable performance improvement?

I'd rather not go down the READ/WRITE_ONCE path and continue using
proper locking, especially given that it's only for reads.  FWIW the
plan is to replace header_rwsem with a spin lock, after refactoring
header read-in code to use a private buffer instead of reading into
rbd_dev directly.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 07/15] rbd: use callback for image request completion
  2020-01-31 10:37 ` [PATCH 07/15] rbd: use callback for image request completion Hannes Reinecke
@ 2020-02-03 17:13   ` Ilya Dryomov
  0 siblings, 0 replies; 25+ messages in thread
From: Ilya Dryomov @ 2020-02-03 17:13 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, Ceph Development, linux-block

On Fri, Jan 31, 2020 at 11:38 AM Hannes Reinecke <hare@suse.de> wrote:
>
> Using callbacks to simplify code and to separate out the different
> code paths for parent and child requests.
>
> Suggested-by: David Disseldorp <ddiss@suse.com>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/block/rbd.c | 61 +++++++++++++++++++++++++++++------------------------
>  1 file changed, 33 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index c31507a5fdd2..8cfd9407cbb8 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -317,6 +317,9 @@ struct rbd_obj_request {
>         struct kref             kref;
>  };
>
> +typedef void (*rbd_img_request_cb_t)(struct rbd_img_request *img_request,
> +                                    int result);
> +
>  enum img_req_flags {
>         IMG_REQ_CHILD,          /* initiator: block = 0, child image = 1 */
>         IMG_REQ_LAYERED,        /* ENOENT handling: normal = 0, layered = 1 */
> @@ -339,11 +342,8 @@ struct rbd_img_request {
>                 u64                     snap_id;        /* for reads */
>                 struct ceph_snap_context *snapc;        /* for writes */
>         };
> -       union {
> -               struct request          *rq;            /* block request */
> -               struct rbd_obj_request  *obj_request;   /* obj req initiator */
> -       };
> -
> +       void                    *callback_data;
> +       rbd_img_request_cb_t    callback;
>         struct list_head        lock_item;
>         struct list_head        object_extents; /* obj_req.ex structs */
>         struct mutex            object_mutex;
> @@ -506,6 +506,8 @@ static ssize_t add_single_major_store(struct bus_type *bus, const char *buf,
>  static ssize_t remove_single_major_store(struct bus_type *bus, const char *buf,
>                                          size_t count);
>  static int rbd_dev_image_probe(struct rbd_device *rbd_dev, int depth);
> +static void rbd_img_end_child_request(struct rbd_img_request *img_req,
> +                                     int result);
>
>  static int rbd_dev_id_to_minor(int dev_id)
>  {
> @@ -2882,7 +2884,8 @@ static int rbd_obj_read_from_parent(struct rbd_obj_request *obj_req)
>                 return -ENOMEM;
>
>         __set_bit(IMG_REQ_CHILD, &child_img_req->flags);
> -       child_img_req->obj_request = obj_req;
> +       child_img_req->callback = rbd_img_end_child_request;
> +       child_img_req->callback_data = obj_req;
>
>         dout("%s child_img_req %p for obj_req %p\n", __func__, child_img_req,
>              obj_req);
> @@ -3506,14 +3509,12 @@ static bool __rbd_obj_handle_request(struct rbd_obj_request *obj_req,
>         return done;
>  }
>
> -/*
> - * This is open-coded in rbd_img_handle_request() to avoid parent chain
> - * recursion.
> - */
>  static void rbd_obj_handle_request(struct rbd_obj_request *obj_req, int result)
>  {
> -       if (__rbd_obj_handle_request(obj_req, &result))
> +       if (__rbd_obj_handle_request(obj_req, &result)) {
> +               /* Recurse into parent */
>                 rbd_img_handle_request(obj_req->img_request, result);
> +       }
>  }
>
>  static bool need_exclusive_lock(struct rbd_img_request *img_req)
> @@ -3695,26 +3696,29 @@ static bool __rbd_img_handle_request(struct rbd_img_request *img_req,
>         return done;
>  }
>
> -static void rbd_img_handle_request(struct rbd_img_request *img_req, int result)
> +static void rbd_img_end_child_request(struct rbd_img_request *img_req,
> +                                     int result)
>  {
> -again:
> -       if (!__rbd_img_handle_request(img_req, &result))
> -               return;
> +       struct rbd_obj_request *obj_req = img_req->callback_data;
>
> -       if (test_bit(IMG_REQ_CHILD, &img_req->flags)) {
> -               struct rbd_obj_request *obj_req = img_req->obj_request;
> +       rbd_img_request_put(img_req);
> +       rbd_obj_handle_request(obj_req, result);
> +}
>
> -               rbd_img_request_put(img_req);
> -               if (__rbd_obj_handle_request(obj_req, &result)) {
> -                       img_req = obj_req->img_request;
> -                       goto again;
> -               }
> -       } else {
> -               struct request *rq = img_req->rq;
> +static void rbd_img_end_request(struct rbd_img_request *img_req, int result)
> +{
> +       struct request *rq = img_req->callback_data;
>
> -               rbd_img_request_put(img_req);
> -               blk_mq_end_request(rq, errno_to_blk_status(result));
> -       }
> +       rbd_img_request_put(img_req);
> +       blk_mq_end_request(rq, errno_to_blk_status(result));
> +}
> +
> +void rbd_img_handle_request(struct rbd_img_request *img_req, int result)
> +{
> +       if (!__rbd_img_handle_request(img_req, &result))
> +               return;
> +
> +       img_req->callback(img_req, result);
>  }
>
>  static const struct rbd_client_id rbd_empty_cid;
> @@ -4840,7 +4844,8 @@ static void rbd_queue_workfn(struct work_struct *work)
>                 result = -ENOMEM;
>                 goto err_rq;
>         }
> -       img_request->rq = rq;
> +       img_request->callback = rbd_img_end_request;
> +       img_request->callback_data = rq;
>         snapc = NULL; /* img_request consumes a ref */
>
>         dout("%s rbd_dev %p img_req %p %s %llu~%llu\n", __func__, rbd_dev,

We walked away from callbacks to avoid recursion described here:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6d69bb536bac0d403d83db1ca841444981b280cd

The title says "on rbd map", but it was on the I/O path as well.
The functions have changed, but the gist is the same: some object
request completes, which completes its image request, which completes
the object request in the child image, which completes the image
request in the child image, etc.

The plan is to get rid of RBD_MAX_PARENT_CHAIN_LEN after refactoring
header read-in code.  If rbd_img_handle_request() is confusing, let's
add a comment.  There already is one (removed in this patch), but it
is far away, so I can see how that logic may seem over-complicated.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 09/15] rbd: count pending object requests in-line
  2020-01-31 10:37 ` [PATCH 09/15] rbd: count pending object requests in-line Hannes Reinecke
@ 2020-02-03 17:47   ` Ilya Dryomov
  2020-02-04  6:59     ` Hannes Reinecke
  0 siblings, 1 reply; 25+ messages in thread
From: Ilya Dryomov @ 2020-02-03 17:47 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, Ceph Development, linux-block

On Fri, Jan 31, 2020 at 11:38 AM Hannes Reinecke <hare@suse.de> wrote:
>
> Instead of having a counter for outstanding object requests
> check the state and count only those which are not in the final
> state.
>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/block/rbd.c | 41 ++++++++++++++++++++++++++++++-----------
>  1 file changed, 30 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index b708f5ecda07..a6c95b6e9c0c 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -350,7 +350,7 @@ struct rbd_img_request {
>         struct mutex            object_mutex;
>
>         struct mutex            state_mutex;
> -       struct pending_result   pending;
> +       int                     pending_result;
>         struct work_struct      work;
>         int                     work_result;
>         struct kref             kref;
> @@ -3602,11 +3602,12 @@ static int rbd_img_exclusive_lock(struct rbd_img_request *img_req)
>         return 0;
>  }
>
> -static void rbd_img_object_requests(struct rbd_img_request *img_req)
> +static int rbd_img_object_requests(struct rbd_img_request *img_req)
>  {
>         struct rbd_obj_request *obj_req;
> +       int num_pending = 0;
>
> -       rbd_assert(!img_req->pending.result && !img_req->pending.num_pending);
> +       rbd_assert(!img_req->pending_result);
>
>         mutex_lock(&img_req->object_mutex);
>         for_each_obj_request(img_req, obj_req) {
> @@ -3617,15 +3618,33 @@ static void rbd_img_object_requests(struct rbd_img_request *img_req)
>                              __func__, obj_req, obj_req->img_request,
>                              img_req, result);
>                         if (result) {
> -                               img_req->pending.result = result;
> -                               mutex_unlock(&img_req->object_mutex);
> -                               return;
> +                               img_req->pending_result = result;
> +                               break;
>                         }
>                 } else {
> -                       img_req->pending.num_pending++;
> +                       num_pending++;
>                 }
>         }
>         mutex_unlock(&img_req->object_mutex);
> +       return num_pending;
> +}
> +
> +static int rbd_img_object_requests_pending(struct rbd_img_request *img_req)
> +{
> +       struct rbd_obj_request *obj_req;
> +       int num_pending = 0;
> +
> +       mutex_lock(&img_req->object_mutex);
> +       for_each_obj_request(img_req, obj_req) {
> +               if (obj_req->obj_state > 1)
> +                       num_pending++;
> +               else if (WARN_ON(obj_req->obj_state == 1))
> +                       num_pending++;
> +               else if (WARN_ON(obj_req->pending.num_pending))
> +                       num_pending++;
> +       }
> +       mutex_unlock(&img_req->object_mutex);
> +       return num_pending;
>  }
>
>  static bool rbd_img_advance(struct rbd_img_request *img_req, int *result)
> @@ -3658,16 +3677,16 @@ static bool rbd_img_advance(struct rbd_img_request *img_req, int *result)
>                            __rbd_is_lock_owner(rbd_dev));
>
>                 img_req->state = RBD_IMG_OBJECT_REQUESTS;
> -               rbd_img_object_requests(img_req);
> -               if (!img_req->pending.num_pending) {
> -                       *result = img_req->pending.result;
> +               if (!rbd_img_object_requests(img_req)) {
> +                       *result = img_req->pending_result;
>                         img_req->state = RBD_IMG_DONE;
>                         return true;
>                 }
>                 return false;
>         case RBD_IMG_OBJECT_REQUESTS:
> -               if (!pending_result_dec(&img_req->pending, result))
> +               if (rbd_img_object_requests_pending(img_req))
>                         return false;
> +               *result = img_req->pending_result;
>                 img_req->state = RBD_IMG_DONE;
>                 /* fall through */
>         case RBD_IMG_DONE:

This is just to be able to drop img_req->state_mutex in patch 11,
right?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 11/15] rbd: drop state_mutex in __rbd_img_handle_request()
  2020-01-31 10:37 ` [PATCH 11/15] rbd: drop state_mutex in __rbd_img_handle_request() Hannes Reinecke
@ 2020-02-03 18:01   ` Ilya Dryomov
  0 siblings, 0 replies; 25+ messages in thread
From: Ilya Dryomov @ 2020-02-03 18:01 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, Ceph Development, linux-block

On Fri, Jan 31, 2020 at 11:38 AM Hannes Reinecke <hare@suse.de> wrote:
>
> The use of READ_ONCE/WRITE_ONCE for the image request state allows
> us to drop the state_mutex in __rbd_img_handle_request().
>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/block/rbd.c | 26 +++++++++-----------------
>  1 file changed, 9 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 671e941d6edf..db04401c4d8b 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -349,7 +349,6 @@ struct rbd_img_request {
>         struct list_head        object_extents; /* obj_req.ex structs */
>         struct mutex            object_mutex;
>
> -       struct mutex            state_mutex;
>         int                     pending_result;
>         struct work_struct      work;
>         struct kref             kref;
> @@ -1674,7 +1673,6 @@ static struct rbd_img_request *rbd_img_request_create(
>
>         INIT_LIST_HEAD(&img_request->lock_item);
>         INIT_LIST_HEAD(&img_request->object_extents);
> -       mutex_init(&img_request->state_mutex);
>         mutex_init(&img_request->object_mutex);
>         kref_init(&img_request->kref);
>
> @@ -2529,7 +2527,7 @@ static int __rbd_img_fill_request(struct rbd_img_request *img_req)
>                 }
>         }
>         mutex_unlock(&img_req->object_mutex);
> -       img_req->state = RBD_IMG_START;
> +       WRITE_ONCE(img_req->state, RBD_IMG_START);
>         return 0;
>  }
>
> @@ -3652,15 +3650,15 @@ static bool rbd_img_advance(struct rbd_img_request *img_req, int *result)
>         int ret;
>
>         dout("%s: img %p state %d\n", __func__, img_req, img_req->state);
> -       switch (img_req->state) {
> +       switch (READ_ONCE(img_req->state)) {
>         case RBD_IMG_START:
>                 rbd_assert(!*result);
>
> -               img_req->state = RBD_IMG_EXCLUSIVE_LOCK;
> +               WRITE_ONCE(img_req->state, RBD_IMG_EXCLUSIVE_LOCK);
>                 ret = rbd_img_exclusive_lock(img_req);
>                 if (ret < 0) {
>                         *result = ret;
> -                       img_req->state = RBD_IMG_DONE;
> +                       WRITE_ONCE(img_req->state, RBD_IMG_DONE);
>                         return true;
>                 }
>                 if (ret == 0)
> @@ -3668,17 +3666,17 @@ static bool rbd_img_advance(struct rbd_img_request *img_req, int *result)
>                 /* fall through */
>         case RBD_IMG_EXCLUSIVE_LOCK:
>                 if (*result) {
> -                       img_req->state = RBD_IMG_DONE;
> +                       WRITE_ONCE(img_req->state, RBD_IMG_DONE);
>                         return true;
>                 }
>
>                 rbd_assert(!need_exclusive_lock(img_req) ||
>                            __rbd_is_lock_owner(rbd_dev));
>
> -               img_req->state = RBD_IMG_OBJECT_REQUESTS;
> +               WRITE_ONCE(img_req->state, RBD_IMG_OBJECT_REQUESTS);
>                 if (!rbd_img_object_requests(img_req)) {
>                         *result = img_req->pending_result;
> -                       img_req->state = RBD_IMG_DONE;
> +                       WRITE_ONCE(img_req->state, RBD_IMG_DONE);
>                         return true;
>                 }
>                 return false;
> @@ -3686,7 +3684,7 @@ static bool rbd_img_advance(struct rbd_img_request *img_req, int *result)
>                 if (rbd_img_object_requests_pending(img_req))
>                         return false;
>                 *result = img_req->pending_result;
> -               img_req->state = RBD_IMG_DONE;
> +               WRITE_ONCE(img_req->state, RBD_IMG_DONE);
>                 /* fall through */
>         case RBD_IMG_DONE:
>                 return true;
> @@ -3706,16 +3704,12 @@ static bool __rbd_img_handle_request(struct rbd_img_request *img_req,
>
>         if (need_exclusive_lock(img_req)) {
>                 down_read(&rbd_dev->lock_rwsem);
> -               mutex_lock(&img_req->state_mutex);
>                 done = rbd_img_advance(img_req, result);
>                 if (done)
>                         rbd_lock_del_request(img_req);
> -               mutex_unlock(&img_req->state_mutex);
>                 up_read(&rbd_dev->lock_rwsem);
>         } else {
> -               mutex_lock(&img_req->state_mutex);
>                 done = rbd_img_advance(img_req, result);
> -               mutex_unlock(&img_req->state_mutex);
>         }
>
>         if (done && *result) {
> @@ -3985,10 +3979,8 @@ static void wake_lock_waiters(struct rbd_device *rbd_dev, int result)
>         }
>
>         list_for_each_entry(img_req, &rbd_dev->acquiring_list, lock_item) {
> -               mutex_lock(&img_req->state_mutex);
> -               rbd_assert(img_req->state == RBD_IMG_EXCLUSIVE_LOCK);
> +               rbd_assert(READ_ONCE(img_req->state) == RBD_IMG_EXCLUSIVE_LOCK);
>                 rbd_img_schedule(img_req, result);
> -               mutex_unlock(&img_req->state_mutex);
>         }
>
>         list_splice_tail_init(&rbd_dev->acquiring_list, &rbd_dev->running_list);

->state_mutex doesn't just protect ->state or ->pending_result,
it is meant to be a code lock.  In the future, we will be adding
support for timeouts and forceful unmapping of rbd devices, which
means cancelling requests at arbitrary points.  These state machines
need to be reentrant, not just from the inside (i.e. object requests)
but also from the outside.  Getting that right when ->state is managed
through READ/WRITE_ONCE and must be carefully set before dispatching
anything that might change it is going to be very challenging.

In the cover letter, this patch is listed as one of the required
steps for up to 25% speedup.  Is that really the case?  It doesn't
make top-30 contended locks in my tests...

Do you have the numbers without this and any of the preceding patches
or possibly just with patch 15?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 13/15] rbd: schedule image_request after preparation
  2020-01-31 10:37 ` [PATCH 13/15] rbd: schedule image_request after preparation Hannes Reinecke
@ 2020-02-03 18:40   ` Ilya Dryomov
  0 siblings, 0 replies; 25+ messages in thread
From: Ilya Dryomov @ 2020-02-03 18:40 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, Ceph Development, linux-block

On Fri, Jan 31, 2020 at 11:38 AM Hannes Reinecke <hare@suse.de> wrote:
>
> Instead of pushing I/O directly to the workqueue we should be
> preparing it first, and push it onto the workqueue as the last
> step. This allows us to signal some back-pressure to the block
> layer in case the queue fills up.

I assume what you mean is signal BLK_STS_RESOURCE (i.e.  ENOMEM), not
the queue full condition, as that is handled intrinsically?

>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/block/rbd.c | 52 +++++++++++++++-------------------------------------
>  1 file changed, 15 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 2566d6bd8230..9829f225c57d 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -4775,9 +4775,10 @@ static int rbd_obj_method_sync(struct rbd_device *rbd_dev,
>         return ret;
>  }
>
> -static void rbd_queue_workfn(struct work_struct *work)
> +static blk_status_t rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
> +               const struct blk_mq_queue_data *bd)
>  {
> -       struct request *rq = blk_mq_rq_from_pdu(work);
> +       struct request *rq = bd->rq;
>         struct rbd_device *rbd_dev = rq->q->queuedata;
>         struct rbd_img_request *img_request;
>         struct ceph_snap_context *snapc = NULL;
> @@ -4802,24 +4803,14 @@ static void rbd_queue_workfn(struct work_struct *work)
>                 break;
>         default:
>                 dout("%s: non-fs request type %d\n", __func__, req_op(rq));
> -               result = -EIO;
> -               goto err;
> -       }
> -
> -       /* Ignore/skip any zero-length requests */
> -
> -       if (!length) {
> -               dout("%s: zero-length request\n", __func__);
> -               result = 0;
> -               goto err_rq;
> +               return BLK_STS_IOERR;
>         }
>
>         if (op_type != OBJ_OP_READ) {
>                 if (rbd_is_ro(rbd_dev)) {
>                         rbd_warn(rbd_dev, "%s on read-only mapping",
>                                  obj_op_name(op_type));
> -                       result = -EIO;
> -                       goto err;
> +                       return BLK_STS_IOERR;
>                 }
>                 rbd_assert(!rbd_is_snap(rbd_dev));
>         }
> @@ -4827,11 +4818,17 @@ static void rbd_queue_workfn(struct work_struct *work)
>         if (offset && length > U64_MAX - offset + 1) {
>                 rbd_warn(rbd_dev, "bad request range (%llu~%llu)", offset,
>                          length);
> -               result = -EINVAL;
> -               goto err_rq;    /* Shouldn't happen */
> +               return BLK_STS_NOSPC;   /* Shouldn't happen */
>         }
>
>         blk_mq_start_request(rq);
> +       /* Ignore/skip any zero-length requests */
> +       if (!length) {
> +               dout("%s: zero-length request\n", __func__);
> +               result = 0;
> +               goto err;
> +       }
> +
>
>         mapping_size = READ_ONCE(rbd_dev->mapping.size);
>         if (op_type != OBJ_OP_READ) {
> @@ -4868,8 +4865,8 @@ static void rbd_queue_workfn(struct work_struct *work)
>         if (result)
>                 goto err_img_request;
>
> -       rbd_img_handle_request(img_request, 0);
> -       return;
> +       rbd_img_schedule(img_request, 0);
> +       return BLK_STS_OK;
>
>  err_img_request:
>         rbd_img_request_destroy(img_request);
> @@ -4880,15 +4877,6 @@ static void rbd_queue_workfn(struct work_struct *work)
>         ceph_put_snap_context(snapc);
>  err:
>         blk_mq_end_request(rq, errno_to_blk_status(result));
> -}
> -
> -static blk_status_t rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
> -               const struct blk_mq_queue_data *bd)
> -{
> -       struct request *rq = bd->rq;
> -       struct work_struct *work = blk_mq_rq_to_pdu(rq);
> -
> -       queue_work(rbd_wq, work);
>         return BLK_STS_OK;
>  }
>
> @@ -5055,18 +5043,8 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>         return ret;
>  }
>
> -static int rbd_init_request(struct blk_mq_tag_set *set, struct request *rq,
> -               unsigned int hctx_idx, unsigned int numa_node)
> -{
> -       struct work_struct *work = blk_mq_rq_to_pdu(rq);
> -
> -       INIT_WORK(work, rbd_queue_workfn);
> -       return 0;
> -}
> -
>  static const struct blk_mq_ops rbd_mq_ops = {
>         .queue_rq       = rbd_queue_rq,
> -       .init_request   = rbd_init_request,
>  };
>
>  static int rbd_init_disk(struct rbd_device *rbd_dev)

Is .queue_rq allowed to block?  AFAIK it's not, or at least not unless
BLK_MQ_F_BLOCKING is specified and I remember hearing about performance
issues with BLK_MQ_F_BLOCKING -- it is basically an offload to kblockd
workqueue, with a single work item per hw queue.

We don't have any device specific resources, the only thing we need is
memory which we can't preallocate upfront because of too many variable
sized pieces, both in rbd and in libceph.  Small GFP_NOIO allocations
don't really fail, so I wonder how important returning something other
than BLK_STS_OK is?

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 09/15] rbd: count pending object requests in-line
  2020-02-03 17:47   ` Ilya Dryomov
@ 2020-02-04  6:59     ` Hannes Reinecke
  0 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2020-02-04  6:59 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, Ceph Development, linux-block

On 2/3/20 6:47 PM, Ilya Dryomov wrote:
> On Fri, Jan 31, 2020 at 11:38 AM Hannes Reinecke <hare@suse.de> wrote:
>>
>> Instead of having a counter for outstanding object requests
>> check the state and count only those which are not in the final
>> state.
>>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>  drivers/block/rbd.c | 41 ++++++++++++++++++++++++++++++-----------
>>  1 file changed, 30 insertions(+), 11 deletions(-)
>>
[ .. ]
> 
> This is just to be able to drop img_req->state_mutex in patch 11,
> right?
> 
correct.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 02/15] rbd: use READ_ONCE() when checking the mapping size
  2020-02-03 16:50   ` Ilya Dryomov
@ 2020-02-04  7:05     ` Hannes Reinecke
  0 siblings, 0 replies; 25+ messages in thread
From: Hannes Reinecke @ 2020-02-04  7:05 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Sage Weil, Daniel Disseldorp, Jens Axboe, Ceph Development, linux-block

On 2/3/20 5:50 PM, Ilya Dryomov wrote:
> On Fri, Jan 31, 2020 at 11:38 AM Hannes Reinecke <hare@suse.de> wrote:
>>
>> The mapping size is changed only very infrequently, so we don't
>> need to take the header mutex for checking; using READ_ONCE()
>> is sufficient here. And it avoids having to take a mutex in the
>> hot path.
>>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>  drivers/block/rbd.c | 12 ++++++------
>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index db80b964d8ea..792180548e89 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -4788,13 +4788,13 @@ static void rbd_queue_workfn(struct work_struct *work)
>>
>>         blk_mq_start_request(rq);
>>
>> -       down_read(&rbd_dev->header_rwsem);
>> -       mapping_size = rbd_dev->mapping.size;
>> +       mapping_size = READ_ONCE(rbd_dev->mapping.size);
>>         if (op_type != OBJ_OP_READ) {
>> +               down_read(&rbd_dev->header_rwsem);
>>                 snapc = rbd_dev->header.snapc;
>>                 ceph_get_snap_context(snapc);
>> +               up_read(&rbd_dev->header_rwsem);
>>         }
>> -       up_read(&rbd_dev->header_rwsem);
>>
>>         if (offset + length > mapping_size) {
>>                 rbd_warn(rbd_dev, "beyond EOD (%llu~%llu > %llu)", offset,
>> @@ -4981,9 +4981,9 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>         u64 mapping_size;
>>         int ret;
>>
>> -       down_write(&rbd_dev->header_rwsem);
>> -       mapping_size = rbd_dev->mapping.size;
>> +       mapping_size = READ_ONCE(rbd_dev->mapping.size);
>>
>> +       down_write(&rbd_dev->header_rwsem);
>>         ret = rbd_dev_header_info(rbd_dev);
>>         if (ret)
>>                 goto out;
>> @@ -4999,7 +4999,7 @@ static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>         }
>>
>>         rbd_assert(!rbd_is_snap(rbd_dev));
>> -       rbd_dev->mapping.size = rbd_dev->header.image_size;
>> +       WRITE_ONCE(rbd_dev->mapping.size, rbd_dev->header.image_size);
>>
>>  out:
>>         up_write(&rbd_dev->header_rwsem);
> 
> Does this result in a measurable performance improvement?
> 
> I'd rather not go down the READ/WRITE_ONCE path and continue using
> proper locking, especially given that it's only for reads.  FWIW the
> plan is to replace header_rwsem with a spin lock, after refactoring
> header read-in code to use a private buffer instead of reading into
> rbd_dev directly.
> 
Well ... Not sure if I like the spin_lock idea.
Thing is, the mapping size is evaluated exactly _once_ when assembling
the request. So any change to the mapping size just after we've read it
would go unnoticed.

Hence it should be possible to combine both approaches; use READ_ONCE()
to read the mapping size, but use a spin lock for updating it as you
suggested. That way we'll eliminate a lock in the hot path, but would be
getting safe updates.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-02-04  7:05 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-31 10:37 [PATCH 00/15] rbd: switch to blk-mq Hannes Reinecke
2020-01-31 10:37 ` [PATCH 01/15] rbd: lock object request list Hannes Reinecke
2020-02-03 16:38   ` Ilya Dryomov
2020-01-31 10:37 ` [PATCH 02/15] rbd: use READ_ONCE() when checking the mapping size Hannes Reinecke
2020-02-03 16:50   ` Ilya Dryomov
2020-02-04  7:05     ` Hannes Reinecke
2020-01-31 10:37 ` [PATCH 03/15] rbd: reorder rbd_img_advance() Hannes Reinecke
2020-01-31 10:37 ` [PATCH 04/15] rbd: reorder switch statement in rbd_advance_read() Hannes Reinecke
2020-01-31 10:37 ` [PATCH 05/15] rbd: reorder switch statement in rbd_advance_write() Hannes Reinecke
2020-01-31 10:37 ` [PATCH 06/15] rbd: add 'done' state for rbd_obj_advance_copyup() Hannes Reinecke
2020-01-31 10:37 ` [PATCH 07/15] rbd: use callback for image request completion Hannes Reinecke
2020-02-03 17:13   ` Ilya Dryomov
2020-01-31 10:37 ` [PATCH 08/15] rbd: add debugging statements for the state machine Hannes Reinecke
2020-01-31 10:37 ` [PATCH 09/15] rbd: count pending object requests in-line Hannes Reinecke
2020-02-03 17:47   ` Ilya Dryomov
2020-02-04  6:59     ` Hannes Reinecke
2020-01-31 10:37 ` [PATCH 10/15] rbd: kill 'work_result' Hannes Reinecke
2020-01-31 10:37 ` [PATCH 11/15] rbd: drop state_mutex in __rbd_img_handle_request() Hannes Reinecke
2020-02-03 18:01   ` Ilya Dryomov
2020-01-31 10:37 ` [PATCH 12/15] rbd: kill img_request kref Hannes Reinecke
2020-01-31 10:37 ` [PATCH 13/15] rbd: schedule image_request after preparation Hannes Reinecke
2020-02-03 18:40   ` Ilya Dryomov
2020-01-31 10:37 ` [PATCH 14/15] rbd: embed image request as blk_mq request payload Hannes Reinecke
2020-01-31 10:37 ` [PATCH 15/15] rbd: switch to blk-mq Hannes Reinecke
2020-02-03  8:36   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).