[PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request
@ 2019-11-15 10:42 Ming Lei
  2019-11-15 10:42 ` [PATCH RFC 1/3] block: reuse one scheduler/flush field for private request's data Ming Lei
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Ming Lei @ 2019-11-15 10:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sagi Grimberg, James Smart, linux-nvme, Ming Lei, linux-block,
	Keith Busch, Christoph Hellwig

Hi,

Use blk_mq_alloc_request() for allocating NVMe loop, fc, rdma and tcp's
connect request, and selecting transport queue runtime for connect
request.

Then kill blk_mq_alloc_request_hctx(). 


Ming Lei (3):
  block: reuse one scheduler/flush field for private request's data
  nvme: don't use blk_mq_alloc_request_hctx() for allocating connect
    request
  blk-mq: kill blk_mq_alloc_request_hctx()

 block/blk-mq.c             | 46 --------------------------------------
 drivers/nvme/host/core.c   |  9 +++-----
 drivers/nvme/host/fc.c     | 10 +++++++++
 drivers/nvme/host/rdma.c   | 40 ++++++++++++++++++++++++++++++---
 drivers/nvme/host/tcp.c    | 41 ++++++++++++++++++++++++++++++---
 drivers/nvme/target/loop.c | 42 +++++++++++++++++++++++++++++++---
 include/linux/blk-mq.h     |  3 ---
 include/linux/blkdev.h     |  6 ++++-
 8 files changed, 132 insertions(+), 65 deletions(-)

Cc: James Smart <james.smart@broadcom.com>
Cc: Sagi Grimberg <sagi@grimberg.me>

-- 
2.20.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH RFC 1/3] block: reuse one scheduler/flush field for private request's data
  2019-11-15 10:42 [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request Ming Lei
@ 2019-11-15 10:42 ` Ming Lei
  2019-11-15 10:42 ` [PATCH RFC 2/3] nvme: don't use blk_mq_alloc_request_hctx() for allocating connect request Ming Lei
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-11-15 10:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sagi Grimberg, James Smart, linux-nvme, Ming Lei, linux-block,
	Keith Busch, Christoph Hellwig

Reuse one union shared by elevator and flush request for data of
private request. Private request won't enter IO scheduler and never
be a flush request, so reuse the space for data of private request.

This field is added to pass driver private info for private request
only.

Cc: James Smart <james.smart@broadcom.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/blkdev.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6a4f7abbdcf7..ce708dc10bdc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -177,7 +177,8 @@ struct request {
 	 * Three pointers are available for the IO schedulers, if they need
 	 * more they have to dynamically allocate it.  Flush requests are
 	 * never put on the IO scheduler. So let the flush fields share
-	 * space with the elevator data.
+	 * space with the elevator data. Private request are never put on
+	 * IO scheduler, and not be a flush request.
 	 */
 	union {
 		struct {
@@ -190,6 +191,9 @@ struct request {
 			struct list_head	list;
 			rq_end_io_fn		*saved_end_io;
 		} flush;
+
+		/* used for private request only */
+		unsigned long private_rq_data;
 	};
 
 	struct gendisk *rq_disk;
-- 
2.20.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC 2/3] nvme: don't use blk_mq_alloc_request_hctx() for allocating connect request
  2019-11-15 10:42 [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request Ming Lei
  2019-11-15 10:42 ` [PATCH RFC 1/3] block: reuse one scheduler/flush field for private request's data Ming Lei
@ 2019-11-15 10:42 ` Ming Lei
  2019-11-15 10:42 ` [PATCH RFC 3/3] blk-mq: kill blk_mq_alloc_request_hctx() Ming Lei
  2019-11-15 22:38 ` [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request Sagi Grimberg
  3 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-11-15 10:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sagi Grimberg, James Smart, linux-nvme, Ming Lei, linux-block,
	Keith Busch, Christoph Hellwig

Use one NVMe specific approach to choose transport queue for connect
request:

- IO connect request uses the unique reserved tag 0
- store the queue id into req->private_rq_data in nvme_alloc_request()
- inside .queue_rq(), select the transport queue via the stored qid

Also request's completion handler need to retrieve the block request via
.command_id and the transport queue instance. So store one flag of the connect
request into rq->private_rq_data before submitting, and make sure that the
flag is available in completion handler, then we can lookup the request
via the flag and the transport queue.

With this approach, we don't need the werid API of blk_mq_alloc_request_hctx()
any more.

Cc: James Smart <james.smart@broadcom.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c   |  9 +++-----
 drivers/nvme/host/fc.c     | 10 +++++++++
 drivers/nvme/host/rdma.c   | 40 +++++++++++++++++++++++++++++++++---
 drivers/nvme/host/tcp.c    | 41 ++++++++++++++++++++++++++++++++++---
 drivers/nvme/target/loop.c | 42 +++++++++++++++++++++++++++++++++++---
 5 files changed, 127 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fa7ba09dca77..e96e3997389b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -481,15 +481,12 @@ struct request *nvme_alloc_request(struct request_queue *q,
 	unsigned op = nvme_is_write(cmd) ? REQ_OP_DRV_OUT : REQ_OP_DRV_IN;
 	struct request *req;
 
-	if (qid == NVME_QID_ANY) {
-		req = blk_mq_alloc_request(q, op, flags);
-	} else {
-		req = blk_mq_alloc_request_hctx(q, op, flags,
-				qid ? qid - 1 : 0);
-	}
+	req = blk_mq_alloc_request(q, op, flags);
 	if (IS_ERR(req))
 		return req;
 
+	if (qid != NVME_QID_ANY)
+		req->private_rq_data = (unsigned long)qid;
 	req->cmd_flags |= REQ_FAILFAST_DRIVER;
 	nvme_clear_nvme_request(req);
 	nvme_req(req)->cmd = cmd;
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 265f89e11d8b..6c836e83e59c 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2333,6 +2333,16 @@ nvme_fc_queue_rq(struct blk_mq_hw_ctx *hctx,
 	u32 data_len;
 	blk_status_t ret;
 
+	/* connect request requires to use the specified queue */
+	if (rq->tag == 0 && ctrl->ctrl.admin_q != hctx->queue) {
+		unsigned qid = rq->private_rq_data;
+
+		WARN_ON_ONCE(!blk_rq_is_private(rq));
+
+		queue = &ctrl->queues[qid];
+		op->queue = queue;
+	}
+
 	if (ctrl->rport->remoteport.port_state != FC_OBJSTATE_ONLINE ||
 	    !nvmf_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
 		return nvmf_fail_nonready_command(&queue->ctrl->ctrl, rq);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index cb4c3000a57e..6056679e8c77 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1436,13 +1436,35 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg)
 	WARN_ON_ONCE(ret);
 }
 
+static struct request *nvme_rdma_lookup_rq(struct nvme_rdma_queue *queue,
+		struct nvme_rdma_qe *qe, struct nvme_completion *cqe)
+{
+	int i;
+
+	if (cqe->command_id || !nvme_rdma_queue_idx(queue))
+		return blk_mq_tag_to_rq(nvme_rdma_tagset(queue),
+				cqe->command_id);
+
+	/* lookup request for connect IO */
+	for (i = 1; i < queue->ctrl->ctrl.queue_count; i++) {
+		struct request *rq = blk_mq_tag_to_rq(nvme_rdma_tagset(
+					&queue->ctrl->queues[i]), 0);
+
+		if (rq && rq->private_rq_data == (unsigned long)qe)
+			return rq;
+	}
+
+	return NULL;
+}
+
 static void nvme_rdma_process_nvme_rsp(struct nvme_rdma_queue *queue,
-		struct nvme_completion *cqe, struct ib_wc *wc)
+		struct nvme_rdma_qe *qe, struct ib_wc *wc)
 {
 	struct request *rq;
 	struct nvme_rdma_request *req;
+	struct nvme_completion *cqe = qe->data;
 
-	rq = blk_mq_tag_to_rq(nvme_rdma_tagset(queue), cqe->command_id);
+	rq = nvme_rdma_lookup_rq(queue, qe, cqe);
 	if (!rq) {
 		dev_err(queue->ctrl->ctrl.device,
 			"tag 0x%x on QP %#x not found\n",
@@ -1506,7 +1528,7 @@ static void nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc)
 		nvme_complete_async_event(&queue->ctrl->ctrl, cqe->status,
 				&cqe->result);
 	else
-		nvme_rdma_process_nvme_rsp(queue, cqe, wc);
+		nvme_rdma_process_nvme_rsp(queue, qe, wc);
 	ib_dma_sync_single_for_device(ibdev, qe->dma, len, DMA_FROM_DEVICE);
 
 	nvme_rdma_post_recv(queue, qe);
@@ -1743,6 +1765,18 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	WARN_ON_ONCE(rq->tag < 0);
 
+	/* connect request requires to use the specified queue */
+	if (rq->tag == 0 && queue->ctrl->ctrl.admin_q != hctx->queue) {
+		unsigned qid = rq->private_rq_data;
+
+		WARN_ON_ONCE(!blk_rq_is_private(rq));
+
+		queue = &queue->ctrl->queues[qid];
+		req->queue = queue;
+
+		rq->private_rq_data = (unsigned long)sqe;
+	}
+
 	if (!nvmf_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
 		return nvmf_fail_nonready_command(&queue->ctrl->ctrl, rq);
 
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 770dbcbc999e..5087e2d168f1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -425,12 +425,35 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
 	queue_work(nvme_wq, &to_tcp_ctrl(ctrl)->err_work);
 }
 
+static struct request *nvme_tcp_lookup_rq(struct nvme_tcp_queue *queue,
+		struct nvme_tcp_rsp_pdu *pdu, struct nvme_completion *cqe)
+{
+	int i;
+
+	if (cqe->command_id || !nvme_tcp_queue_id(queue))
+		return blk_mq_tag_to_rq(nvme_tcp_tagset(queue),
+				cqe->command_id);
+
+	/* lookup request for connect IO */
+	for (i = 1; i < queue->ctrl->ctrl.queue_count; i++) {
+		struct request *rq = blk_mq_tag_to_rq(nvme_tcp_tagset(
+					&queue->ctrl->queues[i]), 0);
+
+		if (rq && rq->private_rq_data == (unsigned long)pdu)
+			return rq;
+	}
+
+	return NULL;
+}
+
+
 static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
-		struct nvme_completion *cqe)
+		struct nvme_tcp_rsp_pdu *pdu)
 {
+	struct nvme_completion *cqe = &pdu->cqe;
 	struct request *rq;
 
-	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
+	rq = nvme_tcp_lookup_rq(queue, pdu, cqe);
 	if (!rq) {
 		dev_err(queue->ctrl->ctrl.device,
 			"queue %d tag 0x%x not found\n",
@@ -496,7 +519,7 @@ static int nvme_tcp_handle_comp(struct nvme_tcp_queue *queue,
 		nvme_complete_async_event(&queue->ctrl->ctrl, cqe->status,
 				&cqe->result);
 	else
-		ret = nvme_tcp_process_nvme_cqe(queue, cqe);
+		ret = nvme_tcp_process_nvme_cqe(queue, pdu);
 
 	return ret;
 }
@@ -2155,6 +2178,18 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
 	bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags);
 	blk_status_t ret;
 
+	/* connect request requires to use the specified queue */
+	if (rq->tag == 0 && queue->ctrl->ctrl.admin_q != hctx->queue) {
+		unsigned qid = rq->private_rq_data;
+
+		WARN_ON_ONCE(!blk_rq_is_private(rq));
+
+		queue = &queue->ctrl->queues[qid];
+		req->queue = queue;
+
+		rq->private_rq_data = (unsigned long)queue->pdu;
+	}
+
 	if (!nvmf_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
 		return nvmf_fail_nonready_command(&queue->ctrl->ctrl, rq);
 
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 11f5aea97d1b..d97c3155d86a 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -90,6 +90,28 @@ static struct blk_mq_tags *nvme_loop_tagset(struct nvme_loop_queue *queue)
 	return queue->ctrl->tag_set.tags[queue_idx - 1];
 }
 
+static struct request *nvme_loop_lookup_rq(struct nvme_loop_queue *queue,
+		struct nvmet_req *req)
+{
+	struct nvme_completion *cqe = req->cqe;
+	int i;
+
+	if (cqe->command_id || !nvme_loop_queue_idx(queue))
+		return blk_mq_tag_to_rq(nvme_loop_tagset(queue),
+				cqe->command_id);
+
+	/* lookup request for connect IO */
+	for (i = 1; i < queue->ctrl->ctrl.queue_count; i++) {
+		struct request *rq = blk_mq_tag_to_rq(nvme_loop_tagset(
+					&queue->ctrl->queues[i]), 0);
+
+		if (rq && rq->private_rq_data == (unsigned long)req)
+			return rq;
+	}
+
+	return NULL;
+}
+
 static void nvme_loop_queue_response(struct nvmet_req *req)
 {
 	struct nvme_loop_queue *queue =
@@ -107,9 +129,7 @@ static void nvme_loop_queue_response(struct nvmet_req *req)
 		nvme_complete_async_event(&queue->ctrl->ctrl, cqe->status,
 				&cqe->result);
 	} else {
-		struct request *rq;
-
-		rq = blk_mq_tag_to_rq(nvme_loop_tagset(queue), cqe->command_id);
+		struct request *rq = nvme_loop_lookup_rq(queue, req);
 		if (!rq) {
 			dev_err(queue->ctrl->ctrl.device,
 				"tag 0x%x on queue %d not found\n",
@@ -139,6 +159,22 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
 	bool queue_ready = test_bit(NVME_LOOP_Q_LIVE, &queue->flags);
 	blk_status_t ret;
 
+	/* connect request requires to use the specified queue */
+	if (req->tag == 0) {
+		struct nvme_loop_ctrl *ctrl = hctx->queue->tag_set->driver_data;
+
+		if (ctrl->ctrl.admin_q != hctx->queue) {
+			unsigned qid = req->private_rq_data;
+
+			WARN_ON_ONCE(!blk_rq_is_private(req));
+
+			queue = &ctrl->queues[qid];
+			iod->queue = queue;
+
+			req->private_rq_data = (unsigned long)&iod->req;
+		}
+	}
+
 	if (!nvmf_check_ready(&queue->ctrl->ctrl, req, queue_ready))
 		return nvmf_fail_nonready_command(&queue->ctrl->ctrl, req);
 
-- 
2.20.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC 3/3] blk-mq: kill blk_mq_alloc_request_hctx()
  2019-11-15 10:42 [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request Ming Lei
  2019-11-15 10:42 ` [PATCH RFC 1/3] block: reuse one scheduler/flush field for private request's data Ming Lei
  2019-11-15 10:42 ` [PATCH RFC 2/3] nvme: don't use blk_mq_alloc_request_hctx() for allocating connect request Ming Lei
@ 2019-11-15 10:42 ` Ming Lei
  2019-11-15 22:38 ` [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request Sagi Grimberg
  3 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-11-15 10:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Sagi Grimberg, James Smart, linux-nvme, Ming Lei, linux-block,
	Keith Busch, Christoph Hellwig

blk-mq uses static queue mapping between sw queue and hw queue to
retrieve hw queue, then allocate request on the retrieved hw queue's
request pool.

blk_mq_alloc_request_hctx() requires to specify one hctx index and ask
blk-mq to allocate request from this hctx's request pool. This way
is quite ugly given the hctx can become inactive any time because of
CPU hotplug. Kernel oops on NVMe FC/LOOP/RDMA/TCP has been reported
several times because of CPU hotplug.

The only user is NVMe loop, FC, RDMA and TCP driver for submitting
connect command. All these drivers have been conveted to use generic
API of blk_mq_alloc_request() to allocate request for NVMe connect
command.

So kill it now.

Cc: James Smart <james.smart@broadcom.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 46 ------------------------------------------
 include/linux/blk-mq.h |  3 ---
 2 files changed, 49 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5c9adcaa27ac..a360fe70ec98 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -427,52 +427,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 }
 EXPORT_SYMBOL(blk_mq_alloc_request);
 
-struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
-	unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
-{
-	struct blk_mq_alloc_data alloc_data = { .flags = flags, .cmd_flags = op };
-	struct request *rq;
-	unsigned int cpu;
-	int ret;
-
-	/*
-	 * If the tag allocator sleeps we could get an allocation for a
-	 * different hardware context.  No need to complicate the low level
-	 * allocator for this for the rare use case of a command tied to
-	 * a specific queue.
-	 */
-	if (WARN_ON_ONCE(!(flags & BLK_MQ_REQ_NOWAIT)))
-		return ERR_PTR(-EINVAL);
-
-	if (hctx_idx >= q->nr_hw_queues)
-		return ERR_PTR(-EIO);
-
-	ret = blk_queue_enter(q, flags);
-	if (ret)
-		return ERR_PTR(ret);
-
-	/*
-	 * Check if the hardware context is actually mapped to anything.
-	 * If not tell the caller that it should skip this queue.
-	 */
-	alloc_data.hctx = q->queue_hw_ctx[hctx_idx];
-	if (!blk_mq_hw_queue_mapped(alloc_data.hctx)) {
-		blk_queue_exit(q);
-		return ERR_PTR(-EXDEV);
-	}
-	cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
-	alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
-
-	rq = blk_mq_get_request(q, NULL, &alloc_data);
-	blk_queue_exit(q);
-
-	if (!rq)
-		return ERR_PTR(-EWOULDBLOCK);
-
-	return rq;
-}
-EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
-
 static void __blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index dc03e059fdff..a0c65de93e8c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -441,9 +441,6 @@ enum {
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 		blk_mq_req_flags_t flags);
-struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
-		unsigned int op, blk_mq_req_flags_t flags,
-		unsigned int hctx_idx);
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag);
 
 enum {
-- 
2.20.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request
  2019-11-15 10:42 [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request Ming Lei
                   ` (2 preceding siblings ...)
  2019-11-15 10:42 ` [PATCH RFC 3/3] blk-mq: kill blk_mq_alloc_request_hctx() Ming Lei
@ 2019-11-15 22:38 ` Sagi Grimberg
  2019-11-16  7:17   ` Ming Lei
  3 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2019-11-15 22:38 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Keith Busch, Christoph Hellwig, linux-nvme, James Smart

> Hi,

Hey Ming,

> Use blk_mq_alloc_request() for allocating NVMe loop, fc, rdma and tcp's
> connect request, and selecting transport queue runtime for connect
> request.
> 
> Then kill blk_mq_alloc_request_hctx().

Is it really so wrong to have an API to allocate a tag that belongs to
a specific queue? Why must the tags allocation always correlate to the
running cpu? Its true that NVMe is the only consumer of this at the
moment, but does this mean that the interface should be removed because
it has one (or rather four) consumer(s)?

I would instead suggest to simply remove the constraint that
blk_mq_alloc_request_hctx() will fail if the first cpu in the mask
is not on the cpu_online_mask.. The caller of this would know and
be able to handle it.

To me it feels like this approach is fundamentally wrong. IMO, having
the driver select a different queue than the tag naturally belongs to
feels like a backwards design.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request
  2019-11-15 22:38 ` [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request Sagi Grimberg
@ 2019-11-16  7:17   ` Ming Lei
  2019-11-17  1:24     ` Bart Van Assche
  2019-11-19  0:05     ` Sagi Grimberg
  0 siblings, 2 replies; 16+ messages in thread
From: Ming Lei @ 2019-11-16  7:17 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, James Smart, linux-nvme, linux-block, Keith Busch,
	Christoph Hellwig

Hi Sagi,

On Fri, Nov 15, 2019 at 02:38:44PM -0800, Sagi Grimberg wrote:
> 
> > Hi,
> 
> Hey Ming,
> 
> > Use blk_mq_alloc_request() for allocating NVMe loop, fc, rdma and tcp's
> > connect request, and selecting transport queue runtime for connect
> > request.
> > 
> > Then kill blk_mq_alloc_request_hctx().
> 
> Is it really so wrong to have an API to allocate a tag that belongs to
> a specific queue? Why must the tags allocation always correlate to the
> running cpu? Its true that NVMe is the only consumer of this at the
> moment, but does this mean that the interface should be removed because
> it has one (or rather four) consumer(s)?

Now blk-mq takes a static queue mapping between CPU and hw queues, given
CPU hotplug may happen any time, so the specified hw queue may become
inactive any time.

Queue mapping from CPU to hw queue is one core model of blk-mq which
relies a lot on the fact that hw queue active or not depends on
request's submission CPU. And we always can retrieve one active hw
queue if there is at least one online CPU.

IO request is always mapped to the proper hw queue via the submission
CPU and queue type.

So blk_mq_alloc_request_hctx() is really weird from the above blk-mq's
model.

Actually the 4 consumer is just one single type of usage for submitting
connect command, seems no one explain this special usage before. And the
driver can handle well enough without this interface just like this
patch, can't it?

> 
> I would instead suggest to simply remove the constraint that
> blk_mq_alloc_request_hctx() will fail if the first cpu in the mask
> is not on the cpu_online_mask.. The caller of this would know and
> be able to handle it.

Of course, this usage is very special, which is different with normal
IO or passthrough request.

The caller of this still needs to rely on blk-mq for dispatching this
request:

1) blk-mq needs to run hw queue in round-robin style, and different
CPU is selected from active CPU masks for running the hw queue.

2) Most of blk-mq drivers have switched to managed IRQ, which may be
shutdown even though there is still in-flight requests not completed
on the hw queue. We need to fix this issue. One solution is to free
the request and remap the bios into proper active hw queue according to
the new submission CPU.

3) warning will be caused when dispatching one request on inactive hw queue

If we are going to support this special usage, lots of blk-mq needs to
be changed for covering the so special case.

> 
> To me it feels like this approach is fundamentally wrong. IMO, having
> the driver select a different queue than the tag naturally belongs to
> feels like a backwards design.

The root cause is that the connection command needs to be submitted via
one specified queue, which is fundamentally not compatible with blk-mq's
queue mapping model.

Given the connect command is so special for nvme, I'd suggest to let driver
handle it completely, since blk-mq is supposed to serve generic IO function.

Thanks,
Ming

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request
  2019-11-16  7:17   ` Ming Lei
@ 2019-11-17  1:24     ` Bart Van Assche
  2019-11-17  4:12       ` Ming Lei
  2019-11-19  0:05     ` Sagi Grimberg
  1 sibling, 1 reply; 16+ messages in thread
From: Bart Van Assche @ 2019-11-17  1:24 UTC (permalink / raw)
  To: Ming Lei, Sagi Grimberg
  Cc: Jens Axboe, James Smart, linux-nvme, linux-block, Keith Busch,
	Christoph Hellwig

On 2019-11-15 23:17, Ming Lei wrote:
> Now blk-mq takes a static queue mapping between CPU and hw queues, given
> CPU hotplug may happen any time, so the specified hw queue may become
> inactive any time.

Hi Ming,

I can trigger a race between blk_mq_alloc_request_hctx() and
CPU-hotplugging by running blktests. The patch below fixes that race
on my setup. Does this patch also fix the race(s) that you ran into?

Thanks,

Bart.


Subject: [PATCH] blk-mq: Fix a race between blk_mq_alloc_request_hctx() and
 CPU hot-plugging

---
 block/blk-mq.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 20a71dcdc339..16057aa2307f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -442,13 +442,15 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	if (WARN_ON_ONCE(!(flags & BLK_MQ_REQ_NOWAIT)))
 		return ERR_PTR(-EINVAL);

-	if (hctx_idx >= q->nr_hw_queues)
-		return ERR_PTR(-EIO);
-
 	ret = blk_queue_enter(q, flags);
 	if (ret)
 		return ERR_PTR(ret);

+	if (hctx_idx >= q->nr_hw_queues) {
+		blk_queue_exit(q);
+		return ERR_PTR(-EIO);
+	}
+
 	/*
 	 * Check if the hardware context is actually mapped to anything.
 	 * If not tell the caller that it should skip this queue.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request
  2019-11-17  1:24     ` Bart Van Assche
@ 2019-11-17  4:12       ` Ming Lei
  2019-11-18 23:27         ` Bart Van Assche
  0 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2019-11-17  4:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, Sagi Grimberg, James Smart, linux-nvme, linux-block,
	Keith Busch, Christoph Hellwig

On Sat, Nov 16, 2019 at 05:24:05PM -0800, Bart Van Assche wrote:
> On 2019-11-15 23:17, Ming Lei wrote:
> > Now blk-mq takes a static queue mapping between CPU and hw queues, given
> > CPU hotplug may happen any time, so the specified hw queue may become
> > inactive any time.
> 
> Hi Ming,
> 
> I can trigger a race between blk_mq_alloc_request_hctx() and
> CPU-hotplugging by running blktests. The patch below fixes that race
> on my setup. Does this patch also fix the race(s) that you ran into?

The following problem has been triggered in my regular test for years,
is it same with yours?

[ 2248.751675] nvme nvme1: creating 2 I/O queues.
[ 2248.752351] BUG: unable to handle page fault for address: 0000607d064434a8
[ 2248.753348] #PF: supervisor write access in kernel mode
[ 2248.754106] #PF: error_code(0x0002) - not-present page
[ 2248.754846] PGD 0 P4D 0
[ 2248.755230] Oops: 0002 [#1] PREEMPT SMP PTI
[ 2248.755838] CPU: 7 PID: 16293 Comm: kworker/u18:3 Not tainted 5.4.0-rc7_96b95eff4a59_master+ #1
[ 2248.757089] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
[ 2248.758863] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop]
[ 2248.759857] RIP: 0010:blk_mq_get_request+0x2a8/0x31c
[ 2248.760654] Code: c7 83 08 01 00 00 00 00 00 00 48 c7 83 10 01 00 00 00 00 00 00 48 8b 55 18 45 84 ed 74 0c 31 c0 41 81 e5 00 08 06 00 0f 95 c0 <48> ff 44 c2 68 c7 83 d4 00 00 00 01 00 00 00 f7 45 10 00 00 06 00
[ 2248.763375] RSP: 0018:ffffc900012dbc80 EFLAGS: 00010246
[ 2248.764130] RAX: 0000000000000000 RBX: ffff888170d70000 RCX: 0000000000000017
[ 2248.765156] RDX: 0000607d06443440 RSI: 0000020bb36c554e RDI: 0000020bb3837c3f
[ 2248.766034] RBP: ffffc900012dbcc0 R08: 00000000f461df07 R09: 00000000000000a8
[ 2248.767084] R10: ffffc900012dbe50 R11: 0000000000000002 R12: 0000000000000000
[ 2248.768109] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 2248.769134] FS:  0000000000000000(0000) GS:ffff88827bd80000(0000) knlGS:0000000000000000
[ 2248.770294] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2248.771125] CR2: 0000607d064434a8 CR3: 0000000272866001 CR4: 0000000000760ee0
[ 2248.772152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2248.773179] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2248.774204] PKRU: 55555554
[ 2248.774603] Call Trace:
[ 2248.774983]  blk_mq_alloc_request_hctx+0xc5/0x10e
[ 2248.775674]  nvme_alloc_request+0x42/0x71
[ 2248.776263]  __nvme_submit_sync_cmd+0x49/0x1b2
[ 2248.776910]  nvmf_connect_io_queue+0x12c/0x195 [nvme_fabrics]
[ 2248.777663]  ? nvme_loop_connect_io_queues+0x2f/0x54 [nvme_loop]
[ 2248.778481]  nvme_loop_connect_io_queues+0x2f/0x54 [nvme_loop]
[ 2248.779325]  nvme_loop_reset_ctrl_work+0x62/0xd4 [nvme_loop]
[ 2248.780144]  process_one_work+0x1a8/0x2a1
[ 2248.780727]  ? process_scheduled_works+0x2c/0x2c
[ 2248.781398]  process_scheduled_works+0x27/0x2c
[ 2248.782046]  worker_thread+0x1b1/0x23f
[ 2248.782594]  kthread+0xf5/0xfa
[ 2248.783048]  ? kthread_unpark+0x62/0x62
[ 2248.783608]  ret_from_fork+0x35/0x40

> 
> Thanks,
> 
> Bart.
> 
> 
> Subject: [PATCH] blk-mq: Fix a race between blk_mq_alloc_request_hctx() and
>  CPU hot-plugging
> 
> ---
>  block/blk-mq.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 20a71dcdc339..16057aa2307f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -442,13 +442,15 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>  	if (WARN_ON_ONCE(!(flags & BLK_MQ_REQ_NOWAIT)))
>  		return ERR_PTR(-EINVAL);
> 
> -	if (hctx_idx >= q->nr_hw_queues)
> -		return ERR_PTR(-EIO);
> -
>  	ret = blk_queue_enter(q, flags);
>  	if (ret)
>  		return ERR_PTR(ret);
> 
> +	if (hctx_idx >= q->nr_hw_queues) {
> +		blk_queue_exit(q);
> +		return ERR_PTR(-EIO);
> +	}
> +

Not sure how this patch can make a difference since blk_queue_enter()
never checks if hctx is active, the problem is that the hctx represented
by 'hctx_idx' becomes inactive when calling blk_mq_alloc_request_hctx()(
all CPUs of this hctx becomes offline).

The problem simply is in the following code:

        cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
        alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

        rq = blk_mq_get_request(q, NULL, &alloc_data);
        blk_queue_exit(q);

'cpu' becomes 'nr_cpu_ids', then kernel oops will be triggered in
blk_mq_get_request().


Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request
  2019-11-17  4:12       ` Ming Lei
@ 2019-11-18 23:27         ` Bart Van Assche
  0 siblings, 0 replies; 16+ messages in thread
From: Bart Van Assche @ 2019-11-18 23:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Sagi Grimberg, James Smart, linux-nvme, linux-block,
	Keith Busch, Christoph Hellwig

On 11/16/19 8:12 PM, Ming Lei wrote:
> On Sat, Nov 16, 2019 at 05:24:05PM -0800, Bart Van Assche wrote:
>> On 2019-11-15 23:17, Ming Lei wrote:
>>> Now blk-mq takes a static queue mapping between CPU and hw queues, given
>>> CPU hotplug may happen any time, so the specified hw queue may become
>>> inactive any time.
>>
>> I can trigger a race between blk_mq_alloc_request_hctx() and
>> CPU-hotplugging by running blktests. The patch below fixes that race
>> on my setup. Does this patch also fix the race(s) that you ran into?
> 
> The following problem has been triggered in my regular test for years,
> is it same with yours?
> 
> [ 2248.751675] nvme nvme1: creating 2 I/O queues.
> [ 2248.752351] BUG: unable to handle page fault for address: 0000607d064434a8
> [ 2248.753348] #PF: supervisor write access in kernel mode
> [ 2248.754106] #PF: error_code(0x0002) - not-present page
> [ 2248.754846] PGD 0 P4D 0
> [ 2248.755230] Oops: 0002 [#1] PREEMPT SMP PTI
> [ 2248.755838] CPU: 7 PID: 16293 Comm: kworker/u18:3 Not tainted 5.4.0-rc7_96b95eff4a59_master+ #1
> [ 2248.757089] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
> [ 2248.758863] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop]
> [ 2248.759857] RIP: 0010:blk_mq_get_request+0x2a8/0x31c
> [ 2248.760654] Code: c7 83 08 01 00 00 00 00 00 00 48 c7 83 10 01 00 00 00 00 00 00 48 8b 55 18 45 84 ed 74 0c 31 c0 41 81 e5 00 08 06 00 0f 95 c0 <48> ff 44 c2 68 c7 83 d4 00 00 00 01 00 00 00 f7 45 10 00 00 06 00
> [ 2248.763375] RSP: 0018:ffffc900012dbc80 EFLAGS: 00010246
> [ 2248.764130] RAX: 0000000000000000 RBX: ffff888170d70000 RCX: 0000000000000017
> [ 2248.765156] RDX: 0000607d06443440 RSI: 0000020bb36c554e RDI: 0000020bb3837c3f
> [ 2248.766034] RBP: ffffc900012dbcc0 R08: 00000000f461df07 R09: 00000000000000a8
> [ 2248.767084] R10: ffffc900012dbe50 R11: 0000000000000002 R12: 0000000000000000
> [ 2248.768109] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 2248.769134] FS:  0000000000000000(0000) GS:ffff88827bd80000(0000) knlGS:0000000000000000
> [ 2248.770294] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2248.771125] CR2: 0000607d064434a8 CR3: 0000000272866001 CR4: 0000000000760ee0
> [ 2248.772152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2248.773179] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 2248.774204] PKRU: 55555554
> [ 2248.774603] Call Trace:
> [ 2248.774983]  blk_mq_alloc_request_hctx+0xc5/0x10e
> [ 2248.775674]  nvme_alloc_request+0x42/0x71
> [ 2248.776263]  __nvme_submit_sync_cmd+0x49/0x1b2
> [ 2248.776910]  nvmf_connect_io_queue+0x12c/0x195 [nvme_fabrics]
> [ 2248.777663]  ? nvme_loop_connect_io_queues+0x2f/0x54 [nvme_loop]
> [ 2248.778481]  nvme_loop_connect_io_queues+0x2f/0x54 [nvme_loop]
> [ 2248.779325]  nvme_loop_reset_ctrl_work+0x62/0xd4 [nvme_loop]
> [ 2248.780144]  process_one_work+0x1a8/0x2a1
> [ 2248.780727]  ? process_scheduled_works+0x2c/0x2c
> [ 2248.781398]  process_scheduled_works+0x27/0x2c
> [ 2248.782046]  worker_thread+0x1b1/0x23f
> [ 2248.782594]  kthread+0xf5/0xfa
> [ 2248.783048]  ? kthread_unpark+0x62/0x62
> [ 2248.783608]  ret_from_fork+0x35/0x40

Hi Ming,

Thanks for having shared this call stack. What made me look at 
blk_mq_alloc_request_hctx() is a different issue, namely the following 
kernel warnings, reported from inside blk_mq_run_hw_queue():

WARNING: CPU: 0 PID: 6123 at include/linux/cpumask.h:137 
cpumask_next_and+0x16/0x30
WARNING: CPU: 0 PID: 323 at include/linux/cpumask.h:137 
__blk_mq_run_hw_queue+0x152/0x1b0

Bart.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request
  2019-11-16  7:17   ` Ming Lei
  2019-11-17  1:24     ` Bart Van Assche
@ 2019-11-19  0:05     ` Sagi Grimberg
  2019-11-19  0:34       ` Keith Busch
                         ` (2 more replies)
  1 sibling, 3 replies; 16+ messages in thread
From: Sagi Grimberg @ 2019-11-19  0:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, James Smart, linux-nvme, linux-block, Keith Busch,
	Christoph Hellwig


> Hi Sagi,
> 
> On Fri, Nov 15, 2019 at 02:38:44PM -0800, Sagi Grimberg wrote:
>>
>>> Hi,
>>
>> Hey Ming,
>>
>>> Use blk_mq_alloc_request() for allocating NVMe loop, fc, rdma and tcp's
>>> connect request, and selecting transport queue runtime for connect
>>> request.
>>>
>>> Then kill blk_mq_alloc_request_hctx().
>>
>> Is it really so wrong to have an API to allocate a tag that belongs to
>> a specific queue? Why must the tags allocation always correlate to the
>> running cpu? Its true that NVMe is the only consumer of this at the
>> moment, but does this mean that the interface should be removed because
>> it has one (or rather four) consumer(s)?
> 
> Now blk-mq takes a static queue mapping between CPU and hw queues, given
> CPU hotplug may happen any time, so the specified hw queue may become
> inactive any time.
> 
> Queue mapping from CPU to hw queue is one core model of blk-mq which
> relies a lot on the fact that hw queue active or not depends on
> request's submission CPU. And we always can retrieve one active hw
> queue if there is at least one online CPU.
> 
> IO request is always mapped to the proper hw queue via the submission
> CPU and queue type.
> 
> So blk_mq_alloc_request_hctx() is really weird from the above blk-mq's
> model.
> 
> Actually the 4 consumer is just one single type of usage for submitting
> connect command, seems no one explain this special usage before. And the
> driver can handle well enough without this interface just like this
> patch, can't it?

Does removing the cpumask_and with cpu_online_mask fix your test?

this check is wrong to begin with because it can not be true right after
the check.

This is a much simpler fix that does not create this churn local to
every driver. Also, I don't like the assumptions about tag reservations
that the drivers is taking locally (that the connect will have tag 0
for example). All this makes this look like a hack.

There is also the question of what happens when we want to make connects
parallel, which is not the case at the moment...

>> I would instead suggest to simply remove the constraint that
>> blk_mq_alloc_request_hctx() will fail if the first cpu in the mask
>> is not on the cpu_online_mask.. The caller of this would know and
>> be able to handle it.
> 
> Of course, this usage is very special, which is different with normal
> IO or passthrough request.
> 
> The caller of this still needs to rely on blk-mq for dispatching this
> request:
> 
> 1) blk-mq needs to run hw queue in round-robin style, and different
> CPU is selected from active CPU masks for running the hw queue.
> 
> 2) Most of blk-mq drivers have switched to managed IRQ, which may be
> shutdown even though there is still in-flight requests not completed
> on the hw queue. We need to fix this issue. One solution is to free
> the request and remap the bios into proper active hw queue according to
> the new submission CPU.
> 
> 3) warning will be caused when dispatching one request on inactive hw queue
> 
> If we are going to support this special usage, lots of blk-mq needs to
> be changed for covering the so special case.

I'm starting to think we maybe need to get the connect out of the block
layer execution if its such a big problem... Its a real shame if that is
the case...

>> To me it feels like this approach is fundamentally wrong. IMO, having
>> the driver select a different queue than the tag naturally belongs to
>> feels like a backwards design.
> 
> The root cause is that the connection command needs to be submitted via
> one specified queue, which is fundamentally not compatible with blk-mq's
> queue mapping model.
> 
> Given the connect command is so special for nvme, I'd suggest to let driver
> handle it completely, since blk-mq is supposed to serve generic IO function.

Its one thing to handling it locally, and to hack queue_rq to queue on a
different queue than what was determined by the block layer... If this
model is fundamentally broken with how the block layer dispatch
requests, then we need a different solution.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request
  2019-11-19  0:05     ` Sagi Grimberg
@ 2019-11-19  0:34       ` Keith Busch
  2019-11-19  1:43         ` Sagi Grimberg
  2019-11-19  2:38         ` Ming Lei
  2019-11-19  2:33       ` Ming Lei
  2019-11-19 17:56       ` James Smart
  2 siblings, 2 replies; 16+ messages in thread
From: Keith Busch @ 2019-11-19  0:34 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, James Smart, linux-nvme, Ming Lei, linux-block,
	Christoph Hellwig

On Mon, Nov 18, 2019 at 04:05:56PM -0800, Sagi Grimberg wrote:
> 
> I'm starting to think we maybe need to get the connect out of the block
> layer execution if its such a big problem... Its a real shame if that is
> the case...

We still need timeout handling for connect commands, so bypassing the
block layer will need to figure out some other way to handle that.

This patch proposal doesn't really handle the timeouts very well either,
though: nvme_rdma_timeout() is going to end up referncing the wrong
queue rather than the one the request was submitted on. It doesn't
appear to really matter in the current code since it just resets the
entire controller, but if it ever wanted to do something queue specific...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request
  2019-11-19  0:34       ` Keith Busch
@ 2019-11-19  1:43         ` Sagi Grimberg
  2019-11-19  2:38         ` Ming Lei
  1 sibling, 0 replies; 16+ messages in thread
From: Sagi Grimberg @ 2019-11-19  1:43 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, James Smart, linux-nvme, Ming Lei, linux-block,
	Christoph Hellwig


>> I'm starting to think we maybe need to get the connect out of the block
>> layer execution if its such a big problem... Its a real shame if that is
>> the case...
> 
> We still need timeout handling for connect commands, so bypassing the
> block layer will need to figure out some other way to handle that.

Which is why I said it'd be a shame :)

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request
  2019-11-19  0:05     ` Sagi Grimberg
  2019-11-19  0:34       ` Keith Busch
@ 2019-11-19  2:33       ` Ming Lei
  2019-11-19 17:56       ` James Smart
  2 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-11-19  2:33 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, James Smart, linux-nvme, linux-block, Keith Busch,
	Christoph Hellwig

On Mon, Nov 18, 2019 at 04:05:56PM -0800, Sagi Grimberg wrote:
> 
> > Hi Sagi,
> > 
> > On Fri, Nov 15, 2019 at 02:38:44PM -0800, Sagi Grimberg wrote:
> > > 
> > > > Hi,
> > > 
> > > Hey Ming,
> > > 
> > > > Use blk_mq_alloc_request() for allocating NVMe loop, fc, rdma and tcp's
> > > > connect request, and selecting transport queue runtime for connect
> > > > request.
> > > > 
> > > > Then kill blk_mq_alloc_request_hctx().
> > > 
> > > Is it really so wrong to have an API to allocate a tag that belongs to
> > > a specific queue? Why must the tags allocation always correlate to the
> > > running cpu? Its true that NVMe is the only consumer of this at the
> > > moment, but does this mean that the interface should be removed because
> > > it has one (or rather four) consumer(s)?
> > 
> > Now blk-mq takes a static queue mapping between CPU and hw queues, given
> > CPU hotplug may happen any time, so the specified hw queue may become
> > inactive any time.
> > 
> > Queue mapping from CPU to hw queue is one core model of blk-mq which
> > relies a lot on the fact that hw queue active or not depends on
> > request's submission CPU. And we always can retrieve one active hw
> > queue if there is at least one online CPU.
> > 
> > IO request is always mapped to the proper hw queue via the submission
> > CPU and queue type.
> > 
> > So blk_mq_alloc_request_hctx() is really weird from the above blk-mq's
> > model.
> > 
> > Actually the 4 consumer is just one single type of usage for submitting
> > connect command, seems no one explain this special usage before. And the
> > driver can handle well enough without this interface just like this
> > patch, can't it?
> 
> Does removing the cpumask_and with cpu_online_mask fix your test?

It can be workaround this way, or return NULL if the hctx becomes
inactive.

But there is big problem to dispatch such request to inactive hctx, as
I explained.

> 
> this check is wrong to begin with because it can not be true right after
> the check.
> 
> This is a much simpler fix that does not create this churn local to
> every driver. Also, I don't like the assumptions about tag reservations
> that the drivers is taking locally (that the connect will have tag 0
> for example). All this makes this look like a hack.

The patch I posted can be applied to non-reserved tag too, and the connect
request can figured by rq->private_rq_data simply.

Also, we can provide blk_mq_rq_is_reserved() helper if you think 'rq->tag == 0'
is like a hack.

> 
> There is also the question of what happens when we want to make connects
> parallel, which is not the case at the moment...

There are several solutions for this:

1) schedule the connection on selected CPUs, so that all hw queues can be
covered

2) use single tag_set for connect command only.


Thanks, 
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request
  2019-11-19  0:34       ` Keith Busch
  2019-11-19  1:43         ` Sagi Grimberg
@ 2019-11-19  2:38         ` Ming Lei
  1 sibling, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-11-19  2:38 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Sagi Grimberg, James Smart, linux-nvme, linux-block,
	Christoph Hellwig

On Tue, Nov 19, 2019 at 09:34:37AM +0900, Keith Busch wrote:
> On Mon, Nov 18, 2019 at 04:05:56PM -0800, Sagi Grimberg wrote:
> > 
> > I'm starting to think we maybe need to get the connect out of the block
> > layer execution if its such a big problem... Its a real shame if that is
> > the case...
> 
> We still need timeout handling for connect commands, so bypassing the
> block layer will need to figure out some other way to handle that.
> 
> This patch proposal doesn't really handle the timeouts very well either,
> though: nvme_rdma_timeout() is going to end up referncing the wrong
> queue rather than the one the request was submitted on. It doesn't
> appear to really matter in the current code since it just resets the
> entire controller, but if it ever wanted to do something queue specific...

I am not sure it is an issue.

Given timeout handler needs the queue for transporting the request
exactly for handling timeout, right?

Or when/what do you need the original submission queue for handling
connect request timeout?


Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request
  2019-11-19  0:05     ` Sagi Grimberg
  2019-11-19  0:34       ` Keith Busch
  2019-11-19  2:33       ` Ming Lei
@ 2019-11-19 17:56       ` James Smart
  2019-11-20  6:35         ` Ming Lei
  2 siblings, 1 reply; 16+ messages in thread
From: James Smart @ 2019-11-19 17:56 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Jens Axboe, linux-block, Keith Busch, linux-nvme, Christoph Hellwig

On 11/18/2019 4:05 PM, Sagi Grimberg wrote:
>
> This is a much simpler fix that does not create this churn local to
> every driver. Also, I don't like the assumptions about tag reservations
> that the drivers is taking locally (that the connect will have tag 0
> for example). All this makes this look like a hack.

Agree with Sagi on this last statement. When I reviewed the patch, it 
was very non-intuitive. Why dependency on tag 0, why a queue number 
squirrelled away on this one request only. Why change the initialization 
(queue pointer) on this one specific request from its hctx and so on. 
For someone without the history, ugly.

>
> I'm starting to think we maybe need to get the connect out of the block
> layer execution if its such a big problem... Its a real shame if that is
> the case...

Yep. This is starting to be another case of perhaps I should be changing 
nvme-fc's blk-mq hctx to nvme queue relationship in a different manner.  
I'm having a very hard time with all the queue resources today's policy 
is wasting on targets.

-- james

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request
  2019-11-19 17:56       ` James Smart
@ 2019-11-20  6:35         ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-11-20  6:35 UTC (permalink / raw)
  To: James Smart
  Cc: Jens Axboe, Sagi Grimberg, linux-nvme, linux-block, Keith Busch,
	Christoph Hellwig

On Tue, Nov 19, 2019 at 09:56:45AM -0800, James Smart wrote:
> On 11/18/2019 4:05 PM, Sagi Grimberg wrote:
> > 
> > This is a much simpler fix that does not create this churn local to
> > every driver. Also, I don't like the assumptions about tag reservations
> > that the drivers is taking locally (that the connect will have tag 0
> > for example). All this makes this look like a hack.
> 
> Agree with Sagi on this last statement. When I reviewed the patch, it was
> very non-intuitive. Why dependency on tag 0, why a queue number squirrelled
> away on this one request only. Why change the initialization (queue pointer)
> on this one specific request from its hctx and so on. For someone without
> the history, ugly.
> 
> > 
> > I'm starting to think we maybe need to get the connect out of the block
> > layer execution if its such a big problem... Its a real shame if that is
> > the case...
> 
> Yep. This is starting to be another case of perhaps I should be changing
> nvme-fc's blk-mq hctx to nvme queue relationship in a different manner.  I'm
> having a very hard time with all the queue resources today's policy is
> wasting on targets.

Wrt. the above two points, I believe both are not an issue at all by
this driver specific approach, see my comment:

https://lore.kernel.org/linux-block/fda43a50-a484-dde7-84a1-94ccf9346bdd@broadcom.com/T/#mb72afa6ed93bc852ca266779977634cf6214b329


Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-11-20  6:36 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-15 10:42 [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request Ming Lei
2019-11-15 10:42 ` [PATCH RFC 1/3] block: reuse one scheduler/flush field for private request's data Ming Lei
2019-11-15 10:42 ` [PATCH RFC 2/3] nvme: don't use blk_mq_alloc_request_hctx() for allocating connect request Ming Lei
2019-11-15 10:42 ` [PATCH RFC 3/3] blk-mq: kill blk_mq_alloc_request_hctx() Ming Lei
2019-11-15 22:38 ` [PATCH RFC 0/3] blk-mq/nvme: use blk_mq_alloc_request() for NVMe's connect request Sagi Grimberg
2019-11-16  7:17   ` Ming Lei
2019-11-17  1:24     ` Bart Van Assche
2019-11-17  4:12       ` Ming Lei
2019-11-18 23:27         ` Bart Van Assche
2019-11-19  0:05     ` Sagi Grimberg
2019-11-19  0:34       ` Keith Busch
2019-11-19  1:43         ` Sagi Grimberg
2019-11-19  2:38         ` Ming Lei
2019-11-19  2:33       ` Ming Lei
2019-11-19 17:56       ` James Smart
2019-11-20  6:35         ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).