linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] implement nvmf read/write queue maps
@ 2018-12-11 23:35 Sagi Grimberg
  2018-12-11 23:35 ` [PATCH v2 1/5] blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues Sagi Grimberg
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Sagi Grimberg @ 2018-12-11 23:35 UTC (permalink / raw)
  To: linux-nvme; +Cc: linux-block, Christoph Hellwig, Keith Busch

This set implements read/write queue maps to nvmf (implemented in tcp
and rdma). We basically allow the users to pass in nr_write_queues
argument that will basically maps a separate set of queues to host
write I/O (or more correctly non-read I/O) and a set of queues to
hold read I/O (which is now controlled by the known nr_io_queues).

A patchset that restores nvme-rdma polling is in the pipe.
The polling is less trivial because:
1. we can find non I/O completions in the cq (i.e. memreg)
2. we need to start with non-polling for a sane connect and
   then switch to polling which is not trivial behind the
   cq API we use.

Note that read/write separation for rdma but especially tcp this can be
very clear win as we minimize the risk for head-of-queue blocking for
mixed workloads over a single tcp byte stream.

Changes from v1:
- simplified map_queues in nvme-tcp and nvme-rdma
- improved change logs
- collected review tags
- added nr-write-queues entry in nvme-cli docuementation

Sagi Grimberg (5):
  blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues
  nvme-fabrics: add missing nvmf_ctrl_options documentation
  nvme-fabrics: allow user to set nr_write_queues for separate queue
    maps
  nvme-tcp: support separate queue maps for read and write
  nvme-rdma: support separate queue maps for read and write

 block/blk-mq-rdma.c         |  8 +++----
 drivers/nvme/host/fabrics.c | 15 ++++++++++++-
 drivers/nvme/host/fabrics.h |  6 +++++
 drivers/nvme/host/rdma.c    | 28 ++++++++++++++++++++---
 drivers/nvme/host/tcp.c     | 44 ++++++++++++++++++++++++++++++++-----
 include/linux/blk-mq-rdma.h |  2 +-
 6 files changed, 88 insertions(+), 15 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/5] blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues
  2018-12-11 23:35 [PATCH v2 0/5] implement nvmf read/write queue maps Sagi Grimberg
@ 2018-12-11 23:35 ` Sagi Grimberg
  2018-12-11 23:35 ` [PATCH v2 2/5] nvme-fabrics: add missing nvmf_ctrl_options documentation Sagi Grimberg
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Sagi Grimberg @ 2018-12-11 23:35 UTC (permalink / raw)
  To: linux-nvme; +Cc: linux-block, Christoph Hellwig, Keith Busch

will be used by nvme-rdma for queue map separation support.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 block/blk-mq-rdma.c         | 8 ++++----
 drivers/nvme/host/rdma.c    | 2 +-
 include/linux/blk-mq-rdma.h | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index a71576aff3a5..45030a81a1ed 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -29,24 +29,24 @@
  * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
  * vector, we fallback to the naive mapping.
  */
-int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
+int blk_mq_rdma_map_queues(struct blk_mq_queue_map *map,
 		struct ib_device *dev, int first_vec)
 {
 	const struct cpumask *mask;
 	unsigned int queue, cpu;
 
-	for (queue = 0; queue < set->nr_hw_queues; queue++) {
+	for (queue = 0; queue < map->nr_queues; queue++) {
 		mask = ib_get_vector_affinity(dev, first_vec + queue);
 		if (!mask)
 			goto fallback;
 
 		for_each_cpu(cpu, mask)
-			set->map[0].mq_map[cpu] = queue;
+			map->mq_map[cpu] = map->queue_offset + queue;
 	}
 
 	return 0;
 
 fallback:
-	return blk_mq_map_queues(&set->map[0]);
+	return blk_mq_map_queues(map);
 }
 EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index f2db848f6985..5057d5ab5aaa 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1751,7 +1751,7 @@ static int nvme_rdma_map_queues(struct blk_mq_tag_set *set)
 {
 	struct nvme_rdma_ctrl *ctrl = set->driver_data;
 
-	return blk_mq_rdma_map_queues(set, ctrl->device->dev, 0);
+	return blk_mq_rdma_map_queues(&set->map[0], ctrl->device->dev, 0);
 }
 
 static const struct blk_mq_ops nvme_rdma_mq_ops = {
diff --git a/include/linux/blk-mq-rdma.h b/include/linux/blk-mq-rdma.h
index b4ade198007d..7b6ecf9ac4c3 100644
--- a/include/linux/blk-mq-rdma.h
+++ b/include/linux/blk-mq-rdma.h
@@ -4,7 +4,7 @@
 struct blk_mq_tag_set;
 struct ib_device;
 
-int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
+int blk_mq_rdma_map_queues(struct blk_mq_queue_map *map,
 		struct ib_device *dev, int first_vec);
 
 #endif /* _LINUX_BLK_MQ_RDMA_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/5] nvme-fabrics: add missing nvmf_ctrl_options documentation
  2018-12-11 23:35 [PATCH v2 0/5] implement nvmf read/write queue maps Sagi Grimberg
  2018-12-11 23:35 ` [PATCH v2 1/5] blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues Sagi Grimberg
@ 2018-12-11 23:35 ` Sagi Grimberg
  2018-12-11 23:35 ` [PATCH v2 3/5] nvme-fabrics: allow user to set nr_write_queues for separate queue maps Sagi Grimberg
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Sagi Grimberg @ 2018-12-11 23:35 UTC (permalink / raw)
  To: linux-nvme; +Cc: linux-block, Christoph Hellwig, Keith Busch

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/fabrics.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index 524a02a67817..28dc916ef26b 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -88,6 +88,9 @@ enum {
  * @max_reconnects: maximum number of allowed reconnect attempts before removing
  *              the controller, (-1) means reconnect forever, zero means remove
  *              immediately;
+ * @disable_sqflow: disable controller sq flow control
+ * @hdr_digest: generate/verify header digest (TCP)
+ * @data_digest: generate/verify data digest (TCP)
  */
 struct nvmf_ctrl_options {
 	unsigned		mask;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/5] nvme-fabrics: allow user to set nr_write_queues for separate queue maps
  2018-12-11 23:35 [PATCH v2 0/5] implement nvmf read/write queue maps Sagi Grimberg
  2018-12-11 23:35 ` [PATCH v2 1/5] blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues Sagi Grimberg
  2018-12-11 23:35 ` [PATCH v2 2/5] nvme-fabrics: add missing nvmf_ctrl_options documentation Sagi Grimberg
@ 2018-12-11 23:35 ` Sagi Grimberg
  2018-12-11 23:35 ` [PATCH v2 4/5] nvme-tcp: support separate queue maps for read and write Sagi Grimberg
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Sagi Grimberg @ 2018-12-11 23:35 UTC (permalink / raw)
  To: linux-nvme; +Cc: linux-block, Christoph Hellwig, Keith Busch

This argument will specify how many I/O queues will be connected
in create_ctrl in addition to nr_io_queues. With this configuration,
I/O that carries payload from the host to the target, will use
the default hctx queue map, and I/O that involves target to host
transfers will use the read hctx queue map.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/fabrics.c | 15 ++++++++++++++-
 drivers/nvme/host/fabrics.h |  3 +++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 9c62c6838b76..066c3a02e08b 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -616,6 +616,7 @@ static const match_table_t opt_tokens = {
 	{ NVMF_OPT_DISABLE_SQFLOW,	"disable_sqflow"	},
 	{ NVMF_OPT_HDR_DIGEST,		"hdr_digest"		},
 	{ NVMF_OPT_DATA_DIGEST,		"data_digest"		},
+	{ NVMF_OPT_NR_WRITE_QUEUES,	"nr_write_queues=%d"	},
 	{ NVMF_OPT_ERR,			NULL			}
 };
 
@@ -837,6 +838,18 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 		case NVMF_OPT_DATA_DIGEST:
 			opts->data_digest = true;
 			break;
+		case NVMF_OPT_NR_WRITE_QUEUES:
+			if (match_int(args, &token)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (token <= 0) {
+				pr_err("Invalid nr_write_queues %d\n", token);
+				ret = -EINVAL;
+				goto out;
+			}
+			opts->nr_write_queues = token;
+			break;
 		default:
 			pr_warn("unknown parameter or missing value '%s' in ctrl creation request\n",
 				p);
@@ -954,7 +967,7 @@ EXPORT_SYMBOL_GPL(nvmf_free_options);
 #define NVMF_ALLOWED_OPTS	(NVMF_OPT_QUEUE_SIZE | NVMF_OPT_NR_IO_QUEUES | \
 				 NVMF_OPT_KATO | NVMF_OPT_HOSTNQN | \
 				 NVMF_OPT_HOST_ID | NVMF_OPT_DUP_CONNECT |\
-				 NVMF_OPT_DISABLE_SQFLOW)
+				 NVMF_OPT_DISABLE_SQFLOW | NVMF_OPT_NR_WRITE_QUEUES)
 
 static struct nvme_ctrl *
 nvmf_create_ctrl(struct device *dev, const char *buf, size_t count)
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index 28dc916ef26b..81b8fd1c0c5d 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -61,6 +61,7 @@ enum {
 	NVMF_OPT_DISABLE_SQFLOW = 1 << 14,
 	NVMF_OPT_HDR_DIGEST	= 1 << 15,
 	NVMF_OPT_DATA_DIGEST	= 1 << 16,
+	NVMF_OPT_NR_WRITE_QUEUES = 1 << 17,
 };
 
 /**
@@ -91,6 +92,7 @@ enum {
  * @disable_sqflow: disable controller sq flow control
  * @hdr_digest: generate/verify header digest (TCP)
  * @data_digest: generate/verify data digest (TCP)
+ * @nr_write_queues: number of queues for write I/O
  */
 struct nvmf_ctrl_options {
 	unsigned		mask;
@@ -110,6 +112,7 @@ struct nvmf_ctrl_options {
 	bool			disable_sqflow;
 	bool			hdr_digest;
 	bool			data_digest;
+	unsigned int		nr_write_queues;
 };
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 4/5] nvme-tcp: support separate queue maps for read and write
  2018-12-11 23:35 [PATCH v2 0/5] implement nvmf read/write queue maps Sagi Grimberg
                   ` (2 preceding siblings ...)
  2018-12-11 23:35 ` [PATCH v2 3/5] nvme-fabrics: allow user to set nr_write_queues for separate queue maps Sagi Grimberg
@ 2018-12-11 23:35 ` Sagi Grimberg
  2018-12-12  7:05   ` Christoph Hellwig
  2018-12-11 23:35 ` [PATCH v2 5/5] nvme-rdma: " Sagi Grimberg
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2018-12-11 23:35 UTC (permalink / raw)
  To: linux-nvme; +Cc: linux-block, Christoph Hellwig, Keith Busch

Allow NVMF_OPT_NR_WRITE_QUEUES and allocate nvme queues for
write additionally. In addition, implement .map_queues that
will apply 2 queue maps for read and write queue sets.

Note that with the separate queue map, HCTX_TYPE_READ will always
use nr_io_queues and HCTX_TYPE_DEFAULT will use nr_write_queues.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 44 +++++++++++++++++++++++++++++++++++------
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 15543358e245..61eeed758f4b 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1215,7 +1215,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
 	struct linger sol = { .l_onoff = 1, .l_linger = 0 };
-	int ret, opt, rcv_pdu_size;
+	int ret, opt, rcv_pdu_size, n;
 
 	queue->ctrl = ctrl;
 	INIT_LIST_HEAD(&queue->send_list);
@@ -1271,7 +1271,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 	}
 
 	queue->sock->sk->sk_allocation = GFP_ATOMIC;
-	queue->io_cpu = (qid == 0) ? 0 : qid - 1;
+	n = (qid ? qid - 1 : 0) % num_online_cpus();
+	queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, false);
 	queue->request = NULL;
 	queue->data_remaining = 0;
 	queue->ddgst_remaining = 0;
@@ -1433,6 +1434,7 @@ static struct blk_mq_tag_set *nvme_tcp_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->driver_data = ctrl;
 		set->nr_hw_queues = nctrl->queue_count - 1;
 		set->timeout = NVME_IO_TIMEOUT;
+		set->nr_maps = 2 /* default + read */;
 	}
 
 	ret = blk_mq_alloc_tag_set(set);
@@ -1527,7 +1529,12 @@ static int nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
 
 static unsigned int nvme_tcp_nr_io_queues(struct nvme_ctrl *ctrl)
 {
-	return min(ctrl->queue_count - 1, num_online_cpus());
+	unsigned int nr_io_queues;
+
+	nr_io_queues = min(ctrl->opts->nr_io_queues, num_online_cpus());
+	nr_io_queues += min(ctrl->opts->nr_write_queues, num_online_cpus());
+
+	return nr_io_queues;
 }
 
 static int nvme_alloc_io_queues(struct nvme_ctrl *ctrl)
@@ -2052,6 +2059,29 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return BLK_STS_OK;
 }
 
+static int nvme_tcp_map_queues(struct blk_mq_tag_set *set)
+{
+	struct nvme_tcp_ctrl *ctrl = set->driver_data;
+
+	set->map[HCTX_TYPE_DEFAULT].queue_offset = 0;
+	set->map[HCTX_TYPE_READ].nr_queues = ctrl->ctrl.opts->nr_io_queues;
+	if (ctrl->ctrl.opts->nr_write_queues) {
+		/* separate read/write queues */
+		set->map[HCTX_TYPE_DEFAULT].nr_queues =
+				ctrl->ctrl.opts->nr_write_queues;
+		set->map[HCTX_TYPE_READ].queue_offset =
+				ctrl->ctrl.opts->nr_write_queues;
+	} else {
+		/* mixed read/write queues */
+		set->map[HCTX_TYPE_DEFAULT].nr_queues =
+				ctrl->ctrl.opts->nr_io_queues;
+		set->map[HCTX_TYPE_READ].queue_offset = 0;
+	}
+	blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]);
+	blk_mq_map_queues(&set->map[HCTX_TYPE_READ]);
+	return 0;
+}
+
 static struct blk_mq_ops nvme_tcp_mq_ops = {
 	.queue_rq	= nvme_tcp_queue_rq,
 	.complete	= nvme_complete_rq,
@@ -2059,6 +2089,7 @@ static struct blk_mq_ops nvme_tcp_mq_ops = {
 	.exit_request	= nvme_tcp_exit_request,
 	.init_hctx	= nvme_tcp_init_hctx,
 	.timeout	= nvme_tcp_timeout,
+	.map_queues	= nvme_tcp_map_queues,
 };
 
 static struct blk_mq_ops nvme_tcp_admin_mq_ops = {
@@ -2113,7 +2144,7 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
 
 	INIT_LIST_HEAD(&ctrl->list);
 	ctrl->ctrl.opts = opts;
-	ctrl->ctrl.queue_count = opts->nr_io_queues + 1; /* +1 for admin queue */
+	ctrl->ctrl.queue_count = opts->nr_io_queues + opts->nr_write_queues + 1;
 	ctrl->ctrl.sqsize = opts->queue_size - 1;
 	ctrl->ctrl.kato = opts->kato;
 
@@ -2155,7 +2186,7 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
 		goto out_free_ctrl;
 	}
 
-	ctrl->queues = kcalloc(opts->nr_io_queues + 1, sizeof(*ctrl->queues),
+	ctrl->queues = kcalloc(ctrl->ctrl.queue_count, sizeof(*ctrl->queues),
 				GFP_KERNEL);
 	if (!ctrl->queues) {
 		ret = -ENOMEM;
@@ -2206,7 +2237,8 @@ static struct nvmf_transport_ops nvme_tcp_transport = {
 	.required_opts	= NVMF_OPT_TRADDR,
 	.allowed_opts	= NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY |
 			  NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
-			  NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST,
+			  NVMF_OPT_HDR_DIGEST | NVMF_OPT_DATA_DIGEST |
+			  NVMF_OPT_NR_IO_QUEUES,
 	.create_ctrl	= nvme_tcp_create_ctrl,
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 5/5] nvme-rdma: support separate queue maps for read and write
  2018-12-11 23:35 [PATCH v2 0/5] implement nvmf read/write queue maps Sagi Grimberg
                   ` (3 preceding siblings ...)
  2018-12-11 23:35 ` [PATCH v2 4/5] nvme-tcp: support separate queue maps for read and write Sagi Grimberg
@ 2018-12-11 23:35 ` Sagi Grimberg
  2018-12-12  7:05   ` Christoph Hellwig
  2018-12-11 23:35 ` [PATCH v2 nvme-cli 6/5] fabrics: pass in number of write queues Sagi Grimberg
  2018-12-12 17:23 ` [PATCH v2 0/5] implement nvmf read/write queue maps James Smart
  6 siblings, 1 reply; 16+ messages in thread
From: Sagi Grimberg @ 2018-12-11 23:35 UTC (permalink / raw)
  To: linux-nvme; +Cc: linux-block, Christoph Hellwig, Keith Busch

Allow NVMF_OPT_NR_WRITE_QUEUES and allocate nvme queues for
write additionally. In addition, implement .map_queues that
will apply 2 queue maps for read and write queue sets.

Note that with the separate queue map, HCTX_TYPE_READ will always
use nr_io_queues and HCTX_TYPE_DEFAULT will use nr_write_queues.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/rdma.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 5057d5ab5aaa..6a7c546b4e74 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -645,6 +645,8 @@ static int nvme_rdma_alloc_io_queues(struct nvme_rdma_ctrl *ctrl)
 	nr_io_queues = min_t(unsigned int, nr_io_queues,
 				ibdev->num_comp_vectors);
 
+	nr_io_queues += min(opts->nr_write_queues, num_online_cpus());
+
 	ret = nvme_set_queue_count(&ctrl->ctrl, &nr_io_queues);
 	if (ret)
 		return ret;
@@ -714,6 +716,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->driver_data = ctrl;
 		set->nr_hw_queues = nctrl->queue_count - 1;
 		set->timeout = NVME_IO_TIMEOUT;
+		set->nr_maps = 2 /* default + read */;
 	}
 
 	ret = blk_mq_alloc_tag_set(set);
@@ -1751,7 +1754,25 @@ static int nvme_rdma_map_queues(struct blk_mq_tag_set *set)
 {
 	struct nvme_rdma_ctrl *ctrl = set->driver_data;
 
-	return blk_mq_rdma_map_queues(&set->map[0], ctrl->device->dev, 0);
+	set->map[HCTX_TYPE_DEFAULT].queue_offset = 0;
+	set->map[HCTX_TYPE_READ].nr_queues = ctrl->ctrl.opts->nr_io_queues;
+	if (ctrl->ctrl.opts->nr_write_queues) {
+		/* separate read/write queues */
+		set->map[HCTX_TYPE_DEFAULT].nr_queues =
+				ctrl->ctrl.opts->nr_write_queues;
+		set->map[HCTX_TYPE_READ].queue_offset =
+				ctrl->ctrl.opts->nr_write_queues;
+	} else {
+		/* mixed read/write queues */
+		set->map[HCTX_TYPE_DEFAULT].nr_queues =
+				ctrl->ctrl.opts->nr_io_queues;
+		set->map[HCTX_TYPE_READ].queue_offset = 0;
+	}
+	blk_mq_rdma_map_queues(&set->map[HCTX_TYPE_DEFAULT],
+			ctrl->device->dev, 0);
+	blk_mq_rdma_map_queues(&set->map[HCTX_TYPE_READ],
+			ctrl->device->dev, 0);
+	return 0;
 }
 
 static const struct blk_mq_ops nvme_rdma_mq_ops = {
@@ -1906,7 +1927,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
 	INIT_WORK(&ctrl->err_work, nvme_rdma_error_recovery_work);
 	INIT_WORK(&ctrl->ctrl.reset_work, nvme_rdma_reset_ctrl_work);
 
-	ctrl->ctrl.queue_count = opts->nr_io_queues + 1; /* +1 for admin queue */
+	ctrl->ctrl.queue_count = opts->nr_io_queues + opts->nr_write_queues + 1;
 	ctrl->ctrl.sqsize = opts->queue_size - 1;
 	ctrl->ctrl.kato = opts->kato;
 
@@ -1957,7 +1978,8 @@ static struct nvmf_transport_ops nvme_rdma_transport = {
 	.module		= THIS_MODULE,
 	.required_opts	= NVMF_OPT_TRADDR,
 	.allowed_opts	= NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY |
-			  NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO,
+			  NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
+			  NVMF_OPT_NR_IO_QUEUES,
 	.create_ctrl	= nvme_rdma_create_ctrl,
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 nvme-cli 6/5] fabrics: pass in number of write queues
  2018-12-11 23:35 [PATCH v2 0/5] implement nvmf read/write queue maps Sagi Grimberg
                   ` (4 preceding siblings ...)
  2018-12-11 23:35 ` [PATCH v2 5/5] nvme-rdma: " Sagi Grimberg
@ 2018-12-11 23:35 ` Sagi Grimberg
  2018-12-12 17:23 ` [PATCH v2 0/5] implement nvmf read/write queue maps James Smart
  6 siblings, 0 replies; 16+ messages in thread
From: Sagi Grimberg @ 2018-12-11 23:35 UTC (permalink / raw)
  To: linux-nvme; +Cc: linux-block, Christoph Hellwig, Keith Busch

nr_write_queues specifies the number of additional queues that
will be connected. These queues will host write I/O (host to target
payload) while nr_io_queues will host read I/O (target to host payload).

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 Documentation/nvme-connect.txt |  5 +++++
 fabrics.c                      | 11 +++++++++++
 2 files changed, 16 insertions(+)

diff --git a/Documentation/nvme-connect.txt b/Documentation/nvme-connect.txt
index af78b2a8238f..d4a0e6678475 100644
--- a/Documentation/nvme-connect.txt
+++ b/Documentation/nvme-connect.txt
@@ -16,6 +16,7 @@ SYNOPSIS
 		[--host-traddr=<traddr>   | -w <traddr>]
 		[--hostnqn=<hostnqn>      | -q <hostnqn>]
 		[--nr-io-queues=<#>       | -i <#>]
+		[--nr-write-queues=<#>    | -W <#>]
 		[--queue-size=<#>         | -Q <#>]
 		[--keep-alive-tmo=<#>     | -k <#>]
 		[--reconnect-delay=<#>    | -c <#>]
@@ -75,6 +76,10 @@ OPTIONS
 --nr-io-queues=<#>::
 	Overrides the default number of I/O queues create by the driver.
 
+-W <#>::
+--nr-write-queues=<#>::
+	Adds additional queues that will be used for write I/O.
+
 -Q <#>::
 --queue-size=<#>::
 	Overrides the default number of elements in the I/O queues created
diff --git a/fabrics.c b/fabrics.c
index 93f3410a1c72..bc6a0b7e4e21 100644
--- a/fabrics.c
+++ b/fabrics.c
@@ -53,6 +53,7 @@ static struct config {
 	char *hostnqn;
 	char *hostid;
 	int  nr_io_queues;
+	int  nr_write_queues;
 	int  queue_size;
 	int  keep_alive_tmo;
 	int  reconnect_delay;
@@ -621,6 +622,8 @@ static int build_options(char *argstr, int max_len)
 		    add_argument(&argstr, &max_len, "hostid", cfg.hostid)) ||
 	    add_int_argument(&argstr, &max_len, "nr_io_queues",
 				cfg.nr_io_queues) ||
+	    add_int_argument(&argstr, &max_len, "nr_write_queues",
+				cfg.nr_write_queues) ||
 	    add_int_argument(&argstr, &max_len, "queue_size", cfg.queue_size) ||
 	    add_int_argument(&argstr, &max_len, "keep_alive_tmo",
 				cfg.keep_alive_tmo) ||
@@ -694,6 +697,13 @@ retry:
 		p += len;
 	}
 
+	if (cfg.nr_write_queues) {
+		len = sprintf(p, ",nr_write_queues=%d", cfg.nr_write_queues);
+		if (len < 0)
+			return -EINVAL;
+		p += len;
+	}
+
 	if (cfg.host_traddr) {
 		len = sprintf(p, ",host_traddr=%s", cfg.host_traddr);
 		if (len < 0)
@@ -1009,6 +1019,7 @@ int connect(const char *desc, int argc, char **argv)
 		{"hostnqn",         'q', "LIST", CFG_STRING, &cfg.hostnqn,         required_argument, "user-defined hostnqn" },
 		{"hostid",          'I', "LIST", CFG_STRING, &cfg.hostid,      required_argument, "user-defined hostid (if default not used)"},
 		{"nr-io-queues",    'i', "LIST", CFG_INT, &cfg.nr_io_queues,    required_argument, "number of io queues to use (default is core count)" },
+		{"nr-write-queues",  'W', "LIST", CFG_INT, &cfg.nr_write_queues,    required_argument, "number of write queues to use (default 0)" },
 		{"queue-size",      'Q', "LIST", CFG_INT, &cfg.queue_size,      required_argument, "number of io queue elements to use (default 128)" },
 		{"keep-alive-tmo",  'k', "LIST", CFG_INT, &cfg.keep_alive_tmo,  required_argument, "keep alive timeout period in seconds" },
 		{"reconnect-delay", 'c', "LIST", CFG_INT, &cfg.reconnect_delay, required_argument, "reconnect timeout period in seconds" },
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/5] nvme-tcp: support separate queue maps for read and write
  2018-12-11 23:35 ` [PATCH v2 4/5] nvme-tcp: support separate queue maps for read and write Sagi Grimberg
@ 2018-12-12  7:05   ` Christoph Hellwig
  2018-12-12  7:10     ` Sagi Grimberg
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2018-12-12  7:05 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, linux-block, Christoph Hellwig, Keith Busch

> -	queue->io_cpu = (qid == 0) ? 0 : qid - 1;
> +	n = (qid ? qid - 1 : 0) % num_online_cpus();
> +	queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, false);

Nitpick: can you use an if/else here?  The ? : construct just looks
like obsfucation.

If nothing else pops up I can fix that up when applying.

Otherwise:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 5/5] nvme-rdma: support separate queue maps for read and write
  2018-12-11 23:35 ` [PATCH v2 5/5] nvme-rdma: " Sagi Grimberg
@ 2018-12-12  7:05   ` Christoph Hellwig
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2018-12-12  7:05 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, linux-block, Christoph Hellwig, Keith Busch

Looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/5] nvme-tcp: support separate queue maps for read and write
  2018-12-12  7:05   ` Christoph Hellwig
@ 2018-12-12  7:10     ` Sagi Grimberg
  0 siblings, 0 replies; 16+ messages in thread
From: Sagi Grimberg @ 2018-12-12  7:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-nvme, linux-block, Keith Busch


>> -	queue->io_cpu = (qid == 0) ? 0 : qid - 1;
>> +	n = (qid ? qid - 1 : 0) % num_online_cpus();
>> +	queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, false);
> 
> Nitpick: can you use an if/else here?  The ? : construct just looks
> like obsfucation.
> 
> If nothing else pops up I can fix that up when applying.

Actually I noticed that the NVMF_OPT_WRITE_QUEUES was added
to  NVMF_ALLOWED_OPTS instead of locally to the supporting drivers...

I'll send another quick revision tomorrow...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] implement nvmf read/write queue maps
  2018-12-11 23:35 [PATCH v2 0/5] implement nvmf read/write queue maps Sagi Grimberg
                   ` (5 preceding siblings ...)
  2018-12-11 23:35 ` [PATCH v2 nvme-cli 6/5] fabrics: pass in number of write queues Sagi Grimberg
@ 2018-12-12 17:23 ` James Smart
  2018-12-12 17:58   ` Sagi Grimberg
  6 siblings, 1 reply; 16+ messages in thread
From: James Smart @ 2018-12-12 17:23 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme; +Cc: linux-block, Keith Busch, Christoph Hellwig



On 12/11/2018 3:35 PM, Sagi Grimberg wrote:
> This set implements read/write queue maps to nvmf (implemented in tcp
> and rdma). We basically allow the users to pass in nr_write_queues
> argument that will basically maps a separate set of queues to host
> write I/O (or more correctly non-read I/O) and a set of queues to
> hold read I/O (which is now controlled by the known nr_io_queues).
>
> A patchset that restores nvme-rdma polling is in the pipe.
> The polling is less trivial because:
> 1. we can find non I/O completions in the cq (i.e. memreg)
> 2. we need to start with non-polling for a sane connect and
>     then switch to polling which is not trivial behind the
>     cq API we use.
>
> Note that read/write separation for rdma but especially tcp this can be
> very clear win as we minimize the risk for head-of-queue blocking for
> mixed workloads over a single tcp byte stream.

Sagi,

What other wins are there for this split ?

I'm considering whether its worthwhile for fc as well, but the hol issue 
doesn't exist with fc. What else is being resolved ?

-- james


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] implement nvmf read/write queue maps
  2018-12-12 17:23 ` [PATCH v2 0/5] implement nvmf read/write queue maps James Smart
@ 2018-12-12 17:58   ` Sagi Grimberg
  2018-12-12 19:15     ` James Smart
  2018-12-13 12:04     ` Hannes Reinecke
  0 siblings, 2 replies; 16+ messages in thread
From: Sagi Grimberg @ 2018-12-12 17:58 UTC (permalink / raw)
  To: James Smart, linux-nvme; +Cc: linux-block, Keith Busch, Christoph Hellwig


> Sagi,
> 
> What other wins are there for this split ?
> 
> I'm considering whether its worthwhile for fc as well, but the hol issue 
> doesn't exist with fc. What else is being resolved ?

I've pondered it myself, which is why I didn't add it to fc as well
(would have been easy enough I think). I guess that with this the
user can limit writes compared to read by assigning less queues as
jens suggested in his patches...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] implement nvmf read/write queue maps
  2018-12-12 17:58   ` Sagi Grimberg
@ 2018-12-12 19:15     ` James Smart
  2018-12-13 12:04     ` Hannes Reinecke
  1 sibling, 0 replies; 16+ messages in thread
From: James Smart @ 2018-12-12 19:15 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme; +Cc: linux-block, Keith Busch, Christoph Hellwig



On 12/12/2018 9:58 AM, Sagi Grimberg wrote:
>
>> Sagi,
>>
>> What other wins are there for this split ?
>>
>> I'm considering whether its worthwhile for fc as well, but the hol 
>> issue doesn't exist with fc. What else is being resolved ?
>
> I've pondered it myself, which is why I didn't add it to fc as well
> (would have been easy enough I think). I guess that with this the
> user can limit writes compared to read by assigning less queues as
> jens suggested in his patches...

not enough to interest me.  I also would prefer to not require more 
queues - it only increases the oversubscription requirements on the 
target, which is already a major issue on real shared subsystems.

Thanks

-- james


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] implement nvmf read/write queue maps
  2018-12-12 17:58   ` Sagi Grimberg
  2018-12-12 19:15     ` James Smart
@ 2018-12-13 12:04     ` Hannes Reinecke
  2018-12-13 14:02       ` Jens Axboe
  1 sibling, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2018-12-13 12:04 UTC (permalink / raw)
  To: Sagi Grimberg, James Smart, linux-nvme
  Cc: linux-block, Keith Busch, Christoph Hellwig

On 12/12/18 6:58 PM, Sagi Grimberg wrote:
> 
>> Sagi,
>>
>> What other wins are there for this split ?
>>
>> I'm considering whether its worthwhile for fc as well, but the hol 
>> issue doesn't exist with fc. What else is being resolved ?
> 
> I've pondered it myself, which is why I didn't add it to fc as well
> (would have been easy enough I think). I guess that with this the
> user can limit writes compared to read by assigning less queues as
> jens suggested in his patches...
 >
Weelll ... isn't that what blkcg is for?
I do understand if one would want to implement something like this if 
there's a real benefit from the _hardware_ side, but if there isn't we 
should try to keep it in the upper layers where it belongs.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] implement nvmf read/write queue maps
  2018-12-13 12:04     ` Hannes Reinecke
@ 2018-12-13 14:02       ` Jens Axboe
  2018-12-13 14:19         ` Hannes Reinecke
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2018-12-13 14:02 UTC (permalink / raw)
  To: Hannes Reinecke, Sagi Grimberg, James Smart, linux-nvme
  Cc: linux-block, Keith Busch, Christoph Hellwig

On 12/13/18 5:04 AM, Hannes Reinecke wrote:
> On 12/12/18 6:58 PM, Sagi Grimberg wrote:
>>
>>> Sagi,
>>>
>>> What other wins are there for this split ?
>>>
>>> I'm considering whether its worthwhile for fc as well, but the hol 
>>> issue doesn't exist with fc. What else is being resolved ?
>>
>> I've pondered it myself, which is why I didn't add it to fc as well
>> (would have been easy enough I think). I guess that with this the
>> user can limit writes compared to read by assigning less queues as
>> jens suggested in his patches...
>  >
> Weelll ... isn't that what blkcg is for?
> I do understand if one would want to implement something like this if 
> there's a real benefit from the _hardware_ side, but if there isn't we 
> should try to keep it in the upper layers where it belongs.

The point is that nvme round robins queues, if you have 1 write queue
for N read queues, then you would get better read servicing as writes
can't flood more than one queue.

Obviously this isn't a replacement for any kind of real throttling,
but it's cheap and easy to do.

The multiple queue maps are more useful for the polled IO, and for
future prioritized IO in general. But it is also handy for not
oversubscribing the write side.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/5] implement nvmf read/write queue maps
  2018-12-13 14:02       ` Jens Axboe
@ 2018-12-13 14:19         ` Hannes Reinecke
  0 siblings, 0 replies; 16+ messages in thread
From: Hannes Reinecke @ 2018-12-13 14:19 UTC (permalink / raw)
  To: Jens Axboe, Sagi Grimberg, James Smart, linux-nvme
  Cc: linux-block, Keith Busch, Christoph Hellwig

On 12/13/18 3:02 PM, Jens Axboe wrote:
> On 12/13/18 5:04 AM, Hannes Reinecke wrote:
>> On 12/12/18 6:58 PM, Sagi Grimberg wrote:
>>>
>>>> Sagi,
>>>>
>>>> What other wins are there for this split ?
>>>>
>>>> I'm considering whether its worthwhile for fc as well, but the hol
>>>> issue doesn't exist with fc. What else is being resolved ?
>>>
>>> I've pondered it myself, which is why I didn't add it to fc as well
>>> (would have been easy enough I think). I guess that with this the
>>> user can limit writes compared to read by assigning less queues as
>>> jens suggested in his patches...
>>>
>> Weelll ... isn't that what blkcg is for?
>> I do understand if one would want to implement something like this if
>> there's a real benefit from the _hardware_ side, but if there isn't we
>> should try to keep it in the upper layers where it belongs.
> 
> The point is that nvme round robins queues, if you have 1 write queue
> for N read queues, then you would get better read servicing as writes
> can't flood more than one queue.
> 
> Obviously this isn't a replacement for any kind of real throttling,
> but it's cheap and easy to do.
> 
> The multiple queue maps are more useful for the polled IO, and for
> future prioritized IO in general. But it is also handy for not
> oversubscribing the write side.
> 
But how can you tell?
It's not that the hardware tells us "Oh, incidentally, I'm far slower 
writing than reading, please use only so many queues for writing.".
So any number we pick would be pretty heuristic to start with.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-12-13 14:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-11 23:35 [PATCH v2 0/5] implement nvmf read/write queue maps Sagi Grimberg
2018-12-11 23:35 ` [PATCH v2 1/5] blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues Sagi Grimberg
2018-12-11 23:35 ` [PATCH v2 2/5] nvme-fabrics: add missing nvmf_ctrl_options documentation Sagi Grimberg
2018-12-11 23:35 ` [PATCH v2 3/5] nvme-fabrics: allow user to set nr_write_queues for separate queue maps Sagi Grimberg
2018-12-11 23:35 ` [PATCH v2 4/5] nvme-tcp: support separate queue maps for read and write Sagi Grimberg
2018-12-12  7:05   ` Christoph Hellwig
2018-12-12  7:10     ` Sagi Grimberg
2018-12-11 23:35 ` [PATCH v2 5/5] nvme-rdma: " Sagi Grimberg
2018-12-12  7:05   ` Christoph Hellwig
2018-12-11 23:35 ` [PATCH v2 nvme-cli 6/5] fabrics: pass in number of write queues Sagi Grimberg
2018-12-12 17:23 ` [PATCH v2 0/5] implement nvmf read/write queue maps James Smart
2018-12-12 17:58   ` Sagi Grimberg
2018-12-12 19:15     ` James Smart
2018-12-13 12:04     ` Hannes Reinecke
2018-12-13 14:02       ` Jens Axboe
2018-12-13 14:19         ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).