All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv3 0/3] nvme: NUMA locality for fabrics
@ 2018-11-02  9:56 Hannes Reinecke
  2018-11-02  9:56 ` [PATCH 1/3] nvme: NUMA locality information " Hannes Reinecke
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Hannes Reinecke @ 2018-11-02  9:56 UTC (permalink / raw)


Hi all,

here's a patchset to leverage NUMA locality information for fabric controllers.
This is the second attempt for doing so; after discussion with hch we came
to the conclusion that the attempt in the initial submission with a manual
configuration would only lead to more confusion and suboptimal configuration.

So here's now a version with an automatic NUMA balancing, where we attempt
to split the number submitting CPUs/cores evenly across the available
controller.

With this patchset I'm seeing a performance increase from
262k IOPS to 344k IOPS, measured against a NetApp AF700.

As usual, comments and reviews are welcome.

Changes to v2:
- use 'numa_node' instead of 'node_id' (suggested by Sagi)
- rediff patches for better readability

Hannes Reinecke (3):
  nvme: NUMA locality information for fabrics
  nvme-multipath: Select paths based on NUMA locality
  nvme-multipath: automatic NUMA path balancing

 drivers/nvme/host/core.c      |  36 ++++++++++++-
 drivers/nvme/host/fc.c        |   5 +-
 drivers/nvme/host/multipath.c | 119 +++++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h      |   3 ++
 drivers/nvme/host/rdma.c      |   6 ++-
 5 files changed, 161 insertions(+), 8 deletions(-)

-- 
2.16.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] nvme: NUMA locality information for fabrics
  2018-11-02  9:56 [PATCHv3 0/3] nvme: NUMA locality for fabrics Hannes Reinecke
@ 2018-11-02  9:56 ` Hannes Reinecke
  2018-11-08  9:22   ` Christoph Hellwig
  2018-11-02  9:56 ` [PATCH 2/3] nvme-multipath: Select paths based on NUMA locality Hannes Reinecke
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Hannes Reinecke @ 2018-11-02  9:56 UTC (permalink / raw)


From: Hannes Reinecke <hare@suse.com>

Add a new field 'numa_node' to the nvme_ctrl structure to hold the
NUMA locality information of the underlying hardware.
With that we can allocate the memory structures on the same NUMA
node as the underlying hardware.

Signed-off-by: Hannes Reinecke <hare at suse.com>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/core.c      | 4 +++-
 drivers/nvme/host/fc.c        | 5 +++--
 drivers/nvme/host/multipath.c | 4 ++--
 drivers/nvme/host/nvme.h      | 1 +
 drivers/nvme/host/rdma.c      | 5 +++--
 5 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2e65be8b1387..6dfcb72aa907 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2766,6 +2766,7 @@ static ssize_t  field##_show(struct device *dev,				\
 static DEVICE_ATTR(field, S_IRUGO, field##_show, NULL);
 
 nvme_show_int_function(cntlid);
+nvme_show_int_function(numa_node);
 
 static ssize_t nvme_sysfs_delete(struct device *dev,
 				struct device_attribute *attr, const char *buf,
@@ -2845,6 +2846,7 @@ static struct attribute *nvme_dev_attrs[] = {
 	&dev_attr_subsysnqn.attr,
 	&dev_attr_address.attr,
 	&dev_attr_state.attr,
+	&dev_attr_numa_node.attr,
 	NULL
 };
 
@@ -3055,7 +3057,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	struct gendisk *disk;
 	struct nvme_id_ns *id;
 	char disk_name[DISK_NAME_LEN];
-	int node = dev_to_node(ctrl->dev), flags = GENHD_FL_EXT_DEVT;
+	int node = ctrl->numa_node, flags = GENHD_FL_EXT_DEVT;
 
 	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
 	if (!ns)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 0b70c8bab045..a22ff6fb82bc 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2433,7 +2433,7 @@ nvme_fc_create_io_queues(struct nvme_fc_ctrl *ctrl)
 	ctrl->tag_set.ops = &nvme_fc_mq_ops;
 	ctrl->tag_set.queue_depth = ctrl->ctrl.opts->queue_size;
 	ctrl->tag_set.reserved_tags = 1; /* fabric connect */
-	ctrl->tag_set.numa_node = NUMA_NO_NODE;
+	ctrl->tag_set.numa_node = ctrl->ctrl.numa_node;
 	ctrl->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 	ctrl->tag_set.cmd_size =
 		struct_size((struct nvme_fcp_op_w_sgl *)NULL, priv,
@@ -3000,6 +3000,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 
 	ctrl->ctrl.opts = opts;
 	ctrl->ctrl.nr_reconnects = 0;
+	ctrl->ctrl.numa_node = dev_to_node(lport->dev);
 	INIT_LIST_HEAD(&ctrl->ctrl_list);
 	ctrl->lport = lport;
 	ctrl->rport = rport;
@@ -3038,7 +3039,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 	ctrl->admin_tag_set.ops = &nvme_fc_admin_mq_ops;
 	ctrl->admin_tag_set.queue_depth = NVME_AQ_MQ_TAG_DEPTH;
 	ctrl->admin_tag_set.reserved_tags = 2; /* fabric connect + Keep-Alive */
-	ctrl->admin_tag_set.numa_node = NUMA_NO_NODE;
+	ctrl->admin_tag_set.numa_node = ctrl->ctrl.numa_node;
 	ctrl->admin_tag_set.cmd_size =
 		struct_size((struct nvme_fcp_op_w_sgl *)NULL, priv,
 			    ctrl->lport->ops->fcprqst_priv_sz);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 5e3cc8c59a39..8e03cda770c5 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -141,7 +141,7 @@ static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node)
 		    test_bit(NVME_NS_ANA_PENDING, &ns->flags))
 			continue;
 
-		distance = node_distance(node, dev_to_node(ns->ctrl->dev));
+		distance = node_distance(node, ns->ctrl->numa_node);
 
 		switch (ns->ana_state) {
 		case NVME_ANA_OPTIMIZED:
@@ -276,7 +276,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 	if (!(ctrl->subsys->cmic & (1 << 1)) || !multipath)
 		return 0;
 
-	q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE, NULL);
+	q = blk_alloc_queue_node(GFP_KERNEL, ctrl->numa_node, NULL);
 	if (!q)
 		goto out;
 	q->queuedata = head;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cee79cb388af..f608fc11d329 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -153,6 +153,7 @@ struct nvme_ctrl {
 	struct request_queue *connect_q;
 	struct device *dev;
 	int instance;
+	int numa_node;
 	struct blk_mq_tag_set *tagset;
 	struct blk_mq_tag_set *admin_tagset;
 	struct list_head namespaces;
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index d181cafedc58..4468d672ced9 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -693,7 +693,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->ops = &nvme_rdma_admin_mq_ops;
 		set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
 		set->reserved_tags = 2; /* connect + keep-alive */
-		set->numa_node = NUMA_NO_NODE;
+		set->numa_node = nctrl->numa_node;
 		set->cmd_size = sizeof(struct nvme_rdma_request) +
 			SG_CHUNK_SIZE * sizeof(struct scatterlist);
 		set->driver_data = ctrl;
@@ -706,7 +706,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->ops = &nvme_rdma_mq_ops;
 		set->queue_depth = nctrl->sqsize + 1;
 		set->reserved_tags = 1; /* fabric connect */
-		set->numa_node = NUMA_NO_NODE;
+		set->numa_node = nctrl->numa_node;
 		set->flags = BLK_MQ_F_SHOULD_MERGE;
 		set->cmd_size = sizeof(struct nvme_rdma_request) +
 			SG_CHUNK_SIZE * sizeof(struct scatterlist);
@@ -762,6 +762,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 		return error;
 
 	ctrl->device = ctrl->queues[0].device;
+	ctrl->ctrl.numa_node = dev_to_node(ctrl->device->dev->dma_device);
 
 	ctrl->max_fr_pages = nvme_rdma_get_max_fr_pages(ctrl->device->dev);
 
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] nvme-multipath: Select paths based on NUMA locality
  2018-11-02  9:56 [PATCHv3 0/3] nvme: NUMA locality for fabrics Hannes Reinecke
  2018-11-02  9:56 ` [PATCH 1/3] nvme: NUMA locality information " Hannes Reinecke
@ 2018-11-02  9:56 ` Hannes Reinecke
  2018-11-08  9:32   ` Christoph Hellwig
  2018-11-02  9:56 ` [PATCH 3/3] nvme-multipath: automatic NUMA path balancing Hannes Reinecke
  2018-11-16  8:12 ` [PATCHv3 0/3] nvme: NUMA locality for fabrics Christoph Hellwig
  3 siblings, 1 reply; 19+ messages in thread
From: Hannes Reinecke @ 2018-11-02  9:56 UTC (permalink / raw)


This patch creates a per-controller map to hold the NUMA locality
information. With that we can route I/O to the controller which is
'nearest' to the issuing CPU and decrease the latency there.

Signed-off-by: Hannes Reinecke <hare at suse.com>
---
 drivers/nvme/host/core.c      | 32 +++++++++++++++++++++++++++++++-
 drivers/nvme/host/fc.c        |  2 +-
 drivers/nvme/host/multipath.c | 30 +++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h      |  2 ++
 drivers/nvme/host/rdma.c      |  3 ++-
 5 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6dfcb72aa907..113ddacd6127 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2204,6 +2204,16 @@ static int nvme_active_ctrls(struct nvme_subsystem *subsys)
 	return count;
 }
 
+void nvme_set_ctrl_node(struct nvme_ctrl *ctrl, int numa_node)
+{
+	ctrl->numa_node = numa_node;
+	if (numa_node == NUMA_NO_NODE)
+		return;
+	ctrl->node_map = kzalloc(num_possible_nodes() * sizeof(int),
+				 GFP_KERNEL);
+}
+EXPORT_SYMBOL_GPL(nvme_set_ctrl_node);
+
 static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 {
 	struct nvme_subsystem *subsys, *found;
@@ -2834,6 +2844,23 @@ static ssize_t nvme_sysfs_show_address(struct device *dev,
 }
 static DEVICE_ATTR(address, S_IRUGO, nvme_sysfs_show_address, NULL);
 
+static ssize_t nvme_sysfs_show_node_map(struct device *dev,
+					struct device_attribute *attr,
+					char *buf)
+{
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+	int node;
+	ssize_t offset = 0;
+
+	for_each_node(node)
+		offset += snprintf(buf + offset, PAGE_SIZE - offset,
+				   "%d ", ctrl->node_map[node]);
+	offset += snprintf(buf + offset, PAGE_SIZE - offset, "\n");
+
+	return offset;
+}
+static DEVICE_ATTR(node_map, S_IRUGO, nvme_sysfs_show_node_map, NULL);
+
 static struct attribute *nvme_dev_attrs[] = {
 	&dev_attr_reset_controller.attr,
 	&dev_attr_rescan_controller.attr,
@@ -2847,6 +2874,7 @@ static struct attribute *nvme_dev_attrs[] = {
 	&dev_attr_address.attr,
 	&dev_attr_state.attr,
 	&dev_attr_numa_node.attr,
+	&dev_attr_node_map.attr,
 	NULL
 };
 
@@ -2860,7 +2888,8 @@ static umode_t nvme_dev_attrs_are_visible(struct kobject *kobj,
 		return 0;
 	if (a == &dev_attr_address.attr && !ctrl->ops->get_address)
 		return 0;
-
+	if (a == &dev_attr_node_map.attr && !ctrl->node_map)
+		return 0;
 	return a->mode;
 }
 
@@ -3511,6 +3540,7 @@ static void nvme_free_ctrl(struct device *dev)
 
 	ida_simple_remove(&nvme_instance_ida, ctrl->instance);
 	kfree(ctrl->effects);
+	kfree(ctrl->node_map);
 	nvme_mpath_uninit(ctrl);
 
 	if (subsys) {
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index a22ff6fb82bc..43c60ca49b3f 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -3000,7 +3000,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 
 	ctrl->ctrl.opts = opts;
 	ctrl->ctrl.nr_reconnects = 0;
-	ctrl->ctrl.numa_node = dev_to_node(lport->dev);
+	nvme_set_ctrl_node(&ctrl->ctrl, dev_to_node(lport->dev));
 	INIT_LIST_HEAD(&ctrl->ctrl_list);
 	ctrl->lport = lport;
 	ctrl->rport = rport;
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 8e03cda770c5..6d1412af7332 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -141,7 +141,8 @@ static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node)
 		    test_bit(NVME_NS_ANA_PENDING, &ns->flags))
 			continue;
 
-		distance = node_distance(node, ns->ctrl->numa_node);
+		distance = ns->ctrl->node_map ?
+			ns->ctrl->node_map[node] : INT_MAX;
 
 		switch (ns->ana_state) {
 		case NVME_ANA_OPTIMIZED:
@@ -258,6 +259,31 @@ static void nvme_requeue_work(struct work_struct *work)
 	}
 }
 
+void nvme_mpath_balance_subsys(struct nvme_subsystem *subsys)
+{
+	struct nvme_ctrl *ctrl;
+	int node;
+
+	mutex_lock(&subsys->lock);
+
+	/*
+	 * Reset set NUMA distance
+	 *    During creation the NUMA distance is only set
+	 *    per controller, so after connecting the other
+	 *    controllers the NUMA information on the existing
+	 *    ones is incorrect.
+	 */
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		for_each_node(node) {
+			if (!ctrl->node_map)
+				continue;
+			ctrl->node_map[node] =
+				node_distance(node, ctrl->numa_node);
+		}
+	}
+	mutex_unlock(&subsys->lock);
+}
+
 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 {
 	struct request_queue *q;
@@ -548,6 +574,8 @@ int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 {
 	int error;
 
+	nvme_mpath_balance_subsys(ctrl->subsys);
+
 	if (!nvme_ctrl_use_ana(ctrl))
 		return 0;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f608fc11d329..aebf78b2946e 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -154,6 +154,7 @@ struct nvme_ctrl {
 	struct device *dev;
 	int instance;
 	int numa_node;
+	int *node_map;
 	struct blk_mq_tag_set *tagset;
 	struct blk_mq_tag_set *admin_tagset;
 	struct list_head namespaces;
@@ -438,6 +439,7 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
 void nvme_start_freeze(struct nvme_ctrl *ctrl);
+void nvme_set_ctrl_node(struct nvme_ctrl *ctrl, int node);
 
 #define NVME_QID_ANY -1
 struct request *nvme_alloc_request(struct request_queue *q,
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4468d672ced9..85520b8d4bea 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -762,7 +762,8 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 		return error;
 
 	ctrl->device = ctrl->queues[0].device;
-	ctrl->ctrl.numa_node = dev_to_node(ctrl->device->dev->dma_device);
+	nvme_set_ctrl_node(&ctrl->ctrl,
+			   dev_to_node(ctrl->device->dev->dma_device));
 
 	ctrl->max_fr_pages = nvme_rdma_get_max_fr_pages(ctrl->device->dev);
 
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] nvme-multipath: automatic NUMA path balancing
  2018-11-02  9:56 [PATCHv3 0/3] nvme: NUMA locality for fabrics Hannes Reinecke
  2018-11-02  9:56 ` [PATCH 1/3] nvme: NUMA locality information " Hannes Reinecke
  2018-11-02  9:56 ` [PATCH 2/3] nvme-multipath: Select paths based on NUMA locality Hannes Reinecke
@ 2018-11-02  9:56 ` Hannes Reinecke
  2018-11-08  9:36   ` Christoph Hellwig
  2018-11-16  8:12 ` [PATCHv3 0/3] nvme: NUMA locality for fabrics Christoph Hellwig
  3 siblings, 1 reply; 19+ messages in thread
From: Hannes Reinecke @ 2018-11-02  9:56 UTC (permalink / raw)


In order to utilize both paths on dual-ported HBAs we cannot rely
on the NUMA affinity alone, but rather have to distribute the
locality information to get the best possible result.
This patch implements a two-pass algorithm for assinging NUMA
locality information:
1. Distribute existing locality information so that no core has
more than one 'local' controller
2. Assign a 'local' controller for each of the remaining cores,
so that the overall weight (ie the sum of all locality information)
per ctrl is minimal.

Signed-off-by: Hannes Reinecke <hare at suse.com>
---
 drivers/nvme/host/multipath.c | 89 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 88 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 6d1412af7332..4944ffdf6831 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -259,15 +259,60 @@ static void nvme_requeue_work(struct work_struct *work)
 	}
 }
 
+void nvme_mpath_distribute_paths(struct nvme_subsystem *subsys, int num_ctrls,
+				 struct nvme_ctrl *ctrl, int numa_node)
+{
+	int node;
+	int found_node = NUMA_NO_NODE;
+	int max = LOCAL_DISTANCE * num_ctrls;
+
+	for_each_node(node) {
+		struct nvme_ctrl *c;
+		int sum = 0;
+
+		list_for_each_entry(c, &subsys->ctrls, subsys_entry)
+			sum += c->node_map[node];
+		if (sum > max) {
+			max = sum;
+			found_node = node;
+		}
+	}
+	if (found_node != NUMA_NO_NODE) {
+		ctrl->node_map[found_node] = LOCAL_DISTANCE;
+		ctrl->node_map[numa_node] = REMOTE_DISTANCE;
+	}
+}
+
+void nvme_mpath_balance_node(struct nvme_subsystem *subsys,
+			     int num_ctrls, int numa_node)
+{
+	struct nvme_ctrl *found = NULL, *ctrl;
+	int max = LOCAL_DISTANCE * num_ctrls, node;
+
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		int sum = 0;
+
+		for_each_node(node)
+			sum += ctrl->node_map[node];
+		if (sum > max) {
+			max = sum;
+			found = ctrl;
+		}
+	}
+	if (found)
+		found->node_map[numa_node] = LOCAL_DISTANCE;
+}
+
 void nvme_mpath_balance_subsys(struct nvme_subsystem *subsys)
 {
 	struct nvme_ctrl *ctrl;
+	int num_ctrls = 0;
 	int node;
 
 	mutex_lock(&subsys->lock);
 
 	/*
-	 * Reset set NUMA distance
+	 * 1. Reset set NUMA distance
 	 *    During creation the NUMA distance is only set
 	 *    per controller, so after connecting the other
 	 *    controllers the NUMA information on the existing
@@ -280,7 +325,49 @@ void nvme_mpath_balance_subsys(struct nvme_subsystem *subsys)
 			ctrl->node_map[node] =
 				node_distance(node, ctrl->numa_node);
 		}
+		num_ctrls++;
+	}
+
+	/*
+	 * 2. Distribute optimal paths:
+	 *    Only one primary paths per node.
+	 *    Additional primary paths are moved to unassigned nodes.
+	 */
+	for_each_node(node) {
+		bool optimal = false;
+
+		list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+			if (ctrl->node_map[node] == LOCAL_DISTANCE) {
+				if (!optimal) {
+					optimal = true;
+					continue;
+				}
+				nvme_mpath_distribute_paths(subsys, num_ctrls,
+							    ctrl, node);
+			}
+		}
+	}
+
+	/*
+	 * 3. Balance unassigned nodes:
+	 *    Each unassigned node should have one primary path;
+	 *    the primary path is assigned to the ctrl with the
+	 *    minimal weight (ie the sum of distances over all nodes)
+	 */
+	for_each_node(node) {
+		bool optimal = false;
+
+		list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+			if (ctrl->node_map[node] == LOCAL_DISTANCE) {
+				optimal = true;
+				break;
+			}
+		}
+		if (optimal)
+			continue;
+		nvme_mpath_balance_node(subsys, num_ctrls, node);
 	}
+
 	mutex_unlock(&subsys->lock);
 }
 
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 1/3] nvme: NUMA locality information for fabrics
  2018-11-02  9:56 ` [PATCH 1/3] nvme: NUMA locality information " Hannes Reinecke
@ 2018-11-08  9:22   ` Christoph Hellwig
  2018-11-08  9:35     ` Hannes Reinecke
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2018-11-08  9:22 UTC (permalink / raw)


This patch looks good, but the dscription seems extremely misleading.

I've rewritten it as below for inclusion in nvme-4.21, let me know
what you think:

---
>From 01d2c22c9d6cd2f0cb2ac2ba5cf98c2ac2d8624e Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <hare@suse.com>
Date: Fri, 2 Nov 2018 10:56:39 +0100
Subject: nvme: add a numa_node field to struct nvme_ctrl

Instead of directly poking into the struct device add a new numa_node
field to struct nvme_ctrl.  This allows fabrics drivers where ctrl->dev
is a virtual device to support NUMA affinity as well.

Also expose the field as a sysfs attribute, and populate it for the
RDMA and FC transports.

Signed-off-by: Hannes Reinecke <hare at suse.com>
Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c      | 4 +++-
 drivers/nvme/host/fc.c        | 5 +++--
 drivers/nvme/host/multipath.c | 4 ++--
 drivers/nvme/host/nvme.h      | 1 +
 drivers/nvme/host/rdma.c      | 5 +++--
 5 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 41e00404f7a5..80b5a8c60b91 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2764,6 +2764,7 @@ static ssize_t  field##_show(struct device *dev,				\
 static DEVICE_ATTR(field, S_IRUGO, field##_show, NULL);
 
 nvme_show_int_function(cntlid);
+nvme_show_int_function(numa_node);
 
 static ssize_t nvme_sysfs_delete(struct device *dev,
 				struct device_attribute *attr, const char *buf,
@@ -2843,6 +2844,7 @@ static struct attribute *nvme_dev_attrs[] = {
 	&dev_attr_subsysnqn.attr,
 	&dev_attr_address.attr,
 	&dev_attr_state.attr,
+	&dev_attr_numa_node.attr,
 	NULL
 };
 
@@ -3053,7 +3055,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	struct gendisk *disk;
 	struct nvme_id_ns *id;
 	char disk_name[DISK_NAME_LEN];
-	int node = dev_to_node(ctrl->dev), flags = GENHD_FL_EXT_DEVT;
+	int node = ctrl->numa_node, flags = GENHD_FL_EXT_DEVT;
 
 	ns = kzalloc_node(sizeof(*ns), GFP_KERNEL, node);
 	if (!ns)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 0b70c8bab045..a22ff6fb82bc 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2433,7 +2433,7 @@ nvme_fc_create_io_queues(struct nvme_fc_ctrl *ctrl)
 	ctrl->tag_set.ops = &nvme_fc_mq_ops;
 	ctrl->tag_set.queue_depth = ctrl->ctrl.opts->queue_size;
 	ctrl->tag_set.reserved_tags = 1; /* fabric connect */
-	ctrl->tag_set.numa_node = NUMA_NO_NODE;
+	ctrl->tag_set.numa_node = ctrl->ctrl.numa_node;
 	ctrl->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 	ctrl->tag_set.cmd_size =
 		struct_size((struct nvme_fcp_op_w_sgl *)NULL, priv,
@@ -3000,6 +3000,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 
 	ctrl->ctrl.opts = opts;
 	ctrl->ctrl.nr_reconnects = 0;
+	ctrl->ctrl.numa_node = dev_to_node(lport->dev);
 	INIT_LIST_HEAD(&ctrl->ctrl_list);
 	ctrl->lport = lport;
 	ctrl->rport = rport;
@@ -3038,7 +3039,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 	ctrl->admin_tag_set.ops = &nvme_fc_admin_mq_ops;
 	ctrl->admin_tag_set.queue_depth = NVME_AQ_MQ_TAG_DEPTH;
 	ctrl->admin_tag_set.reserved_tags = 2; /* fabric connect + Keep-Alive */
-	ctrl->admin_tag_set.numa_node = NUMA_NO_NODE;
+	ctrl->admin_tag_set.numa_node = ctrl->ctrl.numa_node;
 	ctrl->admin_tag_set.cmd_size =
 		struct_size((struct nvme_fcp_op_w_sgl *)NULL, priv,
 			    ctrl->lport->ops->fcprqst_priv_sz);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 5e3cc8c59a39..8e03cda770c5 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -141,7 +141,7 @@ static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node)
 		    test_bit(NVME_NS_ANA_PENDING, &ns->flags))
 			continue;
 
-		distance = node_distance(node, dev_to_node(ns->ctrl->dev));
+		distance = node_distance(node, ns->ctrl->numa_node);
 
 		switch (ns->ana_state) {
 		case NVME_ANA_OPTIMIZED:
@@ -276,7 +276,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
 	if (!(ctrl->subsys->cmic & (1 << 1)) || !multipath)
 		return 0;
 
-	q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE, NULL);
+	q = blk_alloc_queue_node(GFP_KERNEL, ctrl->numa_node, NULL);
 	if (!q)
 		goto out;
 	q->queuedata = head;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cee79cb388af..f608fc11d329 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -153,6 +153,7 @@ struct nvme_ctrl {
 	struct request_queue *connect_q;
 	struct device *dev;
 	int instance;
+	int numa_node;
 	struct blk_mq_tag_set *tagset;
 	struct blk_mq_tag_set *admin_tagset;
 	struct list_head namespaces;
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index d181cafedc58..4468d672ced9 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -693,7 +693,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->ops = &nvme_rdma_admin_mq_ops;
 		set->queue_depth = NVME_AQ_MQ_TAG_DEPTH;
 		set->reserved_tags = 2; /* connect + keep-alive */
-		set->numa_node = NUMA_NO_NODE;
+		set->numa_node = nctrl->numa_node;
 		set->cmd_size = sizeof(struct nvme_rdma_request) +
 			SG_CHUNK_SIZE * sizeof(struct scatterlist);
 		set->driver_data = ctrl;
@@ -706,7 +706,7 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->ops = &nvme_rdma_mq_ops;
 		set->queue_depth = nctrl->sqsize + 1;
 		set->reserved_tags = 1; /* fabric connect */
-		set->numa_node = NUMA_NO_NODE;
+		set->numa_node = nctrl->numa_node;
 		set->flags = BLK_MQ_F_SHOULD_MERGE;
 		set->cmd_size = sizeof(struct nvme_rdma_request) +
 			SG_CHUNK_SIZE * sizeof(struct scatterlist);
@@ -762,6 +762,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 		return error;
 
 	ctrl->device = ctrl->queues[0].device;
+	ctrl->ctrl.numa_node = dev_to_node(ctrl->device->dev->dma_device);
 
 	ctrl->max_fr_pages = nvme_rdma_get_max_fr_pages(ctrl->device->dev);
 
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] nvme-multipath: Select paths based on NUMA locality
  2018-11-02  9:56 ` [PATCH 2/3] nvme-multipath: Select paths based on NUMA locality Hannes Reinecke
@ 2018-11-08  9:32   ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2018-11-08  9:32 UTC (permalink / raw)


On Fri, Nov 02, 2018@10:56:40AM +0100, Hannes Reinecke wrote:
> This patch creates a per-controller map to hold the NUMA locality
> information. With that we can route I/O to the controller which is
> 'nearest' to the issuing CPU and decrease the latency there.

Well, that isn't really true.  The only think it does is that it
caches the result of 'node_distance(node, ns->ctrl->numa_node)'
for each [node, ctrl] tuple, and adds a sysfs file to export that.

Please update the description.  This also mean on its own the
patch isn't all that useful.

> +void nvme_set_ctrl_node(struct nvme_ctrl *ctrl, int numa_node)
> +{
> +	ctrl->numa_node = numa_node;
> +	if (numa_node == NUMA_NO_NODE)
> +		return;

I don't think adding special cases for this degenerate case is a good
idea.  Or is there a really good reason for keeping the !ctrl->node_map
case around?

> +	ctrl->node_map = kzalloc(num_possible_nodes() * sizeof(int),
> +				 GFP_KERNEL);

This should use kcalloc, and needs to check the return value.

> -		distance = node_distance(node, ns->ctrl->numa_node);
> +		distance = ns->ctrl->node_map ?
> +			ns->ctrl->node_map[node] : INT_MAX;

If we have to keep the special case please make this a proper if/else.

> +void nvme_mpath_balance_subsys(struct nvme_subsystem *subsys)

This name seems a little odd.  Why not
nvme_mpath_calculate_node_distances or something similar that
describes what it actually does?

> +{
> +	struct nvme_ctrl *ctrl;
> +	int node;
> +
> +	mutex_lock(&subsys->lock);
> +
> +	/*
> +	 * Reset set NUMA distance
> +	 *    During creation the NUMA distance is only set
> +	 *    per controller, so after connecting the other
> +	 *    controllers the NUMA information on the existing
> +	 *    ones is incorrect.
> +	 */

Please keep the comment above the function.

> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> +		for_each_node(node) {
> +			if (!ctrl->node_map)
> +				continue;

If we want to keep the special case the !ctrl->node_map should be moved
outside the inner loop.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] nvme: NUMA locality information for fabrics
  2018-11-08  9:22   ` Christoph Hellwig
@ 2018-11-08  9:35     ` Hannes Reinecke
  0 siblings, 0 replies; 19+ messages in thread
From: Hannes Reinecke @ 2018-11-08  9:35 UTC (permalink / raw)


On 11/8/18 10:22 AM, Christoph Hellwig wrote:
> This patch looks good, but the dscription seems extremely misleading.
> 
> I've rewritten it as below for inclusion in nvme-4.21, let me know
> what you think:
> 
> ---
>  From 01d2c22c9d6cd2f0cb2ac2ba5cf98c2ac2d8624e Mon Sep 17 00:00:00 2001
> From: Hannes Reinecke <hare at suse.com>
> Date: Fri, 2 Nov 2018 10:56:39 +0100
> Subject: nvme: add a numa_node field to struct nvme_ctrl
> 
> Instead of directly poking into the struct device add a new numa_node
> field to struct nvme_ctrl.  This allows fabrics drivers where ctrl->dev
> is a virtual device to support NUMA affinity as well.
> 
> Also expose the field as a sysfs attribute, and populate it for the
> RDMA and FC transports.
> 
> Signed-off-by: Hannes Reinecke <hare at suse.com>
> Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
> Signed-off-by: Christoph Hellwig <hch at lst.de>
> ---
>   drivers/nvme/host/core.c      | 4 +++-
>   drivers/nvme/host/fc.c        | 5 +++--
>   drivers/nvme/host/multipath.c | 4 ++--
>   drivers/nvme/host/nvme.h      | 1 +
>   drivers/nvme/host/rdma.c      | 5 +++--
>   5 files changed, 12 insertions(+), 7 deletions(-)
Works for me.

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/3] nvme-multipath: automatic NUMA path balancing
  2018-11-02  9:56 ` [PATCH 3/3] nvme-multipath: automatic NUMA path balancing Hannes Reinecke
@ 2018-11-08  9:36   ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2018-11-08  9:36 UTC (permalink / raw)


> +void nvme_mpath_distribute_paths(struct nvme_subsystem *subsys, int num_ctrls,
> +				 struct nvme_ctrl *ctrl, int numa_node)
> +{

This function needs a comment describing it, as it isn't exactly
obvious.

> +	int node;
> +	int found_node = NUMA_NO_NODE;
> +	int max = LOCAL_DISTANCE * num_ctrls;
> +
> +	for_each_node(node) {
> +		struct nvme_ctrl *c;
> +		int sum = 0;
> +
> +		list_for_each_entry(c, &subsys->ctrls, subsys_entry)
> +			sum += c->node_map[node];
> +		if (sum > max) {
> +			max = sum;
> +			found_node = node;
> +		}
> +	}

E.g. I really don't get the LOCAL_DISTANCE magic at all..

> +void nvme_mpath_balance_node(struct nvme_subsystem *subsys,
> +			     int num_ctrls, int numa_node)
> +{
> +	struct nvme_ctrl *found = NULL, *ctrl;
> +	int max = LOCAL_DISTANCE * num_ctrls, node;

Same here.

>  void nvme_mpath_balance_subsys(struct nvme_subsystem *subsys)
>  {
>  	struct nvme_ctrl *ctrl;
> +	int num_ctrls = 0;
>  	int node;
>  
>  	mutex_lock(&subsys->lock);
>  
>  	/*
> -	 * Reset set NUMA distance
> +	 * 1. Reset set NUMA distance
>  	 *    During creation the NUMA distance is only set
>  	 *    per controller, so after connecting the other
>  	 *    controllers the NUMA information on the existing
> @@ -280,7 +325,49 @@ void nvme_mpath_balance_subsys(struct nvme_subsystem *subsys)

Ok, this function and the comments make a whole lot more sense
with the new patch.  I think your intention might be much more clear
if you merge the two patches.

> +	/*
> +	 * 2. Distribute optimal paths:
> +	 *    Only one primary paths per node.
> +	 *    Additional primary paths are moved to unassigned nodes.
> +	 */

Btw, what do you mean with 'primary' path, we don't really use that
terminology anywhere else.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHv3 0/3] nvme: NUMA locality for fabrics
  2018-11-02  9:56 [PATCHv3 0/3] nvme: NUMA locality for fabrics Hannes Reinecke
                   ` (2 preceding siblings ...)
  2018-11-02  9:56 ` [PATCH 3/3] nvme-multipath: automatic NUMA path balancing Hannes Reinecke
@ 2018-11-16  8:12 ` Christoph Hellwig
  2018-11-16  8:21   ` Hannes Reinecke
  3 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2018-11-16  8:12 UTC (permalink / raw)


Before we get into all the other craziness on the list, can I get
a respin of this one that is actually useful first?

On Fri, Nov 02, 2018@10:56:38AM +0100, Hannes Reinecke wrote:
> Hi all,
> 
> here's a patchset to leverage NUMA locality information for fabric controllers.
> This is the second attempt for doing so; after discussion with hch we came
> to the conclusion that the attempt in the initial submission with a manual
> configuration would only lead to more confusion and suboptimal configuration.
> 
> So here's now a version with an automatic NUMA balancing, where we attempt
> to split the number submitting CPUs/cores evenly across the available
> controller.
> 
> With this patchset I'm seeing a performance increase from
> 262k IOPS to 344k IOPS, measured against a NetApp AF700.
> 
> As usual, comments and reviews are welcome.
> 
> Changes to v2:
> - use 'numa_node' instead of 'node_id' (suggested by Sagi)
> - rediff patches for better readability
> 
> Hannes Reinecke (3):
>   nvme: NUMA locality information for fabrics
>   nvme-multipath: Select paths based on NUMA locality
>   nvme-multipath: automatic NUMA path balancing
> 
>  drivers/nvme/host/core.c      |  36 ++++++++++++-
>  drivers/nvme/host/fc.c        |   5 +-
>  drivers/nvme/host/multipath.c | 119 +++++++++++++++++++++++++++++++++++++++++-
>  drivers/nvme/host/nvme.h      |   3 ++
>  drivers/nvme/host/rdma.c      |   6 ++-
>  5 files changed, 161 insertions(+), 8 deletions(-)
> 
> -- 
> 2.16.4
---end quoted text---

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHv3 0/3] nvme: NUMA locality for fabrics
  2018-11-16  8:12 ` [PATCHv3 0/3] nvme: NUMA locality for fabrics Christoph Hellwig
@ 2018-11-16  8:21   ` Hannes Reinecke
  2018-11-16  8:23     ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Reinecke @ 2018-11-16  8:21 UTC (permalink / raw)


On 11/16/18 9:12 AM, Christoph Hellwig wrote:
> Before we get into all the other craziness on the list, can I get
> a respin of this one that is actually useful first?
> 
> On Fri, Nov 02, 2018@10:56:38AM +0100, Hannes Reinecke wrote:
>> Hi all,
>>
>> here's a patchset to leverage NUMA locality information for fabric controllers.
>> This is the second attempt for doing so; after discussion with hch we came
>> to the conclusion that the attempt in the initial submission with a manual
>> configuration would only lead to more confusion and suboptimal configuration.
>>
>> So here's now a version with an automatic NUMA balancing, where we attempt
>> to split the number submitting CPUs/cores evenly across the available
>> controller.
>>
>> With this patchset I'm seeing a performance increase from
>> 262k IOPS to 344k IOPS, measured against a NetApp AF700.
>>
>> As usual, comments and reviews are welcome.
>>
>> Changes to v2:
>> - use 'numa_node' instead of 'node_id' (suggested by Sagi)
>> - rediff patches for better readability
>>
>> Hannes Reinecke (3):
>>    nvme: NUMA locality information for fabrics
>>    nvme-multipath: Select paths based on NUMA locality
>>    nvme-multipath: automatic NUMA path balancing
>>
>>   drivers/nvme/host/core.c      |  36 ++++++++++++-
>>   drivers/nvme/host/fc.c        |   5 +-
>>   drivers/nvme/host/multipath.c | 119 +++++++++++++++++++++++++++++++++++++++++-
>>   drivers/nvme/host/nvme.h      |   3 ++
>>   drivers/nvme/host/rdma.c      |   6 ++-
>>   5 files changed, 161 insertions(+), 8 deletions(-)
>>
>> -- 
>> 2.16.4
> ---end quoted text---
> 
If we can agree that we'll go for the round-robin scheduler I guess this 
patchset can be retracted (for now).
There still might be some gain to be had, but it can be implemented as 
an additional I/O policy.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHv3 0/3] nvme: NUMA locality for fabrics
  2018-11-16  8:21   ` Hannes Reinecke
@ 2018-11-16  8:23     ` Christoph Hellwig
  2018-11-19 22:31       ` Sagi Grimberg
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2018-11-16  8:23 UTC (permalink / raw)


On Fri, Nov 16, 2018@09:21:46AM +0100, Hannes Reinecke wrote:
> If we can agree that we'll go for the round-robin scheduler I guess this 
> patchset can be retracted (for now).
> There still might be some gain to be had, but it can be implemented as an 
> additional I/O policy.

I don't think we can.  round-robin is fundamentally the wrong thing
to do as you keep touching different cache lines.  If a customer really
insists on round robin they are probably better off in a SCSI world..

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHv3 0/3] nvme: NUMA locality for fabrics
  2018-11-16  8:23     ` Christoph Hellwig
@ 2018-11-19 22:31       ` Sagi Grimberg
  2018-11-20  6:12         ` Hannes Reinecke
  0 siblings, 1 reply; 19+ messages in thread
From: Sagi Grimberg @ 2018-11-19 22:31 UTC (permalink / raw)



>> If we can agree that we'll go for the round-robin scheduler I guess this
>> patchset can be retracted (for now).
>> There still might be some gain to be had, but it can be implemented as an
>> additional I/O policy.
> 
> I don't think we can.  round-robin is fundamentally the wrong thing
> to do as you keep touching different cache lines.  If a customer really
> insists on round robin they are probably better off in a SCSI world..

While we often think that link aggregation is a thing of the past as
adapters keep getting faster, I guess that in some cases it can buy us
some benefits. Personally, as long as the round-robin policy stays ANA
aware I don't see the harm in offering it.

And I agree that its often the wrong thing to do, which is why we
should never let it be the default policy.

What I'm a bit struggling with is the implementation that is a bit
cumbersome IMO with constant clearing of the current_path. In this
policy there is no current_path really...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHv3 0/3] nvme: NUMA locality for fabrics
  2018-11-19 22:31       ` Sagi Grimberg
@ 2018-11-20  6:12         ` Hannes Reinecke
  2018-11-20  9:41           ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Hannes Reinecke @ 2018-11-20  6:12 UTC (permalink / raw)


On 11/19/18 11:31 PM, Sagi Grimberg wrote:
> 
>>> If we can agree that we'll go for the round-robin scheduler I guess this
>>> patchset can be retracted (for now).
>>> There still might be some gain to be had, but it can be implemented 
>>> as an
>>> additional I/O policy.
>>
>> I don't think we can.? round-robin is fundamentally the wrong thing
>> to do as you keep touching different cache lines.? If a customer really
>> insists on round robin they are probably better off in a SCSI world..
> 
> While we often think that link aggregation is a thing of the past as
> adapters keep getting faster, I guess that in some cases it can buy us
> some benefits. Personally, as long as the round-robin policy stays ANA
> aware I don't see the harm in offering it.
> 
Fully agreed here.
It all comes down to the link latency.
If the link latency is the main bottleneck multipathing will benefit 
from round-robin (or any I/O scheduler, for that matter).
And this it not just relevant for 'legacy' hardware; I've seen a 
performance benefit on a 32G FC setup, which is pretty much state of the 
art currently.

> And I agree that its often the wrong thing to do, which is why we
> should never let it be the default policy.
> 
... Which is why I made it optional ...

> What I'm a bit struggling with is the implementation that is a bit
> cumbersome IMO with constant clearing of the current_path. In this
> policy there is no current_path really...

The 'current_path' setting serves neatly as a marker where to start the 
search for the next round.
When not setting it I would have had to introduce yet another variable; 
with this approach I can model it without having to add additional stuff.

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHv3 0/3] nvme: NUMA locality for fabrics
  2018-11-20  6:12         ` Hannes Reinecke
@ 2018-11-20  9:41           ` Christoph Hellwig
  2018-11-20 15:47             ` Keith Busch
                               ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Christoph Hellwig @ 2018-11-20  9:41 UTC (permalink / raw)


On Tue, Nov 20, 2018@07:12:47AM +0100, Hannes Reinecke wrote:
> Fully agreed here.
> It all comes down to the link latency.
> If the link latency is the main bottleneck multipathing will benefit from 
> round-robin (or any I/O scheduler, for that matter).

It still makes a lot more sense to try to have queues with an affinity
to a given path rather than doing round robin IFF you care about latency.

If you latency sucks anyway round robing makes sense.  But why do you
use nvme on such a horrible interconnect anyway?

> And this it not just relevant for 'legacy' hardware; I've seen a 
> performance benefit on a 32G FC setup, which is pretty much state of the 
> art currently.

Well, FC is legacy no matter which link speed.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHv3 0/3] nvme: NUMA locality for fabrics
  2018-11-20  9:41           ` Christoph Hellwig
@ 2018-11-20 15:47             ` Keith Busch
  2018-11-20 19:27               ` James Smart
  2018-11-20 16:21             ` Hannes Reinecke
  2018-11-20 18:12             ` James Smart
  2 siblings, 1 reply; 19+ messages in thread
From: Keith Busch @ 2018-11-20 15:47 UTC (permalink / raw)


On Tue, Nov 20, 2018@01:41:40AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 20, 2018@07:12:47AM +0100, Hannes Reinecke wrote:
> > Fully agreed here.
> > It all comes down to the link latency.
> > If the link latency is the main bottleneck multipathing will benefit from 
> > round-robin (or any I/O scheduler, for that matter).
> 
> It still makes a lot more sense to try to have queues with an affinity
> to a given path rather than doing round robin IFF you care about latency.
> 
> If you latency sucks anyway round robing makes sense.  But why do you
> use nvme on such a horrible interconnect anyway?
> 
> > And this it not just relevant for 'legacy' hardware; I've seen a 
> > performance benefit on a 32G FC setup, which is pretty much state of the 
> > art currently.
> 
> Well, FC is legacy no matter which link speed.

Fabrics has a required Identify Controller field called "MAXCMD" which
is a per-queue limiter. A sufficiently shallow value with a high queue
depth workload could justify round-robin'ing your queues.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHv3 0/3] nvme: NUMA locality for fabrics
  2018-11-20  9:41           ` Christoph Hellwig
  2018-11-20 15:47             ` Keith Busch
@ 2018-11-20 16:21             ` Hannes Reinecke
  2018-11-20 18:12             ` James Smart
  2 siblings, 0 replies; 19+ messages in thread
From: Hannes Reinecke @ 2018-11-20 16:21 UTC (permalink / raw)


On 11/20/18 10:41 AM, Christoph Hellwig wrote:
> On Tue, Nov 20, 2018@07:12:47AM +0100, Hannes Reinecke wrote:
>> Fully agreed here.
>> It all comes down to the link latency.
>> If the link latency is the main bottleneck multipathing will benefit from
>> round-robin (or any I/O scheduler, for that matter).
> 
> It still makes a lot more sense to try to have queues with an affinity
> to a given path rather than doing round robin IFF you care about latency.
> 
That's what I tried initially, but the implementation and the 
distribution algorithm turned out the be far from simple.

> If you latency sucks anyway round robing makes sense.  But why do you
> use nvme on such a horrible interconnect anyway?
> 
Hmm.
Because the target implements it?
And I didn't control the target implementation in this case?

>> And this it not just relevant for 'legacy' hardware; I've seen a
>> performance benefit on a 32G FC setup, which is pretty much state of the
>> art currently.
> 
> Well, FC is legacy no matter which link speed.
> 
Oh, for crying out loud.

If you can't be bothered to support FC-NVMe, fine, you don't have to.
But others might, and might not be so fortunate as to have a choice 
about _what_ they want to support.

Hannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHv3 0/3] nvme: NUMA locality for fabrics
  2018-11-20  9:41           ` Christoph Hellwig
  2018-11-20 15:47             ` Keith Busch
  2018-11-20 16:21             ` Hannes Reinecke
@ 2018-11-20 18:12             ` James Smart
  2 siblings, 0 replies; 19+ messages in thread
From: James Smart @ 2018-11-20 18:12 UTC (permalink / raw)




On 11/20/2018 1:41 AM, Christoph Hellwig wrote:
> On Tue, Nov 20, 2018@07:12:47AM +0100, Hannes Reinecke wrote:
>> Fully agreed here.
>> It all comes down to the link latency.
>> If the link latency is the main bottleneck multipathing will benefit from
>> round-robin (or any I/O scheduler, for that matter).
> It still makes a lot more sense to try to have queues with an affinity
> to a given path rather than doing round robin IFF you care about latency.
>
> If you latency sucks anyway round robing makes sense.  But why do you
> use nvme on such a horrible interconnect anyway?

you really need to resurvey what enterprises are using and what is being 
achieved now days.? I hope this prejudice doesn't show up in maintainership.


>
>> And this it not just relevant for 'legacy' hardware; I've seen a
>> performance benefit on a 32G FC setup, which is pretty much state of the
>> art currently.
> Well, FC is legacy no matter which link speed.
>

legacy - but still supporting the new and proven enterprise storage 
technologies; real and large interoperable fabrics; lots of mgmt tools; 
huge existing device base; and features that other technologies do not 
have without special add-ons. Sometimes the 
grass-is-always-greener-over-there doesn't pan out...

Thanks

-- james

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHv3 0/3] nvme: NUMA locality for fabrics
  2018-11-20 15:47             ` Keith Busch
@ 2018-11-20 19:27               ` James Smart
  2018-11-21  8:36                 ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: James Smart @ 2018-11-20 19:27 UTC (permalink / raw)




On 11/20/2018 7:47 AM, Keith Busch wrote:
> On Tue, Nov 20, 2018@01:41:40AM -0800, Christoph Hellwig wrote:
>> On Tue, Nov 20, 2018@07:12:47AM +0100, Hannes Reinecke wrote:
>>> Fully agreed here.
>>> It all comes down to the link latency.
>>> If the link latency is the main bottleneck multipathing will benefit from
>>> round-robin (or any I/O scheduler, for that matter).
>> It still makes a lot more sense to try to have queues with an affinity
>> to a given path rather than doing round robin IFF you care about latency.
>>
>> If you latency sucks anyway round robing makes sense.  But why do you
>> use nvme on such a horrible interconnect anyway?
>>
>>> And this it not just relevant for 'legacy' hardware; I've seen a
>>> performance benefit on a 32G FC setup, which is pretty much state of the
>>> art currently.
>> Well, FC is legacy no matter which link speed.

Christoph,

your statement on affinity seems to have some very specific implied 
boundaries which I would like to understand.?? There's obviously some 
design principals and queue expectations that don't match the fabric 
implementations that are being put together, and they are not specific 
to FC. Do you mind enumerating some of your thoughts ?

 From statements such as "spend all our time optimizing every last cycle 
and cache line"? you are most concerned about the absolute lowest 
latency for a single io as seen on a single cpu. As such, that drives 
for a highly constrained fabric and environment and ignores a lot of 
things "sharing" aspects. Yes, it can be supported, but there are much 
more general cases out there as well.

- What are the latencies that are meaningful ?? is it solely single 
digit us ? 10-30us ?? 1ms ?? 50ms ?? 100ms ???? How do things change if 
the latency change ?

- At what point does it become more important to get commands to the 
subsystem (via a different queue or queues on different controllers) so 
they can be being worked on in parallel vs the latency of a single io 
??? How is this need communicated to the nvme layer so we can make the 
right choices ??? Queue counts, queue size, and MAXCMD limits (thanks 
Keith), may cause throttles that increase this need.?? To that end - 
what are your expectations for queue size or MAXCMD limits vs the 
latencies vs load from a single cpu?? or a set of cpus ?

- Must every cpu get a queue ??? what if the controller won't support a 
queue per cpu ?? how do things change if only 1 in 16 or less of the 
cpu's get queues ?? Today, if less than cpu count queues aren't 
supported - aren't queues for the different controllers likely mapping 
to the same cpus ? I don't think there's awareness of what queues on 
other controllers are present so that there could be redundant paths 
mapped to cpus that aren't bound already. And if such logic were added, 
how does that affect the multipathing choices ?

- What if application load is driven only by specific cpu's - if a "cpu 
was given" to the specific task (constrained app, VMs, or containers. 
how would we know that?) how does that map if multiple cpus are sharing 
a queue ? will multipathing and load choices be made system-wide, 
specific to a cpu set, or a single cpu ?

- What if a cpu is finally out of cycles due to load, how can we find 
out if the io must be limited to a cpu or whether affinity can be 
subverted so that use of other idle cpus can share in the processing 
load for the queue(s) for that cpu ?

If we're completely focused on cpu affinity with a queue, what happens 
when not all queues are equal. There is lots of talk of specific queues 
providing specific functionality that other queues wouldn't support. 
What if queues 1-M do writes with dedup, queues N-P do writes with 
compression, and Q-Z do accelerated writes to a local cache.? How do you 
see the current linux implementation migrating to something like that ?

-- james

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHv3 0/3] nvme: NUMA locality for fabrics
  2018-11-20 19:27               ` James Smart
@ 2018-11-21  8:36                 ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2018-11-21  8:36 UTC (permalink / raw)


On Tue, Nov 20, 2018@11:27:04AM -0800, James Smart wrote:
> - What are the latencies that are meaningful ?? is it solely single digit 
> us ? 10-30us ?? 1ms ?? 50ms ?? 100ms ???? How do things change if the 
> latency change ?

Well, we did spend a lot of effort on making sure sub-10us latencies
work.  For real life setups with multiple cpus in use the 100us order
of magniture is probably more interesting.

> - At what point does it become more important to get commands to the 
> subsystem (via a different queue or queues on different controllers) so 
> they can be being worked on in parallel vs the latency of a single io ??? 
> How is this need communicated to the nvme layer so we can make the right 
> choices ??? Queue counts, queue size, and MAXCMD limits (thanks Keith), 
> may cause throttles that increase this need.?? To that end - what are 
> your expectations for queue size or MAXCMD limits vs the latencies vs load 
> from a single cpu?? or a set of cpus ?

If you fully load the queue you are of course going to see massive
additional latency.  That's why you'll see typical NVMe PCIe device
(or RDMA setups) massively overprovisioned in number of queues and/or
queue depth.

> - Must every cpu get a queue ??? what if the controller won't support a 
> queue per cpu ?? how do things change if only 1 in 16 or less of the cpu's 
> get queues ?? Today, if less than cpu count queues aren't supported - 
> aren't queues for the different controllers likely mapping to the same cpus 
> ? I don't think there's awareness of what queues on other controllers are 
> present so that there could be redundant paths mapped to cpus that aren't 
> bound already. And if such logic were added, how does that affect the 
> multipathing choices ?

Less than cpu count queues are perfectly supported, we'll just start
sharing queues.  In general as long as you share queues between cores
on the same socket things still work reasonably fine.  Once you start
sharing a queue between sockets you are going to be in a world of pain.

> - What if application load is driven only by specific cpu's - if a "cpu was 
> given" to the specific task (constrained app, VMs, or containers. how would 
> we know that?) how does that map if multiple cpus are sharing a queue ? 
> will multipathing and load choices be made system-wide, specific to a cpu 
> set, or a single cpu ?

Right now we do multipathing devisions per node (aka per socket for
todays typical systems).

> - What if a cpu is finally out of cycles due to load, how can we find out 
> if the io must be limited to a cpu or whether affinity can be subverted so 
> that use of other idle cpus can share in the processing load for the 
> queue(s) for that cpu ?

For the traditional interrupt driven model we really need to process
the interrupts on the submitting cpu to throttle.  For an interesting
model to move the I/O completion load to another thread and thus potential
cpu look at the aio poll patches Jens just posted.

> If we're completely focused on cpu affinity with a queue, what happens when 
> not all queues are equal. There is lots of talk of specific queues 
> providing specific functionality that other queues wouldn't support. What 
> if queues 1-M do writes with dedup, queues N-P do writes with compression, 
> and Q-Z do accelerated writes to a local cache.? How do you see the 
> current linux implementation migrating to something like that ?

Very unlikely.  We have support for a few queu types now in the for-4.21
block tree (default, reads, polling), but the above just sounds way too
magic.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-11-21  8:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-02  9:56 [PATCHv3 0/3] nvme: NUMA locality for fabrics Hannes Reinecke
2018-11-02  9:56 ` [PATCH 1/3] nvme: NUMA locality information " Hannes Reinecke
2018-11-08  9:22   ` Christoph Hellwig
2018-11-08  9:35     ` Hannes Reinecke
2018-11-02  9:56 ` [PATCH 2/3] nvme-multipath: Select paths based on NUMA locality Hannes Reinecke
2018-11-08  9:32   ` Christoph Hellwig
2018-11-02  9:56 ` [PATCH 3/3] nvme-multipath: automatic NUMA path balancing Hannes Reinecke
2018-11-08  9:36   ` Christoph Hellwig
2018-11-16  8:12 ` [PATCHv3 0/3] nvme: NUMA locality for fabrics Christoph Hellwig
2018-11-16  8:21   ` Hannes Reinecke
2018-11-16  8:23     ` Christoph Hellwig
2018-11-19 22:31       ` Sagi Grimberg
2018-11-20  6:12         ` Hannes Reinecke
2018-11-20  9:41           ` Christoph Hellwig
2018-11-20 15:47             ` Keith Busch
2018-11-20 19:27               ` James Smart
2018-11-21  8:36                 ` Christoph Hellwig
2018-11-20 16:21             ` Hannes Reinecke
2018-11-20 18:12             ` James Smart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.