All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
@ 2017-04-02 13:41 ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)
  To: linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig

This patch set is aiming to automatically find the optimal
queue <-> irq multi-queue assignments in storage ULPs (demonstrated
on nvme-rdma) based on the underlying rdma device irq affinity
settings.

First two patches modify mlx5 core driver to use generic API
to allocate array of irq vectors with automatic affinity
settings instead of open-coding exactly what it does (and
slightly worse).

Then, in order to obtain an affinity map for a given completion
vector, we expose a new RDMA core API, and implement it in mlx5.

The third part is addition of a rdma-based queue mapping helper
to blk-mq that maps the tagset hctx's according to the device
affinity mappings.

I'd happily convert some more drivers, but I'll need volunteers
to test as I don't have access to any other devices.

I cc'd @netdev (and Saeed + Or) as this is the place that most of
mlx5 core action takes place, so Saeed, would love to hear your
feedback.

Any feedback is welcome.

Sagi Grimberg (6):
  mlx5: convert to generic pci_alloc_irq_vectors
  mlx5: move affinity hints assignments to generic code
  RDMA/core: expose affinity mappings per completion vector
  mlx5: support ->get_vector_affinity
  block: Add rdma affinity based queue mapping helper
  nvme-rdma: use intelligent affinity based queue mappings

 block/Kconfig                                      |   5 +
 block/Makefile                                     |   1 +
 block/blk-mq-rdma.c                                |  56 +++++++++++
 drivers/infiniband/hw/mlx5/main.c                  |  10 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   5 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c       |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/health.c   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c     | 106 +++------------------
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |   1 -
 drivers/nvme/host/rdma.c                           |  13 +++
 include/linux/blk-mq-rdma.h                        |  10 ++
 include/linux/mlx5/driver.h                        |   2 -
 include/rdma/ib_verbs.h                            |  24 +++++
 14 files changed, 138 insertions(+), 108 deletions(-)
 create mode 100644 block/blk-mq-rdma.c
 create mode 100644 include/linux/blk-mq-rdma.h

-- 
2.7.4

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
@ 2017-04-02 13:41 ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)


This patch set is aiming to automatically find the optimal
queue <-> irq multi-queue assignments in storage ULPs (demonstrated
on nvme-rdma) based on the underlying rdma device irq affinity
settings.

First two patches modify mlx5 core driver to use generic API
to allocate array of irq vectors with automatic affinity
settings instead of open-coding exactly what it does (and
slightly worse).

Then, in order to obtain an affinity map for a given completion
vector, we expose a new RDMA core API, and implement it in mlx5.

The third part is addition of a rdma-based queue mapping helper
to blk-mq that maps the tagset hctx's according to the device
affinity mappings.

I'd happily convert some more drivers, but I'll need volunteers
to test as I don't have access to any other devices.

I cc'd @netdev (and Saeed + Or) as this is the place that most of
mlx5 core action takes place, so Saeed, would love to hear your
feedback.

Any feedback is welcome.

Sagi Grimberg (6):
  mlx5: convert to generic pci_alloc_irq_vectors
  mlx5: move affinity hints assignments to generic code
  RDMA/core: expose affinity mappings per completion vector
  mlx5: support ->get_vector_affinity
  block: Add rdma affinity based queue mapping helper
  nvme-rdma: use intelligent affinity based queue mappings

 block/Kconfig                                      |   5 +
 block/Makefile                                     |   1 +
 block/blk-mq-rdma.c                                |  56 +++++++++++
 drivers/infiniband/hw/mlx5/main.c                  |  10 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   5 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c       |   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/health.c   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c     | 106 +++------------------
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |   1 -
 drivers/nvme/host/rdma.c                           |  13 +++
 include/linux/blk-mq-rdma.h                        |  10 ++
 include/linux/mlx5/driver.h                        |   2 -
 include/rdma/ib_verbs.h                            |  24 +++++
 14 files changed, 138 insertions(+), 108 deletions(-)
 create mode 100644 block/blk-mq-rdma.c
 create mode 100644 include/linux/blk-mq-rdma.h

-- 
2.7.4

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 1/6] mlx5: convert to generic pci_alloc_irq_vectors
  2017-04-02 13:41 ` Sagi Grimberg
@ 2017-04-02 13:41   ` Sagi Grimberg
  -1 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)
  To: linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig

Now that we have a generic code to allocate an array
of irq vectors and even correctly spread their affinity,
correctly handle cpu hotplug events and more, were much
better off using it.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c       |  9 ++----
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/health.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c     | 33 ++++++++--------------
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |  1 -
 include/linux/mlx5/driver.h                        |  1 -
 7 files changed, 17 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8ef64c4db2c2..eec0d172761e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -389,7 +389,7 @@ static void mlx5e_enable_async_events(struct mlx5e_priv *priv)
 static void mlx5e_disable_async_events(struct mlx5e_priv *priv)
 {
 	clear_bit(MLX5E_STATE_ASYNC_EVENTS_ENABLED, &priv->state);
-	synchronize_irq(mlx5_get_msix_vec(priv->mdev, MLX5_EQ_VEC_ASYNC));
+	synchronize_irq(pci_irq_vector(priv->mdev->pdev, MLX5_EQ_VEC_ASYNC));
 }
 
 static inline int mlx5e_get_wqe_mtt_sz(void)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index ea5d8d37a75c..e2c33c493b89 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -575,7 +575,7 @@ int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq, u8 vecidx,
 		 name, pci_name(dev->pdev));
 
 	eq->eqn = MLX5_GET(create_eq_out, out, eq_number);
-	eq->irqn = priv->msix_arr[vecidx].vector;
+	eq->irqn = pci_irq_vector(dev->pdev, vecidx);
 	eq->dev = dev;
 	eq->doorbell = priv->uar->map + MLX5_EQ_DOORBEL_OFFSET;
 	err = request_irq(eq->irqn, handler, 0,
@@ -610,7 +610,7 @@ int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq, u8 vecidx,
 	return 0;
 
 err_irq:
-	free_irq(priv->msix_arr[vecidx].vector, eq);
+	free_irq(eq->irqn, eq);
 
 err_eq:
 	mlx5_cmd_destroy_eq(dev, eq->eqn);
@@ -651,11 +651,6 @@ int mlx5_destroy_unmap_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq)
 }
 EXPORT_SYMBOL_GPL(mlx5_destroy_unmap_eq);
 
-u32 mlx5_get_msix_vec(struct mlx5_core_dev *dev, int vecidx)
-{
-	return dev->priv.msix_arr[MLX5_EQ_VEC_ASYNC].vector;
-}
-
 int mlx5_eq_init(struct mlx5_core_dev *dev)
 {
 	int err;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index fcd5bc7e31db..6bf5d70b4117 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1596,7 +1596,7 @@ static void esw_disable_vport(struct mlx5_eswitch *esw, int vport_num)
 	/* Mark this vport as disabled to discard new events */
 	vport->enabled = false;
 
-	synchronize_irq(mlx5_get_msix_vec(esw->dev, MLX5_EQ_VEC_ASYNC));
+	synchronize_irq(pci_irq_vector(esw->dev->pdev, MLX5_EQ_VEC_ASYNC));
 	/* Wait for current already scheduled events to complete */
 	flush_workqueue(esw->work_queue);
 	/* Disable events from this vport */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c
index d0515391d33b..8b38d5cfd4c5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
@@ -80,7 +80,7 @@ static void trigger_cmd_completions(struct mlx5_core_dev *dev)
 	u64 vector;
 
 	/* wait for pending handlers to complete */
-	synchronize_irq(dev->priv.msix_arr[MLX5_EQ_VEC_CMD].vector);
+	synchronize_irq(pci_irq_vector(dev->pdev, MLX5_EQ_VEC_CMD));
 	spin_lock_irqsave(&dev->cmd.alloc_lock, flags);
 	vector = ~dev->cmd.bitmask & ((1ul << (1 << dev->cmd.log_sz)) - 1);
 	if (!vector)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index e2bd600d19de..7c8672cbb369 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -308,13 +308,12 @@ static void release_bar(struct pci_dev *pdev)
 	pci_release_regions(pdev);
 }
 
-static int mlx5_enable_msix(struct mlx5_core_dev *dev)
+static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
 {
 	struct mlx5_priv *priv = &dev->priv;
 	struct mlx5_eq_table *table = &priv->eq_table;
 	int num_eqs = 1 << MLX5_CAP_GEN(dev, log_max_eq);
 	int nvec;
-	int i;
 
 	nvec = MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
 	       MLX5_EQ_VEC_COMP_BASE;
@@ -322,17 +321,13 @@ static int mlx5_enable_msix(struct mlx5_core_dev *dev)
 	if (nvec <= MLX5_EQ_VEC_COMP_BASE)
 		return -ENOMEM;
 
-	priv->msix_arr = kcalloc(nvec, sizeof(*priv->msix_arr), GFP_KERNEL);
-
 	priv->irq_info = kcalloc(nvec, sizeof(*priv->irq_info), GFP_KERNEL);
-	if (!priv->msix_arr || !priv->irq_info)
+	if (!priv->irq_info)
 		goto err_free_msix;
 
-	for (i = 0; i < nvec; i++)
-		priv->msix_arr[i].entry = i;
-
-	nvec = pci_enable_msix_range(dev->pdev, priv->msix_arr,
-				     MLX5_EQ_VEC_COMP_BASE + 1, nvec);
+	nvec = pci_alloc_irq_vectors(dev->pdev,
+			MLX5_EQ_VEC_COMP_BASE + 1, nvec,
+			PCI_IRQ_MSIX);
 	if (nvec < 0)
 		return nvec;
 
@@ -342,7 +337,6 @@ static int mlx5_enable_msix(struct mlx5_core_dev *dev)
 
 err_free_msix:
 	kfree(priv->irq_info);
-	kfree(priv->msix_arr);
 	return -ENOMEM;
 }
 
@@ -350,9 +344,8 @@ static void mlx5_disable_msix(struct mlx5_core_dev *dev)
 {
 	struct mlx5_priv *priv = &dev->priv;
 
-	pci_disable_msix(dev->pdev);
+	pci_free_irq_vectors(dev->pdev);
 	kfree(priv->irq_info);
-	kfree(priv->msix_arr);
 }
 
 struct mlx5_reg_host_endianess {
@@ -610,8 +603,7 @@ u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev)
 static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
 {
 	struct mlx5_priv *priv  = &mdev->priv;
-	struct msix_entry *msix = priv->msix_arr;
-	int irq                 = msix[i + MLX5_EQ_VEC_COMP_BASE].vector;
+	int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
 	int err;
 
 	if (!zalloc_cpumask_var(&priv->irq_info[i].mask, GFP_KERNEL)) {
@@ -639,8 +631,7 @@ static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
 static void mlx5_irq_clear_affinity_hint(struct mlx5_core_dev *mdev, int i)
 {
 	struct mlx5_priv *priv  = &mdev->priv;
-	struct msix_entry *msix = priv->msix_arr;
-	int irq                 = msix[i + MLX5_EQ_VEC_COMP_BASE].vector;
+	int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
 
 	irq_set_affinity_hint(irq, NULL);
 	free_cpumask_var(priv->irq_info[i].mask);
@@ -763,8 +754,8 @@ static int alloc_comp_eqs(struct mlx5_core_dev *dev)
 		}
 
 #ifdef CONFIG_RFS_ACCEL
-		irq_cpu_rmap_add(dev->rmap,
-				 dev->priv.msix_arr[i + MLX5_EQ_VEC_COMP_BASE].vector);
+		irq_cpu_rmap_add(dev->rmap, pci_irq_vector(dev->pdev,
+				 MLX5_EQ_VEC_COMP_BASE + i));
 #endif
 		snprintf(name, MLX5_MAX_IRQ_NAME, "mlx5_comp%d", i);
 		err = mlx5_create_map_eq(dev, eq,
@@ -1101,9 +1092,9 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 		goto err_stop_poll;
 	}
 
-	err = mlx5_enable_msix(dev);
+	err = mlx5_alloc_irq_vectors(dev);
 	if (err) {
-		dev_err(&pdev->dev, "enable msix failed\n");
+		dev_err(&pdev->dev, "alloc irq vectors failed\n");
 		goto err_cleanup_once;
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index b3dabe6e8836..42bfcf20d875 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -109,7 +109,6 @@ int mlx5_destroy_scheduling_element_cmd(struct mlx5_core_dev *dev, u8 hierarchy,
 					u32 element_id);
 int mlx5_wait_for_vf_pages(struct mlx5_core_dev *dev);
 u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev);
-u32 mlx5_get_msix_vec(struct mlx5_core_dev *dev, int vecidx);
 struct mlx5_eq *mlx5_eqn2eq(struct mlx5_core_dev *dev, int eqn);
 void mlx5_cq_tasklet_cb(unsigned long data);
 
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 2fcff6b4503f..a9891df94ce0 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -589,7 +589,6 @@ struct mlx5_port_module_event_stats {
 struct mlx5_priv {
 	char			name[MLX5_MAX_NAME_LEN];
 	struct mlx5_eq_table	eq_table;
-	struct msix_entry	*msix_arr;
 	struct mlx5_irq_info	*irq_info;
 
 	/* pages stuff */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rfc 1/6] mlx5: convert to generic pci_alloc_irq_vectors
@ 2017-04-02 13:41   ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)


Now that we have a generic code to allocate an array
of irq vectors and even correctly spread their affinity,
correctly handle cpu hotplug events and more, were much
better off using it.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c       |  9 ++----
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/health.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c     | 33 ++++++++--------------
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |  1 -
 include/linux/mlx5/driver.h                        |  1 -
 7 files changed, 17 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8ef64c4db2c2..eec0d172761e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -389,7 +389,7 @@ static void mlx5e_enable_async_events(struct mlx5e_priv *priv)
 static void mlx5e_disable_async_events(struct mlx5e_priv *priv)
 {
 	clear_bit(MLX5E_STATE_ASYNC_EVENTS_ENABLED, &priv->state);
-	synchronize_irq(mlx5_get_msix_vec(priv->mdev, MLX5_EQ_VEC_ASYNC));
+	synchronize_irq(pci_irq_vector(priv->mdev->pdev, MLX5_EQ_VEC_ASYNC));
 }
 
 static inline int mlx5e_get_wqe_mtt_sz(void)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index ea5d8d37a75c..e2c33c493b89 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -575,7 +575,7 @@ int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq, u8 vecidx,
 		 name, pci_name(dev->pdev));
 
 	eq->eqn = MLX5_GET(create_eq_out, out, eq_number);
-	eq->irqn = priv->msix_arr[vecidx].vector;
+	eq->irqn = pci_irq_vector(dev->pdev, vecidx);
 	eq->dev = dev;
 	eq->doorbell = priv->uar->map + MLX5_EQ_DOORBEL_OFFSET;
 	err = request_irq(eq->irqn, handler, 0,
@@ -610,7 +610,7 @@ int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq, u8 vecidx,
 	return 0;
 
 err_irq:
-	free_irq(priv->msix_arr[vecidx].vector, eq);
+	free_irq(eq->irqn, eq);
 
 err_eq:
 	mlx5_cmd_destroy_eq(dev, eq->eqn);
@@ -651,11 +651,6 @@ int mlx5_destroy_unmap_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq)
 }
 EXPORT_SYMBOL_GPL(mlx5_destroy_unmap_eq);
 
-u32 mlx5_get_msix_vec(struct mlx5_core_dev *dev, int vecidx)
-{
-	return dev->priv.msix_arr[MLX5_EQ_VEC_ASYNC].vector;
-}
-
 int mlx5_eq_init(struct mlx5_core_dev *dev)
 {
 	int err;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index fcd5bc7e31db..6bf5d70b4117 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -1596,7 +1596,7 @@ static void esw_disable_vport(struct mlx5_eswitch *esw, int vport_num)
 	/* Mark this vport as disabled to discard new events */
 	vport->enabled = false;
 
-	synchronize_irq(mlx5_get_msix_vec(esw->dev, MLX5_EQ_VEC_ASYNC));
+	synchronize_irq(pci_irq_vector(esw->dev->pdev, MLX5_EQ_VEC_ASYNC));
 	/* Wait for current already scheduled events to complete */
 	flush_workqueue(esw->work_queue);
 	/* Disable events from this vport */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c
index d0515391d33b..8b38d5cfd4c5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
@@ -80,7 +80,7 @@ static void trigger_cmd_completions(struct mlx5_core_dev *dev)
 	u64 vector;
 
 	/* wait for pending handlers to complete */
-	synchronize_irq(dev->priv.msix_arr[MLX5_EQ_VEC_CMD].vector);
+	synchronize_irq(pci_irq_vector(dev->pdev, MLX5_EQ_VEC_CMD));
 	spin_lock_irqsave(&dev->cmd.alloc_lock, flags);
 	vector = ~dev->cmd.bitmask & ((1ul << (1 << dev->cmd.log_sz)) - 1);
 	if (!vector)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index e2bd600d19de..7c8672cbb369 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -308,13 +308,12 @@ static void release_bar(struct pci_dev *pdev)
 	pci_release_regions(pdev);
 }
 
-static int mlx5_enable_msix(struct mlx5_core_dev *dev)
+static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
 {
 	struct mlx5_priv *priv = &dev->priv;
 	struct mlx5_eq_table *table = &priv->eq_table;
 	int num_eqs = 1 << MLX5_CAP_GEN(dev, log_max_eq);
 	int nvec;
-	int i;
 
 	nvec = MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
 	       MLX5_EQ_VEC_COMP_BASE;
@@ -322,17 +321,13 @@ static int mlx5_enable_msix(struct mlx5_core_dev *dev)
 	if (nvec <= MLX5_EQ_VEC_COMP_BASE)
 		return -ENOMEM;
 
-	priv->msix_arr = kcalloc(nvec, sizeof(*priv->msix_arr), GFP_KERNEL);
-
 	priv->irq_info = kcalloc(nvec, sizeof(*priv->irq_info), GFP_KERNEL);
-	if (!priv->msix_arr || !priv->irq_info)
+	if (!priv->irq_info)
 		goto err_free_msix;
 
-	for (i = 0; i < nvec; i++)
-		priv->msix_arr[i].entry = i;
-
-	nvec = pci_enable_msix_range(dev->pdev, priv->msix_arr,
-				     MLX5_EQ_VEC_COMP_BASE + 1, nvec);
+	nvec = pci_alloc_irq_vectors(dev->pdev,
+			MLX5_EQ_VEC_COMP_BASE + 1, nvec,
+			PCI_IRQ_MSIX);
 	if (nvec < 0)
 		return nvec;
 
@@ -342,7 +337,6 @@ static int mlx5_enable_msix(struct mlx5_core_dev *dev)
 
 err_free_msix:
 	kfree(priv->irq_info);
-	kfree(priv->msix_arr);
 	return -ENOMEM;
 }
 
@@ -350,9 +344,8 @@ static void mlx5_disable_msix(struct mlx5_core_dev *dev)
 {
 	struct mlx5_priv *priv = &dev->priv;
 
-	pci_disable_msix(dev->pdev);
+	pci_free_irq_vectors(dev->pdev);
 	kfree(priv->irq_info);
-	kfree(priv->msix_arr);
 }
 
 struct mlx5_reg_host_endianess {
@@ -610,8 +603,7 @@ u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev)
 static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
 {
 	struct mlx5_priv *priv  = &mdev->priv;
-	struct msix_entry *msix = priv->msix_arr;
-	int irq                 = msix[i + MLX5_EQ_VEC_COMP_BASE].vector;
+	int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
 	int err;
 
 	if (!zalloc_cpumask_var(&priv->irq_info[i].mask, GFP_KERNEL)) {
@@ -639,8 +631,7 @@ static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
 static void mlx5_irq_clear_affinity_hint(struct mlx5_core_dev *mdev, int i)
 {
 	struct mlx5_priv *priv  = &mdev->priv;
-	struct msix_entry *msix = priv->msix_arr;
-	int irq                 = msix[i + MLX5_EQ_VEC_COMP_BASE].vector;
+	int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
 
 	irq_set_affinity_hint(irq, NULL);
 	free_cpumask_var(priv->irq_info[i].mask);
@@ -763,8 +754,8 @@ static int alloc_comp_eqs(struct mlx5_core_dev *dev)
 		}
 
 #ifdef CONFIG_RFS_ACCEL
-		irq_cpu_rmap_add(dev->rmap,
-				 dev->priv.msix_arr[i + MLX5_EQ_VEC_COMP_BASE].vector);
+		irq_cpu_rmap_add(dev->rmap, pci_irq_vector(dev->pdev,
+				 MLX5_EQ_VEC_COMP_BASE + i));
 #endif
 		snprintf(name, MLX5_MAX_IRQ_NAME, "mlx5_comp%d", i);
 		err = mlx5_create_map_eq(dev, eq,
@@ -1101,9 +1092,9 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 		goto err_stop_poll;
 	}
 
-	err = mlx5_enable_msix(dev);
+	err = mlx5_alloc_irq_vectors(dev);
 	if (err) {
-		dev_err(&pdev->dev, "enable msix failed\n");
+		dev_err(&pdev->dev, "alloc irq vectors failed\n");
 		goto err_cleanup_once;
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index b3dabe6e8836..42bfcf20d875 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -109,7 +109,6 @@ int mlx5_destroy_scheduling_element_cmd(struct mlx5_core_dev *dev, u8 hierarchy,
 					u32 element_id);
 int mlx5_wait_for_vf_pages(struct mlx5_core_dev *dev);
 u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev);
-u32 mlx5_get_msix_vec(struct mlx5_core_dev *dev, int vecidx);
 struct mlx5_eq *mlx5_eqn2eq(struct mlx5_core_dev *dev, int eqn);
 void mlx5_cq_tasklet_cb(unsigned long data);
 
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 2fcff6b4503f..a9891df94ce0 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -589,7 +589,6 @@ struct mlx5_port_module_event_stats {
 struct mlx5_priv {
 	char			name[MLX5_MAX_NAME_LEN];
 	struct mlx5_eq_table	eq_table;
-	struct msix_entry	*msix_arr;
 	struct mlx5_irq_info	*irq_info;
 
 	/* pages stuff */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rfc 2/6] mlx5: move affinity hints assignments to generic code
  2017-04-02 13:41 ` Sagi Grimberg
@ 2017-04-02 13:41   ` Sagi Grimberg
  -1 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)
  To: linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig

generic api takes care of spreading affinity similar to
what mlx5 open coded (and even handles better asymmetric
configurations). Ask the generic API to spread affinity
for us, and feed him pre_vectors that do not participate
in affinity settings (which is an improvement to what we
had before).

The affinity assignments should match what mlx5 tried to
do earlier but now we do not set affinity to async, cmd
and pages dedicated vectors.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c    | 81 ++---------------------
 include/linux/mlx5/driver.h                       |  1 -
 3 files changed, 6 insertions(+), 79 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index eec0d172761e..2bab0e1ceb94 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1375,7 +1375,8 @@ static void mlx5e_close_cq(struct mlx5e_cq *cq)
 
 static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
 {
-	return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
+	return cpumask_first(pci_irq_get_affinity(priv->mdev->pdev,
+			MLX5_EQ_VEC_COMP_BASE + ix));
 }
 
 static int mlx5e_open_tx_cqs(struct mlx5e_channel *c,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 7c8672cbb369..8624a7451064 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -312,6 +312,7 @@ static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
 {
 	struct mlx5_priv *priv = &dev->priv;
 	struct mlx5_eq_table *table = &priv->eq_table;
+	struct irq_affinity irqdesc = { .pre_vectors = MLX5_EQ_VEC_COMP_BASE, };
 	int num_eqs = 1 << MLX5_CAP_GEN(dev, log_max_eq);
 	int nvec;
 
@@ -325,9 +326,10 @@ static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
 	if (!priv->irq_info)
 		goto err_free_msix;
 
-	nvec = pci_alloc_irq_vectors(dev->pdev,
+	nvec = pci_alloc_irq_vectors_affinity(dev->pdev,
 			MLX5_EQ_VEC_COMP_BASE + 1, nvec,
-			PCI_IRQ_MSIX);
+			PCI_IRQ_MSIX | PCI_IRQ_AFFINITY,
+			&irqdesc);
 	if (nvec < 0)
 		return nvec;
 
@@ -600,71 +602,6 @@ u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev)
 	return (u64)timer_l | (u64)timer_h1 << 32;
 }
 
-static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
-{
-	struct mlx5_priv *priv  = &mdev->priv;
-	int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
-	int err;
-
-	if (!zalloc_cpumask_var(&priv->irq_info[i].mask, GFP_KERNEL)) {
-		mlx5_core_warn(mdev, "zalloc_cpumask_var failed");
-		return -ENOMEM;
-	}
-
-	cpumask_set_cpu(cpumask_local_spread(i, priv->numa_node),
-			priv->irq_info[i].mask);
-
-	err = irq_set_affinity_hint(irq, priv->irq_info[i].mask);
-	if (err) {
-		mlx5_core_warn(mdev, "irq_set_affinity_hint failed,irq 0x%.4x",
-			       irq);
-		goto err_clear_mask;
-	}
-
-	return 0;
-
-err_clear_mask:
-	free_cpumask_var(priv->irq_info[i].mask);
-	return err;
-}
-
-static void mlx5_irq_clear_affinity_hint(struct mlx5_core_dev *mdev, int i)
-{
-	struct mlx5_priv *priv  = &mdev->priv;
-	int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
-
-	irq_set_affinity_hint(irq, NULL);
-	free_cpumask_var(priv->irq_info[i].mask);
-}
-
-static int mlx5_irq_set_affinity_hints(struct mlx5_core_dev *mdev)
-{
-	int err;
-	int i;
-
-	for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++) {
-		err = mlx5_irq_set_affinity_hint(mdev, i);
-		if (err)
-			goto err_out;
-	}
-
-	return 0;
-
-err_out:
-	for (i--; i >= 0; i--)
-		mlx5_irq_clear_affinity_hint(mdev, i);
-
-	return err;
-}
-
-static void mlx5_irq_clear_affinity_hints(struct mlx5_core_dev *mdev)
-{
-	int i;
-
-	for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++)
-		mlx5_irq_clear_affinity_hint(mdev, i);
-}
-
 int mlx5_vector2eqn(struct mlx5_core_dev *dev, int vector, int *eqn,
 		    unsigned int *irqn)
 {
@@ -1116,12 +1053,6 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 		goto err_stop_eqs;
 	}
 
-	err = mlx5_irq_set_affinity_hints(dev);
-	if (err) {
-		dev_err(&pdev->dev, "Failed to alloc affinity hint cpumask\n");
-		goto err_affinity_hints;
-	}
-
 	err = mlx5_init_fs(dev);
 	if (err) {
 		dev_err(&pdev->dev, "Failed to init flow steering\n");
@@ -1165,9 +1096,6 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 	mlx5_cleanup_fs(dev);
 
 err_fs:
-	mlx5_irq_clear_affinity_hints(dev);
-
-err_affinity_hints:
 	free_comp_eqs(dev);
 
 err_stop_eqs:
@@ -1234,7 +1162,6 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 	mlx5_eswitch_detach(dev->priv.eswitch);
 #endif
 	mlx5_cleanup_fs(dev);
-	mlx5_irq_clear_affinity_hints(dev);
 	free_comp_eqs(dev);
 	mlx5_stop_eqs(dev);
 	mlx5_put_uars_page(dev, priv->uar);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index a9891df94ce0..4c7cf4dfb024 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -527,7 +527,6 @@ struct mlx5_core_sriov {
 };
 
 struct mlx5_irq_info {
-	cpumask_var_t mask;
 	char name[MLX5_MAX_IRQ_NAME];
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rfc 2/6] mlx5: move affinity hints assignments to generic code
@ 2017-04-02 13:41   ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)


generic api takes care of spreading affinity similar to
what mlx5 open coded (and even handles better asymmetric
configurations). Ask the generic API to spread affinity
for us, and feed him pre_vectors that do not participate
in affinity settings (which is an improvement to what we
had before).

The affinity assignments should match what mlx5 tried to
do earlier but now we do not set affinity to async, cmd
and pages dedicated vectors.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c    | 81 ++---------------------
 include/linux/mlx5/driver.h                       |  1 -
 3 files changed, 6 insertions(+), 79 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index eec0d172761e..2bab0e1ceb94 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1375,7 +1375,8 @@ static void mlx5e_close_cq(struct mlx5e_cq *cq)
 
 static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
 {
-	return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
+	return cpumask_first(pci_irq_get_affinity(priv->mdev->pdev,
+			MLX5_EQ_VEC_COMP_BASE + ix));
 }
 
 static int mlx5e_open_tx_cqs(struct mlx5e_channel *c,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 7c8672cbb369..8624a7451064 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -312,6 +312,7 @@ static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
 {
 	struct mlx5_priv *priv = &dev->priv;
 	struct mlx5_eq_table *table = &priv->eq_table;
+	struct irq_affinity irqdesc = { .pre_vectors = MLX5_EQ_VEC_COMP_BASE, };
 	int num_eqs = 1 << MLX5_CAP_GEN(dev, log_max_eq);
 	int nvec;
 
@@ -325,9 +326,10 @@ static int mlx5_alloc_irq_vectors(struct mlx5_core_dev *dev)
 	if (!priv->irq_info)
 		goto err_free_msix;
 
-	nvec = pci_alloc_irq_vectors(dev->pdev,
+	nvec = pci_alloc_irq_vectors_affinity(dev->pdev,
 			MLX5_EQ_VEC_COMP_BASE + 1, nvec,
-			PCI_IRQ_MSIX);
+			PCI_IRQ_MSIX | PCI_IRQ_AFFINITY,
+			&irqdesc);
 	if (nvec < 0)
 		return nvec;
 
@@ -600,71 +602,6 @@ u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev)
 	return (u64)timer_l | (u64)timer_h1 << 32;
 }
 
-static int mlx5_irq_set_affinity_hint(struct mlx5_core_dev *mdev, int i)
-{
-	struct mlx5_priv *priv  = &mdev->priv;
-	int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
-	int err;
-
-	if (!zalloc_cpumask_var(&priv->irq_info[i].mask, GFP_KERNEL)) {
-		mlx5_core_warn(mdev, "zalloc_cpumask_var failed");
-		return -ENOMEM;
-	}
-
-	cpumask_set_cpu(cpumask_local_spread(i, priv->numa_node),
-			priv->irq_info[i].mask);
-
-	err = irq_set_affinity_hint(irq, priv->irq_info[i].mask);
-	if (err) {
-		mlx5_core_warn(mdev, "irq_set_affinity_hint failed,irq 0x%.4x",
-			       irq);
-		goto err_clear_mask;
-	}
-
-	return 0;
-
-err_clear_mask:
-	free_cpumask_var(priv->irq_info[i].mask);
-	return err;
-}
-
-static void mlx5_irq_clear_affinity_hint(struct mlx5_core_dev *mdev, int i)
-{
-	struct mlx5_priv *priv  = &mdev->priv;
-	int irq = pci_irq_vector(mdev->pdev, MLX5_EQ_VEC_COMP_BASE + i);
-
-	irq_set_affinity_hint(irq, NULL);
-	free_cpumask_var(priv->irq_info[i].mask);
-}
-
-static int mlx5_irq_set_affinity_hints(struct mlx5_core_dev *mdev)
-{
-	int err;
-	int i;
-
-	for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++) {
-		err = mlx5_irq_set_affinity_hint(mdev, i);
-		if (err)
-			goto err_out;
-	}
-
-	return 0;
-
-err_out:
-	for (i--; i >= 0; i--)
-		mlx5_irq_clear_affinity_hint(mdev, i);
-
-	return err;
-}
-
-static void mlx5_irq_clear_affinity_hints(struct mlx5_core_dev *mdev)
-{
-	int i;
-
-	for (i = 0; i < mdev->priv.eq_table.num_comp_vectors; i++)
-		mlx5_irq_clear_affinity_hint(mdev, i);
-}
-
 int mlx5_vector2eqn(struct mlx5_core_dev *dev, int vector, int *eqn,
 		    unsigned int *irqn)
 {
@@ -1116,12 +1053,6 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 		goto err_stop_eqs;
 	}
 
-	err = mlx5_irq_set_affinity_hints(dev);
-	if (err) {
-		dev_err(&pdev->dev, "Failed to alloc affinity hint cpumask\n");
-		goto err_affinity_hints;
-	}
-
 	err = mlx5_init_fs(dev);
 	if (err) {
 		dev_err(&pdev->dev, "Failed to init flow steering\n");
@@ -1165,9 +1096,6 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 	mlx5_cleanup_fs(dev);
 
 err_fs:
-	mlx5_irq_clear_affinity_hints(dev);
-
-err_affinity_hints:
 	free_comp_eqs(dev);
 
 err_stop_eqs:
@@ -1234,7 +1162,6 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, struct mlx5_priv *priv,
 	mlx5_eswitch_detach(dev->priv.eswitch);
 #endif
 	mlx5_cleanup_fs(dev);
-	mlx5_irq_clear_affinity_hints(dev);
 	free_comp_eqs(dev);
 	mlx5_stop_eqs(dev);
 	mlx5_put_uars_page(dev, priv->uar);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index a9891df94ce0..4c7cf4dfb024 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -527,7 +527,6 @@ struct mlx5_core_sriov {
 };
 
 struct mlx5_irq_info {
-	cpumask_var_t mask;
 	char name[MLX5_MAX_IRQ_NAME];
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rfc 3/6] RDMA/core: expose affinity mappings per completion vector
  2017-04-02 13:41 ` Sagi Grimberg
@ 2017-04-02 13:41   ` Sagi Grimberg
  -1 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)
  To: linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig

This will allow ULPs to intelligently locate threads based
on completion vector cpu affinity mappings. In case the
driver does not expose a get_vector_affinity callout, return
NULL so the caller can maintain a fallback logic.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 include/rdma/ib_verbs.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 0f1813c13687..d44b62791c64 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2150,6 +2150,8 @@ struct ib_device {
 	 */
 	int (*get_port_immutable)(struct ib_device *, u8, struct ib_port_immutable *);
 	void (*get_dev_fw_str)(struct ib_device *, char *str, size_t str_len);
+	const struct cpumask *(*get_vector_affinity)(struct ib_device *ibdev,
+						     int comp_vector);
 };
 
 struct ib_client {
@@ -3377,4 +3379,26 @@ void ib_drain_qp(struct ib_qp *qp);
 
 int ib_resolve_eth_dmac(struct ib_device *device,
 			struct ib_ah_attr *ah_attr);
+
+/**
+ * ib_get_vector_affinity - Get the affinity mappings of a given completion
+ *   vector
+ * @device:         the rdma device
+ * @comp_vector:    index of completion vector
+ *
+ * Returns NULL on failure, otherwise a corresponding cpu map of the
+ * completion vector (returns all-cpus map if the device driver doesn't
+ * implement get_vector_affinity).
+ */
+static inline const struct cpumask *
+ib_get_vector_affinity(struct ib_device *device, int comp_vector)
+{
+	if (comp_vector > device->num_comp_vectors ||
+	    !device->get_vector_affinity)
+		return NULL;
+
+	return device->get_vector_affinity(device, comp_vector);
+
+}
+
 #endif /* IB_VERBS_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rfc 3/6] RDMA/core: expose affinity mappings per completion vector
@ 2017-04-02 13:41   ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)


This will allow ULPs to intelligently locate threads based
on completion vector cpu affinity mappings. In case the
driver does not expose a get_vector_affinity callout, return
NULL so the caller can maintain a fallback logic.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 include/rdma/ib_verbs.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 0f1813c13687..d44b62791c64 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2150,6 +2150,8 @@ struct ib_device {
 	 */
 	int (*get_port_immutable)(struct ib_device *, u8, struct ib_port_immutable *);
 	void (*get_dev_fw_str)(struct ib_device *, char *str, size_t str_len);
+	const struct cpumask *(*get_vector_affinity)(struct ib_device *ibdev,
+						     int comp_vector);
 };
 
 struct ib_client {
@@ -3377,4 +3379,26 @@ void ib_drain_qp(struct ib_qp *qp);
 
 int ib_resolve_eth_dmac(struct ib_device *device,
 			struct ib_ah_attr *ah_attr);
+
+/**
+ * ib_get_vector_affinity - Get the affinity mappings of a given completion
+ *   vector
+ * @device:         the rdma device
+ * @comp_vector:    index of completion vector
+ *
+ * Returns NULL on failure, otherwise a corresponding cpu map of the
+ * completion vector (returns all-cpus map if the device driver doesn't
+ * implement get_vector_affinity).
+ */
+static inline const struct cpumask *
+ib_get_vector_affinity(struct ib_device *device, int comp_vector)
+{
+	if (comp_vector > device->num_comp_vectors ||
+	    !device->get_vector_affinity)
+		return NULL;
+
+	return device->get_vector_affinity(device, comp_vector);
+
+}
+
 #endif /* IB_VERBS_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rfc 4/6] mlx5: support ->get_vector_affinity
  2017-04-02 13:41 ` Sagi Grimberg
@ 2017-04-02 13:41   ` Sagi Grimberg
  -1 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)
  To: linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig

Simply refer to the generic affinity mask helper.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/infiniband/hw/mlx5/main.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 4dc0a8785fe0..b12bc2294895 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -3319,6 +3319,15 @@ static int mlx5_ib_get_hw_stats(struct ib_device *ibdev,
 	return port->q_cnts.num_counters;
 }
 
+const struct cpumask *mlx5_ib_get_vector_affinity(struct ib_device *ibdev,
+		int comp_vector)
+{
+	struct mlx5_ib_dev *dev = to_mdev(ibdev);
+
+	return pci_irq_get_affinity(dev->mdev->pdev,
+			MLX5_EQ_VEC_COMP_BASE + comp_vector);
+}
+
 static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 {
 	struct mlx5_ib_dev *dev;
@@ -3449,6 +3458,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 	dev->ib_dev.check_mr_status	= mlx5_ib_check_mr_status;
 	dev->ib_dev.get_port_immutable  = mlx5_port_immutable;
 	dev->ib_dev.get_dev_fw_str      = get_dev_fw_str;
+	dev->ib_dev.get_vector_affinity	= mlx5_ib_get_vector_affinity;
 	if (mlx5_core_is_pf(mdev)) {
 		dev->ib_dev.get_vf_config	= mlx5_ib_get_vf_config;
 		dev->ib_dev.set_vf_link_state	= mlx5_ib_set_vf_link_state;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rfc 4/6] mlx5: support ->get_vector_affinity
@ 2017-04-02 13:41   ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)


Simply refer to the generic affinity mask helper.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/infiniband/hw/mlx5/main.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 4dc0a8785fe0..b12bc2294895 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -3319,6 +3319,15 @@ static int mlx5_ib_get_hw_stats(struct ib_device *ibdev,
 	return port->q_cnts.num_counters;
 }
 
+const struct cpumask *mlx5_ib_get_vector_affinity(struct ib_device *ibdev,
+		int comp_vector)
+{
+	struct mlx5_ib_dev *dev = to_mdev(ibdev);
+
+	return pci_irq_get_affinity(dev->mdev->pdev,
+			MLX5_EQ_VEC_COMP_BASE + comp_vector);
+}
+
 static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 {
 	struct mlx5_ib_dev *dev;
@@ -3449,6 +3458,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
 	dev->ib_dev.check_mr_status	= mlx5_ib_check_mr_status;
 	dev->ib_dev.get_port_immutable  = mlx5_port_immutable;
 	dev->ib_dev.get_dev_fw_str      = get_dev_fw_str;
+	dev->ib_dev.get_vector_affinity	= mlx5_ib_get_vector_affinity;
 	if (mlx5_core_is_pf(mdev)) {
 		dev->ib_dev.get_vf_config	= mlx5_ib_get_vf_config;
 		dev->ib_dev.set_vf_link_state	= mlx5_ib_set_vf_link_state;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
  2017-04-02 13:41 ` Sagi Grimberg
@ 2017-04-02 13:41   ` Sagi Grimberg
  -1 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)
  To: linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig

Like pci and virtio, we add a rdma helper for affinity
spreading. This achieves optimal mq affinity assignments
according to the underlying rdma device affinity maps.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 block/Kconfig               |  5 ++++
 block/Makefile              |  1 +
 block/blk-mq-rdma.c         | 56 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq-rdma.h | 10 ++++++++
 4 files changed, 72 insertions(+)
 create mode 100644 block/blk-mq-rdma.c
 create mode 100644 include/linux/blk-mq-rdma.h

diff --git a/block/Kconfig b/block/Kconfig
index 89cd28f8d051..3ab42bbb06d5 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -206,4 +206,9 @@ config BLK_MQ_VIRTIO
 	depends on BLOCK && VIRTIO
 	default y
 
+config BLK_MQ_RDMA
+	bool
+	depends on BLOCK && INFINIBAND
+	default y
+
 source block/Kconfig.iosched
diff --git a/block/Makefile b/block/Makefile
index 081bb680789b..4498603dbc83 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
 obj-$(CONFIG_BLK_MQ_PCI)	+= blk-mq-pci.o
 obj-$(CONFIG_BLK_MQ_VIRTIO)	+= blk-mq-virtio.o
+obj-$(CONFIG_BLK_MQ_RDMA)	+= blk-mq-rdma.o
 obj-$(CONFIG_BLK_DEV_ZONED)	+= blk-zoned.o
 obj-$(CONFIG_BLK_WBT)		+= blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS)	+= blk-mq-debugfs.o
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
new file mode 100644
index 000000000000..d402f7c93528
--- /dev/null
+++ b/block/blk-mq-rdma.c
@@ -0,0 +1,56 @@
+/*
+ * Copyright (c) 2017 Sagi Grimberg.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/blk-mq.h>
+#include <linux/blk-mq-rdma.h>
+#include <rdma/ib_verbs.h>
+#include <linux/module.h>
+#include "blk-mq.h"
+
+/**
+ * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
+ * @set:	tagset to provide the mapping for
+ * @dev:	rdma device associated with @set.
+ * @first_vec:	first interrupt vectors to use for queues (usually 0)
+ *
+ * This function assumes the rdma device @dev has at least as many available
+ * interrupt vetors as @set has queues.  It will then query it's affinity mask
+ * and built queue mapping that maps a queue to the CPUs that have irq affinity
+ * for the corresponding vector.
+ *
+ * In case either the driver passed a @dev with less vectors than
+ * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
+ * vector, we fallback to the naive mapping.
+ */
+int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
+		struct ib_device *dev, int first_vec)
+{
+	const struct cpumask *mask;
+	unsigned int queue, cpu;
+
+	if (set->nr_hw_queues > dev->num_comp_vectors)
+		goto fallback;
+
+	for (queue = 0; queue < set->nr_hw_queues; queue++) {
+		mask = ib_get_vector_affinity(dev, first_vec + queue);
+		if (!mask)
+			goto fallback;
+
+		for_each_cpu(cpu, mask)
+			set->mq_map[cpu] = queue;
+	}
+
+	return 0;
+fallback:
+	return blk_mq_map_queues(set);
+}
+EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);
diff --git a/include/linux/blk-mq-rdma.h b/include/linux/blk-mq-rdma.h
new file mode 100644
index 000000000000..b4ade198007d
--- /dev/null
+++ b/include/linux/blk-mq-rdma.h
@@ -0,0 +1,10 @@
+#ifndef _LINUX_BLK_MQ_RDMA_H
+#define _LINUX_BLK_MQ_RDMA_H
+
+struct blk_mq_tag_set;
+struct ib_device;
+
+int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
+		struct ib_device *dev, int first_vec);
+
+#endif /* _LINUX_BLK_MQ_RDMA_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
@ 2017-04-02 13:41   ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)


Like pci and virtio, we add a rdma helper for affinity
spreading. This achieves optimal mq affinity assignments
according to the underlying rdma device affinity maps.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 block/Kconfig               |  5 ++++
 block/Makefile              |  1 +
 block/blk-mq-rdma.c         | 56 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq-rdma.h | 10 ++++++++
 4 files changed, 72 insertions(+)
 create mode 100644 block/blk-mq-rdma.c
 create mode 100644 include/linux/blk-mq-rdma.h

diff --git a/block/Kconfig b/block/Kconfig
index 89cd28f8d051..3ab42bbb06d5 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -206,4 +206,9 @@ config BLK_MQ_VIRTIO
 	depends on BLOCK && VIRTIO
 	default y
 
+config BLK_MQ_RDMA
+	bool
+	depends on BLOCK && INFINIBAND
+	default y
+
 source block/Kconfig.iosched
diff --git a/block/Makefile b/block/Makefile
index 081bb680789b..4498603dbc83 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -26,6 +26,7 @@ obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
 obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
 obj-$(CONFIG_BLK_MQ_PCI)	+= blk-mq-pci.o
 obj-$(CONFIG_BLK_MQ_VIRTIO)	+= blk-mq-virtio.o
+obj-$(CONFIG_BLK_MQ_RDMA)	+= blk-mq-rdma.o
 obj-$(CONFIG_BLK_DEV_ZONED)	+= blk-zoned.o
 obj-$(CONFIG_BLK_WBT)		+= blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS)	+= blk-mq-debugfs.o
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
new file mode 100644
index 000000000000..d402f7c93528
--- /dev/null
+++ b/block/blk-mq-rdma.c
@@ -0,0 +1,56 @@
+/*
+ * Copyright (c) 2017 Sagi Grimberg.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/blk-mq.h>
+#include <linux/blk-mq-rdma.h>
+#include <rdma/ib_verbs.h>
+#include <linux/module.h>
+#include "blk-mq.h"
+
+/**
+ * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
+ * @set:	tagset to provide the mapping for
+ * @dev:	rdma device associated with @set.
+ * @first_vec:	first interrupt vectors to use for queues (usually 0)
+ *
+ * This function assumes the rdma device @dev has at least as many available
+ * interrupt vetors as @set has queues.  It will then query it's affinity mask
+ * and built queue mapping that maps a queue to the CPUs that have irq affinity
+ * for the corresponding vector.
+ *
+ * In case either the driver passed a @dev with less vectors than
+ * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
+ * vector, we fallback to the naive mapping.
+ */
+int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
+		struct ib_device *dev, int first_vec)
+{
+	const struct cpumask *mask;
+	unsigned int queue, cpu;
+
+	if (set->nr_hw_queues > dev->num_comp_vectors)
+		goto fallback;
+
+	for (queue = 0; queue < set->nr_hw_queues; queue++) {
+		mask = ib_get_vector_affinity(dev, first_vec + queue);
+		if (!mask)
+			goto fallback;
+
+		for_each_cpu(cpu, mask)
+			set->mq_map[cpu] = queue;
+	}
+
+	return 0;
+fallback:
+	return blk_mq_map_queues(set);
+}
+EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);
diff --git a/include/linux/blk-mq-rdma.h b/include/linux/blk-mq-rdma.h
new file mode 100644
index 000000000000..b4ade198007d
--- /dev/null
+++ b/include/linux/blk-mq-rdma.h
@@ -0,0 +1,10 @@
+#ifndef _LINUX_BLK_MQ_RDMA_H
+#define _LINUX_BLK_MQ_RDMA_H
+
+struct blk_mq_tag_set;
+struct ib_device;
+
+int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
+		struct ib_device *dev, int first_vec);
+
+#endif /* _LINUX_BLK_MQ_RDMA_H */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rfc 6/6] nvme-rdma: use intelligent affinity based queue mappings
  2017-04-02 13:41 ` Sagi Grimberg
@ 2017-04-02 13:41   ` Sagi Grimberg
  -1 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)
  To: linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig

Use the geneic block layer affinity mapping helper. Also,
limit nr_hw_queues to the rdma device number of irq vectors
as we don't really need more.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/rdma.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4aae363943e3..81ee5b1207c8 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -19,6 +19,7 @@
 #include <linux/string.h>
 #include <linux/atomic.h>
 #include <linux/blk-mq.h>
+#include <linux/blk-mq-rdma.h>
 #include <linux/types.h>
 #include <linux/list.h>
 #include <linux/mutex.h>
@@ -645,10 +646,14 @@ static int nvme_rdma_connect_io_queues(struct nvme_rdma_ctrl *ctrl)
 static int nvme_rdma_init_io_queues(struct nvme_rdma_ctrl *ctrl)
 {
 	struct nvmf_ctrl_options *opts = ctrl->ctrl.opts;
+	struct ib_device *ibdev = ctrl->device->dev;
 	unsigned int nr_io_queues;
 	int i, ret;
 
 	nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
+	nr_io_queues = min_t(unsigned int, nr_io_queues,
+				ibdev->num_comp_vectors);
+
 	ret = nvme_set_queue_count(&ctrl->ctrl, &nr_io_queues);
 	if (ret)
 		return ret;
@@ -1523,6 +1528,13 @@ static void nvme_rdma_complete_rq(struct request *rq)
 	nvme_complete_rq(rq);
 }
 
+static int nvme_rdma_map_queues(struct blk_mq_tag_set *set)
+{
+	struct nvme_rdma_ctrl *ctrl = set->driver_data;
+
+	return blk_mq_rdma_map_queues(set, ctrl->device->dev, 0);
+}
+
 static const struct blk_mq_ops nvme_rdma_mq_ops = {
 	.queue_rq	= nvme_rdma_queue_rq,
 	.complete	= nvme_rdma_complete_rq,
@@ -1532,6 +1544,7 @@ static const struct blk_mq_ops nvme_rdma_mq_ops = {
 	.init_hctx	= nvme_rdma_init_hctx,
 	.poll		= nvme_rdma_poll,
 	.timeout	= nvme_rdma_timeout,
+	.map_queues	= nvme_rdma_map_queues,
 };
 
 static const struct blk_mq_ops nvme_rdma_admin_mq_ops = {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rfc 6/6] nvme-rdma: use intelligent affinity based queue mappings
@ 2017-04-02 13:41   ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-02 13:41 UTC (permalink / raw)


Use the geneic block layer affinity mapping helper. Also,
limit nr_hw_queues to the rdma device number of irq vectors
as we don't really need more.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
 drivers/nvme/host/rdma.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 4aae363943e3..81ee5b1207c8 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -19,6 +19,7 @@
 #include <linux/string.h>
 #include <linux/atomic.h>
 #include <linux/blk-mq.h>
+#include <linux/blk-mq-rdma.h>
 #include <linux/types.h>
 #include <linux/list.h>
 #include <linux/mutex.h>
@@ -645,10 +646,14 @@ static int nvme_rdma_connect_io_queues(struct nvme_rdma_ctrl *ctrl)
 static int nvme_rdma_init_io_queues(struct nvme_rdma_ctrl *ctrl)
 {
 	struct nvmf_ctrl_options *opts = ctrl->ctrl.opts;
+	struct ib_device *ibdev = ctrl->device->dev;
 	unsigned int nr_io_queues;
 	int i, ret;
 
 	nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
+	nr_io_queues = min_t(unsigned int, nr_io_queues,
+				ibdev->num_comp_vectors);
+
 	ret = nvme_set_queue_count(&ctrl->ctrl, &nr_io_queues);
 	if (ret)
 		return ret;
@@ -1523,6 +1528,13 @@ static void nvme_rdma_complete_rq(struct request *rq)
 	nvme_complete_rq(rq);
 }
 
+static int nvme_rdma_map_queues(struct blk_mq_tag_set *set)
+{
+	struct nvme_rdma_ctrl *ctrl = set->driver_data;
+
+	return blk_mq_rdma_map_queues(set, ctrl->device->dev, 0);
+}
+
 static const struct blk_mq_ops nvme_rdma_mq_ops = {
 	.queue_rq	= nvme_rdma_queue_rq,
 	.complete	= nvme_rdma_complete_rq,
@@ -1532,6 +1544,7 @@ static const struct blk_mq_ops nvme_rdma_mq_ops = {
 	.init_hctx	= nvme_rdma_init_hctx,
 	.poll		= nvme_rdma_poll,
 	.timeout	= nvme_rdma_timeout,
+	.map_queues	= nvme_rdma_map_queues,
 };
 
 static const struct blk_mq_ops nvme_rdma_admin_mq_ops = {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 1/6] mlx5: convert to generic pci_alloc_irq_vectors
  2017-04-02 13:41   ` Sagi Grimberg
@ 2017-04-04  6:27     ` Christoph Hellwig
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-04  6:27 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
	Or Gerlitz, Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 1/6] mlx5: convert to generic pci_alloc_irq_vectors
@ 2017-04-04  6:27     ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-04  6:27 UTC (permalink / raw)


Looks good:

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 2/6] mlx5: move affinity hints assignments to generic code
  2017-04-02 13:41   ` Sagi Grimberg
@ 2017-04-04  6:32     ` Christoph Hellwig
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-04  6:32 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
	Or Gerlitz, Christoph Hellwig

> @@ -1375,7 +1375,8 @@ static void mlx5e_close_cq(struct mlx5e_cq *cq)
>  
>  static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
>  {
> -	return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
> +	return cpumask_first(pci_irq_get_affinity(priv->mdev->pdev,
> +			MLX5_EQ_VEC_COMP_BASE + ix));

This looks ok for now, but if we look at the callers we'd probably
want to make direct use of pci_irq_get_node and pci_irq_get_affinity for
the uses directly in mlx5e_open_channel as well as the stored away
->cpu field.  But maybe that should be left for another patch after
this one.

> +	struct irq_affinity irqdesc = { .pre_vectors = MLX5_EQ_VEC_COMP_BASE, };

I usually move assignments inside structures onto a separate line to make
it more readable, e.g.

	struct irq_affinity irqdesc = {
		.pre_vectors = MLX5_EQ_VEC_COMP_BASE,
	};

Otherwise this looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 2/6] mlx5: move affinity hints assignments to generic code
@ 2017-04-04  6:32     ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-04  6:32 UTC (permalink / raw)


> @@ -1375,7 +1375,8 @@ static void mlx5e_close_cq(struct mlx5e_cq *cq)
>  
>  static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
>  {
> -	return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
> +	return cpumask_first(pci_irq_get_affinity(priv->mdev->pdev,
> +			MLX5_EQ_VEC_COMP_BASE + ix));

This looks ok for now, but if we look at the callers we'd probably
want to make direct use of pci_irq_get_node and pci_irq_get_affinity for
the uses directly in mlx5e_open_channel as well as the stored away
->cpu field.  But maybe that should be left for another patch after
this one.

> +	struct irq_affinity irqdesc = { .pre_vectors = MLX5_EQ_VEC_COMP_BASE, };

I usually move assignments inside structures onto a separate line to make
it more readable, e.g.

	struct irq_affinity irqdesc = {
		.pre_vectors = MLX5_EQ_VEC_COMP_BASE,
	};

Otherwise this looks fine:

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 3/6] RDMA/core: expose affinity mappings per completion vector
  2017-04-02 13:41   ` Sagi Grimberg
@ 2017-04-04  6:32     ` Christoph Hellwig
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-04  6:32 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
	Or Gerlitz, Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 3/6] RDMA/core: expose affinity mappings per completion vector
@ 2017-04-04  6:32     ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-04  6:32 UTC (permalink / raw)


Looks good,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 4/6] mlx5: support ->get_vector_affinity
  2017-04-02 13:41   ` Sagi Grimberg
@ 2017-04-04  6:33     ` Christoph Hellwig
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-04  6:33 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
	Or Gerlitz, Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 4/6] mlx5: support ->get_vector_affinity
@ 2017-04-04  6:33     ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-04  6:33 UTC (permalink / raw)


Looks good,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
  2017-04-02 13:41   ` Sagi Grimberg
@ 2017-04-04  6:33     ` Christoph Hellwig
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-04  6:33 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
	Or Gerlitz, Christoph Hellwig

On Sun, Apr 02, 2017 at 04:41:31PM +0300, Sagi Grimberg wrote:
> Like pci and virtio, we add a rdma helper for affinity
> spreading. This achieves optimal mq affinity assignments
> according to the underlying rdma device affinity maps.
> 
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>  block/Kconfig               |  5 ++++
>  block/Makefile              |  1 +
>  block/blk-mq-rdma.c         | 56 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/blk-mq-rdma.h | 10 ++++++++
>  4 files changed, 72 insertions(+)
>  create mode 100644 block/blk-mq-rdma.c
>  create mode 100644 include/linux/blk-mq-rdma.h
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index 89cd28f8d051..3ab42bbb06d5 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -206,4 +206,9 @@ config BLK_MQ_VIRTIO
>  	depends on BLOCK && VIRTIO
>  	default y
>  
> +config BLK_MQ_RDMA
> +	bool
> +	depends on BLOCK && INFINIBAND
> +	default y
> +
>  source block/Kconfig.iosched
> diff --git a/block/Makefile b/block/Makefile
> index 081bb680789b..4498603dbc83 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -26,6 +26,7 @@ obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
>  obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
>  obj-$(CONFIG_BLK_MQ_PCI)	+= blk-mq-pci.o
>  obj-$(CONFIG_BLK_MQ_VIRTIO)	+= blk-mq-virtio.o
> +obj-$(CONFIG_BLK_MQ_RDMA)	+= blk-mq-rdma.o
>  obj-$(CONFIG_BLK_DEV_ZONED)	+= blk-zoned.o
>  obj-$(CONFIG_BLK_WBT)		+= blk-wbt.o
>  obj-$(CONFIG_BLK_DEBUG_FS)	+= blk-mq-debugfs.o
> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> new file mode 100644
> index 000000000000..d402f7c93528
> --- /dev/null
> +++ b/block/blk-mq-rdma.c
> @@ -0,0 +1,56 @@
> +/*
> + * Copyright (c) 2017 Sagi Grimberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +#include <linux/blk-mq.h>
> +#include <linux/blk-mq-rdma.h>
> +#include <rdma/ib_verbs.h>
> +#include <linux/module.h>
> +#include "blk-mq.h"
> +
> +/**
> + * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
> + * @set:	tagset to provide the mapping for
> + * @dev:	rdma device associated with @set.
> + * @first_vec:	first interrupt vectors to use for queues (usually 0)
> + *
> + * This function assumes the rdma device @dev has at least as many available
> + * interrupt vetors as @set has queues.  It will then query it's affinity mask
> + * and built queue mapping that maps a queue to the CPUs that have irq affinity
> + * for the corresponding vector.
> + *
> + * In case either the driver passed a @dev with less vectors than
> + * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
> + * vector, we fallback to the naive mapping.
> + */
> +int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
> +		struct ib_device *dev, int first_vec)
> +{
> +	const struct cpumask *mask;
> +	unsigned int queue, cpu;
> +
> +	if (set->nr_hw_queues > dev->num_comp_vectors)
> +		goto fallback;

maybe print a warning here?

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
@ 2017-04-04  6:33     ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-04  6:33 UTC (permalink / raw)


On Sun, Apr 02, 2017@04:41:31PM +0300, Sagi Grimberg wrote:
> Like pci and virtio, we add a rdma helper for affinity
> spreading. This achieves optimal mq affinity assignments
> according to the underlying rdma device affinity maps.
> 
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
>  block/Kconfig               |  5 ++++
>  block/Makefile              |  1 +
>  block/blk-mq-rdma.c         | 56 +++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/blk-mq-rdma.h | 10 ++++++++
>  4 files changed, 72 insertions(+)
>  create mode 100644 block/blk-mq-rdma.c
>  create mode 100644 include/linux/blk-mq-rdma.h
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index 89cd28f8d051..3ab42bbb06d5 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -206,4 +206,9 @@ config BLK_MQ_VIRTIO
>  	depends on BLOCK && VIRTIO
>  	default y
>  
> +config BLK_MQ_RDMA
> +	bool
> +	depends on BLOCK && INFINIBAND
> +	default y
> +
>  source block/Kconfig.iosched
> diff --git a/block/Makefile b/block/Makefile
> index 081bb680789b..4498603dbc83 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -26,6 +26,7 @@ obj-$(CONFIG_BLK_CMDLINE_PARSER)	+= cmdline-parser.o
>  obj-$(CONFIG_BLK_DEV_INTEGRITY) += bio-integrity.o blk-integrity.o t10-pi.o
>  obj-$(CONFIG_BLK_MQ_PCI)	+= blk-mq-pci.o
>  obj-$(CONFIG_BLK_MQ_VIRTIO)	+= blk-mq-virtio.o
> +obj-$(CONFIG_BLK_MQ_RDMA)	+= blk-mq-rdma.o
>  obj-$(CONFIG_BLK_DEV_ZONED)	+= blk-zoned.o
>  obj-$(CONFIG_BLK_WBT)		+= blk-wbt.o
>  obj-$(CONFIG_BLK_DEBUG_FS)	+= blk-mq-debugfs.o
> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> new file mode 100644
> index 000000000000..d402f7c93528
> --- /dev/null
> +++ b/block/blk-mq-rdma.c
> @@ -0,0 +1,56 @@
> +/*
> + * Copyright (c) 2017 Sagi Grimberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +#include <linux/blk-mq.h>
> +#include <linux/blk-mq-rdma.h>
> +#include <rdma/ib_verbs.h>
> +#include <linux/module.h>
> +#include "blk-mq.h"
> +
> +/**
> + * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
> + * @set:	tagset to provide the mapping for
> + * @dev:	rdma device associated with @set.
> + * @first_vec:	first interrupt vectors to use for queues (usually 0)
> + *
> + * This function assumes the rdma device @dev has at least as many available
> + * interrupt vetors as @set has queues.  It will then query it's affinity mask
> + * and built queue mapping that maps a queue to the CPUs that have irq affinity
> + * for the corresponding vector.
> + *
> + * In case either the driver passed a @dev with less vectors than
> + * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
> + * vector, we fallback to the naive mapping.
> + */
> +int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
> +		struct ib_device *dev, int first_vec)
> +{
> +	const struct cpumask *mask;
> +	unsigned int queue, cpu;
> +
> +	if (set->nr_hw_queues > dev->num_comp_vectors)
> +		goto fallback;

maybe print a warning here?

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 6/6] nvme-rdma: use intelligent affinity based queue mappings
  2017-04-02 13:41   ` Sagi Grimberg
@ 2017-04-04  6:34     ` Christoph Hellwig
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-04  6:34 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed,
	Or Gerlitz, Christoph Hellwig

On Sun, Apr 02, 2017 at 04:41:32PM +0300, Sagi Grimberg wrote:
> Use the geneic block layer affinity mapping helper. Also,

          generic

>  	nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
> +	nr_io_queues = min_t(unsigned int, nr_io_queues,
> +				ibdev->num_comp_vectors);
> +

Add a comment here?

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 6/6] nvme-rdma: use intelligent affinity based queue mappings
@ 2017-04-04  6:34     ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-04  6:34 UTC (permalink / raw)


On Sun, Apr 02, 2017@04:41:32PM +0300, Sagi Grimberg wrote:
> Use the geneic block layer affinity mapping helper. Also,

          generic

>  	nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
> +	nr_io_queues = min_t(unsigned int, nr_io_queues,
> +				ibdev->num_comp_vectors);
> +

Add a comment here?

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
@ 2017-04-04  7:46     ` Max Gurtovoy
  0 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2017-04-04  7:46 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig


> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> new file mode 100644
> index 000000000000..d402f7c93528
> --- /dev/null
> +++ b/block/blk-mq-rdma.c
> @@ -0,0 +1,56 @@
> +/*
> + * Copyright (c) 2017 Sagi Grimberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */

shouldn't you include  <linux/kobject.h> and <linux/blkdev.h> like in 
commit 8ec2ef2b66ea2f that fixes blk-mq-pci.c ?

> +#include <linux/blk-mq.h>
> +#include <linux/blk-mq-rdma.h>
> +#include <rdma/ib_verbs.h>
> +#include <linux/module.h>
> +#include "blk-mq.h"

Is this include needed ?


> +
> +/**
> + * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
> + * @set:	tagset to provide the mapping for
> + * @dev:	rdma device associated with @set.
> + * @first_vec:	first interrupt vectors to use for queues (usually 0)
> + *
> + * This function assumes the rdma device @dev has at least as many available
> + * interrupt vetors as @set has queues.  It will then query it's affinity mask
> + * and built queue mapping that maps a queue to the CPUs that have irq affinity
> + * for the corresponding vector.
> + *
> + * In case either the driver passed a @dev with less vectors than
> + * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
> + * vector, we fallback to the naive mapping.
> + */
> +int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
> +		struct ib_device *dev, int first_vec)
> +{
> +	const struct cpumask *mask;
> +	unsigned int queue, cpu;
> +
> +	if (set->nr_hw_queues > dev->num_comp_vectors)
> +		goto fallback;
> +
> +	for (queue = 0; queue < set->nr_hw_queues; queue++) {
> +		mask = ib_get_vector_affinity(dev, first_vec + queue);
> +		if (!mask)
> +			goto fallback;

Christoph,
we can use fallback also in the blk-mq-pci.c in case 
pci_irq_get_affinity fails, right ?

> +
> +		for_each_cpu(cpu, mask)
> +			set->mq_map[cpu] = queue;
> +	}
> +
> +	return 0;
> +fallback:
> +	return blk_mq_map_queues(set);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);

Otherwise, Looks good.

Reviewed-by: Max Gurtovoy <maxg@mellanox.com>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
@ 2017-04-04  7:46     ` Max Gurtovoy
  0 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2017-04-04  7:46 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Saeed Mahameed, Or Gerlitz,
	Christoph Hellwig


> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> new file mode 100644
> index 000000000000..d402f7c93528
> --- /dev/null
> +++ b/block/blk-mq-rdma.c
> @@ -0,0 +1,56 @@
> +/*
> + * Copyright (c) 2017 Sagi Grimberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */

shouldn't you include  <linux/kobject.h> and <linux/blkdev.h> like in 
commit 8ec2ef2b66ea2f that fixes blk-mq-pci.c ?

> +#include <linux/blk-mq.h>
> +#include <linux/blk-mq-rdma.h>
> +#include <rdma/ib_verbs.h>
> +#include <linux/module.h>
> +#include "blk-mq.h"

Is this include needed ?


> +
> +/**
> + * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
> + * @set:	tagset to provide the mapping for
> + * @dev:	rdma device associated with @set.
> + * @first_vec:	first interrupt vectors to use for queues (usually 0)
> + *
> + * This function assumes the rdma device @dev has at least as many available
> + * interrupt vetors as @set has queues.  It will then query it's affinity mask
> + * and built queue mapping that maps a queue to the CPUs that have irq affinity
> + * for the corresponding vector.
> + *
> + * In case either the driver passed a @dev with less vectors than
> + * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
> + * vector, we fallback to the naive mapping.
> + */
> +int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
> +		struct ib_device *dev, int first_vec)
> +{
> +	const struct cpumask *mask;
> +	unsigned int queue, cpu;
> +
> +	if (set->nr_hw_queues > dev->num_comp_vectors)
> +		goto fallback;
> +
> +	for (queue = 0; queue < set->nr_hw_queues; queue++) {
> +		mask = ib_get_vector_affinity(dev, first_vec + queue);
> +		if (!mask)
> +			goto fallback;

Christoph,
we can use fallback also in the blk-mq-pci.c in case 
pci_irq_get_affinity fails, right ?

> +
> +		for_each_cpu(cpu, mask)
> +			set->mq_map[cpu] = queue;
> +	}
> +
> +	return 0;
> +fallback:
> +	return blk_mq_map_queues(set);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);

Otherwise, Looks good.

Reviewed-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
@ 2017-04-04  7:46     ` Max Gurtovoy
  0 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2017-04-04  7:46 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Saeed Mahameed, Or Gerlitz,
	Christoph Hellwig


> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> new file mode 100644
> index 000000000000..d402f7c93528
> --- /dev/null
> +++ b/block/blk-mq-rdma.c
> @@ -0,0 +1,56 @@
> +/*
> + * Copyright (c) 2017 Sagi Grimberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */

shouldn't you include  <linux/kobject.h> and <linux/blkdev.h> like in 
commit 8ec2ef2b66ea2f that fixes blk-mq-pci.c ?

> +#include <linux/blk-mq.h>
> +#include <linux/blk-mq-rdma.h>
> +#include <rdma/ib_verbs.h>
> +#include <linux/module.h>
> +#include "blk-mq.h"

Is this include needed ?


> +
> +/**
> + * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
> + * @set:	tagset to provide the mapping for
> + * @dev:	rdma device associated with @set.
> + * @first_vec:	first interrupt vectors to use for queues (usually 0)
> + *
> + * This function assumes the rdma device @dev has at least as many available
> + * interrupt vetors as @set has queues.  It will then query it's affinity mask
> + * and built queue mapping that maps a queue to the CPUs that have irq affinity
> + * for the corresponding vector.
> + *
> + * In case either the driver passed a @dev with less vectors than
> + * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
> + * vector, we fallback to the naive mapping.
> + */
> +int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
> +		struct ib_device *dev, int first_vec)
> +{
> +	const struct cpumask *mask;
> +	unsigned int queue, cpu;
> +
> +	if (set->nr_hw_queues > dev->num_comp_vectors)
> +		goto fallback;
> +
> +	for (queue = 0; queue < set->nr_hw_queues; queue++) {
> +		mask = ib_get_vector_affinity(dev, first_vec + queue);
> +		if (!mask)
> +			goto fallback;

Christoph,
we can use fallback also in the blk-mq-pci.c in case 
pci_irq_get_affinity fails, right ?

> +
> +		for_each_cpu(cpu, mask)
> +			set->mq_map[cpu] = queue;
> +	}
> +
> +	return 0;
> +fallback:
> +	return blk_mq_map_queues(set);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);

Otherwise, Looks good.

Reviewed-by: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
@ 2017-04-04  7:46     ` Max Gurtovoy
  0 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2017-04-04  7:46 UTC (permalink / raw)



> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> new file mode 100644
> index 000000000000..d402f7c93528
> --- /dev/null
> +++ b/block/blk-mq-rdma.c
> @@ -0,0 +1,56 @@
> +/*
> + * Copyright (c) 2017 Sagi Grimberg.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */

shouldn't you include  <linux/kobject.h> and <linux/blkdev.h> like in 
commit 8ec2ef2b66ea2f that fixes blk-mq-pci.c ?

> +#include <linux/blk-mq.h>
> +#include <linux/blk-mq-rdma.h>
> +#include <rdma/ib_verbs.h>
> +#include <linux/module.h>
> +#include "blk-mq.h"

Is this include needed ?


> +
> +/**
> + * blk_mq_rdma_map_queues - provide a default queue mapping for rdma device
> + * @set:	tagset to provide the mapping for
> + * @dev:	rdma device associated with @set.
> + * @first_vec:	first interrupt vectors to use for queues (usually 0)
> + *
> + * This function assumes the rdma device @dev has at least as many available
> + * interrupt vetors as @set has queues.  It will then query it's affinity mask
> + * and built queue mapping that maps a queue to the CPUs that have irq affinity
> + * for the corresponding vector.
> + *
> + * In case either the driver passed a @dev with less vectors than
> + * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
> + * vector, we fallback to the naive mapping.
> + */
> +int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
> +		struct ib_device *dev, int first_vec)
> +{
> +	const struct cpumask *mask;
> +	unsigned int queue, cpu;
> +
> +	if (set->nr_hw_queues > dev->num_comp_vectors)
> +		goto fallback;
> +
> +	for (queue = 0; queue < set->nr_hw_queues; queue++) {
> +		mask = ib_get_vector_affinity(dev, first_vec + queue);
> +		if (!mask)
> +			goto fallback;

Christoph,
we can use fallback also in the blk-mq-pci.c in case 
pci_irq_get_affinity fails, right ?

> +
> +		for_each_cpu(cpu, mask)
> +			set->mq_map[cpu] = queue;
> +	}
> +
> +	return 0;
> +fallback:
> +	return blk_mq_map_queues(set);
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_rdma_map_queues);

Otherwise, Looks good.

Reviewed-by: Max Gurtovoy <maxg at mellanox.com>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
@ 2017-04-04  7:51   ` Max Gurtovoy
  0 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2017-04-04  7:51 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig

>
> Any feedback is welcome.

Hi Sagi,

the patchset looks good and of course we can add support for more 
drivers in the future.
have you run some performance testing with the nvmf initiator ?


>
> Sagi Grimberg (6):
>   mlx5: convert to generic pci_alloc_irq_vectors
>   mlx5: move affinity hints assignments to generic code
>   RDMA/core: expose affinity mappings per completion vector
>   mlx5: support ->get_vector_affinity
>   block: Add rdma affinity based queue mapping helper
>   nvme-rdma: use intelligent affinity based queue mappings
>
>  block/Kconfig                                      |   5 +
>  block/Makefile                                     |   1 +
>  block/blk-mq-rdma.c                                |  56 +++++++++++
>  drivers/infiniband/hw/mlx5/main.c                  |  10 ++
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   5 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eq.c       |   9 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/health.c   |   2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/main.c     | 106 +++------------------
>  .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |   1 -
>  drivers/nvme/host/rdma.c                           |  13 +++
>  include/linux/blk-mq-rdma.h                        |  10 ++
>  include/linux/mlx5/driver.h                        |   2 -
>  include/rdma/ib_verbs.h                            |  24 +++++
>  14 files changed, 138 insertions(+), 108 deletions(-)
>  create mode 100644 block/blk-mq-rdma.c
>  create mode 100644 include/linux/blk-mq-rdma.h
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
@ 2017-04-04  7:51   ` Max Gurtovoy
  0 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2017-04-04  7:51 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Saeed Mahameed, Or Gerlitz,
	Christoph Hellwig

>
> Any feedback is welcome.

Hi Sagi,

the patchset looks good and of course we can add support for more 
drivers in the future.
have you run some performance testing with the nvmf initiator ?


>
> Sagi Grimberg (6):
>   mlx5: convert to generic pci_alloc_irq_vectors
>   mlx5: move affinity hints assignments to generic code
>   RDMA/core: expose affinity mappings per completion vector
>   mlx5: support ->get_vector_affinity
>   block: Add rdma affinity based queue mapping helper
>   nvme-rdma: use intelligent affinity based queue mappings
>
>  block/Kconfig                                      |   5 +
>  block/Makefile                                     |   1 +
>  block/blk-mq-rdma.c                                |  56 +++++++++++
>  drivers/infiniband/hw/mlx5/main.c                  |  10 ++
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   5 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eq.c       |   9 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/health.c   |   2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/main.c     | 106 +++------------------
>  .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |   1 -
>  drivers/nvme/host/rdma.c                           |  13 +++
>  include/linux/blk-mq-rdma.h                        |  10 ++
>  include/linux/mlx5/driver.h                        |   2 -
>  include/rdma/ib_verbs.h                            |  24 +++++
>  14 files changed, 138 insertions(+), 108 deletions(-)
>  create mode 100644 block/blk-mq-rdma.c
>  create mode 100644 include/linux/blk-mq-rdma.h
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
@ 2017-04-04  7:51   ` Max Gurtovoy
  0 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2017-04-04  7:51 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Saeed Mahameed, Or Gerlitz,
	Christoph Hellwig

>
> Any feedback is welcome.

Hi Sagi,

the patchset looks good and of course we can add support for more 
drivers in the future.
have you run some performance testing with the nvmf initiator ?


>
> Sagi Grimberg (6):
>   mlx5: convert to generic pci_alloc_irq_vectors
>   mlx5: move affinity hints assignments to generic code
>   RDMA/core: expose affinity mappings per completion vector
>   mlx5: support ->get_vector_affinity
>   block: Add rdma affinity based queue mapping helper
>   nvme-rdma: use intelligent affinity based queue mappings
>
>  block/Kconfig                                      |   5 +
>  block/Makefile                                     |   1 +
>  block/blk-mq-rdma.c                                |  56 +++++++++++
>  drivers/infiniband/hw/mlx5/main.c                  |  10 ++
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   5 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eq.c       |   9 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/health.c   |   2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/main.c     | 106 +++------------------
>  .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |   1 -
>  drivers/nvme/host/rdma.c                           |  13 +++
>  include/linux/blk-mq-rdma.h                        |  10 ++
>  include/linux/mlx5/driver.h                        |   2 -
>  include/rdma/ib_verbs.h                            |  24 +++++
>  14 files changed, 138 insertions(+), 108 deletions(-)
>  create mode 100644 block/blk-mq-rdma.c
>  create mode 100644 include/linux/blk-mq-rdma.h
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
@ 2017-04-04  7:51   ` Max Gurtovoy
  0 siblings, 0 replies; 52+ messages in thread
From: Max Gurtovoy @ 2017-04-04  7:51 UTC (permalink / raw)


>
> Any feedback is welcome.

Hi Sagi,

the patchset looks good and of course we can add support for more 
drivers in the future.
have you run some performance testing with the nvmf initiator ?


>
> Sagi Grimberg (6):
>   mlx5: convert to generic pci_alloc_irq_vectors
>   mlx5: move affinity hints assignments to generic code
>   RDMA/core: expose affinity mappings per completion vector
>   mlx5: support ->get_vector_affinity
>   block: Add rdma affinity based queue mapping helper
>   nvme-rdma: use intelligent affinity based queue mappings
>
>  block/Kconfig                                      |   5 +
>  block/Makefile                                     |   1 +
>  block/blk-mq-rdma.c                                |  56 +++++++++++
>  drivers/infiniband/hw/mlx5/main.c                  |  10 ++
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |   5 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eq.c       |   9 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |   2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/health.c   |   2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/main.c     | 106 +++------------------
>  .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |   1 -
>  drivers/nvme/host/rdma.c                           |  13 +++
>  include/linux/blk-mq-rdma.h                        |  10 ++
>  include/linux/mlx5/driver.h                        |   2 -
>  include/rdma/ib_verbs.h                            |  24 +++++
>  14 files changed, 138 insertions(+), 108 deletions(-)
>  create mode 100644 block/blk-mq-rdma.c
>  create mode 100644 include/linux/blk-mq-rdma.h
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
  2017-04-04  7:46     ` Max Gurtovoy
@ 2017-04-04 13:09       ` Christoph Hellwig
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-04 13:09 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Sagi Grimberg, linux-rdma, linux-nvme, linux-block, netdev,
	Saeed Mahameed, Or Gerlitz, Christoph Hellwig

On Tue, Apr 04, 2017 at 10:46:54AM +0300, Max Gurtovoy wrote:
>> +	if (set->nr_hw_queues > dev->num_comp_vectors)
>> +		goto fallback;
>> +
>> +	for (queue = 0; queue < set->nr_hw_queues; queue++) {
>> +		mask = ib_get_vector_affinity(dev, first_vec + queue);
>> +		if (!mask)
>> +			goto fallback;
>
> Christoph,
> we can use fallback also in the blk-mq-pci.c in case pci_irq_get_affinity 
> fails, right ?

For PCI it shouldn't fail as the driver calling pci_irq_get_affinity
knows how it set up the interrupts.  So I don't think it's necessary there.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
@ 2017-04-04 13:09       ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-04 13:09 UTC (permalink / raw)


On Tue, Apr 04, 2017@10:46:54AM +0300, Max Gurtovoy wrote:
>> +	if (set->nr_hw_queues > dev->num_comp_vectors)
>> +		goto fallback;
>> +
>> +	for (queue = 0; queue < set->nr_hw_queues; queue++) {
>> +		mask = ib_get_vector_affinity(dev, first_vec + queue);
>> +		if (!mask)
>> +			goto fallback;
>
> Christoph,
> we can use fallback also in the blk-mq-pci.c in case pci_irq_get_affinity 
> fails, right ?

For PCI it shouldn't fail as the driver calling pci_irq_get_affinity
knows how it set up the interrupts.  So I don't think it's necessary there.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
  2017-04-02 13:41   ` Sagi Grimberg
@ 2017-04-05 14:17     ` Jens Axboe
  -1 siblings, 0 replies; 52+ messages in thread
From: Jens Axboe @ 2017-04-05 14:17 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig

On 04/02/2017 07:41 AM, Sagi Grimberg wrote:
> Like pci and virtio, we add a rdma helper for affinity
> spreading. This achieves optimal mq affinity assignments
> according to the underlying rdma device affinity maps.

Reviewed-by: Jens Axboe <axboe@fb.com>

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
@ 2017-04-05 14:17     ` Jens Axboe
  0 siblings, 0 replies; 52+ messages in thread
From: Jens Axboe @ 2017-04-05 14:17 UTC (permalink / raw)


On 04/02/2017 07:41 AM, Sagi Grimberg wrote:
> Like pci and virtio, we add a rdma helper for affinity
> spreading. This achieves optimal mq affinity assignments
> according to the underlying rdma device affinity maps.

Reviewed-by: Jens Axboe <axboe at fb.com>

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 2/6] mlx5: move affinity hints assignments to generic code
  2017-04-04  6:32     ` Christoph Hellwig
@ 2017-04-06  8:29       ` Sagi Grimberg
  -1 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-06  8:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed, Or Gerlitz

>>  static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
>>  {
>> -	return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
>> +	return cpumask_first(pci_irq_get_affinity(priv->mdev->pdev,
>> +			MLX5_EQ_VEC_COMP_BASE + ix));
>
> This looks ok for now, but if we look at the callers we'd probably
> want to make direct use of pci_irq_get_node and pci_irq_get_affinity for
> the uses directly in mlx5e_open_channel as well as the stored away
> ->cpu field.  But maybe that should be left for another patch after
> this one.

Its small enough to fold it in.

>> +	struct irq_affinity irqdesc = { .pre_vectors = MLX5_EQ_VEC_COMP_BASE, };
>
> I usually move assignments inside structures onto a separate line to make
> it more readable, e.g.
>
> 	struct irq_affinity irqdesc = {
> 		.pre_vectors = MLX5_EQ_VEC_COMP_BASE,
> 	};

Will do.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 2/6] mlx5: move affinity hints assignments to generic code
@ 2017-04-06  8:29       ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-06  8:29 UTC (permalink / raw)


>>  static int mlx5e_get_cpu(struct mlx5e_priv *priv, int ix)
>>  {
>> -	return cpumask_first(priv->mdev->priv.irq_info[ix].mask);
>> +	return cpumask_first(pci_irq_get_affinity(priv->mdev->pdev,
>> +			MLX5_EQ_VEC_COMP_BASE + ix));
>
> This looks ok for now, but if we look at the callers we'd probably
> want to make direct use of pci_irq_get_node and pci_irq_get_affinity for
> the uses directly in mlx5e_open_channel as well as the stored away
> ->cpu field.  But maybe that should be left for another patch after
> this one.

Its small enough to fold it in.

>> +	struct irq_affinity irqdesc = { .pre_vectors = MLX5_EQ_VEC_COMP_BASE, };
>
> I usually move assignments inside structures onto a separate line to make
> it more readable, e.g.
>
> 	struct irq_affinity irqdesc = {
> 		.pre_vectors = MLX5_EQ_VEC_COMP_BASE,
> 	};

Will do.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 6/6] nvme-rdma: use intelligent affinity based queue mappings
  2017-04-04  6:34     ` Christoph Hellwig
@ 2017-04-06  8:30       ` Sagi Grimberg
  -1 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-06  8:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma, linux-nvme, linux-block, netdev, Saeed Mahameed, Or Gerlitz


>> Use the geneic block layer affinity mapping helper. Also,
>
>           generic
>
>>  	nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
>> +	nr_io_queues = min_t(unsigned int, nr_io_queues,
>> +				ibdev->num_comp_vectors);
>> +
>
> Add a comment here?

Will do

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 6/6] nvme-rdma: use intelligent affinity based queue mappings
@ 2017-04-06  8:30       ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-06  8:30 UTC (permalink / raw)



>> Use the geneic block layer affinity mapping helper. Also,
>
>           generic
>
>>  	nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
>> +	nr_io_queues = min_t(unsigned int, nr_io_queues,
>> +				ibdev->num_comp_vectors);
>> +
>
> Add a comment here?

Will do

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
@ 2017-04-06  8:34     ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-06  8:34 UTC (permalink / raw)
  To: Max Gurtovoy, linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig

> Hi Sagi,

Hey Max,

> the patchset looks good and of course we can add support for more
> drivers in the future.
> have you run some performance testing with the nvmf initiator ?

I'm limited by the target machine in terms of IOPs, but the host shows
~10% cpu usage decrease, and latency improves slightly as well
which is more apparent depending on which cpu I'm running my IO
thread (due to the mismatch in comp_vectors and queue mappings
some queues have irq vectors mapped to a core on a different numa
node.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
@ 2017-04-06  8:34     ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-06  8:34 UTC (permalink / raw)
  To: Max Gurtovoy, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Saeed Mahameed, Or Gerlitz,
	Christoph Hellwig

> Hi Sagi,

Hey Max,

> the patchset looks good and of course we can add support for more
> drivers in the future.
> have you run some performance testing with the nvmf initiator ?

I'm limited by the target machine in terms of IOPs, but the host shows
~10% cpu usage decrease, and latency improves slightly as well
which is more apparent depending on which cpu I'm running my IO
thread (due to the mismatch in comp_vectors and queue mappings
some queues have irq vectors mapped to a core on a different numa
node.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
@ 2017-04-06  8:34     ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-06  8:34 UTC (permalink / raw)


> Hi Sagi,

Hey Max,

> the patchset looks good and of course we can add support for more
> drivers in the future.
> have you run some performance testing with the nvmf initiator ?

I'm limited by the target machine in terms of IOPs, but the host shows
~10% cpu usage decrease, and latency improves slightly as well
which is more apparent depending on which cpu I'm running my IO
thread (due to the mismatch in comp_vectors and queue mappings
some queues have irq vectors mapped to a core on a different numa
node.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
@ 2017-04-06  9:23       ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-06  9:23 UTC (permalink / raw)
  To: Max Gurtovoy, linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig

> shouldn't you include  <linux/kobject.h> and <linux/blkdev.h> like in
> commit 8ec2ef2b66ea2f that fixes blk-mq-pci.c ?

Not really. We can lose these from blk-mq-pci.c as well.

>> +#include <linux/blk-mq.h>
>> +#include <linux/blk-mq-rdma.h>
>> +#include <rdma/ib_verbs.h>
>> +#include <linux/module.h>
>> +#include "blk-mq.h"
>
> Is this include needed ?

You're right, I can just keep:

+#include <linux/blk-mq.h>
+#include <linux/blk-mq-rdma.h>
+#include <rdma/ib_verbs.h>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
@ 2017-04-06  9:23       ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-06  9:23 UTC (permalink / raw)
  To: Max Gurtovoy, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-block-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Saeed Mahameed, Or Gerlitz,
	Christoph Hellwig

> shouldn't you include  <linux/kobject.h> and <linux/blkdev.h> like in
> commit 8ec2ef2b66ea2f that fixes blk-mq-pci.c ?

Not really. We can lose these from blk-mq-pci.c as well.

>> +#include <linux/blk-mq.h>
>> +#include <linux/blk-mq-rdma.h>
>> +#include <rdma/ib_verbs.h>
>> +#include <linux/module.h>
>> +#include "blk-mq.h"
>
> Is this include needed ?

You're right, I can just keep:

+#include <linux/blk-mq.h>
+#include <linux/blk-mq-rdma.h>
+#include <rdma/ib_verbs.h>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper
@ 2017-04-06  9:23       ` Sagi Grimberg
  0 siblings, 0 replies; 52+ messages in thread
From: Sagi Grimberg @ 2017-04-06  9:23 UTC (permalink / raw)


> shouldn't you include  <linux/kobject.h> and <linux/blkdev.h> like in
> commit 8ec2ef2b66ea2f that fixes blk-mq-pci.c ?

Not really. We can lose these from blk-mq-pci.c as well.

>> +#include <linux/blk-mq.h>
>> +#include <linux/blk-mq-rdma.h>
>> +#include <rdma/ib_verbs.h>
>> +#include <linux/module.h>
>> +#include "blk-mq.h"
>
> Is this include needed ?

You're right, I can just keep:

+#include <linux/blk-mq.h>
+#include <linux/blk-mq-rdma.h>
+#include <rdma/ib_verbs.h>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
  2017-04-02 13:41 ` Sagi Grimberg
@ 2017-04-10 18:05   ` Steve Wise
  -1 siblings, 0 replies; 52+ messages in thread
From: Steve Wise @ 2017-04-10 18:05 UTC (permalink / raw)
  To: Sagi Grimberg, linux-rdma, linux-nvme, linux-block
  Cc: netdev, Saeed Mahameed, Or Gerlitz, Christoph Hellwig

On 4/2/2017 8:41 AM, Sagi Grimberg wrote:
> This patch set is aiming to automatically find the optimal
> queue <-> irq multi-queue assignments in storage ULPs (demonstrated
> on nvme-rdma) based on the underlying rdma device irq affinity
> settings.
>
> First two patches modify mlx5 core driver to use generic API
> to allocate array of irq vectors with automatic affinity
> settings instead of open-coding exactly what it does (and
> slightly worse).
>
> Then, in order to obtain an affinity map for a given completion
> vector, we expose a new RDMA core API, and implement it in mlx5.
>
> The third part is addition of a rdma-based queue mapping helper
> to blk-mq that maps the tagset hctx's according to the device
> affinity mappings.
>
> I'd happily convert some more drivers, but I'll need volunteers
> to test as I don't have access to any other devices.

I'll test cxgb4 if you convert it. :)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
@ 2017-04-10 18:05   ` Steve Wise
  0 siblings, 0 replies; 52+ messages in thread
From: Steve Wise @ 2017-04-10 18:05 UTC (permalink / raw)


On 4/2/2017 8:41 AM, Sagi Grimberg wrote:
> This patch set is aiming to automatically find the optimal
> queue <-> irq multi-queue assignments in storage ULPs (demonstrated
> on nvme-rdma) based on the underlying rdma device irq affinity
> settings.
>
> First two patches modify mlx5 core driver to use generic API
> to allocate array of irq vectors with automatic affinity
> settings instead of open-coding exactly what it does (and
> slightly worse).
>
> Then, in order to obtain an affinity map for a given completion
> vector, we expose a new RDMA core API, and implement it in mlx5.
>
> The third part is addition of a rdma-based queue mapping helper
> to blk-mq that maps the tagset hctx's according to the device
> affinity mappings.
>
> I'd happily convert some more drivers, but I'll need volunteers
> to test as I don't have access to any other devices.

I'll test cxgb4 if you convert it. :)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
  2017-04-10 18:05   ` Steve Wise
@ 2017-04-12  6:34     ` Christoph Hellwig
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-12  6:34 UTC (permalink / raw)
  To: Steve Wise
  Cc: Sagi Grimberg, linux-rdma, linux-nvme, linux-block, netdev,
	Saeed Mahameed, Or Gerlitz, Christoph Hellwig

On Mon, Apr 10, 2017 at 01:05:50PM -0500, Steve Wise wrote:
> I'll test cxgb4 if you convert it. :)

That will take a lot of work.  The problem with cxgb4 is that it
allocatesd all the interrupts at device enable time, but then only
allocates them to ULDs when they attached, while this scheme assumes
as way to map out queues / vectors at initialization time.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma
@ 2017-04-12  6:34     ` Christoph Hellwig
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Hellwig @ 2017-04-12  6:34 UTC (permalink / raw)


On Mon, Apr 10, 2017@01:05:50PM -0500, Steve Wise wrote:
> I'll test cxgb4 if you convert it. :)

That will take a lot of work.  The problem with cxgb4 is that it
allocatesd all the interrupts at device enable time, but then only
allocates them to ULDs when they attached, while this scheme assumes
as way to map out queues / vectors at initialization time.

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2017-04-12  6:34 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-02 13:41 [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma Sagi Grimberg
2017-04-02 13:41 ` Sagi Grimberg
2017-04-02 13:41 ` [PATCH rfc 1/6] mlx5: convert to generic pci_alloc_irq_vectors Sagi Grimberg
2017-04-02 13:41   ` Sagi Grimberg
2017-04-04  6:27   ` Christoph Hellwig
2017-04-04  6:27     ` Christoph Hellwig
2017-04-02 13:41 ` [PATCH rfc 2/6] mlx5: move affinity hints assignments to generic code Sagi Grimberg
2017-04-02 13:41   ` Sagi Grimberg
2017-04-04  6:32   ` Christoph Hellwig
2017-04-04  6:32     ` Christoph Hellwig
2017-04-06  8:29     ` Sagi Grimberg
2017-04-06  8:29       ` Sagi Grimberg
2017-04-02 13:41 ` [PATCH rfc 3/6] RDMA/core: expose affinity mappings per completion vector Sagi Grimberg
2017-04-02 13:41   ` Sagi Grimberg
2017-04-04  6:32   ` Christoph Hellwig
2017-04-04  6:32     ` Christoph Hellwig
2017-04-02 13:41 ` [PATCH rfc 4/6] mlx5: support ->get_vector_affinity Sagi Grimberg
2017-04-02 13:41   ` Sagi Grimberg
2017-04-04  6:33   ` Christoph Hellwig
2017-04-04  6:33     ` Christoph Hellwig
2017-04-02 13:41 ` [PATCH rfc 5/6] block: Add rdma affinity based queue mapping helper Sagi Grimberg
2017-04-02 13:41   ` Sagi Grimberg
2017-04-04  6:33   ` Christoph Hellwig
2017-04-04  6:33     ` Christoph Hellwig
2017-04-04  7:46   ` Max Gurtovoy
2017-04-04  7:46     ` Max Gurtovoy
2017-04-04  7:46     ` Max Gurtovoy
2017-04-04  7:46     ` Max Gurtovoy
2017-04-04 13:09     ` Christoph Hellwig
2017-04-04 13:09       ` Christoph Hellwig
2017-04-06  9:23     ` Sagi Grimberg
2017-04-06  9:23       ` Sagi Grimberg
2017-04-06  9:23       ` Sagi Grimberg
2017-04-05 14:17   ` Jens Axboe
2017-04-05 14:17     ` Jens Axboe
2017-04-02 13:41 ` [PATCH rfc 6/6] nvme-rdma: use intelligent affinity based queue mappings Sagi Grimberg
2017-04-02 13:41   ` Sagi Grimberg
2017-04-04  6:34   ` Christoph Hellwig
2017-04-04  6:34     ` Christoph Hellwig
2017-04-06  8:30     ` Sagi Grimberg
2017-04-06  8:30       ` Sagi Grimberg
2017-04-04  7:51 ` [PATCH rfc 0/6] Automatic affinity settings for nvme over rdma Max Gurtovoy
2017-04-04  7:51   ` Max Gurtovoy
2017-04-04  7:51   ` Max Gurtovoy
2017-04-04  7:51   ` Max Gurtovoy
2017-04-06  8:34   ` Sagi Grimberg
2017-04-06  8:34     ` Sagi Grimberg
2017-04-06  8:34     ` Sagi Grimberg
2017-04-10 18:05 ` Steve Wise
2017-04-10 18:05   ` Steve Wise
2017-04-12  6:34   ` Christoph Hellwig
2017-04-12  6:34     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.