All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/2] net/smc: Spread workload over multiple cores
@ 2022-01-26 13:01 Tony Lu
  2022-01-26 13:01 ` [PATCH net-next 1/2] net/smc: Introduce smc_ib_cq to bind link and cq Tony Lu
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Tony Lu @ 2022-01-26 13:01 UTC (permalink / raw)
  To: kgraul, kuba, davem; +Cc: netdev, linux-s390, linux-rdma

Currently, SMC creates one CQ per IB device, and shares this cq among
all the QPs of links. Meanwhile, this CQ is always binded to the first
completion vector, the IRQ affinity of this vector binds to some CPU
core. 

┌────────┐    ┌──────────────┐   ┌──────────────┐
│ SMC IB │    ├────┐         │   │              │
│ DEVICE │ ┌─▶│ QP │ SMC LINK├──▶│SMC Link Group│
│   ┌────┤ │  ├────┘         │   │              │
│   │ CQ ├─┘  └──────────────┘   └──────────────┘
│   │    ├─┐  ┌──────────────┐   ┌──────────────┐
│   └────┤ │  ├────┐         │   │              │
│        │ └─▶│ QP │ SMC LINK├──▶│SMC Link Group│
│        │    ├────┘         │   │              │
└────────┘    └──────────────┘   └──────────────┘

In this model, when connections execeeds SMC_RMBS_PER_LGR_MAX, it will
create multiple link groups and corresponding QPs. All the connections
share limited QPs and one CQ (both recv and send sides). Generally, one
completion vector binds to a fixed CPU core, it will limit the
performance by single core, and large-scale scenes, such as multiple
threads and lots of connections.

Running nginx and wrk test with 8 threads and 800 connections on 8 cores
host, the softirq of CPU 0 is limited the scalability:

04:18:54 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
04:18:55 PM  all    5.81    0.00   19.42    0.00    2.94   10.21    0.00    0.00    0.00   61.63
04:18:55 PM    0    0.00    0.00    0.00    0.00   16.80   82.78    0.00    0.00    0.00    0.41
<snip>

Nowadays, RDMA devices have more than one completion vectors, such as
mlx5 has 8, eRDMA has 4 completion vector by default. This unlocks the
limitation of single vector and single CPU core.

To enhance scalability and take advantage of multi-core resources, we
can spread CQs to different CPU cores, and introduce more flexible
mapping. Here comes up a new model, the main different is that creating
multiple CQs per IB device, which the max number of CQs is limited by
ibdev's ability (num_comp_vectors). In the scene of multiple linkgroups,
the link group's QP can bind to the least used CQ, and CQs are binded
to different completion vector and CPU cores. So that we can spread
the softirq (tasklet of wr tx/rx) handler to different cores.

                        ┌──────────────┐   ┌──────────────┐
┌────────┐  ┌───────┐   ├────┐         │   │              │
│        ├─▶│ CQ 0  ├──▶│ QP │ SMC LINK├──▶│SMC Link Group│
│        │  └───────┘   ├────┘         │   │              │
│ SMC IB │  ┌───────┐   └──────────────┘   └──────────────┘
│ DEVICE ├─▶│ CQ 1  │─┐                                    
│        │  └───────┘ │ ┌──────────────┐   ┌──────────────┐
│        │  ┌───────┐ │ ├────┐         │   │              │
│        ├─▶│ CQ n  │ └▶│ QP │ SMC LINK├──▶│SMC Link Group│
└────────┘  └───────┘   ├────┘         │   │              │
                        └──────────────┘   └──────────────┘

After sperad one CQ (4 linkgroups) to four CPU cores, the softirq load
spreads to different cores:

04:26:25 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
04:26:26 PM  all   10.70    0.00   35.80    0.00    7.64   26.62    0.00    0.00    0.00   19.24
04:26:26 PM    0    0.00    0.00    0.00    0.00   16.33   50.00    0.00    0.00    0.00   33.67
04:26:26 PM    1    0.00    0.00    0.00    0.00   15.46   69.07    0.00    0.00    0.00   15.46
04:26:26 PM    2    0.00    0.00    0.00    0.00   13.13   39.39    0.00    0.00    0.00   47.47
04:26:26 PM    3    0.00    0.00    0.00    0.00   13.27   55.10    0.00    0.00    0.00   31.63
<snip>

Here is the benchmark with this patch set:

Test environment:
- CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4.
- nginx + wrk HTTP benchmark.
- nginx: disable access_log, increase keepalive_timeout and
  keepalive_requests, long-live connection, return 200 directly.
- wrk: 8 threads and 100, 200, 400 connections.

Benchmark result:

Conns/QPS         100        200        400
w/o patch   338502.49  359216.66  398167.16
w/  patch   677247.40  694193.70  812502.69
Ratio        +100.07%    +93.25%   +104.06%

This patch set shows nearly 1x increasement of QPS.

The benchmarks of 100, 200, 400 connections use 1, 1, 2 link groups.
When link group is one, it spreads send/recv to two cores. Once more
than one link groups, it would spread to more cores.

RFC Link: https://lore.kernel.org/netdev/YeRaSdg8TcNJsGBB@TonyMac-Alibaba/T/

These two patches split from previous RFC, and move netlink related patch
to the next patch set.

Tony Lu (2):
  net/smc: Introduce smc_ib_cq to bind link and cq
  net/smc: Multiple CQs per IB devices

 net/smc/smc_core.h |   2 +
 net/smc/smc_ib.c   | 132 ++++++++++++++++++++++++++++++++++++---------
 net/smc/smc_ib.h   |  15 ++++--
 net/smc/smc_wr.c   |  44 +++++++++------
 4 files changed, 148 insertions(+), 45 deletions(-)

-- 
2.32.0.3.g01195cf9f


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH net-next 1/2] net/smc: Introduce smc_ib_cq to bind link and cq
  2022-01-26 13:01 [PATCH net-next 0/2] net/smc: Spread workload over multiple cores Tony Lu
@ 2022-01-26 13:01 ` Tony Lu
  2022-01-26 13:01 ` [PATCH net-next 2/2] net/smc: Multiple CQs per IB devices Tony Lu
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Tony Lu @ 2022-01-26 13:01 UTC (permalink / raw)
  To: kgraul, kuba, davem; +Cc: netdev, linux-s390, linux-rdma

This patch introduces struct smc_ib_cq as a medium between smc_link and
ib_cq. Every smc_link can access ib_cq from their own, and unbinds
smc_link from smc_ib_device. This allows flexible mapping, prepares for
multiple CQs support.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
---
 net/smc/smc_core.h |  2 ++
 net/smc/smc_ib.c   | 85 +++++++++++++++++++++++++++++++++-------------
 net/smc/smc_ib.h   | 13 ++++---
 net/smc/smc_wr.c   | 34 +++++++++----------
 4 files changed, 89 insertions(+), 45 deletions(-)

diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 521c64a3d8d3..fd10cad8fb77 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -86,6 +86,8 @@ struct smc_link {
 	struct ib_pd		*roce_pd;	/* IB protection domain,
 						 * unique for every RoCE QP
 						 */
+	struct smc_ib_cq	*smcibcq_recv;	/* cq for recv */
+	struct smc_ib_cq	*smcibcq_send;	/* cq for send */
 	struct ib_qp		*roce_qp;	/* IB queue pair */
 	struct ib_qp_attr	qp_attr;	/* IB queue pair attributes */
 
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index a3e2d3b89568..0d98cf440adc 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -133,7 +133,7 @@ int smc_ib_ready_link(struct smc_link *lnk)
 	if (rc)
 		goto out;
 	smc_wr_remember_qp_attr(lnk);
-	rc = ib_req_notify_cq(lnk->smcibdev->roce_cq_recv,
+	rc = ib_req_notify_cq(lnk->smcibcq_recv->ib_cq,
 			      IB_CQ_SOLICITED_MASK);
 	if (rc)
 		goto out;
@@ -651,6 +651,8 @@ void smc_ib_destroy_queue_pair(struct smc_link *lnk)
 	if (lnk->roce_qp)
 		ib_destroy_qp(lnk->roce_qp);
 	lnk->roce_qp = NULL;
+	lnk->smcibcq_send = NULL;
+	lnk->smcibcq_recv = NULL;
 }
 
 /* create a queue pair within the protection domain for a link */
@@ -660,8 +662,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 	struct ib_qp_init_attr qp_attr = {
 		.event_handler = smc_ib_qp_event_handler,
 		.qp_context = lnk,
-		.send_cq = lnk->smcibdev->roce_cq_send,
-		.recv_cq = lnk->smcibdev->roce_cq_recv,
+		.send_cq = lnk->smcibdev->ib_cq_send->ib_cq,
+		.recv_cq = lnk->smcibdev->ib_cq_recv->ib_cq,
 		.srq = NULL,
 		.cap = {
 				/* include unsolicited rdma_writes as well,
@@ -679,10 +681,13 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 
 	lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr);
 	rc = PTR_ERR_OR_ZERO(lnk->roce_qp);
-	if (IS_ERR(lnk->roce_qp))
+	if (IS_ERR(lnk->roce_qp)) {
 		lnk->roce_qp = NULL;
-	else
+	} else {
+		lnk->smcibcq_send = lnk->smcibdev->ib_cq_send;
+		lnk->smcibcq_recv = lnk->smcibdev->ib_cq_recv;
 		smc_wr_remember_qp_attr(lnk);
+	}
 	return rc;
 }
 
@@ -799,10 +804,21 @@ void smc_ib_buf_unmap_sg(struct smc_link *lnk,
 	buf_slot->sgt[lnk->link_idx].sgl->dma_address = 0;
 }
 
+static void smc_ib_cleanup_cq(struct smc_ib_device *smcibdev)
+{
+	ib_destroy_cq(smcibdev->ib_cq_send->ib_cq);
+	kfree(smcibdev->ib_cq_send);
+	smcibdev->ib_cq_send = NULL;
+
+	ib_destroy_cq(smcibdev->ib_cq_recv->ib_cq);
+	kfree(smcibdev->ib_cq_recv);
+	smcibdev->ib_cq_recv = NULL;
+}
+
 long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
 {
-	struct ib_cq_init_attr cqattr =	{
-		.cqe = SMC_MAX_CQE, .comp_vector = 0 };
+	struct ib_cq_init_attr cqattr =	{ .cqe = SMC_MAX_CQE };
+	struct smc_ib_cq *smcibcq_send, *smcibcq_recv;
 	int cqe_size_order, smc_order;
 	long rc;
 
@@ -815,28 +831,50 @@ long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
 	smc_order = MAX_ORDER - cqe_size_order - 1;
 	if (SMC_MAX_CQE + 2 > (0x00000001 << smc_order) * PAGE_SIZE)
 		cqattr.cqe = (0x00000001 << smc_order) * PAGE_SIZE - 2;
-	smcibdev->roce_cq_send = ib_create_cq(smcibdev->ibdev,
-					      smc_wr_tx_cq_handler, NULL,
-					      smcibdev, &cqattr);
-	rc = PTR_ERR_OR_ZERO(smcibdev->roce_cq_send);
-	if (IS_ERR(smcibdev->roce_cq_send)) {
-		smcibdev->roce_cq_send = NULL;
+
+	smcibcq_send = kzalloc(sizeof(*smcibcq_send), GFP_KERNEL);
+	if (!smcibcq_send) {
+		rc = -ENOMEM;
+		goto out;
+	}
+	smcibcq_send->smcibdev = smcibdev;
+	smcibcq_send->is_send = 1;
+	cqattr.comp_vector = 0;
+	smcibcq_send->ib_cq = ib_create_cq(smcibdev->ibdev,
+					   smc_wr_tx_cq_handler, NULL,
+					   smcibcq_send, &cqattr);
+	rc = PTR_ERR_OR_ZERO(smcibdev->ib_cq_send);
+	if (IS_ERR(smcibdev->ib_cq_send)) {
+		smcibdev->ib_cq_send = NULL;
 		goto out;
 	}
-	smcibdev->roce_cq_recv = ib_create_cq(smcibdev->ibdev,
-					      smc_wr_rx_cq_handler, NULL,
-					      smcibdev, &cqattr);
-	rc = PTR_ERR_OR_ZERO(smcibdev->roce_cq_recv);
-	if (IS_ERR(smcibdev->roce_cq_recv)) {
-		smcibdev->roce_cq_recv = NULL;
-		goto err;
+	smcibdev->ib_cq_send = smcibcq_send;
+
+	smcibcq_recv = kzalloc(sizeof(*smcibcq_recv), GFP_KERNEL);
+	if (!smcibcq_recv) {
+		rc = -ENOMEM;
+		goto err_send;
+	}
+	smcibcq_recv->smcibdev = smcibdev;
+	cqattr.comp_vector = 1;
+	smcibcq_recv->ib_cq = ib_create_cq(smcibdev->ibdev,
+					   smc_wr_rx_cq_handler, NULL,
+					   smcibcq_recv, &cqattr);
+	rc = PTR_ERR_OR_ZERO(smcibdev->ib_cq_recv);
+	if (IS_ERR(smcibdev->ib_cq_recv)) {
+		smcibdev->ib_cq_recv = NULL;
+		goto err_recv;
 	}
+	smcibdev->ib_cq_recv = smcibcq_recv;
 	smc_wr_add_dev(smcibdev);
 	smcibdev->initialized = 1;
 	goto out;
 
-err:
-	ib_destroy_cq(smcibdev->roce_cq_send);
+err_recv:
+	kfree(smcibcq_recv);
+	ib_destroy_cq(smcibcq_send->ib_cq);
+err_send:
+	kfree(smcibcq_send);
 out:
 	mutex_unlock(&smcibdev->mutex);
 	return rc;
@@ -848,8 +886,7 @@ static void smc_ib_cleanup_per_ibdev(struct smc_ib_device *smcibdev)
 	if (!smcibdev->initialized)
 		goto out;
 	smcibdev->initialized = 0;
-	ib_destroy_cq(smcibdev->roce_cq_recv);
-	ib_destroy_cq(smcibdev->roce_cq_send);
+	smc_ib_cleanup_cq(smcibdev);
 	smc_wr_remove_dev(smcibdev);
 out:
 	mutex_unlock(&smcibdev->mutex);
diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h
index 5d8b49c57f50..1dc567599977 100644
--- a/net/smc/smc_ib.h
+++ b/net/smc/smc_ib.h
@@ -32,15 +32,20 @@ struct smc_ib_devices {			/* list of smc ib devices definition */
 extern struct smc_ib_devices	smc_ib_devices; /* list of smc ib devices */
 extern struct smc_lgr_list smc_lgr_list; /* list of linkgroups */
 
+struct smc_ib_cq {		/* ib_cq wrapper for smc */
+	struct smc_ib_device	*smcibdev;	/* parent ib device */
+	struct ib_cq		*ib_cq;		/* real ib_cq for link */
+	struct tasklet_struct	tasklet;	/* tasklet for wr */
+	bool			is_send;	/* send for recv cq */
+};
+
 struct smc_ib_device {				/* ib-device infos for smc */
 	struct list_head	list;
 	struct ib_device	*ibdev;
 	struct ib_port_attr	pattr[SMC_MAX_PORTS];	/* ib dev. port attrs */
 	struct ib_event_handler	event_handler;	/* global ib_event handler */
-	struct ib_cq		*roce_cq_send;	/* send completion queue */
-	struct ib_cq		*roce_cq_recv;	/* recv completion queue */
-	struct tasklet_struct	send_tasklet;	/* called by send cq handler */
-	struct tasklet_struct	recv_tasklet;	/* called by recv cq handler */
+	struct smc_ib_cq	*ib_cq_send;	/* send completion queue */
+	struct smc_ib_cq	*ib_cq_recv;	/* recv completion queue */
 	char			mac[SMC_MAX_PORTS][ETH_ALEN];
 						/* mac address per port*/
 	u8			pnetid[SMC_MAX_PORTS][SMC_MAX_PNETID_LEN];
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index 24be1d03fef9..ddb0ba67a851 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -135,7 +135,7 @@ static inline void smc_wr_tx_process_cqe(struct ib_wc *wc)
 
 static void smc_wr_tx_tasklet_fn(struct tasklet_struct *t)
 {
-	struct smc_ib_device *dev = from_tasklet(dev, t, send_tasklet);
+	struct smc_ib_cq *smcibcq = from_tasklet(smcibcq, t, tasklet);
 	struct ib_wc wc[SMC_WR_MAX_POLL_CQE];
 	int i = 0, rc;
 	int polled = 0;
@@ -144,9 +144,9 @@ static void smc_wr_tx_tasklet_fn(struct tasklet_struct *t)
 	polled++;
 	do {
 		memset(&wc, 0, sizeof(wc));
-		rc = ib_poll_cq(dev->roce_cq_send, SMC_WR_MAX_POLL_CQE, wc);
+		rc = ib_poll_cq(smcibcq->ib_cq, SMC_WR_MAX_POLL_CQE, wc);
 		if (polled == 1) {
-			ib_req_notify_cq(dev->roce_cq_send,
+			ib_req_notify_cq(smcibcq->ib_cq,
 					 IB_CQ_NEXT_COMP |
 					 IB_CQ_REPORT_MISSED_EVENTS);
 		}
@@ -161,9 +161,9 @@ static void smc_wr_tx_tasklet_fn(struct tasklet_struct *t)
 
 void smc_wr_tx_cq_handler(struct ib_cq *ib_cq, void *cq_context)
 {
-	struct smc_ib_device *dev = (struct smc_ib_device *)cq_context;
+	struct smc_ib_cq *smcibcq = (struct smc_ib_cq *)cq_context;
 
-	tasklet_schedule(&dev->send_tasklet);
+	tasklet_schedule(&smcibcq->tasklet);
 }
 
 /*---------------------------- request submission ---------------------------*/
@@ -306,7 +306,7 @@ int smc_wr_tx_send(struct smc_link *link, struct smc_wr_tx_pend_priv *priv)
 	struct smc_wr_tx_pend *pend;
 	int rc;
 
-	ib_req_notify_cq(link->smcibdev->roce_cq_send,
+	ib_req_notify_cq(link->smcibcq_send->ib_cq,
 			 IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
 	pend = container_of(priv, struct smc_wr_tx_pend, priv);
 	rc = ib_post_send(link->roce_qp, &link->wr_tx_ibs[pend->idx], NULL);
@@ -323,7 +323,7 @@ int smc_wr_tx_v2_send(struct smc_link *link, struct smc_wr_tx_pend_priv *priv,
 	int rc;
 
 	link->wr_tx_v2_ib->sg_list[0].length = len;
-	ib_req_notify_cq(link->smcibdev->roce_cq_send,
+	ib_req_notify_cq(link->smcibcq_send->ib_cq,
 			 IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
 	rc = ib_post_send(link->roce_qp, link->wr_tx_v2_ib, NULL);
 	if (rc) {
@@ -367,7 +367,7 @@ int smc_wr_reg_send(struct smc_link *link, struct ib_mr *mr)
 {
 	int rc;
 
-	ib_req_notify_cq(link->smcibdev->roce_cq_send,
+	ib_req_notify_cq(link->smcibcq_send->ib_cq,
 			 IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
 	link->wr_reg_state = POSTED;
 	link->wr_reg.wr.wr_id = (u64)(uintptr_t)mr;
@@ -476,7 +476,7 @@ static inline void smc_wr_rx_process_cqes(struct ib_wc wc[], int num)
 
 static void smc_wr_rx_tasklet_fn(struct tasklet_struct *t)
 {
-	struct smc_ib_device *dev = from_tasklet(dev, t, recv_tasklet);
+	struct smc_ib_cq *smcibcq = from_tasklet(smcibcq, t, tasklet);
 	struct ib_wc wc[SMC_WR_MAX_POLL_CQE];
 	int polled = 0;
 	int rc;
@@ -485,9 +485,9 @@ static void smc_wr_rx_tasklet_fn(struct tasklet_struct *t)
 	polled++;
 	do {
 		memset(&wc, 0, sizeof(wc));
-		rc = ib_poll_cq(dev->roce_cq_recv, SMC_WR_MAX_POLL_CQE, wc);
+		rc = ib_poll_cq(smcibcq->ib_cq, SMC_WR_MAX_POLL_CQE, wc);
 		if (polled == 1) {
-			ib_req_notify_cq(dev->roce_cq_recv,
+			ib_req_notify_cq(smcibcq->ib_cq,
 					 IB_CQ_SOLICITED_MASK
 					 | IB_CQ_REPORT_MISSED_EVENTS);
 		}
@@ -501,9 +501,9 @@ static void smc_wr_rx_tasklet_fn(struct tasklet_struct *t)
 
 void smc_wr_rx_cq_handler(struct ib_cq *ib_cq, void *cq_context)
 {
-	struct smc_ib_device *dev = (struct smc_ib_device *)cq_context;
+	struct smc_ib_cq *smcibcq = (struct smc_ib_cq *)cq_context;
 
-	tasklet_schedule(&dev->recv_tasklet);
+	tasklet_schedule(&smcibcq->tasklet);
 }
 
 int smc_wr_rx_post_init(struct smc_link *link)
@@ -830,14 +830,14 @@ int smc_wr_alloc_link_mem(struct smc_link *link)
 
 void smc_wr_remove_dev(struct smc_ib_device *smcibdev)
 {
-	tasklet_kill(&smcibdev->recv_tasklet);
-	tasklet_kill(&smcibdev->send_tasklet);
+	tasklet_kill(&smcibdev->ib_cq_recv->tasklet);
+	tasklet_kill(&smcibdev->ib_cq_send->tasklet);
 }
 
 void smc_wr_add_dev(struct smc_ib_device *smcibdev)
 {
-	tasklet_setup(&smcibdev->recv_tasklet, smc_wr_rx_tasklet_fn);
-	tasklet_setup(&smcibdev->send_tasklet, smc_wr_tx_tasklet_fn);
+	tasklet_setup(&smcibdev->ib_cq_recv->tasklet, smc_wr_rx_tasklet_fn);
+	tasklet_setup(&smcibdev->ib_cq_send->tasklet, smc_wr_tx_tasklet_fn);
 }
 
 int smc_wr_create_link(struct smc_link *lnk)
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH net-next 2/2] net/smc: Multiple CQs per IB devices
  2022-01-26 13:01 [PATCH net-next 0/2] net/smc: Spread workload over multiple cores Tony Lu
  2022-01-26 13:01 ` [PATCH net-next 1/2] net/smc: Introduce smc_ib_cq to bind link and cq Tony Lu
@ 2022-01-26 13:01 ` Tony Lu
  2022-01-26 15:29 ` [PATCH net-next 0/2] net/smc: Spread workload over multiple cores Jason Gunthorpe
  2022-01-27 14:59 ` Karsten Graul
  3 siblings, 0 replies; 8+ messages in thread
From: Tony Lu @ 2022-01-26 13:01 UTC (permalink / raw)
  To: kgraul, kuba, davem; +Cc: netdev, linux-s390, linux-rdma

This allows multiple CQs for one IB device, compared to one CQ now.

During IB device setup, it would initialize ibdev->num_comp_vectors
amount of send/recv CQs, and the corresponding tasklets, like queues for
net devices.

Every smc_link has their own send and recv CQs, which always assigning
from the least used CQs of current IB device.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
---
 net/smc/smc_ib.c | 139 +++++++++++++++++++++++++++++++----------------
 net/smc/smc_ib.h |   6 +-
 net/smc/smc_wr.c |  18 ++++--
 3 files changed, 111 insertions(+), 52 deletions(-)

diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 0d98cf440adc..5d2fce0a7796 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -625,6 +625,36 @@ int smcr_nl_get_device(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }
 
+static struct smc_ib_cq *smc_ib_get_least_used_cq(struct smc_ib_device *smcibdev,
+						  bool is_send)
+{
+	struct smc_ib_cq *smcibcq, *cq;
+	int min, i;
+
+	if (is_send)
+		smcibcq = smcibdev->smcibcq_send;
+	else
+		smcibcq = smcibdev->smcibcq_recv;
+
+	cq = smcibcq;
+	min = cq->load;
+
+	for (i = 0; i < smcibdev->num_cq_peer; i++) {
+		if (smcibcq[i].load < min) {
+			cq = &smcibcq[i];
+			min = cq->load;
+		}
+	}
+
+	cq->load++;
+	return cq;
+}
+
+static void smc_ib_put_cq(struct smc_ib_cq *smcibcq)
+{
+	smcibcq->load--;
+}
+
 static void smc_ib_qp_event_handler(struct ib_event *ibevent, void *priv)
 {
 	struct smc_link *lnk = (struct smc_link *)priv;
@@ -648,8 +678,11 @@ static void smc_ib_qp_event_handler(struct ib_event *ibevent, void *priv)
 
 void smc_ib_destroy_queue_pair(struct smc_link *lnk)
 {
-	if (lnk->roce_qp)
+	if (lnk->roce_qp) {
 		ib_destroy_qp(lnk->roce_qp);
+		smc_ib_put_cq(lnk->smcibcq_send);
+		smc_ib_put_cq(lnk->smcibcq_recv);
+	}
 	lnk->roce_qp = NULL;
 	lnk->smcibcq_send = NULL;
 	lnk->smcibcq_recv = NULL;
@@ -658,12 +691,16 @@ void smc_ib_destroy_queue_pair(struct smc_link *lnk)
 /* create a queue pair within the protection domain for a link */
 int smc_ib_create_queue_pair(struct smc_link *lnk)
 {
+	struct smc_ib_cq *smcibcq_send = smc_ib_get_least_used_cq(lnk->smcibdev,
+								  true);
+	struct smc_ib_cq *smcibcq_recv = smc_ib_get_least_used_cq(lnk->smcibdev,
+								  false);
 	int sges_per_buf = (lnk->lgr->smc_version == SMC_V2) ? 2 : 1;
 	struct ib_qp_init_attr qp_attr = {
 		.event_handler = smc_ib_qp_event_handler,
 		.qp_context = lnk,
-		.send_cq = lnk->smcibdev->ib_cq_send->ib_cq,
-		.recv_cq = lnk->smcibdev->ib_cq_recv->ib_cq,
+		.send_cq = smcibcq_send->ib_cq,
+		.recv_cq = smcibcq_recv->ib_cq,
 		.srq = NULL,
 		.cap = {
 				/* include unsolicited rdma_writes as well,
@@ -684,8 +721,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 	if (IS_ERR(lnk->roce_qp)) {
 		lnk->roce_qp = NULL;
 	} else {
-		lnk->smcibcq_send = lnk->smcibdev->ib_cq_send;
-		lnk->smcibcq_recv = lnk->smcibdev->ib_cq_recv;
+		lnk->smcibcq_send = smcibcq_send;
+		lnk->smcibcq_recv = smcibcq_recv;
 		smc_wr_remember_qp_attr(lnk);
 	}
 	return rc;
@@ -806,20 +843,26 @@ void smc_ib_buf_unmap_sg(struct smc_link *lnk,
 
 static void smc_ib_cleanup_cq(struct smc_ib_device *smcibdev)
 {
-	ib_destroy_cq(smcibdev->ib_cq_send->ib_cq);
-	kfree(smcibdev->ib_cq_send);
-	smcibdev->ib_cq_send = NULL;
+	int i;
+
+	for (i = 0; i < smcibdev->num_cq_peer; i++) {
+		if (smcibdev->smcibcq_send[i].ib_cq)
+			ib_destroy_cq(smcibdev->smcibcq_send[i].ib_cq);
+
+		if (smcibdev->smcibcq_recv[i].ib_cq)
+			ib_destroy_cq(smcibdev->smcibcq_recv[i].ib_cq);
+	}
 
-	ib_destroy_cq(smcibdev->ib_cq_recv->ib_cq);
-	kfree(smcibdev->ib_cq_recv);
-	smcibdev->ib_cq_recv = NULL;
+	kfree(smcibdev->smcibcq_send);
+	kfree(smcibdev->smcibcq_recv);
 }
 
 long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
 {
 	struct ib_cq_init_attr cqattr =	{ .cqe = SMC_MAX_CQE };
-	struct smc_ib_cq *smcibcq_send, *smcibcq_recv;
 	int cqe_size_order, smc_order;
+	struct smc_ib_cq *smcibcq;
+	int i, num_cq_peer;
 	long rc;
 
 	mutex_lock(&smcibdev->mutex);
@@ -832,49 +875,53 @@ long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
 	if (SMC_MAX_CQE + 2 > (0x00000001 << smc_order) * PAGE_SIZE)
 		cqattr.cqe = (0x00000001 << smc_order) * PAGE_SIZE - 2;
 
-	smcibcq_send = kzalloc(sizeof(*smcibcq_send), GFP_KERNEL);
-	if (!smcibcq_send) {
+	num_cq_peer = min_t(int, smcibdev->ibdev->num_comp_vectors,
+			    num_online_cpus());
+	smcibdev->num_cq_peer = num_cq_peer;
+	smcibdev->smcibcq_send = kcalloc(num_cq_peer, sizeof(*smcibcq),
+					 GFP_KERNEL);
+	if (!smcibdev->smcibcq_send) {
 		rc = -ENOMEM;
-		goto out;
-	}
-	smcibcq_send->smcibdev = smcibdev;
-	smcibcq_send->is_send = 1;
-	cqattr.comp_vector = 0;
-	smcibcq_send->ib_cq = ib_create_cq(smcibdev->ibdev,
-					   smc_wr_tx_cq_handler, NULL,
-					   smcibcq_send, &cqattr);
-	rc = PTR_ERR_OR_ZERO(smcibdev->ib_cq_send);
-	if (IS_ERR(smcibdev->ib_cq_send)) {
-		smcibdev->ib_cq_send = NULL;
-		goto out;
+		goto err;
 	}
-	smcibdev->ib_cq_send = smcibcq_send;
-
-	smcibcq_recv = kzalloc(sizeof(*smcibcq_recv), GFP_KERNEL);
-	if (!smcibcq_recv) {
+	smcibdev->smcibcq_recv = kcalloc(num_cq_peer, sizeof(*smcibcq),
+					 GFP_KERNEL);
+	if (!smcibdev->smcibcq_recv) {
 		rc = -ENOMEM;
-		goto err_send;
+		goto err;
 	}
-	smcibcq_recv->smcibdev = smcibdev;
-	cqattr.comp_vector = 1;
-	smcibcq_recv->ib_cq = ib_create_cq(smcibdev->ibdev,
-					   smc_wr_rx_cq_handler, NULL,
-					   smcibcq_recv, &cqattr);
-	rc = PTR_ERR_OR_ZERO(smcibdev->ib_cq_recv);
-	if (IS_ERR(smcibdev->ib_cq_recv)) {
-		smcibdev->ib_cq_recv = NULL;
-		goto err_recv;
+
+	/* initialize CQs */
+	for (i = 0; i < num_cq_peer; i++) {
+		/* initialize send CQ */
+		smcibcq = &smcibdev->smcibcq_send[i];
+		smcibcq->smcibdev = smcibdev;
+		smcibcq->is_send = 1;
+		cqattr.comp_vector = i;
+		smcibcq->ib_cq = ib_create_cq(smcibdev->ibdev,
+					      smc_wr_tx_cq_handler, NULL,
+					      smcibcq, &cqattr);
+		rc = PTR_ERR_OR_ZERO(smcibcq->ib_cq);
+		if (IS_ERR(smcibcq->ib_cq))
+			goto err;
+
+		/* initialize recv CQ */
+		smcibcq = &smcibdev->smcibcq_recv[i];
+		smcibcq->smcibdev = smcibdev;
+		cqattr.comp_vector = num_cq_peer - 1 - i; /* reverse to spread snd/rcv */
+		smcibcq->ib_cq = ib_create_cq(smcibdev->ibdev,
+					      smc_wr_rx_cq_handler, NULL,
+					      smcibcq, &cqattr);
+		rc = PTR_ERR_OR_ZERO(smcibcq->ib_cq);
+		if (IS_ERR(smcibcq->ib_cq))
+			goto err;
 	}
-	smcibdev->ib_cq_recv = smcibcq_recv;
 	smc_wr_add_dev(smcibdev);
 	smcibdev->initialized = 1;
 	goto out;
 
-err_recv:
-	kfree(smcibcq_recv);
-	ib_destroy_cq(smcibcq_send->ib_cq);
-err_send:
-	kfree(smcibcq_send);
+err:
+	smc_ib_cleanup_cq(smcibdev);
 out:
 	mutex_unlock(&smcibdev->mutex);
 	return rc;
diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h
index 1dc567599977..d303b0717c3f 100644
--- a/net/smc/smc_ib.h
+++ b/net/smc/smc_ib.h
@@ -37,6 +37,7 @@ struct smc_ib_cq {		/* ib_cq wrapper for smc */
 	struct ib_cq		*ib_cq;		/* real ib_cq for link */
 	struct tasklet_struct	tasklet;	/* tasklet for wr */
 	bool			is_send;	/* send for recv cq */
+	int			load;		/* load of current cq */
 };
 
 struct smc_ib_device {				/* ib-device infos for smc */
@@ -44,8 +45,9 @@ struct smc_ib_device {				/* ib-device infos for smc */
 	struct ib_device	*ibdev;
 	struct ib_port_attr	pattr[SMC_MAX_PORTS];	/* ib dev. port attrs */
 	struct ib_event_handler	event_handler;	/* global ib_event handler */
-	struct smc_ib_cq	*ib_cq_send;	/* send completion queue */
-	struct smc_ib_cq	*ib_cq_recv;	/* recv completion queue */
+	int			num_cq_peer;	/* num of snd/rcv cq peer */
+	struct smc_ib_cq	*smcibcq_send;	/* send cqs */
+	struct smc_ib_cq	*smcibcq_recv;	/* recv cqs */
 	char			mac[SMC_MAX_PORTS][ETH_ALEN];
 						/* mac address per port*/
 	u8			pnetid[SMC_MAX_PORTS][SMC_MAX_PNETID_LEN];
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index ddb0ba67a851..24014c9924b1 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -830,14 +830,24 @@ int smc_wr_alloc_link_mem(struct smc_link *link)
 
 void smc_wr_remove_dev(struct smc_ib_device *smcibdev)
 {
-	tasklet_kill(&smcibdev->ib_cq_recv->tasklet);
-	tasklet_kill(&smcibdev->ib_cq_send->tasklet);
+	int i;
+
+	for (i = 0; i < smcibdev->num_cq_peer; i++) {
+		tasklet_kill(&smcibdev->smcibcq_send[i].tasklet);
+		tasklet_kill(&smcibdev->smcibcq_recv[i].tasklet);
+	}
 }
 
 void smc_wr_add_dev(struct smc_ib_device *smcibdev)
 {
-	tasklet_setup(&smcibdev->ib_cq_recv->tasklet, smc_wr_rx_tasklet_fn);
-	tasklet_setup(&smcibdev->ib_cq_send->tasklet, smc_wr_tx_tasklet_fn);
+	int i;
+
+	for (i = 0; i < smcibdev->num_cq_peer; i++) {
+		tasklet_setup(&smcibdev->smcibcq_send[i].tasklet,
+			      smc_wr_tx_tasklet_fn);
+		tasklet_setup(&smcibdev->smcibcq_recv[i].tasklet,
+			      smc_wr_rx_tasklet_fn);
+	}
 }
 
 int smc_wr_create_link(struct smc_link *lnk)
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 0/2] net/smc: Spread workload over multiple cores
  2022-01-26 13:01 [PATCH net-next 0/2] net/smc: Spread workload over multiple cores Tony Lu
  2022-01-26 13:01 ` [PATCH net-next 1/2] net/smc: Introduce smc_ib_cq to bind link and cq Tony Lu
  2022-01-26 13:01 ` [PATCH net-next 2/2] net/smc: Multiple CQs per IB devices Tony Lu
@ 2022-01-26 15:29 ` Jason Gunthorpe
  2022-01-27  3:19   ` Tony Lu
  2022-01-27 14:59 ` Karsten Graul
  3 siblings, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2022-01-26 15:29 UTC (permalink / raw)
  To: Tony Lu; +Cc: kgraul, kuba, davem, netdev, linux-s390, linux-rdma

On Wed, Jan 26, 2022 at 09:01:39PM +0800, Tony Lu wrote:
> Currently, SMC creates one CQ per IB device, and shares this cq among
> all the QPs of links. Meanwhile, this CQ is always binded to the first
> completion vector, the IRQ affinity of this vector binds to some CPU
> core.

As we said in the RFC discussion this should be updated to use the
proper core APIS, not re-implement them in a driver like this.

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 0/2] net/smc: Spread workload over multiple cores
  2022-01-26 15:29 ` [PATCH net-next 0/2] net/smc: Spread workload over multiple cores Jason Gunthorpe
@ 2022-01-27  3:19   ` Tony Lu
  2022-01-27  6:18     ` Leon Romanovsky
  0 siblings, 1 reply; 8+ messages in thread
From: Tony Lu @ 2022-01-27  3:19 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: kgraul, kuba, davem, netdev, linux-s390, linux-rdma

On Wed, Jan 26, 2022 at 11:29:16AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 26, 2022 at 09:01:39PM +0800, Tony Lu wrote:
> > Currently, SMC creates one CQ per IB device, and shares this cq among
> > all the QPs of links. Meanwhile, this CQ is always binded to the first
> > completion vector, the IRQ affinity of this vector binds to some CPU
> > core.
> 
> As we said in the RFC discussion this should be updated to use the
> proper core APIS, not re-implement them in a driver like this.

Thanks for your advice. As I replied in the RFC, I will start to do that
after a clear plan is determined.

Glad to hear your advice. 

Tony Lu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 0/2] net/smc: Spread workload over multiple cores
  2022-01-27  3:19   ` Tony Lu
@ 2022-01-27  6:18     ` Leon Romanovsky
  2022-01-27  8:05       ` Tony Lu
  0 siblings, 1 reply; 8+ messages in thread
From: Leon Romanovsky @ 2022-01-27  6:18 UTC (permalink / raw)
  To: Tony Lu
  Cc: Jason Gunthorpe, kgraul, kuba, davem, netdev, linux-s390, linux-rdma

On Thu, Jan 27, 2022 at 11:19:10AM +0800, Tony Lu wrote:
> On Wed, Jan 26, 2022 at 11:29:16AM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 26, 2022 at 09:01:39PM +0800, Tony Lu wrote:
> > > Currently, SMC creates one CQ per IB device, and shares this cq among
> > > all the QPs of links. Meanwhile, this CQ is always binded to the first
> > > completion vector, the IRQ affinity of this vector binds to some CPU
> > > core.
> > 
> > As we said in the RFC discussion this should be updated to use the
> > proper core APIS, not re-implement them in a driver like this.
> 
> Thanks for your advice. As I replied in the RFC, I will start to do that
> after a clear plan is determined.
> 
> Glad to hear your advice. 

Please do right thing from the beginning.

You are improving code from 2017 to be aligned with core code that
exists from 2020.

Thanks

> 
> Tony Lu
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 0/2] net/smc: Spread workload over multiple cores
  2022-01-27  6:18     ` Leon Romanovsky
@ 2022-01-27  8:05       ` Tony Lu
  0 siblings, 0 replies; 8+ messages in thread
From: Tony Lu @ 2022-01-27  8:05 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, kgraul, kuba, davem, netdev, linux-s390, linux-rdma

On Thu, Jan 27, 2022 at 08:18:48AM +0200, Leon Romanovsky wrote:
> On Thu, Jan 27, 2022 at 11:19:10AM +0800, Tony Lu wrote:
> > On Wed, Jan 26, 2022 at 11:29:16AM -0400, Jason Gunthorpe wrote:
> > > On Wed, Jan 26, 2022 at 09:01:39PM +0800, Tony Lu wrote:
> > > > Currently, SMC creates one CQ per IB device, and shares this cq among
> > > > all the QPs of links. Meanwhile, this CQ is always binded to the first
> > > > completion vector, the IRQ affinity of this vector binds to some CPU
> > > > core.
> > > 
> > > As we said in the RFC discussion this should be updated to use the
> > > proper core APIS, not re-implement them in a driver like this.
> > 
> > Thanks for your advice. As I replied in the RFC, I will start to do that
> > after a clear plan is determined.
> > 
> > Glad to hear your advice. 
> 
> Please do right thing from the beginning.
> 
> You are improving code from 2017 to be aligned with core code that
> exists from 2020.

Thanks for your reply. The implement of this patch set isn't a brand-new
feature, just existed codes and logics adjustment and recombination,
aims to solve an existed issue in real world. So I fixes it now.

The other thing is to align code to now with new API. I will do it
before a full discussion with Karsten.

Thank you,
Tony Lu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 0/2] net/smc: Spread workload over multiple cores
  2022-01-26 13:01 [PATCH net-next 0/2] net/smc: Spread workload over multiple cores Tony Lu
                   ` (2 preceding siblings ...)
  2022-01-26 15:29 ` [PATCH net-next 0/2] net/smc: Spread workload over multiple cores Jason Gunthorpe
@ 2022-01-27 14:59 ` Karsten Graul
  3 siblings, 0 replies; 8+ messages in thread
From: Karsten Graul @ 2022-01-27 14:59 UTC (permalink / raw)
  To: Tony Lu, kuba, davem; +Cc: netdev, linux-s390, linux-rdma

On 26/01/2022 14:01, Tony Lu wrote:
> Currently, SMC creates one CQ per IB device, and shares this cq among
> all the QPs of links. Meanwhile, this CQ is always binded to the first
> completion vector, the IRQ affinity of this vector binds to some CPU
> core. 

As discussed in the RFC thread, please come back with the complete fix.

Thanks for the work you are putting in here!

And thanks for the feedback from the rdma side!

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-01-27 14:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-26 13:01 [PATCH net-next 0/2] net/smc: Spread workload over multiple cores Tony Lu
2022-01-26 13:01 ` [PATCH net-next 1/2] net/smc: Introduce smc_ib_cq to bind link and cq Tony Lu
2022-01-26 13:01 ` [PATCH net-next 2/2] net/smc: Multiple CQs per IB devices Tony Lu
2022-01-26 15:29 ` [PATCH net-next 0/2] net/smc: Spread workload over multiple cores Jason Gunthorpe
2022-01-27  3:19   ` Tony Lu
2022-01-27  6:18     ` Leon Romanovsky
2022-01-27  8:05       ` Tony Lu
2022-01-27 14:59 ` Karsten Graul

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.