All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
@ 2022-01-14  5:48 Tony Lu
  2022-01-14  5:48 ` [RFC PATCH net-next 1/6] net/smc: Spread CQs to differents completion vectors Tony Lu
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Tony Lu @ 2022-01-14  5:48 UTC (permalink / raw)
  To: kgraul; +Cc: kuba, davem, netdev, linux-s390

Currently, SMC creates one CQ per IB device, and shares this cq among
all the QPs of links. Meanwhile, this CQ is always binded to the first
completion vector, the IRQ affinity of this vector binds to some CPU
core. 

┌────────┐    ┌──────────────┐   ┌──────────────┐
│ SMC IB │    ├────┐         │   │              │
│ DEVICE │ ┌─▶│ QP │ SMC LINK├──▶│SMC Link Group│
│   ┌────┤ │  ├────┘         │   │              │
│   │ CQ ├─┘  └──────────────┘   └──────────────┘
│   │    ├─┐  ┌──────────────┐   ┌──────────────┐
│   └────┤ │  ├────┐         │   │              │
│        │ └─▶│ QP │ SMC LINK├──▶│SMC Link Group│
│        │    ├────┘         │   │              │
└────────┘    └──────────────┘   └──────────────┘

In this model, when connections execeeds SMC_RMBS_PER_LGR_MAX, it will
create multiple link groups and corresponding QPs. All the connections
share limited QPs and one CQ (both recv and send sides). Generally, one
completion vector binds to a fixed CPU core, it will limit the
performance by single core, and large-scale scenes, such as multiple
threads and lots of connections.

Running nginx and wrk test with 8 threads and 800 connections on 8 cores
host, the softirq of CPU 0 is limited the scalability:

04:18:54 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
04:18:55 PM  all    5.81    0.00   19.42    0.00    2.94   10.21    0.00    0.00    0.00   61.63
04:18:55 PM    0    0.00    0.00    0.00    0.00   16.80   82.78    0.00    0.00    0.00    0.41
<snip>

Nowadays, RDMA devices have more than one completion vectors, such as
mlx5 has 8, eRDMA has 4 completion vector by default. This unlocks the
limitation of single vector and single CPU core.

To enhance scalability and take advantage of multi-core resources, we
can spread CQs to different CPU cores, and introduce more flexible
mapping. Here comes up a new model, the main different is that creating
multiple CQs per IB device, which the max number of CQs is limited by
ibdev's ability (num_comp_vectors). In the scenen of multiple linkgroups,
the link group's QP can bind to the least used CQ, and CQs are binded
to different completion vector and CPU cores. So that we can spread
the softirq (tasklet of wr tx/rx) handler to different cores.

                        ┌──────────────┐   ┌──────────────┐
┌────────┐  ┌───────┐   ├────┐         │   │              │
│        ├─▶│ CQ 0  ├──▶│ QP │ SMC LINK├──▶│SMC Link Group│
│        │  └───────┘   ├────┘         │   │              │
│ SMC IB │  ┌───────┐   └──────────────┘   └──────────────┘
│ DEVICE ├─▶│ CQ 1  │─┐                                    
│        │  └───────┘ │ ┌──────────────┐   ┌──────────────┐
│        │  ┌───────┐ │ ├────┐         │   │              │
│        ├─▶│ CQ n  │ └▶│ QP │ SMC LINK├──▶│SMC Link Group│
└────────┘  └───────┘   ├────┘         │   │              │
                        └──────────────┘   └──────────────┘

After sperad one CQ (4 linkgroups) to four CPU cores, the softirq load
spreads to different cores:

04:26:25 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
04:26:26 PM  all   10.70    0.00   35.80    0.00    7.64   26.62    0.00    0.00    0.00   19.24
04:26:26 PM    0    0.00    0.00    0.00    0.00   16.33   50.00    0.00    0.00    0.00   33.67
04:26:26 PM    1    0.00    0.00    0.00    0.00   15.46   69.07    0.00    0.00    0.00   15.46
04:26:26 PM    2    0.00    0.00    0.00    0.00   13.13   39.39    0.00    0.00    0.00   47.47
04:26:26 PM    3    0.00    0.00    0.00    0.00   13.27   55.10    0.00    0.00    0.00   31.63
<snip>

Here is the benchmark with this patch (prototype of new model):

Test environment:
- CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4.
- nginx + wrk HTTP benchmark.
- nginx: disable access_log, increase keepalive_timeout and
  keepalive_requests, long-live connection.
- wrk: 8 threads and 100, 200, 400 connections.

Benchmark result:

Conns/QPS         100        200        400
w/o patch   338502.49  359216.66  398167.16
w/  patch   677247.40  694193.70  812502.69
Ratio        +100.07%    +93.25%   +104.06%

This prototype patches show nealy 1x increasement of QPS.

The benchmarkes of 100, 200, 400 connections use 1, 1, 2 link groups.
When link group is one, it spreads send/recv to two cores. Once more
than one link groups, it would spread to more cores.

If the application's connections is no more than link group's limitation
(SMC_RMBS_PER_LGR_MAX, 255), and CPU resources is restricted. This patch
introduces a tunable way to reduce the hard limitation of link group
connections number. It reduces the restriction of less CQs (cores) and
less competition, such as link-level CDC slots. It depends on the scenes
of applications, so this patch provides a userspace knob, users can
choose to share link groups for saving resources, or create more link
groups for less limitation.

Patch 1-4 introduce multiple CQs support.
- Patch 1 spreads CQ to two cores, it works for less connections.
- Patch 2, 3, 4 introduce multiple CQs support, involves a new medium
  to tie link and ibdev, and load balancing between different completion
  vectors and CQs.
- the policy of spreading CQs is still thinking and testing to get
  highest performance, such as splitting recv/send CQs, or joining them
  together, or bind recv/recv (send/send) CQ to same vector and so on.
  Glad to hear your advice.

Patch 5 is a medium for userspace control knob.
- mainly provide two knobs to adjust the buffer size of smc socket. We
  found that too little buffers would let smc wait for buffer for a long
  time, and limit the performance.
- introduce a sysctl framework, just for tuning, netlink also does work.
  Because sysctl is easy to compose as patch and no need userspace example.
  I am glad to wait for your advice about the control panel for
  userspace.

Patch 6 introduces a tunable knob to decrease the per link group
connections' number, which would increase parallel performance as
mentioned previous.

These patches are still improving, I am very glad to hear your advice.

Thank you.

Tony Lu (6):
  net/smc: Spread CQs to differents completion vectors
  net/smc: Prepare for multiple CQs per IB devices
  net/smc: Introduce smc_ib_cq to bind link and cq
  net/smc: Multiple CQs per IB devices
  net/smc: Unbind buffer size from clcsock and make it tunable
  net/smc: Introduce tunable linkgroup max connections

 Documentation/networking/smc-sysctl.rst |  20 +++
 include/net/netns/smc.h                 |   6 +
 net/smc/Makefile                        |   2 +-
 net/smc/af_smc.c                        |  18 ++-
 net/smc/smc_core.c                      |   2 +-
 net/smc/smc_core.h                      |   2 +
 net/smc/smc_ib.c                        | 178 ++++++++++++++++++++----
 net/smc/smc_ib.h                        |  17 ++-
 net/smc/smc_sysctl.c                    |  92 ++++++++++++
 net/smc/smc_sysctl.h                    |  22 +++
 net/smc/smc_wr.c                        |  42 +++---
 11 files changed, 350 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/networking/smc-sysctl.rst
 create mode 100644 net/smc/smc_sysctl.c
 create mode 100644 net/smc/smc_sysctl.h

-- 
2.32.0.3.g01195cf9f


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 1/6] net/smc: Spread CQs to differents completion vectors
  2022-01-14  5:48 [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores Tony Lu
@ 2022-01-14  5:48 ` Tony Lu
  2022-01-14  5:48 ` [RFC PATCH net-next 2/6] net/smc: Prepare for multiple CQs per IB devices Tony Lu
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Tony Lu @ 2022-01-14  5:48 UTC (permalink / raw)
  To: kgraul; +Cc: kuba, davem, netdev, linux-s390

This spreads recv/send CQs to different vectors. This removes the
limitation of single vector, which binds to some CPU.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
---
 net/smc/smc_ib.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index a3e2d3b89568..d1f337522bd5 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -823,6 +823,9 @@ long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
 		smcibdev->roce_cq_send = NULL;
 		goto out;
 	}
+	/* spread to different completion vector */
+	if (smcibdev->ibdev->num_comp_vectors > 1)
+		cqattr.comp_vector = 1;
 	smcibdev->roce_cq_recv = ib_create_cq(smcibdev->ibdev,
 					      smc_wr_rx_cq_handler, NULL,
 					      smcibdev, &cqattr);
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 2/6] net/smc: Prepare for multiple CQs per IB devices
  2022-01-14  5:48 [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores Tony Lu
  2022-01-14  5:48 ` [RFC PATCH net-next 1/6] net/smc: Spread CQs to differents completion vectors Tony Lu
@ 2022-01-14  5:48 ` Tony Lu
  2022-01-14  5:48 ` [RFC PATCH net-next 3/6] net/smc: Introduce smc_ib_cq to bind link and cq Tony Lu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Tony Lu @ 2022-01-14  5:48 UTC (permalink / raw)
  To: kgraul; +Cc: kuba, davem, netdev, linux-s390

This introduces load of completion vector helper. During setup progress
of IB device, it helps pick up the least used vector of current device.
Only one CQ and two vectors are needed, so it is no practical use right
now. This prepares for multiple CQs support.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
---
 net/smc/smc_ib.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
 net/smc/smc_ib.h |  1 +
 2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index d1f337522bd5..9a162810ed8c 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -625,6 +625,28 @@ int smcr_nl_get_device(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }
 
+static int smc_ib_get_least_used_vector(struct smc_ib_device *smcibdev)
+{
+	int min = smcibdev->vector_load[0];
+	int i, index = 0;
+
+	/* use it from the beginning of vectors */
+	for (i = 0; i < smcibdev->ibdev->num_comp_vectors; i++) {
+		if (smcibdev->vector_load[i] < min) {
+			index = i;
+			min = smcibdev->vector_load[i];
+		}
+	}
+
+	smcibdev->vector_load[index]++;
+	return index;
+}
+
+static void smc_ib_put_vector(struct smc_ib_device *smcibdev, int index)
+{
+	smcibdev->vector_load[index]--;
+}
+
 static void smc_ib_qp_event_handler(struct ib_event *ibevent, void *priv)
 {
 	struct smc_link *lnk = (struct smc_link *)priv;
@@ -801,8 +823,8 @@ void smc_ib_buf_unmap_sg(struct smc_link *lnk,
 
 long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
 {
-	struct ib_cq_init_attr cqattr =	{
-		.cqe = SMC_MAX_CQE, .comp_vector = 0 };
+	struct ib_cq_init_attr cqattr =	{ .cqe = SMC_MAX_CQE };
+	int cq_send_vector, cq_recv_vector;
 	int cqe_size_order, smc_order;
 	long rc;
 
@@ -815,31 +837,35 @@ long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
 	smc_order = MAX_ORDER - cqe_size_order - 1;
 	if (SMC_MAX_CQE + 2 > (0x00000001 << smc_order) * PAGE_SIZE)
 		cqattr.cqe = (0x00000001 << smc_order) * PAGE_SIZE - 2;
+	cq_send_vector = smc_ib_get_least_used_vector(smcibdev);
+	cqattr.comp_vector = cq_send_vector;
 	smcibdev->roce_cq_send = ib_create_cq(smcibdev->ibdev,
 					      smc_wr_tx_cq_handler, NULL,
 					      smcibdev, &cqattr);
 	rc = PTR_ERR_OR_ZERO(smcibdev->roce_cq_send);
 	if (IS_ERR(smcibdev->roce_cq_send)) {
 		smcibdev->roce_cq_send = NULL;
-		goto out;
+		goto err_send;
 	}
-	/* spread to different completion vector */
-	if (smcibdev->ibdev->num_comp_vectors > 1)
-		cqattr.comp_vector = 1;
+	cq_recv_vector = smc_ib_get_least_used_vector(smcibdev);
+	cqattr.comp_vector = cq_recv_vector;
 	smcibdev->roce_cq_recv = ib_create_cq(smcibdev->ibdev,
 					      smc_wr_rx_cq_handler, NULL,
 					      smcibdev, &cqattr);
 	rc = PTR_ERR_OR_ZERO(smcibdev->roce_cq_recv);
 	if (IS_ERR(smcibdev->roce_cq_recv)) {
 		smcibdev->roce_cq_recv = NULL;
-		goto err;
+		goto err_recv;
 	}
 	smc_wr_add_dev(smcibdev);
 	smcibdev->initialized = 1;
 	goto out;
 
-err:
+err_recv:
+	smc_ib_put_vector(smcibdev, cq_recv_vector);
 	ib_destroy_cq(smcibdev->roce_cq_send);
+err_send:
+	smc_ib_put_vector(smcibdev, cq_send_vector);
 out:
 	mutex_unlock(&smcibdev->mutex);
 	return rc;
@@ -928,6 +954,11 @@ static int smc_ib_add_dev(struct ib_device *ibdev)
 	INIT_IB_EVENT_HANDLER(&smcibdev->event_handler, smcibdev->ibdev,
 			      smc_ib_global_event_handler);
 	ib_register_event_handler(&smcibdev->event_handler);
+	/* vector's load per ib device */
+	smcibdev->vector_load = kcalloc(ibdev->num_comp_vectors,
+					sizeof(int), GFP_KERNEL);
+	if (!smcibdev->vector_load)
+		return -ENOMEM;
 
 	/* trigger reading of the port attributes */
 	port_cnt = smcibdev->ibdev->phys_port_cnt;
@@ -968,6 +999,7 @@ static void smc_ib_remove_dev(struct ib_device *ibdev, void *client_data)
 	smc_ib_cleanup_per_ibdev(smcibdev);
 	ib_unregister_event_handler(&smcibdev->event_handler);
 	cancel_work_sync(&smcibdev->port_event_work);
+	kfree(smcibdev->vector_load);
 	kfree(smcibdev);
 }
 
diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h
index 5d8b49c57f50..a748b74e56e6 100644
--- a/net/smc/smc_ib.h
+++ b/net/smc/smc_ib.h
@@ -57,6 +57,7 @@ struct smc_ib_device {				/* ib-device infos for smc */
 	atomic_t		lnk_cnt_by_port[SMC_MAX_PORTS];
 						/* number of links per port */
 	int			ndev_ifidx[SMC_MAX_PORTS]; /* ndev if indexes */
+	int			*vector_load;	/* load of all completion vectors */
 };
 
 static inline __be32 smc_ib_gid_to_ipv4(u8 gid[SMC_GID_SIZE])
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 3/6] net/smc: Introduce smc_ib_cq to bind link and cq
  2022-01-14  5:48 [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores Tony Lu
  2022-01-14  5:48 ` [RFC PATCH net-next 1/6] net/smc: Spread CQs to differents completion vectors Tony Lu
  2022-01-14  5:48 ` [RFC PATCH net-next 2/6] net/smc: Prepare for multiple CQs per IB devices Tony Lu
@ 2022-01-14  5:48 ` Tony Lu
  2022-01-14  5:48 ` [RFC PATCH net-next 4/6] net/smc: Multiple CQs per IB devices Tony Lu
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Tony Lu @ 2022-01-14  5:48 UTC (permalink / raw)
  To: kgraul; +Cc: kuba, davem, netdev, linux-s390

This patch introduces struct smc_ib_cq as a medium between smc_link and
ib_cq. Every smc_link can access ib_cq from their own, and unbinds
smc_link from smc_ib_device. This allows flexible mapping, prepares for
multiple CQs support.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
---
 net/smc/smc_core.h |  2 ++
 net/smc/smc_ib.c   | 52 +++++++++++++++++++++++++++++++++-------------
 net/smc/smc_ib.h   | 14 +++++++++----
 net/smc/smc_wr.c   | 34 +++++++++++++++---------------
 4 files changed, 67 insertions(+), 35 deletions(-)

diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 521c64a3d8d3..fd10cad8fb77 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -86,6 +86,8 @@ struct smc_link {
 	struct ib_pd		*roce_pd;	/* IB protection domain,
 						 * unique for every RoCE QP
 						 */
+	struct smc_ib_cq	*smcibcq_recv;	/* cq for recv */
+	struct smc_ib_cq	*smcibcq_send;	/* cq for send */
 	struct ib_qp		*roce_qp;	/* IB queue pair */
 	struct ib_qp_attr	qp_attr;	/* IB queue pair attributes */
 
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 9a162810ed8c..b08b9af4c156 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -133,7 +133,7 @@ int smc_ib_ready_link(struct smc_link *lnk)
 	if (rc)
 		goto out;
 	smc_wr_remember_qp_attr(lnk);
-	rc = ib_req_notify_cq(lnk->smcibdev->roce_cq_recv,
+	rc = ib_req_notify_cq(lnk->smcibcq_recv->roce_cq,
 			      IB_CQ_SOLICITED_MASK);
 	if (rc)
 		goto out;
@@ -672,6 +672,8 @@ void smc_ib_destroy_queue_pair(struct smc_link *lnk)
 {
 	if (lnk->roce_qp)
 		ib_destroy_qp(lnk->roce_qp);
+	lnk->smcibcq_send = NULL;
+	lnk->smcibcq_recv = NULL;
 	lnk->roce_qp = NULL;
 }
 
@@ -682,8 +684,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 	struct ib_qp_init_attr qp_attr = {
 		.event_handler = smc_ib_qp_event_handler,
 		.qp_context = lnk,
-		.send_cq = lnk->smcibdev->roce_cq_send,
-		.recv_cq = lnk->smcibdev->roce_cq_recv,
+		.send_cq = lnk->smcibdev->roce_cq_send->roce_cq,
+		.recv_cq = lnk->smcibdev->roce_cq_recv->roce_cq,
 		.srq = NULL,
 		.cap = {
 				/* include unsolicited rdma_writes as well,
@@ -701,10 +703,13 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 
 	lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr);
 	rc = PTR_ERR_OR_ZERO(lnk->roce_qp);
-	if (IS_ERR(lnk->roce_qp))
+	if (IS_ERR(lnk->roce_qp)) {
 		lnk->roce_qp = NULL;
-	else
+	} else {
+		lnk->smcibcq_send = lnk->smcibdev->roce_cq_send;
+		lnk->smcibcq_recv = lnk->smcibdev->roce_cq_recv;
 		smc_wr_remember_qp_attr(lnk);
+	}
 	return rc;
 }
 
@@ -824,6 +829,7 @@ void smc_ib_buf_unmap_sg(struct smc_link *lnk,
 long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
 {
 	struct ib_cq_init_attr cqattr =	{ .cqe = SMC_MAX_CQE };
+	struct smc_ib_cq *smcibcq_send, *smcibcq_recv;
 	int cq_send_vector, cq_recv_vector;
 	int cqe_size_order, smc_order;
 	long rc;
@@ -837,34 +843,52 @@ long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
 	smc_order = MAX_ORDER - cqe_size_order - 1;
 	if (SMC_MAX_CQE + 2 > (0x00000001 << smc_order) * PAGE_SIZE)
 		cqattr.cqe = (0x00000001 << smc_order) * PAGE_SIZE - 2;
+	smcibcq_send = kmalloc(sizeof(*smcibcq_send), GFP_KERNEL);
+	if (!smcibcq_send) {
+		rc = -ENOMEM;
+		goto out;
+	}
 	cq_send_vector = smc_ib_get_least_used_vector(smcibdev);
+	smcibcq_send->smcibdev = smcibdev;
+	smcibcq_send->is_send = 1;
 	cqattr.comp_vector = cq_send_vector;
-	smcibdev->roce_cq_send = ib_create_cq(smcibdev->ibdev,
-					      smc_wr_tx_cq_handler, NULL,
-					      smcibdev, &cqattr);
+	smcibcq_send->roce_cq = ib_create_cq(smcibdev->ibdev,
+					     smc_wr_tx_cq_handler, NULL,
+					     smcibcq_send, &cqattr);
 	rc = PTR_ERR_OR_ZERO(smcibdev->roce_cq_send);
 	if (IS_ERR(smcibdev->roce_cq_send)) {
 		smcibdev->roce_cq_send = NULL;
 		goto err_send;
 	}
+	smcibdev->roce_cq_send = smcibcq_send;
+	smcibcq_recv = kmalloc(sizeof(*smcibcq_recv), GFP_KERNEL);
+	if (!smcibcq_recv) {
+		rc = -ENOMEM;
+		goto err_send;
+	}
 	cq_recv_vector = smc_ib_get_least_used_vector(smcibdev);
+	smcibcq_recv->smcibdev = smcibdev;
+	smcibcq_recv->is_send = 0;
 	cqattr.comp_vector = cq_recv_vector;
-	smcibdev->roce_cq_recv = ib_create_cq(smcibdev->ibdev,
-					      smc_wr_rx_cq_handler, NULL,
-					      smcibdev, &cqattr);
+	smcibcq_recv->roce_cq = ib_create_cq(smcibdev->ibdev,
+					     smc_wr_rx_cq_handler, NULL,
+					     smcibcq_recv, &cqattr);
 	rc = PTR_ERR_OR_ZERO(smcibdev->roce_cq_recv);
 	if (IS_ERR(smcibdev->roce_cq_recv)) {
 		smcibdev->roce_cq_recv = NULL;
 		goto err_recv;
 	}
+	smcibdev->roce_cq_recv = smcibcq_recv;
 	smc_wr_add_dev(smcibdev);
 	smcibdev->initialized = 1;
 	goto out;
 
 err_recv:
+	kfree(smcibcq_recv);
 	smc_ib_put_vector(smcibdev, cq_recv_vector);
-	ib_destroy_cq(smcibdev->roce_cq_send);
+	ib_destroy_cq(smcibcq_send->roce_cq);
 err_send:
+	kfree(smcibcq_send);
 	smc_ib_put_vector(smcibdev, cq_send_vector);
 out:
 	mutex_unlock(&smcibdev->mutex);
@@ -877,8 +901,8 @@ static void smc_ib_cleanup_per_ibdev(struct smc_ib_device *smcibdev)
 	if (!smcibdev->initialized)
 		goto out;
 	smcibdev->initialized = 0;
-	ib_destroy_cq(smcibdev->roce_cq_recv);
-	ib_destroy_cq(smcibdev->roce_cq_send);
+	ib_destroy_cq(smcibdev->roce_cq_recv->roce_cq);
+	ib_destroy_cq(smcibdev->roce_cq_send->roce_cq);
 	smc_wr_remove_dev(smcibdev);
 out:
 	mutex_unlock(&smcibdev->mutex);
diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h
index a748b74e56e6..5b34274ecf47 100644
--- a/net/smc/smc_ib.h
+++ b/net/smc/smc_ib.h
@@ -32,15 +32,21 @@ struct smc_ib_devices {			/* list of smc ib devices definition */
 extern struct smc_ib_devices	smc_ib_devices; /* list of smc ib devices */
 extern struct smc_lgr_list smc_lgr_list; /* list of linkgroups */
 
+struct smc_ib_cq {		/* ib_cq wrapper for smc */
+	struct list_head	list;
+	struct smc_ib_device	*smcibdev;	/* parent ib device */
+	struct ib_cq		*roce_cq;	/* real ib_cq for link */
+	struct tasklet_struct	tasklet;	/* tasklet for wr */
+	bool			is_send;	/* send for recv cq */
+};
+
 struct smc_ib_device {				/* ib-device infos for smc */
 	struct list_head	list;
 	struct ib_device	*ibdev;
 	struct ib_port_attr	pattr[SMC_MAX_PORTS];	/* ib dev. port attrs */
 	struct ib_event_handler	event_handler;	/* global ib_event handler */
-	struct ib_cq		*roce_cq_send;	/* send completion queue */
-	struct ib_cq		*roce_cq_recv;	/* recv completion queue */
-	struct tasklet_struct	send_tasklet;	/* called by send cq handler */
-	struct tasklet_struct	recv_tasklet;	/* called by recv cq handler */
+	struct smc_ib_cq	*roce_cq_send;	/* send completion queue */
+	struct smc_ib_cq	*roce_cq_recv;	/* recv completion queue */
 	char			mac[SMC_MAX_PORTS][ETH_ALEN];
 						/* mac address per port*/
 	u8			pnetid[SMC_MAX_PORTS][SMC_MAX_PNETID_LEN];
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index 24be1d03fef9..011435efb65b 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -135,7 +135,7 @@ static inline void smc_wr_tx_process_cqe(struct ib_wc *wc)
 
 static void smc_wr_tx_tasklet_fn(struct tasklet_struct *t)
 {
-	struct smc_ib_device *dev = from_tasklet(dev, t, send_tasklet);
+	struct smc_ib_cq *smcibcq = from_tasklet(smcibcq, t, tasklet);
 	struct ib_wc wc[SMC_WR_MAX_POLL_CQE];
 	int i = 0, rc;
 	int polled = 0;
@@ -144,9 +144,9 @@ static void smc_wr_tx_tasklet_fn(struct tasklet_struct *t)
 	polled++;
 	do {
 		memset(&wc, 0, sizeof(wc));
-		rc = ib_poll_cq(dev->roce_cq_send, SMC_WR_MAX_POLL_CQE, wc);
+		rc = ib_poll_cq(smcibcq->roce_cq, SMC_WR_MAX_POLL_CQE, wc);
 		if (polled == 1) {
-			ib_req_notify_cq(dev->roce_cq_send,
+			ib_req_notify_cq(smcibcq->roce_cq,
 					 IB_CQ_NEXT_COMP |
 					 IB_CQ_REPORT_MISSED_EVENTS);
 		}
@@ -161,9 +161,9 @@ static void smc_wr_tx_tasklet_fn(struct tasklet_struct *t)
 
 void smc_wr_tx_cq_handler(struct ib_cq *ib_cq, void *cq_context)
 {
-	struct smc_ib_device *dev = (struct smc_ib_device *)cq_context;
+	struct smc_ib_cq *smcibcq = (struct smc_ib_cq *)cq_context;
 
-	tasklet_schedule(&dev->send_tasklet);
+	tasklet_schedule(&smcibcq->tasklet);
 }
 
 /*---------------------------- request submission ---------------------------*/
@@ -306,7 +306,7 @@ int smc_wr_tx_send(struct smc_link *link, struct smc_wr_tx_pend_priv *priv)
 	struct smc_wr_tx_pend *pend;
 	int rc;
 
-	ib_req_notify_cq(link->smcibdev->roce_cq_send,
+	ib_req_notify_cq(link->smcibcq_send->roce_cq,
 			 IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
 	pend = container_of(priv, struct smc_wr_tx_pend, priv);
 	rc = ib_post_send(link->roce_qp, &link->wr_tx_ibs[pend->idx], NULL);
@@ -323,7 +323,7 @@ int smc_wr_tx_v2_send(struct smc_link *link, struct smc_wr_tx_pend_priv *priv,
 	int rc;
 
 	link->wr_tx_v2_ib->sg_list[0].length = len;
-	ib_req_notify_cq(link->smcibdev->roce_cq_send,
+	ib_req_notify_cq(link->smcibcq_send->roce_cq,
 			 IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
 	rc = ib_post_send(link->roce_qp, link->wr_tx_v2_ib, NULL);
 	if (rc) {
@@ -367,7 +367,7 @@ int smc_wr_reg_send(struct smc_link *link, struct ib_mr *mr)
 {
 	int rc;
 
-	ib_req_notify_cq(link->smcibdev->roce_cq_send,
+	ib_req_notify_cq(link->smcibcq_send->roce_cq,
 			 IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
 	link->wr_reg_state = POSTED;
 	link->wr_reg.wr.wr_id = (u64)(uintptr_t)mr;
@@ -476,7 +476,7 @@ static inline void smc_wr_rx_process_cqes(struct ib_wc wc[], int num)
 
 static void smc_wr_rx_tasklet_fn(struct tasklet_struct *t)
 {
-	struct smc_ib_device *dev = from_tasklet(dev, t, recv_tasklet);
+	struct smc_ib_cq *smcibcq = from_tasklet(smcibcq, t, tasklet);
 	struct ib_wc wc[SMC_WR_MAX_POLL_CQE];
 	int polled = 0;
 	int rc;
@@ -485,9 +485,9 @@ static void smc_wr_rx_tasklet_fn(struct tasklet_struct *t)
 	polled++;
 	do {
 		memset(&wc, 0, sizeof(wc));
-		rc = ib_poll_cq(dev->roce_cq_recv, SMC_WR_MAX_POLL_CQE, wc);
+		rc = ib_poll_cq(smcibcq->roce_cq, SMC_WR_MAX_POLL_CQE, wc);
 		if (polled == 1) {
-			ib_req_notify_cq(dev->roce_cq_recv,
+			ib_req_notify_cq(smcibcq->roce_cq,
 					 IB_CQ_SOLICITED_MASK
 					 | IB_CQ_REPORT_MISSED_EVENTS);
 		}
@@ -501,9 +501,9 @@ static void smc_wr_rx_tasklet_fn(struct tasklet_struct *t)
 
 void smc_wr_rx_cq_handler(struct ib_cq *ib_cq, void *cq_context)
 {
-	struct smc_ib_device *dev = (struct smc_ib_device *)cq_context;
+	struct smc_ib_cq *smcibcq = (struct smc_ib_cq *)cq_context;
 
-	tasklet_schedule(&dev->recv_tasklet);
+	tasklet_schedule(&smcibcq->tasklet);
 }
 
 int smc_wr_rx_post_init(struct smc_link *link)
@@ -830,14 +830,14 @@ int smc_wr_alloc_link_mem(struct smc_link *link)
 
 void smc_wr_remove_dev(struct smc_ib_device *smcibdev)
 {
-	tasklet_kill(&smcibdev->recv_tasklet);
-	tasklet_kill(&smcibdev->send_tasklet);
+	tasklet_kill(&smcibdev->roce_cq_recv->tasklet);
+	tasklet_kill(&smcibdev->roce_cq_send->tasklet);
 }
 
 void smc_wr_add_dev(struct smc_ib_device *smcibdev)
 {
-	tasklet_setup(&smcibdev->recv_tasklet, smc_wr_rx_tasklet_fn);
-	tasklet_setup(&smcibdev->send_tasklet, smc_wr_tx_tasklet_fn);
+	tasklet_setup(&smcibdev->roce_cq_recv->tasklet, smc_wr_rx_tasklet_fn);
+	tasklet_setup(&smcibdev->roce_cq_send->tasklet, smc_wr_tx_tasklet_fn);
 }
 
 int smc_wr_create_link(struct smc_link *lnk)
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 4/6] net/smc: Multiple CQs per IB devices
  2022-01-14  5:48 [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores Tony Lu
                   ` (2 preceding siblings ...)
  2022-01-14  5:48 ` [RFC PATCH net-next 3/6] net/smc: Introduce smc_ib_cq to bind link and cq Tony Lu
@ 2022-01-14  5:48 ` Tony Lu
  2022-01-14  5:48 ` [RFC PATCH net-next 5/6] net/smc: Unbind buffer size from clcsock and make it tunable Tony Lu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Tony Lu @ 2022-01-14  5:48 UTC (permalink / raw)
  To: kgraul; +Cc: kuba, davem, netdev, linux-s390

This patch allows multiple CQs for one IB device, compared to one CQ
now.

During IB device setup, it would initialize ibdev->num_comp_vectors
amount of send/recv CQs, and the corresponding tasklets, like queues for
net devices.

Every smc_link has their own send and recv CQs, which always assigning
from the least used CQs of current IB device.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
---
 net/smc/smc_ib.c | 165 +++++++++++++++++++++++++++++++++--------------
 net/smc/smc_ib.h |   6 +-
 net/smc/smc_wr.c |  16 +++--
 3 files changed, 132 insertions(+), 55 deletions(-)

diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index b08b9af4c156..19c49184cd03 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -647,6 +647,38 @@ static void smc_ib_put_vector(struct smc_ib_device *smcibdev, int index)
 	smcibdev->vector_load[index]--;
 }
 
+static struct smc_ib_cq *smc_ib_get_least_used_cq(struct smc_ib_device *smcibdev,
+						  bool is_send)
+{
+	struct smc_ib_cq *smcibcq, *cq;
+	struct list_head *head;
+	int min;
+
+	if (is_send)
+		head = &smcibdev->smcibcq_send;
+	else
+		head = &smcibdev->smcibcq_recv;
+	cq = list_first_entry(head, struct smc_ib_cq, list);
+	min = cq->load;
+
+	list_for_each_entry(smcibcq, head, list) {
+		if (smcibcq->load < min) {
+			cq = smcibcq;
+			min = cq->load;
+		}
+	}
+	if (!cq)
+		cq = smcibcq;
+
+	cq->load++;
+	return cq;
+}
+
+static void smc_ib_put_cq(struct smc_ib_cq *smcibcq)
+{
+	smcibcq->load--;
+}
+
 static void smc_ib_qp_event_handler(struct ib_event *ibevent, void *priv)
 {
 	struct smc_link *lnk = (struct smc_link *)priv;
@@ -670,8 +702,11 @@ static void smc_ib_qp_event_handler(struct ib_event *ibevent, void *priv)
 
 void smc_ib_destroy_queue_pair(struct smc_link *lnk)
 {
-	if (lnk->roce_qp)
+	if (lnk->roce_qp) {
 		ib_destroy_qp(lnk->roce_qp);
+		smc_ib_put_cq(lnk->smcibcq_send);
+		smc_ib_put_cq(lnk->smcibcq_recv);
+	}
 	lnk->smcibcq_send = NULL;
 	lnk->smcibcq_recv = NULL;
 	lnk->roce_qp = NULL;
@@ -680,12 +715,16 @@ void smc_ib_destroy_queue_pair(struct smc_link *lnk)
 /* create a queue pair within the protection domain for a link */
 int smc_ib_create_queue_pair(struct smc_link *lnk)
 {
+	struct smc_ib_cq *smcibcq_send = smc_ib_get_least_used_cq(lnk->smcibdev,
+								  true);
+	struct smc_ib_cq *smcibcq_recv = smc_ib_get_least_used_cq(lnk->smcibdev,
+								  false);
 	int sges_per_buf = (lnk->lgr->smc_version == SMC_V2) ? 2 : 1;
 	struct ib_qp_init_attr qp_attr = {
 		.event_handler = smc_ib_qp_event_handler,
 		.qp_context = lnk,
-		.send_cq = lnk->smcibdev->roce_cq_send->roce_cq,
-		.recv_cq = lnk->smcibdev->roce_cq_recv->roce_cq,
+		.send_cq = smcibcq_send->roce_cq,
+		.recv_cq = smcibcq_recv->roce_cq,
 		.srq = NULL,
 		.cap = {
 				/* include unsolicited rdma_writes as well,
@@ -706,8 +745,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 	if (IS_ERR(lnk->roce_qp)) {
 		lnk->roce_qp = NULL;
 	} else {
-		lnk->smcibcq_send = lnk->smcibdev->roce_cq_send;
-		lnk->smcibcq_recv = lnk->smcibdev->roce_cq_recv;
+		lnk->smcibcq_send = smcibcq_send;
+		lnk->smcibcq_recv = smcibcq_recv;
 		smc_wr_remember_qp_attr(lnk);
 	}
 	return rc;
@@ -826,6 +865,24 @@ void smc_ib_buf_unmap_sg(struct smc_link *lnk,
 	buf_slot->sgt[lnk->link_idx].sgl->dma_address = 0;
 }
 
+static void smc_ib_cleanup_cq(struct smc_ib_device *smcibdev)
+{
+	struct smc_ib_cq *smcibcq, *cq;
+
+	list_for_each_entry_safe(smcibcq, cq, &smcibdev->smcibcq_send, list) {
+		list_del(&smcibcq->list);
+		ib_destroy_cq(smcibcq->roce_cq);
+		smc_ib_put_vector(smcibdev, smcibcq->comp_vector);
+		kfree(smcibcq);
+	}
+	list_for_each_entry_safe(smcibcq, cq, &smcibdev->smcibcq_recv, list) {
+		list_del(&smcibcq->list);
+		ib_destroy_cq(smcibcq->roce_cq);
+		smc_ib_put_vector(smcibdev, smcibcq->comp_vector);
+		kfree(smcibcq);
+	}
+}
+
 long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
 {
 	struct ib_cq_init_attr cqattr =	{ .cqe = SMC_MAX_CQE };
@@ -833,6 +890,7 @@ long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
 	int cq_send_vector, cq_recv_vector;
 	int cqe_size_order, smc_order;
 	long rc;
+	int i;
 
 	mutex_lock(&smcibdev->mutex);
 	rc = 0;
@@ -843,53 +901,61 @@ long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev)
 	smc_order = MAX_ORDER - cqe_size_order - 1;
 	if (SMC_MAX_CQE + 2 > (0x00000001 << smc_order) * PAGE_SIZE)
 		cqattr.cqe = (0x00000001 << smc_order) * PAGE_SIZE - 2;
-	smcibcq_send = kmalloc(sizeof(*smcibcq_send), GFP_KERNEL);
-	if (!smcibcq_send) {
-		rc = -ENOMEM;
-		goto out;
-	}
-	cq_send_vector = smc_ib_get_least_used_vector(smcibdev);
-	smcibcq_send->smcibdev = smcibdev;
-	smcibcq_send->is_send = 1;
-	cqattr.comp_vector = cq_send_vector;
-	smcibcq_send->roce_cq = ib_create_cq(smcibdev->ibdev,
-					     smc_wr_tx_cq_handler, NULL,
-					     smcibcq_send, &cqattr);
-	rc = PTR_ERR_OR_ZERO(smcibdev->roce_cq_send);
-	if (IS_ERR(smcibdev->roce_cq_send)) {
-		smcibdev->roce_cq_send = NULL;
-		goto err_send;
-	}
-	smcibdev->roce_cq_send = smcibcq_send;
-	smcibcq_recv = kmalloc(sizeof(*smcibcq_recv), GFP_KERNEL);
-	if (!smcibcq_recv) {
-		rc = -ENOMEM;
-		goto err_send;
-	}
-	cq_recv_vector = smc_ib_get_least_used_vector(smcibdev);
-	smcibcq_recv->smcibdev = smcibdev;
-	smcibcq_recv->is_send = 0;
-	cqattr.comp_vector = cq_recv_vector;
-	smcibcq_recv->roce_cq = ib_create_cq(smcibdev->ibdev,
-					     smc_wr_rx_cq_handler, NULL,
-					     smcibcq_recv, &cqattr);
-	rc = PTR_ERR_OR_ZERO(smcibdev->roce_cq_recv);
-	if (IS_ERR(smcibdev->roce_cq_recv)) {
-		smcibdev->roce_cq_recv = NULL;
-		goto err_recv;
+
+	/* initialize send/recv CQs */
+	for (i = 0; i < smcibdev->ibdev->num_comp_vectors; i++) {
+		/* initialize send CQ */
+		smcibcq_send = kmalloc(sizeof(*smcibcq_send), GFP_KERNEL);
+		if (!smcibcq_send) {
+			rc = -ENOMEM;
+			goto err;
+		}
+		cq_send_vector = smc_ib_get_least_used_vector(smcibdev);
+		smcibcq_send->smcibdev = smcibdev;
+		smcibcq_send->load = 0;
+		smcibcq_send->is_send = 1;
+		smcibcq_send->comp_vector = cq_send_vector;
+		INIT_LIST_HEAD(&smcibcq_send->list);
+		cqattr.comp_vector = cq_send_vector;
+		smcibcq_send->roce_cq = ib_create_cq(smcibdev->ibdev,
+						     smc_wr_tx_cq_handler, NULL,
+						     smcibcq_send, &cqattr);
+		rc = PTR_ERR_OR_ZERO(smcibcq_send->roce_cq);
+		if (IS_ERR(smcibcq_send->roce_cq)) {
+			smcibcq_send->roce_cq = NULL;
+			goto err;
+		}
+		list_add_tail(&smcibcq_send->list, &smcibdev->smcibcq_send);
+
+		/* initialize recv CQ */
+		smcibcq_recv = kmalloc(sizeof(*smcibcq_recv), GFP_KERNEL);
+		if (!smcibcq_recv) {
+			rc = -ENOMEM;
+			goto err;
+		}
+		cq_recv_vector = smc_ib_get_least_used_vector(smcibdev);
+		smcibcq_recv->smcibdev = smcibdev;
+		smcibcq_recv->load = 0;
+		smcibcq_recv->is_send = 0;
+		smcibcq_recv->comp_vector = cq_recv_vector;
+		INIT_LIST_HEAD(&smcibcq_recv->list);
+		cqattr.comp_vector = cq_recv_vector;
+		smcibcq_recv->roce_cq = ib_create_cq(smcibdev->ibdev,
+						     smc_wr_rx_cq_handler, NULL,
+						     smcibcq_recv, &cqattr);
+		rc = PTR_ERR_OR_ZERO(smcibcq_recv->roce_cq);
+		if (IS_ERR(smcibcq_recv->roce_cq)) {
+			smcibcq_recv->roce_cq = NULL;
+			goto err;
+		}
+		list_add_tail(&smcibcq_recv->list, &smcibdev->smcibcq_recv);
 	}
-	smcibdev->roce_cq_recv = smcibcq_recv;
 	smc_wr_add_dev(smcibdev);
 	smcibdev->initialized = 1;
 	goto out;
 
-err_recv:
-	kfree(smcibcq_recv);
-	smc_ib_put_vector(smcibdev, cq_recv_vector);
-	ib_destroy_cq(smcibcq_send->roce_cq);
-err_send:
-	kfree(smcibcq_send);
-	smc_ib_put_vector(smcibdev, cq_send_vector);
+err:
+	smc_ib_cleanup_cq(smcibdev);
 out:
 	mutex_unlock(&smcibdev->mutex);
 	return rc;
@@ -901,8 +967,7 @@ static void smc_ib_cleanup_per_ibdev(struct smc_ib_device *smcibdev)
 	if (!smcibdev->initialized)
 		goto out;
 	smcibdev->initialized = 0;
-	ib_destroy_cq(smcibdev->roce_cq_recv->roce_cq);
-	ib_destroy_cq(smcibdev->roce_cq_send->roce_cq);
+	smc_ib_cleanup_cq(smcibdev);
 	smc_wr_remove_dev(smcibdev);
 out:
 	mutex_unlock(&smcibdev->mutex);
@@ -978,6 +1043,8 @@ static int smc_ib_add_dev(struct ib_device *ibdev)
 	INIT_IB_EVENT_HANDLER(&smcibdev->event_handler, smcibdev->ibdev,
 			      smc_ib_global_event_handler);
 	ib_register_event_handler(&smcibdev->event_handler);
+	INIT_LIST_HEAD(&smcibdev->smcibcq_send);
+	INIT_LIST_HEAD(&smcibdev->smcibcq_recv);
 	/* vector's load per ib device */
 	smcibdev->vector_load = kcalloc(ibdev->num_comp_vectors,
 					sizeof(int), GFP_KERNEL);
diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h
index 5b34274ecf47..1776627f113d 100644
--- a/net/smc/smc_ib.h
+++ b/net/smc/smc_ib.h
@@ -38,6 +38,8 @@ struct smc_ib_cq {		/* ib_cq wrapper for smc */
 	struct ib_cq		*roce_cq;	/* real ib_cq for link */
 	struct tasklet_struct	tasklet;	/* tasklet for wr */
 	bool			is_send;	/* send for recv cq */
+	int			comp_vector;	/* index of completion vector */
+	int			load;		/* load of current cq */
 };
 
 struct smc_ib_device {				/* ib-device infos for smc */
@@ -45,8 +47,8 @@ struct smc_ib_device {				/* ib-device infos for smc */
 	struct ib_device	*ibdev;
 	struct ib_port_attr	pattr[SMC_MAX_PORTS];	/* ib dev. port attrs */
 	struct ib_event_handler	event_handler;	/* global ib_event handler */
-	struct smc_ib_cq	*roce_cq_send;	/* send completion queue */
-	struct smc_ib_cq	*roce_cq_recv;	/* recv completion queue */
+	struct list_head	smcibcq_send;	/* all send cqs */
+	struct list_head	smcibcq_recv;	/* all recv cqs */
 	char			mac[SMC_MAX_PORTS][ETH_ALEN];
 						/* mac address per port*/
 	u8			pnetid[SMC_MAX_PORTS][SMC_MAX_PNETID_LEN];
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index 011435efb65b..169253e53786 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -830,14 +830,22 @@ int smc_wr_alloc_link_mem(struct smc_link *link)
 
 void smc_wr_remove_dev(struct smc_ib_device *smcibdev)
 {
-	tasklet_kill(&smcibdev->roce_cq_recv->tasklet);
-	tasklet_kill(&smcibdev->roce_cq_send->tasklet);
+	struct smc_ib_cq *smcibcq;
+
+	list_for_each_entry(smcibcq, &smcibdev->smcibcq_send, list)
+		tasklet_kill(&smcibcq->tasklet);
+	list_for_each_entry(smcibcq, &smcibdev->smcibcq_recv, list)
+		tasklet_kill(&smcibcq->tasklet);
 }
 
 void smc_wr_add_dev(struct smc_ib_device *smcibdev)
 {
-	tasklet_setup(&smcibdev->roce_cq_recv->tasklet, smc_wr_rx_tasklet_fn);
-	tasklet_setup(&smcibdev->roce_cq_send->tasklet, smc_wr_tx_tasklet_fn);
+	struct smc_ib_cq *smcibcq;
+
+	list_for_each_entry(smcibcq, &smcibdev->smcibcq_send, list)
+		tasklet_setup(&smcibcq->tasklet, smc_wr_tx_tasklet_fn);
+	list_for_each_entry(smcibcq, &smcibdev->smcibcq_recv, list)
+		tasklet_setup(&smcibcq->tasklet, smc_wr_rx_tasklet_fn);
 }
 
 int smc_wr_create_link(struct smc_link *lnk)
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 5/6] net/smc: Unbind buffer size from clcsock and make it tunable
  2022-01-14  5:48 [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores Tony Lu
                   ` (3 preceding siblings ...)
  2022-01-14  5:48 ` [RFC PATCH net-next 4/6] net/smc: Multiple CQs per IB devices Tony Lu
@ 2022-01-14  5:48 ` Tony Lu
  2022-01-14  9:13   ` kernel test robot
  2022-01-14  9:43   ` kernel test robot
  2022-01-14  5:48 ` [RFC PATCH net-next 6/6] net/smc: Introduce tunable linkgroup max connections Tony Lu
  2022-01-16  9:00 ` [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores Leon Romanovsky
  6 siblings, 2 replies; 24+ messages in thread
From: Tony Lu @ 2022-01-14  5:48 UTC (permalink / raw)
  To: kgraul; +Cc: kuba, davem, netdev, linux-s390

SMC uses smc->sk.sk_{rcv|snd}buf to create buffer for send buffer or
RMB. And the values of buffer size inherits from clcsock. The clcsock is
a TCP sock which is initiated during SMC connection startup.

The inherited buffer size doesn't fit SMC well. TCP provides two sysctl
knobs to tune r/w buffers, net.ipv4.tcp_{r|w}mem, and SMC use the default
value from TCP. The buffer size is tuned for TCP, but not fit SMC well
in some scenarios. For example, we need larger buffer of SMC for high
throughput applications, and smaller buffer of SMC for saving contiguous
memory. We need to adjust the buffer size apart from TCP and not to
disturb TCP.

This unbinds buffer size which inherits from clcsock, and provides
sysctl knobs to adjust buffer size independently. These knobs can be
tuned with different values for different net namespaces for performance
and flexibility.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
---
 Documentation/networking/smc-sysctl.rst | 20 ++++++
 include/net/netns/smc.h                 |  5 ++
 net/smc/Makefile                        |  2 +-
 net/smc/af_smc.c                        | 17 +++++-
 net/smc/smc_sysctl.c                    | 81 +++++++++++++++++++++++++
 net/smc/smc_sysctl.h                    | 22 +++++++
 6 files changed, 144 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/networking/smc-sysctl.rst
 create mode 100644 net/smc/smc_sysctl.c
 create mode 100644 net/smc/smc_sysctl.h

diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
new file mode 100644
index 000000000000..ba2be59a57dd
--- /dev/null
+++ b/Documentation/networking/smc-sysctl.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========
+SMC Sysctl
+=========
+
+/proc/sys/net/smc/* Variables
+==============================
+
+wmem_default - INTEGER
+    Initial size of send buffer used by SMC sockets.
+    The default value inherits from net.ipv4.tcp_wmem[1].
+
+    Default: 16K
+
+rmem_default - INTEGER
+    Initial size of receive buffer (RMB) used by SMC sockets.
+    The default value inherits from net.ipv4.tcp_rmem[1].
+
+    Default: 131072 bytes.
diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h
index ea8a9cf2619b..f948235e3156 100644
--- a/include/net/netns/smc.h
+++ b/include/net/netns/smc.h
@@ -12,5 +12,10 @@ struct netns_smc {
 	/* protect fback_rsn */
 	struct mutex			mutex_fback_rsn;
 	struct smc_stats_rsn		*fback_rsn;
+#ifdef CONFIG_SYSCTL
+	struct ctl_table_header		*smc_hdr;
+#endif
+	int				sysctl_wmem_default;
+	int				sysctl_rmem_default;
 };
 #endif
diff --git a/net/smc/Makefile b/net/smc/Makefile
index 196fb6f01b14..640af9a39f9c 100644
--- a/net/smc/Makefile
+++ b/net/smc/Makefile
@@ -4,4 +4,4 @@ obj-$(CONFIG_SMC)	+= smc.o
 obj-$(CONFIG_SMC_DIAG)	+= smc_diag.o
 smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o
 smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o smc_netlink.o smc_stats.o
-smc-y += smc_tracepoint.o
+smc-y += smc_tracepoint.o smc_sysctl.o
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index ffab9cee747d..0650b5971e0a 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -51,6 +51,7 @@
 #include "smc_close.h"
 #include "smc_stats.h"
 #include "smc_tracepoint.h"
+#include "smc_sysctl.h"
 
 static DEFINE_MUTEX(smc_server_lgr_pending);	/* serialize link group
 						 * creation on server
@@ -2851,8 +2852,8 @@ static int __smc_create(struct net *net, struct socket *sock, int protocol,
 		smc->clcsock = clcsock;
 	}
 
-	smc->sk.sk_sndbuf = max(smc->clcsock->sk->sk_sndbuf, SMC_BUF_MIN_SIZE);
-	smc->sk.sk_rcvbuf = max(smc->clcsock->sk->sk_rcvbuf, SMC_BUF_MIN_SIZE);
+	smc->sk.sk_sndbuf = sock_net(sk)->smc.sysctl_wmem_default;
+	smc->sk.sk_rcvbuf = sock_net(sk)->smc.sysctl_rmem_default;
 
 out:
 	return rc;
@@ -2932,6 +2933,11 @@ unsigned int smc_net_id;
 
 static __net_init int smc_net_init(struct net *net)
 {
+	net->smc.sysctl_wmem_default = max(net->ipv4.sysctl_tcp_wmem[1],
+					   SMC_BUF_MIN_SIZE);
+	net->smc.sysctl_rmem_default = max(net->ipv4.sysctl_tcp_rmem[1],
+					   SMC_BUF_MIN_SIZE);
+
 	return smc_pnet_net_init(net);
 }
 
@@ -3044,6 +3050,12 @@ static int __init smc_init(void)
 		goto out_sock;
 	}
 
+	rc = smc_sysctl_init();
+	if (rc) {
+		pr_err("%s: sysctl fails with %d\n", __func__, rc);
+		goto out_sock;
+	}
+
 	static_branch_enable(&tcp_have_smc);
 	return 0;
 
@@ -3085,6 +3097,7 @@ static void __exit smc_exit(void)
 	smc_clc_exit();
 	unregister_pernet_subsys(&smc_net_stat_ops);
 	unregister_pernet_subsys(&smc_net_ops);
+	smc_sysctl_exit();
 	rcu_barrier();
 }
 
diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
new file mode 100644
index 000000000000..6706fe1bd888
--- /dev/null
+++ b/net/smc/smc_sysctl.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/sysctl.h>
+#include <net/sock.h>
+#include <net/net_namespace.h>
+
+#include "smc_core.h"
+
+static int min_sndbuf = SMC_BUF_MIN_SIZE;
+static int min_rcvbuf = SMC_BUF_MIN_SIZE;
+
+static struct ctl_table smc_table[] = {
+	{
+		.procname	= "wmem_default",
+		.data		= &init_net.smc.sysctl_wmem_default,
+		.maxlen		= sizeof(init_net.smc.sysctl_wmem_default),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &min_sndbuf,
+	},
+	{
+		.procname	= "rmem_default",
+		.data		= &init_net.smc.sysctl_rmem_default,
+		.maxlen		= sizeof(init_net.smc.sysctl_rmem_default),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &min_rcvbuf,
+	},
+	{  }
+};
+
+static __net_init int smc_sysctl_init_net(struct net *net)
+{
+	struct ctl_table *table;
+
+	table = smc_table;
+	if (!net_eq(net, &init_net)) {
+		int i;
+
+		table = kmemdup(table, sizeof(smc_table), GFP_KERNEL);
+		if (!table)
+			goto err_alloc;
+
+		for (i = 0; i < ARRAY_SIZE(smc_table) - 1; i++)
+			table[i].data += (void *)net - (void *)&init_net;
+	}
+
+	net->smc.smc_hdr = register_net_sysctl(net, "net/smc", table);
+	if (!net->smc.smc_hdr)
+		goto err_reg;
+
+	return 0;
+
+err_reg:
+	if (!net_eq(net, &init_net))
+		kfree(table);
+err_alloc:
+	return -ENOMEM;
+}
+
+static __net_exit void smc_sysctl_exit_net(struct net *net)
+{
+	unregister_net_sysctl_table(net->smc.smc_hdr);
+}
+
+static struct pernet_operations smc_sysctl_ops __net_initdata = {
+	.init = smc_sysctl_init_net,
+	.exit = smc_sysctl_exit_net,
+};
+
+int __init smc_sysctl_init(void)
+{
+	return register_pernet_subsys(&smc_sysctl_ops);
+}
+
+void smc_sysctl_exit(void)
+{
+	unregister_pernet_subsys(&smc_sysctl_ops);
+}
diff --git a/net/smc/smc_sysctl.h b/net/smc/smc_sysctl.h
new file mode 100644
index 000000000000..c01c5de3a3ea
--- /dev/null
+++ b/net/smc/smc_sysctl.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _SMC_SYSCTL_H
+#define _SMC_SYSCTL_H
+
+#ifdef CONFIG_SYSCTL
+
+int smc_sysctl_init(void);
+void smc_sysctl_exit(void);
+
+#else
+
+int smc_sysctl_init(void)
+{
+	return 0;
+}
+
+void smc_sysctl_exit(void) { }
+
+#endif /* CONFIG_SYSCTL */
+
+#endif /* _SMC_SYSCTL_H */
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH net-next 6/6] net/smc: Introduce tunable linkgroup max connections
  2022-01-14  5:48 [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores Tony Lu
                   ` (4 preceding siblings ...)
  2022-01-14  5:48 ` [RFC PATCH net-next 5/6] net/smc: Unbind buffer size from clcsock and make it tunable Tony Lu
@ 2022-01-14  5:48 ` Tony Lu
  2022-01-16  9:00 ` [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores Leon Romanovsky
  6 siblings, 0 replies; 24+ messages in thread
From: Tony Lu @ 2022-01-14  5:48 UTC (permalink / raw)
  To: kgraul; +Cc: kuba, davem, netdev, linux-s390

This introduces tunable sysctl knob max_lgr_conns to tune the max
connections in one linkgroup. This knob is net-namespaceify.

Currently, a linkgroup is shared by SMC_RMBS_PER_LGR_MAX connectiosn at
max, which is 255. This shares one QP, and introduces more competition,
as connections increases, such as smc_cdc_get_free_slot(), it shares
link-level slots. The environment and scenario are different, so this
makes it possible to tunable by users, to save linkgroup resources or
reduce competition and increase performance.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
---
 include/net/netns/smc.h |  1 +
 net/smc/af_smc.c        |  1 +
 net/smc/smc_core.c      |  2 +-
 net/smc/smc_sysctl.c    | 11 +++++++++++
 4 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h
index f948235e3156..4f55d2876d19 100644
--- a/include/net/netns/smc.h
+++ b/include/net/netns/smc.h
@@ -17,5 +17,6 @@ struct netns_smc {
 #endif
 	int				sysctl_wmem_default;
 	int				sysctl_rmem_default;
+	int				sysctl_max_lgr_conns;
 };
 #endif
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 0650b5971e0a..f38e24cbb4a7 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -2937,6 +2937,7 @@ static __net_init int smc_net_init(struct net *net)
 					   SMC_BUF_MIN_SIZE);
 	net->smc.sysctl_rmem_default = max(net->ipv4.sysctl_tcp_rmem[1],
 					   SMC_BUF_MIN_SIZE);
+	net->smc.sysctl_max_lgr_conns = SMC_RMBS_PER_LGR_MAX;
 
 	return smc_pnet_net_init(net);
 }
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 8935ef4811b0..b6e70dd0688d 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1817,7 +1817,7 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 		    (ini->smcd_version == SMC_V2 ||
 		     lgr->vlan_id == ini->vlan_id) &&
 		    (role == SMC_CLNT || ini->is_smcd ||
-		     lgr->conns_num < SMC_RMBS_PER_LGR_MAX)) {
+		     lgr->conns_num < net->smc.sysctl_max_lgr_conns)) {
 			/* link group found */
 			ini->first_contact_local = 0;
 			conn->lgr = lgr;
diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
index 6706fe1bd888..5ffcf6008c20 100644
--- a/net/smc/smc_sysctl.c
+++ b/net/smc/smc_sysctl.c
@@ -10,6 +10,8 @@
 
 static int min_sndbuf = SMC_BUF_MIN_SIZE;
 static int min_rcvbuf = SMC_BUF_MIN_SIZE;
+static int min_lgr_conns = 1;
+static int max_lgr_conns = SMC_RMBS_PER_LGR_MAX;
 
 static struct ctl_table smc_table[] = {
 	{
@@ -28,6 +30,15 @@ static struct ctl_table smc_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &min_rcvbuf,
 	},
+	{
+		.procname	= "max_lgr_conns",
+		.data		= &init_net.smc.sysctl_max_lgr_conns,
+		.maxlen		= sizeof(init_net.smc.sysctl_max_lgr_conns),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &min_lgr_conns,
+		.extra2		= &max_lgr_conns,
+	},
 	{  }
 };
 
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 5/6] net/smc: Unbind buffer size from clcsock and make it tunable
  2022-01-14  5:48 ` [RFC PATCH net-next 5/6] net/smc: Unbind buffer size from clcsock and make it tunable Tony Lu
@ 2022-01-14  9:13   ` kernel test robot
  2022-01-14  9:43   ` kernel test robot
  1 sibling, 0 replies; 24+ messages in thread
From: kernel test robot @ 2022-01-14  9:13 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2101 bytes --]

Hi Tony,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Tony-Lu/net-smc-Spread-workload-over-multiple-cores/20220114-134957
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 8aaaf2f3af2ae212428f4db1af34214225f5cec3
config: powerpc-allyesconfig (https://download.01.org/0day-ci/archive/20220114/202201141728.f7G5VC95-lkp(a)intel.com/config)
compiler: powerpc-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/5c1f27f3fde941bdde57fcae8606932f245b6759
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Tony-Lu/net-smc-Spread-workload-over-multiple-cores/20220114-134957
        git checkout 5c1f27f3fde941bdde57fcae8606932f245b6759
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=powerpc SHELL=/bin/bash net/smc/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> net/smc/smc_sysctl.c:73:12: warning: no previous prototype for 'smc_sysctl_init' [-Wmissing-prototypes]
      73 | int __init smc_sysctl_init(void)
         |            ^~~~~~~~~~~~~~~
>> net/smc/smc_sysctl.c:78:6: warning: no previous prototype for 'smc_sysctl_exit' [-Wmissing-prototypes]
      78 | void smc_sysctl_exit(void)
         |      ^~~~~~~~~~~~~~~


vim +/smc_sysctl_init +73 net/smc/smc_sysctl.c

    72	
  > 73	int __init smc_sysctl_init(void)
    74	{
    75		return register_pernet_subsys(&smc_sysctl_ops);
    76	}
    77	
  > 78	void smc_sysctl_exit(void)

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 5/6] net/smc: Unbind buffer size from clcsock and make it tunable
  2022-01-14  5:48 ` [RFC PATCH net-next 5/6] net/smc: Unbind buffer size from clcsock and make it tunable Tony Lu
  2022-01-14  9:13   ` kernel test robot
@ 2022-01-14  9:43   ` kernel test robot
  1 sibling, 0 replies; 24+ messages in thread
From: kernel test robot @ 2022-01-14  9:43 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2027 bytes --]

Hi Tony,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Tony-Lu/net-smc-Spread-workload-over-multiple-cores/20220114-134957
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 8aaaf2f3af2ae212428f4db1af34214225f5cec3
config: x86_64-randconfig-a015 (https://download.01.org/0day-ci/archive/20220114/202201141735.QH2hx4ss-lkp(a)intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/5c1f27f3fde941bdde57fcae8606932f245b6759
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Tony-Lu/net-smc-Spread-workload-over-multiple-cores/20220114-134957
        git checkout 5c1f27f3fde941bdde57fcae8606932f245b6759
        # save the config file to linux build tree
        mkdir build_dir
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>, old ones prefixed by <<):

>> WARNING: modpost: vmlinux.o(.text+0x12b0b96): Section mismatch in reference from the function smc_sysctl_exit() to the variable .init.data:smc_sysctl_ops
The function smc_sysctl_exit() references
the variable __initdata smc_sysctl_ops.
This is often because smc_sysctl_exit lacks a __initdata
annotation or the annotation of smc_sysctl_ops is wrong.
--
>> net/smc/smc_sysctl.c:73:12: warning: no previous prototype for 'smc_sysctl_init' [-Wmissing-prototypes]
73 | int __init smc_sysctl_init(void)
|            ^~~~~~~~~~~~~~~
>> net/smc/smc_sysctl.c:78:6: warning: no previous prototype for 'smc_sysctl_exit' [-Wmissing-prototypes]
78 | void smc_sysctl_exit(void)
|      ^~~~~~~~~~~~~~~

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-01-14  5:48 [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores Tony Lu
                   ` (5 preceding siblings ...)
  2022-01-14  5:48 ` [RFC PATCH net-next 6/6] net/smc: Introduce tunable linkgroup max connections Tony Lu
@ 2022-01-16  9:00 ` Leon Romanovsky
  2022-01-16 17:47   ` Tony Lu
  2022-01-26  7:23   ` Tony Lu
  6 siblings, 2 replies; 24+ messages in thread
From: Leon Romanovsky @ 2022-01-16  9:00 UTC (permalink / raw)
  To: Tony Lu; +Cc: kgraul, kuba, davem, netdev, linux-s390, RDMA mailing list

On Fri, Jan 14, 2022 at 01:48:46PM +0800, Tony Lu wrote:
> Currently, SMC creates one CQ per IB device, and shares this cq among
> all the QPs of links. Meanwhile, this CQ is always binded to the first
> completion vector, the IRQ affinity of this vector binds to some CPU
> core. 
> 
> ┌────────┐    ┌──────────────┐   ┌──────────────┐
> │ SMC IB │    ├────┐         │   │              │
> │ DEVICE │ ┌─▶│ QP │ SMC LINK├──▶│SMC Link Group│
> │   ┌────┤ │  ├────┘         │   │              │
> │   │ CQ ├─┘  └──────────────┘   └──────────────┘
> │   │    ├─┐  ┌──────────────┐   ┌──────────────┐
> │   └────┤ │  ├────┐         │   │              │
> │        │ └─▶│ QP │ SMC LINK├──▶│SMC Link Group│
> │        │    ├────┘         │   │              │
> └────────┘    └──────────────┘   └──────────────┘
> 
> In this model, when connections execeeds SMC_RMBS_PER_LGR_MAX, it will
> create multiple link groups and corresponding QPs. All the connections
> share limited QPs and one CQ (both recv and send sides). Generally, one
> completion vector binds to a fixed CPU core, it will limit the
> performance by single core, and large-scale scenes, such as multiple
> threads and lots of connections.
> 
> Running nginx and wrk test with 8 threads and 800 connections on 8 cores
> host, the softirq of CPU 0 is limited the scalability:
> 
> 04:18:54 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 04:18:55 PM  all    5.81    0.00   19.42    0.00    2.94   10.21    0.00    0.00    0.00   61.63
> 04:18:55 PM    0    0.00    0.00    0.00    0.00   16.80   82.78    0.00    0.00    0.00    0.41
> <snip>
> 
> Nowadays, RDMA devices have more than one completion vectors, such as
> mlx5 has 8, eRDMA has 4 completion vector by default. This unlocks the
> limitation of single vector and single CPU core.
> 
> To enhance scalability and take advantage of multi-core resources, we
> can spread CQs to different CPU cores, and introduce more flexible
> mapping. Here comes up a new model, the main different is that creating
> multiple CQs per IB device, which the max number of CQs is limited by
> ibdev's ability (num_comp_vectors). In the scenen of multiple linkgroups,
> the link group's QP can bind to the least used CQ, and CQs are binded
> to different completion vector and CPU cores. So that we can spread
> the softirq (tasklet of wr tx/rx) handler to different cores.
> 
>                         ┌──────────────┐   ┌──────────────┐
> ┌────────┐  ┌───────┐   ├────┐         │   │              │
> │        ├─▶│ CQ 0  ├──▶│ QP │ SMC LINK├──▶│SMC Link Group│
> │        │  └───────┘   ├────┘         │   │              │
> │ SMC IB │  ┌───────┐   └──────────────┘   └──────────────┘
> │ DEVICE ├─▶│ CQ 1  │─┐                                    
> │        │  └───────┘ │ ┌──────────────┐   ┌──────────────┐
> │        │  ┌───────┐ │ ├────┐         │   │              │
> │        ├─▶│ CQ n  │ └▶│ QP │ SMC LINK├──▶│SMC Link Group│
> └────────┘  └───────┘   ├────┘         │   │              │
>                         └──────────────┘   └──────────────┘
> 
> After sperad one CQ (4 linkgroups) to four CPU cores, the softirq load
> spreads to different cores:
> 
> 04:26:25 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 04:26:26 PM  all   10.70    0.00   35.80    0.00    7.64   26.62    0.00    0.00    0.00   19.24
> 04:26:26 PM    0    0.00    0.00    0.00    0.00   16.33   50.00    0.00    0.00    0.00   33.67
> 04:26:26 PM    1    0.00    0.00    0.00    0.00   15.46   69.07    0.00    0.00    0.00   15.46
> 04:26:26 PM    2    0.00    0.00    0.00    0.00   13.13   39.39    0.00    0.00    0.00   47.47
> 04:26:26 PM    3    0.00    0.00    0.00    0.00   13.27   55.10    0.00    0.00    0.00   31.63
> <snip>
> 
> Here is the benchmark with this patch (prototype of new model):
> 
> Test environment:
> - CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4.
> - nginx + wrk HTTP benchmark.
> - nginx: disable access_log, increase keepalive_timeout and
>   keepalive_requests, long-live connection.
> - wrk: 8 threads and 100, 200, 400 connections.
> 
> Benchmark result:
> 
> Conns/QPS         100        200        400
> w/o patch   338502.49  359216.66  398167.16
> w/  patch   677247.40  694193.70  812502.69
> Ratio        +100.07%    +93.25%   +104.06%
> 
> This prototype patches show nealy 1x increasement of QPS.
> 
> The benchmarkes of 100, 200, 400 connections use 1, 1, 2 link groups.
> When link group is one, it spreads send/recv to two cores. Once more
> than one link groups, it would spread to more cores.
> 
> If the application's connections is no more than link group's limitation
> (SMC_RMBS_PER_LGR_MAX, 255), and CPU resources is restricted. This patch
> introduces a tunable way to reduce the hard limitation of link group
> connections number. It reduces the restriction of less CQs (cores) and
> less competition, such as link-level CDC slots. It depends on the scenes
> of applications, so this patch provides a userspace knob, users can
> choose to share link groups for saving resources, or create more link
> groups for less limitation.
> 
> Patch 1-4 introduce multiple CQs support.
> - Patch 1 spreads CQ to two cores, it works for less connections.
> - Patch 2, 3, 4 introduce multiple CQs support, involves a new medium
>   to tie link and ibdev, and load balancing between different completion
>   vectors and CQs.
> - the policy of spreading CQs is still thinking and testing to get
>   highest performance, such as splitting recv/send CQs, or joining them
>   together, or bind recv/recv (send/send) CQ to same vector and so on.
>   Glad to hear your advice.
> 
> Patch 5 is a medium for userspace control knob.
> - mainly provide two knobs to adjust the buffer size of smc socket. We
>   found that too little buffers would let smc wait for buffer for a long
>   time, and limit the performance.
> - introduce a sysctl framework, just for tuning, netlink also does work.
>   Because sysctl is easy to compose as patch and no need userspace example.
>   I am glad to wait for your advice about the control panel for
>   userspace.
> 
> Patch 6 introduces a tunable knob to decrease the per link group
> connections' number, which would increase parallel performance as
> mentioned previous.
> 
> These patches are still improving, I am very glad to hear your advice.

Please CC RDMA mailing list next time.

Why didn't you use already existed APIs in drivers/infiniband/core/cq.c?
ib_cq_pool_get() will do most if not all of your open-coded CQ spreading
logic.

Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-01-16  9:00 ` [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores Leon Romanovsky
@ 2022-01-16 17:47   ` Tony Lu
  2022-01-26  7:23   ` Tony Lu
  1 sibling, 0 replies; 24+ messages in thread
From: Tony Lu @ 2022-01-16 17:47 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: kgraul, kuba, davem, netdev, linux-s390, RDMA mailing list

On Sun, Jan 16, 2022 at 11:00:33AM +0200, Leon Romanovsky wrote:
> On Fri, Jan 14, 2022 at 01:48:46PM +0800, Tony Lu wrote:
> > <snip>
> > 
> > These patches are still improving, I am very glad to hear your advice.
> 
> Please CC RDMA mailing list next time.

I will do it in the next patch.
 
> Why didn't you use already existed APIs in drivers/infiniband/core/cq.c?
> ib_cq_pool_get() will do most if not all of your open-coded CQ spreading
> logic.

Thanks for your advice. I have looked into this API about shared CQ
pool. It should suit for this scene after my brief test. I will replace
the logic of least-used CQ to this CQ pool API in the next patch.

Thank you.
Tony Lu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-01-16  9:00 ` [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores Leon Romanovsky
  2022-01-16 17:47   ` Tony Lu
@ 2022-01-26  7:23   ` Tony Lu
  2022-01-26 15:28     ` Jason Gunthorpe
  1 sibling, 1 reply; 24+ messages in thread
From: Tony Lu @ 2022-01-26  7:23 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: kgraul, kuba, davem, netdev, linux-s390, RDMA mailing list

On Sun, Jan 16, 2022 at 11:00:33AM +0200, Leon Romanovsky wrote:
> 
> Please CC RDMA mailing list next time.
> 
> Why didn't you use already existed APIs in drivers/infiniband/core/cq.c?
> ib_cq_pool_get() will do most if not all of your open-coded CQ spreading
> logic.

I am working on replacing with ib_cq_pool_get(), this need ib_poll_context
to indicate the poller which provides by ib_poll_handler(). It's okay
for now, but for the callback function. When it polled a ib_wc, it
would call wc->wr_cqe->done(cq, wc), which is the union with wr_id. The
wr_id is heavily used in SMC.

In this patch set, I am not going to change the logic which is out of cq
allocation. So I have to use original interface to allocate cq this
time.

I am glad to hear your advice, if I missed information or misunderstood.

Thanks,
Tony Lu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-01-26  7:23   ` Tony Lu
@ 2022-01-26 15:28     ` Jason Gunthorpe
  2022-01-27  3:14       ` Tony Lu
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2022-01-26 15:28 UTC (permalink / raw)
  To: Tony Lu
  Cc: Leon Romanovsky, kgraul, kuba, davem, netdev, linux-s390,
	RDMA mailing list

On Wed, Jan 26, 2022 at 03:23:22PM +0800, Tony Lu wrote:
> On Sun, Jan 16, 2022 at 11:00:33AM +0200, Leon Romanovsky wrote:
> > 
> > Please CC RDMA mailing list next time.
> > 
> > Why didn't you use already existed APIs in drivers/infiniband/core/cq.c?
> > ib_cq_pool_get() will do most if not all of your open-coded CQ spreading
> > logic.
> 
> I am working on replacing with ib_cq_pool_get(), this need ib_poll_context
> to indicate the poller which provides by ib_poll_handler(). It's okay
> for now, but for the callback function. When it polled a ib_wc, it
> would call wc->wr_cqe->done(cq, wc), which is the union with wr_id. The
> wr_id is heavily used in SMC.

Part of using the new interface is converting to use wr_cqe, you
should just do that work instead of trying to duplicate a core API in
a driver.

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-01-26 15:28     ` Jason Gunthorpe
@ 2022-01-27  3:14       ` Tony Lu
  2022-01-27  6:21         ` Leon Romanovsky
  0 siblings, 1 reply; 24+ messages in thread
From: Tony Lu @ 2022-01-27  3:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, kgraul, kuba, davem, netdev, linux-s390,
	RDMA mailing list

On Wed, Jan 26, 2022 at 11:28:06AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 26, 2022 at 03:23:22PM +0800, Tony Lu wrote:
> > On Sun, Jan 16, 2022 at 11:00:33AM +0200, Leon Romanovsky wrote:
> > > 
> > > Please CC RDMA mailing list next time.
> > > 
> > > Why didn't you use already existed APIs in drivers/infiniband/core/cq.c?
> > > ib_cq_pool_get() will do most if not all of your open-coded CQ spreading
> > > logic.
> > 
> > I am working on replacing with ib_cq_pool_get(), this need ib_poll_context
> > to indicate the poller which provides by ib_poll_handler(). It's okay
> > for now, but for the callback function. When it polled a ib_wc, it
> > would call wc->wr_cqe->done(cq, wc), which is the union with wr_id. The
> > wr_id is heavily used in SMC.
> 
> Part of using the new interface is converting to use wr_cqe, you
> should just do that work instead of trying to duplicate a core API in
> a driver.

Thanks for your advice. This patch set aims to improve performance with
current API in SMC protocol, which is more urgent. I will do that work
with new API in the next separate patch.That work may require a lot of
revisions, I will issue patches after a full discussion with Karsten
and finalization of the solution.

Thanks,
Tony Lu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-01-27  3:14       ` Tony Lu
@ 2022-01-27  6:21         ` Leon Romanovsky
  2022-01-27  7:59           ` Tony Lu
  0 siblings, 1 reply; 24+ messages in thread
From: Leon Romanovsky @ 2022-01-27  6:21 UTC (permalink / raw)
  To: Tony Lu
  Cc: Jason Gunthorpe, kgraul, kuba, davem, netdev, linux-s390,
	RDMA mailing list

On Thu, Jan 27, 2022 at 11:14:37AM +0800, Tony Lu wrote:
> On Wed, Jan 26, 2022 at 11:28:06AM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 26, 2022 at 03:23:22PM +0800, Tony Lu wrote:
> > > On Sun, Jan 16, 2022 at 11:00:33AM +0200, Leon Romanovsky wrote:
> > > > 
> > > > Please CC RDMA mailing list next time.
> > > > 
> > > > Why didn't you use already existed APIs in drivers/infiniband/core/cq.c?
> > > > ib_cq_pool_get() will do most if not all of your open-coded CQ spreading
> > > > logic.
> > > 
> > > I am working on replacing with ib_cq_pool_get(), this need ib_poll_context
> > > to indicate the poller which provides by ib_poll_handler(). It's okay
> > > for now, but for the callback function. When it polled a ib_wc, it
> > > would call wc->wr_cqe->done(cq, wc), which is the union with wr_id. The
> > > wr_id is heavily used in SMC.
> > 
> > Part of using the new interface is converting to use wr_cqe, you
> > should just do that work instead of trying to duplicate a core API in
> > a driver.
> 
> Thanks for your advice. This patch set aims to improve performance with
> current API in SMC protocol, which is more urgent. 

This code existed from 2017, it is hard to agree with "urgent" claim.

Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-01-27  6:21         ` Leon Romanovsky
@ 2022-01-27  7:59           ` Tony Lu
  2022-01-27  8:47             ` Leon Romanovsky
  0 siblings, 1 reply; 24+ messages in thread
From: Tony Lu @ 2022-01-27  7:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, kgraul, kuba, davem, netdev, linux-s390,
	RDMA mailing list

On Thu, Jan 27, 2022 at 08:21:07AM +0200, Leon Romanovsky wrote:
> On Thu, Jan 27, 2022 at 11:14:37AM +0800, Tony Lu wrote:
> > On Wed, Jan 26, 2022 at 11:28:06AM -0400, Jason Gunthorpe wrote:
> > > On Wed, Jan 26, 2022 at 03:23:22PM +0800, Tony Lu wrote:
> > > > On Sun, Jan 16, 2022 at 11:00:33AM +0200, Leon Romanovsky wrote:
> > > > > 
> > > > > Please CC RDMA mailing list next time.
> > > > > 
> > > > > Why didn't you use already existed APIs in drivers/infiniband/core/cq.c?
> > > > > ib_cq_pool_get() will do most if not all of your open-coded CQ spreading
> > > > > logic.
> > > > 
> > > > I am working on replacing with ib_cq_pool_get(), this need ib_poll_context
> > > > to indicate the poller which provides by ib_poll_handler(). It's okay
> > > > for now, but for the callback function. When it polled a ib_wc, it
> > > > would call wc->wr_cqe->done(cq, wc), which is the union with wr_id. The
> > > > wr_id is heavily used in SMC.
> > > 
> > > Part of using the new interface is converting to use wr_cqe, you
> > > should just do that work instead of trying to duplicate a core API in
> > > a driver.
> > 
> > Thanks for your advice. This patch set aims to improve performance with
> > current API in SMC protocol, which is more urgent. 
> 
> This code existed from 2017, it is hard to agree with "urgent" claim.

Yes, I agree with you that the code is old. I think there are two
problems, one for performance issue, the other one for API refactor.

We are running into the performance issues mentioned in patches in our
cloud environment. So I think it is more urgent for a real world issue.

The current modification is less intrusive to the code. This makes
changes simpler. And current implement works for now, this is why I put
refactor behind.

Thank you,
Tony Lu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-01-27  7:59           ` Tony Lu
@ 2022-01-27  8:47             ` Leon Romanovsky
  2022-01-27  9:14               ` Tony Lu
  0 siblings, 1 reply; 24+ messages in thread
From: Leon Romanovsky @ 2022-01-27  8:47 UTC (permalink / raw)
  To: Tony Lu
  Cc: Jason Gunthorpe, kgraul, kuba, davem, netdev, linux-s390,
	RDMA mailing list

On Thu, Jan 27, 2022 at 03:59:36PM +0800, Tony Lu wrote:
> On Thu, Jan 27, 2022 at 08:21:07AM +0200, Leon Romanovsky wrote:
> > On Thu, Jan 27, 2022 at 11:14:37AM +0800, Tony Lu wrote:
> > > On Wed, Jan 26, 2022 at 11:28:06AM -0400, Jason Gunthorpe wrote:
> > > > On Wed, Jan 26, 2022 at 03:23:22PM +0800, Tony Lu wrote:
> > > > > On Sun, Jan 16, 2022 at 11:00:33AM +0200, Leon Romanovsky wrote:
> > > > > > 
> > > > > > Please CC RDMA mailing list next time.
> > > > > > 
> > > > > > Why didn't you use already existed APIs in drivers/infiniband/core/cq.c?
> > > > > > ib_cq_pool_get() will do most if not all of your open-coded CQ spreading
> > > > > > logic.
> > > > > 
> > > > > I am working on replacing with ib_cq_pool_get(), this need ib_poll_context
> > > > > to indicate the poller which provides by ib_poll_handler(). It's okay
> > > > > for now, but for the callback function. When it polled a ib_wc, it
> > > > > would call wc->wr_cqe->done(cq, wc), which is the union with wr_id. The
> > > > > wr_id is heavily used in SMC.
> > > > 
> > > > Part of using the new interface is converting to use wr_cqe, you
> > > > should just do that work instead of trying to duplicate a core API in
> > > > a driver.
> > > 
> > > Thanks for your advice. This patch set aims to improve performance with
> > > current API in SMC protocol, which is more urgent. 
> > 
> > This code existed from 2017, it is hard to agree with "urgent" claim.
> 
> Yes, I agree with you that the code is old. I think there are two
> problems, one for performance issue, the other one for API refactor.
> 
> We are running into the performance issues mentioned in patches in our
> cloud environment. So I think it is more urgent for a real world issue.
> 
> The current modification is less intrusive to the code. This makes
> changes simpler. And current implement works for now, this is why I put
> refactor behind.

We are not requesting to refactor the code, but properly use existing
in-kernel API, while implementing new feature ("Spread workload over
multiple cores").

Thanks

> 
> Thank you,
> Tony Lu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-01-27  8:47             ` Leon Romanovsky
@ 2022-01-27  9:14               ` Tony Lu
  2022-01-27  9:25                 ` Leon Romanovsky
  0 siblings, 1 reply; 24+ messages in thread
From: Tony Lu @ 2022-01-27  9:14 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, kgraul, kuba, davem, netdev, linux-s390,
	RDMA mailing list

On Thu, Jan 27, 2022 at 10:47:09AM +0200, Leon Romanovsky wrote:
> On Thu, Jan 27, 2022 at 03:59:36PM +0800, Tony Lu wrote:
> > 
> > Yes, I agree with you that the code is old. I think there are two
> > problems, one for performance issue, the other one for API refactor.
> > 
> > We are running into the performance issues mentioned in patches in our
> > cloud environment. So I think it is more urgent for a real world issue.
> > 
> > The current modification is less intrusive to the code. This makes
> > changes simpler. And current implement works for now, this is why I put
> > refactor behind.
> 
> We are not requesting to refactor the code, but properly use existing
> in-kernel API, while implementing new feature ("Spread workload over
> multiple cores").

Sorry for that if I missed something about properly using existing
in-kernel API. I am not sure the proper API is to use ib_cq_pool_get()
and ib_cq_pool_put()?

If so, these APIs doesn't suit for current smc's usage, I have to
refactor logic (tasklet and wr_id) in smc. I think it is a huge work
and should do it with full discussion.

Thanks,
Tony Lu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-01-27  9:14               ` Tony Lu
@ 2022-01-27  9:25                 ` Leon Romanovsky
  2022-01-27  9:50                   ` Tony Lu
  0 siblings, 1 reply; 24+ messages in thread
From: Leon Romanovsky @ 2022-01-27  9:25 UTC (permalink / raw)
  To: Tony Lu
  Cc: Jason Gunthorpe, kgraul, kuba, davem, netdev, linux-s390,
	RDMA mailing list

On Thu, Jan 27, 2022 at 05:14:35PM +0800, Tony Lu wrote:
> On Thu, Jan 27, 2022 at 10:47:09AM +0200, Leon Romanovsky wrote:
> > On Thu, Jan 27, 2022 at 03:59:36PM +0800, Tony Lu wrote:
> > > 
> > > Yes, I agree with you that the code is old. I think there are two
> > > problems, one for performance issue, the other one for API refactor.
> > > 
> > > We are running into the performance issues mentioned in patches in our
> > > cloud environment. So I think it is more urgent for a real world issue.
> > > 
> > > The current modification is less intrusive to the code. This makes
> > > changes simpler. And current implement works for now, this is why I put
> > > refactor behind.
> > 
> > We are not requesting to refactor the code, but properly use existing
> > in-kernel API, while implementing new feature ("Spread workload over
> > multiple cores").
> 
> Sorry for that if I missed something about properly using existing
> in-kernel API. I am not sure the proper API is to use ib_cq_pool_get()
> and ib_cq_pool_put()?
> 
> If so, these APIs doesn't suit for current smc's usage, I have to
> refactor logic (tasklet and wr_id) in smc. I think it is a huge work
> and should do it with full discussion.

This discussion is not going anywhere. Just to summarize, we (Jason and I)
are asking to use existing API, from the beginning.

You can try and convince netdev maintainers to merge the code despite
our request.

Thanks

> 
> Thanks,
> Tony Lu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-01-27  9:25                 ` Leon Romanovsky
@ 2022-01-27  9:50                   ` Tony Lu
  2022-01-27 14:52                     ` Karsten Graul
  0 siblings, 1 reply; 24+ messages in thread
From: Tony Lu @ 2022-01-27  9:50 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, kgraul, kuba, davem, netdev, linux-s390,
	RDMA mailing list

On Thu, Jan 27, 2022 at 11:25:41AM +0200, Leon Romanovsky wrote:
> On Thu, Jan 27, 2022 at 05:14:35PM +0800, Tony Lu wrote:
> > On Thu, Jan 27, 2022 at 10:47:09AM +0200, Leon Romanovsky wrote:
> > > On Thu, Jan 27, 2022 at 03:59:36PM +0800, Tony Lu wrote:
> > 
> > Sorry for that if I missed something about properly using existing
> > in-kernel API. I am not sure the proper API is to use ib_cq_pool_get()
> > and ib_cq_pool_put()?
> > 
> > If so, these APIs doesn't suit for current smc's usage, I have to
> > refactor logic (tasklet and wr_id) in smc. I think it is a huge work
> > and should do it with full discussion.
> 
> This discussion is not going anywhere. Just to summarize, we (Jason and I)
> are asking to use existing API, from the beginning.

Yes, I can't agree more with you about using existing API and I have
tried them earlier. The existing APIs are easy to use if I wrote a new
logic. I also don't want to repeat the codes.

The main obstacle is that the packet and wr processing of smc is
tightly bound to the old API and not easy to replace with existing API.

To solve a real issue, I have to fix it based on the old API. If using
existing API in this patch, I have to refactor smc logics which needs
more time. Our production tree is synced with smc next. So I choose to
fix this issue first, then refactor these logic to fit existing API once
and for all.
 
> You can try and convince netdev maintainers to merge the code despite
> our request.

That's not my purpose to recklessly merge this patch. I appreciate that
you can tell me existing APIs are available. So I listened to everyone
and decided to go with a compromise, fix it first, then refactor. 

Thanks for your advice.

Tony Lu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-01-27  9:50                   ` Tony Lu
@ 2022-01-27 14:52                     ` Karsten Graul
  2022-01-28  6:55                       ` Tony Lu
  0 siblings, 1 reply; 24+ messages in thread
From: Karsten Graul @ 2022-01-27 14:52 UTC (permalink / raw)
  To: Tony Lu, Leon Romanovsky
  Cc: Jason Gunthorpe, kuba, davem, netdev, linux-s390, RDMA mailing list

On 27/01/2022 10:50, Tony Lu wrote:
> On Thu, Jan 27, 2022 at 11:25:41AM +0200, Leon Romanovsky wrote:
>> On Thu, Jan 27, 2022 at 05:14:35PM +0800, Tony Lu wrote:
>>> On Thu, Jan 27, 2022 at 10:47:09AM +0200, Leon Romanovsky wrote:
>>>> On Thu, Jan 27, 2022 at 03:59:36PM +0800, Tony Lu wrote:
>>>
>>> Sorry for that if I missed something about properly using existing
>>> in-kernel API. I am not sure the proper API is to use ib_cq_pool_get()
>>> and ib_cq_pool_put()?
>>>
>>> If so, these APIs doesn't suit for current smc's usage, I have to
>>> refactor logic (tasklet and wr_id) in smc. I think it is a huge work
>>> and should do it with full discussion.
>>
>> This discussion is not going anywhere. Just to summarize, we (Jason and I)
>> are asking to use existing API, from the beginning.
> 
> Yes, I can't agree more with you about using existing API and I have
> tried them earlier. The existing APIs are easy to use if I wrote a new
> logic. I also don't want to repeat the codes.
> 
> The main obstacle is that the packet and wr processing of smc is
> tightly bound to the old API and not easy to replace with existing API.
> 
> To solve a real issue, I have to fix it based on the old API. If using
> existing API in this patch, I have to refactor smc logics which needs
> more time. Our production tree is synced with smc next. So I choose to
> fix this issue first, then refactor these logic to fit existing API once
> and for all.

While I understand your approach to fix the issue first I need to say
that such interim fixes create an significant amount of effort that has to
be spent for review and test for others. And there is the increased risk 
to introduce new bugs by just this only-for-now fix.

Given the fact that right now you are the only one who is affected by this problem
I recommend to keep your fix in your environment for now, and come back with the
final version. In the meantime I can use the saved time to review the bunch 
of other patches that we received.

Thank you!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-01-27 14:52                     ` Karsten Graul
@ 2022-01-28  6:55                       ` Tony Lu
  2022-02-01 16:50                         ` Karsten Graul
  0 siblings, 1 reply; 24+ messages in thread
From: Tony Lu @ 2022-01-28  6:55 UTC (permalink / raw)
  To: Karsten Graul
  Cc: Leon Romanovsky, Jason Gunthorpe, kuba, davem, netdev,
	linux-s390, RDMA mailing list

On Thu, Jan 27, 2022 at 03:52:36PM +0100, Karsten Graul wrote:
> On 27/01/2022 10:50, Tony Lu wrote:
> > On Thu, Jan 27, 2022 at 11:25:41AM +0200, Leon Romanovsky wrote:
> >> On Thu, Jan 27, 2022 at 05:14:35PM +0800, Tony Lu wrote:
> >>> On Thu, Jan 27, 2022 at 10:47:09AM +0200, Leon Romanovsky wrote:
> >>>> On Thu, Jan 27, 2022 at 03:59:36PM +0800, Tony Lu wrote:
> >>>
> >>> Sorry for that if I missed something about properly using existing
> >>> in-kernel API. I am not sure the proper API is to use ib_cq_pool_get()
> >>> and ib_cq_pool_put()?
> >>>
> >>> If so, these APIs doesn't suit for current smc's usage, I have to
> >>> refactor logic (tasklet and wr_id) in smc. I think it is a huge work
> >>> and should do it with full discussion.
> >>
> >> This discussion is not going anywhere. Just to summarize, we (Jason and I)
> >> are asking to use existing API, from the beginning.
> > 
> > Yes, I can't agree more with you about using existing API and I have
> > tried them earlier. The existing APIs are easy to use if I wrote a new
> > logic. I also don't want to repeat the codes.
> > 
> > The main obstacle is that the packet and wr processing of smc is
> > tightly bound to the old API and not easy to replace with existing API.
> > 
> > To solve a real issue, I have to fix it based on the old API. If using
> > existing API in this patch, I have to refactor smc logics which needs
> > more time. Our production tree is synced with smc next. So I choose to
> > fix this issue first, then refactor these logic to fit existing API once
> > and for all.
> 
> While I understand your approach to fix the issue first I need to say
> that such interim fixes create an significant amount of effort that has to
> be spent for review and test for others. And there is the increased risk 
> to introduce new bugs by just this only-for-now fix.

Let's back to this patch itself. This approach spreads CQs to different
vectors, it tries to solve this issue under current design and not to
introduce more changes to make it easier to review and test. It severely
limits the performance of SMC when replacing TCP. This patch tries to
reduce the gap between SMC and TCP.

To use newer API, it should have a lots of work to do with wr process
logic, for example remove tasklet handler, refactor wr_id logic. I have
no idea if we should do this? If it's okay and got your permission, I
will do this in the next patch.

> Given the fact that right now you are the only one who is affected by this problem
> I recommend to keep your fix in your environment for now, and come back with the
> final version. In the meantime I can use the saved time to review the bunch 
> of other patches that we received.

I really appreciate the time you spent reviewing our patch. Recently,
our team has submitted a lot of patches and got your detailed
suggestions, including panic (linkgroup, CDC), performance and so on.
We are using SMC in our public cloud environment. Therefore, we maintain
a internal tree and try to contribute these changes to upstream, and we
will continue to invest to improve the stability, performance and
compatibility, and focus on SMC for a long time.

We are willing to commit time and resource to help out in reviewing and
testing the patch in mail list and -next, as reviewer or tester.

We have built up CI/CD and nightly test for SMC. And we intend to send
test reports for each patch in the mail list, help to review, find out
panic and performance regression.

Not sure if this proposal will help save your time to review other
patches? Glad to hear your advice.

Thank you,
Tony Lu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-01-28  6:55                       ` Tony Lu
@ 2022-02-01 16:50                         ` Karsten Graul
  2022-02-09  9:49                           ` Tony Lu
  0 siblings, 1 reply; 24+ messages in thread
From: Karsten Graul @ 2022-02-01 16:50 UTC (permalink / raw)
  To: Tony Lu
  Cc: Leon Romanovsky, Jason Gunthorpe, kuba, davem, netdev,
	linux-s390, RDMA mailing list

On 28/01/2022 07:55, Tony Lu wrote:
> On Thu, Jan 27, 2022 at 03:52:36PM +0100, Karsten Graul wrote:
>> On 27/01/2022 10:50, Tony Lu wrote:
>>> On Thu, Jan 27, 2022 at 11:25:41AM +0200, Leon Romanovsky wrote:
>>>> On Thu, Jan 27, 2022 at 05:14:35PM +0800, Tony Lu wrote:
>>>>> On Thu, Jan 27, 2022 at 10:47:09AM +0200, Leon Romanovsky wrote:
>>>>>> On Thu, Jan 27, 2022 at 03:59:36PM +0800, Tony Lu wrote:
>>>>>
>>>>> Sorry for that if I missed something about properly using existing
>>>>> in-kernel API. I am not sure the proper API is to use ib_cq_pool_get()
>>>>> and ib_cq_pool_put()?
>>>>>
>>>>> If so, these APIs doesn't suit for current smc's usage, I have to
>>>>> refactor logic (tasklet and wr_id) in smc. I think it is a huge work
>>>>> and should do it with full discussion.
>>>>
>>>> This discussion is not going anywhere. Just to summarize, we (Jason and I)
>>>> are asking to use existing API, from the beginning.
>>>
>>> Yes, I can't agree more with you about using existing API and I have
>>> tried them earlier. The existing APIs are easy to use if I wrote a new
>>> logic. I also don't want to repeat the codes.
>>>
>>> The main obstacle is that the packet and wr processing of smc is
>>> tightly bound to the old API and not easy to replace with existing API.
>>>
>>> To solve a real issue, I have to fix it based on the old API. If using
>>> existing API in this patch, I have to refactor smc logics which needs
>>> more time. Our production tree is synced with smc next. So I choose to
>>> fix this issue first, then refactor these logic to fit existing API once
>>> and for all.
>>
>> While I understand your approach to fix the issue first I need to say
>> that such interim fixes create an significant amount of effort that has to
>> be spent for review and test for others. And there is the increased risk 
>> to introduce new bugs by just this only-for-now fix.
> 
> Let's back to this patch itself. This approach spreads CQs to different
> vectors, it tries to solve this issue under current design and not to
> introduce more changes to make it easier to review and test. It severely
> limits the performance of SMC when replacing TCP. This patch tries to
> reduce the gap between SMC and TCP.
> 
> To use newer API, it should have a lots of work to do with wr process
> logic, for example remove tasklet handler, refactor wr_id logic. I have
> no idea if we should do this? If it's okay and got your permission, I
> will do this in the next patch.

Hi Tony,

I think there was quite a discussion now about this patch series and the conclusion from 
the RDMA list and from my side was that if this code is changed it should be done using
the new API. The current version re-implements code that is already available there.

I agree that using the new API is the way to go, and I am in for any early discussions
about the changes that are needed.

>> Given the fact that right now you are the only one who is affected by this problem
>> I recommend to keep your fix in your environment for now, and come back with the
>> final version. In the meantime I can use the saved time to review the bunch 
>> of other patches that we received.
> 
> I really appreciate the time you spent reviewing our patch. Recently,
> our team has submitted a lot of patches and got your detailed
> suggestions, including panic (linkgroup, CDC), performance and so on.
> We are using SMC in our public cloud environment. Therefore, we maintain
> a internal tree and try to contribute these changes to upstream, and we
> will continue to invest to improve the stability, performance and
> compatibility, and focus on SMC for a long time.
> 
> We are willing to commit time and resource to help out in reviewing and
> testing the patch in mail list and -next, as reviewer or tester.
> 
> We have built up CI/CD and nightly test for SMC. And we intend to send
> test reports for each patch in the mail list, help to review, find out
> panic and performance regression.
> 
> Not sure if this proposal will help save your time to review other
> patches? Glad to hear your advice.

Thanks for all the time and work you spend to bring SMC into a cloud environment!
We really appreciate this a lot.

Karsten

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores
  2022-02-01 16:50                         ` Karsten Graul
@ 2022-02-09  9:49                           ` Tony Lu
  0 siblings, 0 replies; 24+ messages in thread
From: Tony Lu @ 2022-02-09  9:49 UTC (permalink / raw)
  To: Karsten Graul
  Cc: Leon Romanovsky, Jason Gunthorpe, kuba, davem, netdev,
	linux-s390, RDMA mailing list

On Tue, Feb 01, 2022 at 05:50:48PM +0100, Karsten Graul wrote:
> On 28/01/2022 07:55, Tony Lu wrote:
> > On Thu, Jan 27, 2022 at 03:52:36PM +0100, Karsten Graul wrote:
> >> On 27/01/2022 10:50, Tony Lu wrote:
> >>> On Thu, Jan 27, 2022 at 11:25:41AM +0200, Leon Romanovsky wrote:
> >>>> On Thu, Jan 27, 2022 at 05:14:35PM +0800, Tony Lu wrote:
> >>>>> On Thu, Jan 27, 2022 at 10:47:09AM +0200, Leon Romanovsky wrote:
> >>>>>> On Thu, Jan 27, 2022 at 03:59:36PM +0800, Tony Lu wrote:
> >>>>>
> >>>>> Sorry for that if I missed something about properly using existing
> >>>>> in-kernel API. I am not sure the proper API is to use ib_cq_pool_get()
> >>>>> and ib_cq_pool_put()?
> >>>>>
> >>>>> If so, these APIs doesn't suit for current smc's usage, I have to
> >>>>> refactor logic (tasklet and wr_id) in smc. I think it is a huge work
> >>>>> and should do it with full discussion.
> >>>>
> >>>> This discussion is not going anywhere. Just to summarize, we (Jason and I)
> >>>> are asking to use existing API, from the beginning.
> >>>
> >>> Yes, I can't agree more with you about using existing API and I have
> >>> tried them earlier. The existing APIs are easy to use if I wrote a new
> >>> logic. I also don't want to repeat the codes.
> >>>
> >>> The main obstacle is that the packet and wr processing of smc is
> >>> tightly bound to the old API and not easy to replace with existing API.
> >>>
> >>> To solve a real issue, I have to fix it based on the old API. If using
> >>> existing API in this patch, I have to refactor smc logics which needs
> >>> more time. Our production tree is synced with smc next. So I choose to
> >>> fix this issue first, then refactor these logic to fit existing API once
> >>> and for all.
> >>
> >> While I understand your approach to fix the issue first I need to say
> >> that such interim fixes create an significant amount of effort that has to
> >> be spent for review and test for others. And there is the increased risk 
> >> to introduce new bugs by just this only-for-now fix.
> > 
> > Let's back to this patch itself. This approach spreads CQs to different
> > vectors, it tries to solve this issue under current design and not to
> > introduce more changes to make it easier to review and test. It severely
> > limits the performance of SMC when replacing TCP. This patch tries to
> > reduce the gap between SMC and TCP.
> > 
> > To use newer API, it should have a lots of work to do with wr process
> > logic, for example remove tasklet handler, refactor wr_id logic. I have
> > no idea if we should do this? If it's okay and got your permission, I
> > will do this in the next patch.
> 
> Hi Tony,
> 
> I think there was quite a discussion now about this patch series and the conclusion from 
> the RDMA list and from my side was that if this code is changed it should be done using
> the new API. The current version re-implements code that is already available there.
> 
> I agree that using the new API is the way to go, and I am in for any early discussions
> about the changes that are needed.
> 

Thank you for pointing me to the sure way.

I am working on this. I will send the complete refactor version with the
new API later.

Best regards,
Tony Lu

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-02-09  9:53 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-14  5:48 [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores Tony Lu
2022-01-14  5:48 ` [RFC PATCH net-next 1/6] net/smc: Spread CQs to differents completion vectors Tony Lu
2022-01-14  5:48 ` [RFC PATCH net-next 2/6] net/smc: Prepare for multiple CQs per IB devices Tony Lu
2022-01-14  5:48 ` [RFC PATCH net-next 3/6] net/smc: Introduce smc_ib_cq to bind link and cq Tony Lu
2022-01-14  5:48 ` [RFC PATCH net-next 4/6] net/smc: Multiple CQs per IB devices Tony Lu
2022-01-14  5:48 ` [RFC PATCH net-next 5/6] net/smc: Unbind buffer size from clcsock and make it tunable Tony Lu
2022-01-14  9:13   ` kernel test robot
2022-01-14  9:43   ` kernel test robot
2022-01-14  5:48 ` [RFC PATCH net-next 6/6] net/smc: Introduce tunable linkgroup max connections Tony Lu
2022-01-16  9:00 ` [RFC PATCH net-next 0/6] net/smc: Spread workload over multiple cores Leon Romanovsky
2022-01-16 17:47   ` Tony Lu
2022-01-26  7:23   ` Tony Lu
2022-01-26 15:28     ` Jason Gunthorpe
2022-01-27  3:14       ` Tony Lu
2022-01-27  6:21         ` Leon Romanovsky
2022-01-27  7:59           ` Tony Lu
2022-01-27  8:47             ` Leon Romanovsky
2022-01-27  9:14               ` Tony Lu
2022-01-27  9:25                 ` Leon Romanovsky
2022-01-27  9:50                   ` Tony Lu
2022-01-27 14:52                     ` Karsten Graul
2022-01-28  6:55                       ` Tony Lu
2022-02-01 16:50                         ` Karsten Graul
2022-02-09  9:49                           ` Tony Lu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.