netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections
@ 2022-08-10 17:47 D. Wythe
  2022-08-10 17:47 ` [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D. Wythe
                   ` (12 more replies)
  0 siblings, 13 replies; 29+ messages in thread
From: D. Wythe @ 2022-08-10 17:47 UTC (permalink / raw)
  To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

From: "D. Wythe" <alibuda@linux.alibaba.com>

This patch set attempts to optimize the parallelism of SMC-R connections,
mainly to reduce unnecessary blocking on locks, and to fix exceptions that
occur after thoses optimization.

According to Off-CPU graph, SMC worker's off-CPU as that: 

smc_close_passive_work			(1.09%)
	smcr_buf_unuse			(1.08%)
		smc_llc_flow_initiate	(1.02%)
	
smc_listen_work 			(48.17%)
	__mutex_lock.isra.11 		(47.96%)


An ideal SMC-R connection process should only block on the IO events
of the network, but it's quite clear that the SMC-R connection now is
queued on the lock most of the time.

The goal of this patchset is to achieve our ideal situation where
network IO events are blocked for the majority of the connection lifetime.

There are three big locks here:

1. smc_client_lgr_pending & smc_server_lgr_pending

2. llc_conf_mutex

3. rmbs_lock & sndbufs_lock

And an implementation issue:

1. confirm/delete rkey msg can't be sent concurrently while
protocol allows indeed.

Unfortunately,The above problems together affect the parallelism of
SMC-R connection. If any of them are not solved. our goal cannot
be achieved.

After this patch set, we can get a quite ideal off-CPU graph as
following:

smc_close_passive_work					(41.58%)
	smcr_buf_unuse					(41.57%)
		smc_llc_do_delete_rkey			(41.57%)

smc_listen_work						(39.10%)
	smc_clc_wait_msg				(13.18%)
		tcp_recvmsg_locked			(13.18)
	smc_listen_find_device				(25.87%)
		smcr_lgr_reg_rmbs			(25.87%)
			smc_llc_do_confirm_rkey		(25.87%)

We can see that most of the waiting times are waiting for network IO
events. This also has a certain performance improvement on our
short-lived conenction wrk/nginx benchmark test:

+--------------+------+------+-------+--------+------+--------+
|conns/qps     |c4    | c8   |  c16  |  c32   | c64  |  c200  |
+--------------+------+------+-------+--------+------+--------+
|SMC-R before  |9.7k  | 10k  |  10k  |  9.9k  | 9.1k |  8.9k  | 
+--------------+------+------+-------+--------+------+--------+
|SMC-R now     |13k   | 19k  |  18k  |  16k   | 15k  |  12k   |
+--------------+------+------+-------+--------+------+--------+
|TCP	       |15k   | 35k  |  51k  |  80k   | 100k |  162k  |
+--------------+------+------+-------+--------+------+--------+

The reason why the benefit is not obvious after the number of connections has
increased dues to workqueue. If we try to change workqueue to WQ_UNBOUND, we can
obtain at least 4-5 times performance improvement, can reach up to half of TCP.
However, this is not an elegant solution, the optimization of it will be much
more complicated. But in any case, we will submit relevant optimization
patches as soon as possible.

Please note that the premise here is that the lock related problem
must be solved first, otherwise, no matter how we optimize the workqueue,
there won't be much improvement. 

Because there are a lot of related changes to the code, if you have any questions
or suggestions, please let me know.

Thanks
D. Wythe

D. Wythe (10):
  net/smc: remove locks smc_client_lgr_pending and
    smc_server_lgr_pending
  net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending
  net/smc: allow confirm/delete rkey response deliver multiplex
  net/smc: make SMC_LLC_FLOW_RKEY run concurrently
  net/smc: llc_conf_mutex refactor, replace it with rw_semaphore
  net/smc: use read semaphores to reduce unnecessary blocking in
    smc_buf_create() & smcr_buf_unuse()
  net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs()
  net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore
  net/smc: fix potential panic dues to unprotected
    smc_llc_srv_add_link()
  net/smc: fix application data exception

 net/smc/af_smc.c   |  40 +++--
 net/smc/smc_core.c | 447 +++++++++++++++++++++++++++++++++++++++++++++++------
 net/smc/smc_core.h |  76 ++++++++-
 net/smc/smc_llc.c  | 286 +++++++++++++++++++++++++---------
 net/smc/smc_llc.h  |   6 +
 net/smc/smc_wr.c   |  10 --
 net/smc/smc_wr.h   |  10 ++
 7 files changed, 728 insertions(+), 147 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending
  2022-08-10 17:47 [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections D. Wythe
@ 2022-08-10 17:47 ` D. Wythe
  2022-08-11  3:41   ` kernel test robot
                     ` (3 more replies)
  2022-08-10 17:47 ` [PATCH net-next 02/10] net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending D. Wythe
                   ` (11 subsequent siblings)
  12 siblings, 4 replies; 29+ messages in thread
From: D. Wythe @ 2022-08-10 17:47 UTC (permalink / raw)
  To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

From: "D. Wythe" <alibuda@linux.alibaba.com>

This patch attempts to remove locks named smc_client_lgr_pending and
smc_server_lgr_pending, which aim to serialize the creation of link
group. However, once link group existed already, those locks are
meaningless, worse still, they make incoming connections have to be
queued one after the other.

Now, the creation of link group is no longer generated by competition,
but allocated through following strategy.

1. Try to find a suitable link group, if successd, current connection
is considered as NON first contact connection. ends.

2. Check the number of connections currently waiting for a suitable
link group to be created, if it is not less that the number of link
groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
increase the number of link groups to be created, current connection
is considered as the first contact connection. ends.

3. Increase the number of connections currently waiting, and wait
for woken up.

4. Decrease the number of connections currently waiting, goto 1.

We wake up the connection that was put to sleep in stage 3 through
the SMC link state change event. Once the link moves out of the
SMC_LNK_ACTIVATING state, decrease the number of link groups to
be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
connections.

In the iplementation, we introduce the concept of lnk cluster, which is
a collection of links with the same characteristics (see
smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
wake up efficiently in the scenario of N v.s 1.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/af_smc.c   |  11 +-
 net/smc/smc_core.c | 356 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 net/smc/smc_core.h |  48 ++++++++
 net/smc/smc_llc.c  |   9 +-
 4 files changed, 411 insertions(+), 13 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 79c1318..af4b0aa 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1194,10 +1194,8 @@ static int smc_connect_rdma(struct smc_sock *smc,
 	if (reason_code)
 		return reason_code;
 
-	mutex_lock(&smc_client_lgr_pending);
 	reason_code = smc_conn_create(smc, ini);
 	if (reason_code) {
-		mutex_unlock(&smc_client_lgr_pending);
 		return reason_code;
 	}
 
@@ -1289,7 +1287,6 @@ static int smc_connect_rdma(struct smc_sock *smc,
 		if (reason_code)
 			goto connect_abort;
 	}
-	mutex_unlock(&smc_client_lgr_pending);
 
 	smc_copy_sock_settings_to_clc(smc);
 	smc->connect_nonblock = 0;
@@ -1299,7 +1296,6 @@ static int smc_connect_rdma(struct smc_sock *smc,
 	return 0;
 connect_abort:
 	smc_conn_abort(smc, ini->first_contact_local);
-	mutex_unlock(&smc_client_lgr_pending);
 	smc->connect_nonblock = 0;
 
 	return reason_code;
@@ -2377,7 +2373,8 @@ static void smc_listen_work(struct work_struct *work)
 	if (rc)
 		goto out_decl;
 
-	mutex_lock(&smc_server_lgr_pending);
+	if (ini->is_smcd)
+		mutex_lock(&smc_server_lgr_pending);
 	smc_close_init(new_smc);
 	smc_rx_init(new_smc);
 	smc_tx_init(new_smc);
@@ -2415,7 +2412,6 @@ static void smc_listen_work(struct work_struct *work)
 					    ini->first_contact_local, ini);
 		if (rc)
 			goto out_unlock;
-		mutex_unlock(&smc_server_lgr_pending);
 	}
 	smc_conn_save_peer_info(new_smc, cclc);
 	smc_listen_out_connected(new_smc);
@@ -2423,7 +2419,8 @@ static void smc_listen_work(struct work_struct *work)
 	goto out_free;
 
 out_unlock:
-	mutex_unlock(&smc_server_lgr_pending);
+	if (ini->is_smcd)
+		mutex_unlock(&smc_server_lgr_pending);
 out_decl:
 	smc_listen_decline(new_smc, rc, ini ? ini->first_contact_local : 0,
 			   proposal_version);
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index ff49a11..a3338cc 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -46,6 +46,10 @@ struct smc_lgr_list smc_lgr_list = {	/* established link groups */
 	.num = 0,
 };
 
+struct smc_lgr_manager smc_lgr_manager = {
+	.lock = __SPIN_LOCK_UNLOCKED(smc_lgr_manager.lock),
+};
+
 static atomic_t lgr_cnt = ATOMIC_INIT(0); /* number of existing link groups */
 static DECLARE_WAIT_QUEUE_HEAD(lgrs_deleted);
 
@@ -55,6 +59,282 @@ static void smc_buf_free(struct smc_link_group *lgr, bool is_rmb,
 
 static void smc_link_down_work(struct work_struct *work);
 
+/* SMC-R lnk cluster compare func
+ * All lnks that meet the description conditions of this function
+ * are logically aggregated, called lnk cluster.
+ * For the server side, lnk cluster is used to determine whether
+ * a new group needs to be created when processing new imcoming connections.
+ * For the client side, lnk cluster is used to determine whether
+ * to wait for link ready (in other words, first contact ready).
+ */
+static int smcr_lnk_cluster_cmpfn(struct rhashtable_compare_arg *arg, const void *obj)
+{
+	const struct smc_lnk_cluster_compare_arg *key = arg->key;
+	const struct smc_lnk_cluster *lnkc = obj;
+
+	if (memcmp(key->peer_systemid, lnkc->peer_systemid, SMC_SYSTEMID_LEN))
+		return 1;
+
+	if (memcmp(key->peer_gid, lnkc->peer_gid, SMC_GID_SIZE))
+		return 1;
+
+	if ((key->role == SMC_SERV || key->clcqpn == lnkc->clcqpn) &&
+	    (key->smcr_version == SMC_V2 ||
+	    !memcmp(key->peer_mac, lnkc->peer_mac, ETH_ALEN)))
+		return 0;
+
+	return 1;
+}
+
+/* SMC-R lnk cluster hash func */
+static u32 smcr_lnk_cluster_hashfn(const void *data, u32 len, u32 seed)
+{
+	const struct smc_lnk_cluster *lnkc = data;
+
+	return jhash2((u32 *)lnkc->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
+		+ (lnkc->role == SMC_SERV) ? 0 : lnkc->clcqpn;
+}
+
+/* SMC-R lnk cluster compare arg hash func */
+static u32 smcr_lnk_cluster_compare_arg_hashfn(const void *data, u32 len, u32 seed)
+{
+	const struct smc_lnk_cluster_compare_arg *key = data;
+
+	return jhash2((u32 *)key->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
+		+ (key->role == SMC_SERV) ? 0 : key->clcqpn;
+}
+
+static const struct rhashtable_params smcr_lnk_cluster_rhl_params = {
+	.head_offset = offsetof(struct smc_lnk_cluster, rnode),
+	.key_len = sizeof(struct smc_lnk_cluster_compare_arg),
+	.obj_cmpfn = smcr_lnk_cluster_cmpfn,
+	.obj_hashfn = smcr_lnk_cluster_hashfn,
+	.hashfn = smcr_lnk_cluster_compare_arg_hashfn,
+	.automatic_shrinking = true,
+};
+
+/* hold a reference for smc_lnk_cluster */
+static inline void smc_lnk_cluster_hold(struct smc_lnk_cluster *lnkc)
+{
+	if (likely(lnkc))
+		refcount_inc(&lnkc->ref);
+}
+
+/* release a reference for smc_lnk_cluster */
+static inline void smc_lnk_cluster_put(struct smc_lnk_cluster *lnkc)
+{
+	bool do_free = false;
+
+	if (!lnkc)
+		return;
+
+	if (refcount_dec_not_one(&lnkc->ref))
+		return;
+
+	spin_lock_bh(&smc_lgr_manager.lock);
+	/* last ref */
+	if (refcount_dec_and_test(&lnkc->ref)) {
+		do_free = true;
+		rhashtable_remove_fast(&smc_lgr_manager.lnk_cluster_maps, &lnkc->rnode,
+				       smcr_lnk_cluster_rhl_params);
+	}
+	spin_unlock_bh(&smc_lgr_manager.lock);
+	if (do_free)
+		kfree(lnkc);
+}
+
+/* Get or create smc_lnk_cluster by key
+ * This function will hold a reference of returned smc_lnk_cluster
+ * or create a new smc_lnk_cluster with the reference initialized to 1。
+ * caller MUST call smc_lnk_cluster_put after this.
+ */
+static inline struct smc_lnk_cluster *
+smcr_lnk_get_or_create_cluster(struct smc_lnk_cluster_compare_arg *key)
+{
+	struct smc_lnk_cluster *lnkc, *tmp_lnkc;
+	bool busy_retry;
+	int err;
+
+	/* serving a hardware or software interrupt, or preemption is disabled */
+	busy_retry = !in_interrupt();
+
+	spin_lock_bh(&smc_lgr_manager.lock);
+	lnkc = rhashtable_lookup_fast(&smc_lgr_manager.lnk_cluster_maps, key,
+				      smcr_lnk_cluster_rhl_params);
+	if (!lnkc) {
+		lnkc = kzalloc(sizeof(*lnkc), GFP_ATOMIC);
+		if (unlikely(!lnkc))
+			goto fail;
+
+		/* init cluster */
+		spin_lock_init(&lnkc->lock);
+		lnkc->role = key->role;
+		if (key->role == SMC_CLNT)
+			lnkc->clcqpn = key->clcqpn;
+		init_waitqueue_head(&lnkc->first_contact_waitqueue);
+		memcpy(lnkc->peer_systemid, key->peer_systemid, SMC_SYSTEMID_LEN);
+		memcpy(lnkc->peer_gid, key->peer_gid, SMC_GID_SIZE);
+		memcpy(lnkc->peer_mac, key->peer_mac, ETH_ALEN);
+		refcount_set(&lnkc->ref, 1);
+
+		do {
+			err = rhashtable_insert_fast(&smc_lgr_manager.lnk_cluster_maps,
+						     &lnkc->rnode, smcr_lnk_cluster_rhl_params);
+
+			/* success or fatal error */
+			if (err != -EBUSY)
+				break;
+
+			/* impossible in fact right now */
+			if (unlikely(!busy_retry)) {
+				pr_warn_ratelimited("smc: create lnk cluster in softirq\n");
+				break;
+			}
+
+			spin_unlock_bh(&smc_lgr_manager.lock);
+			/* yeild */
+			cond_resched();
+			spin_lock_bh(&smc_lgr_manager.lock);
+
+			/* after spin_unlock_bh(), lnk_cluster_maps may be changed */
+			tmp_lnkc = rhashtable_lookup_fast(&smc_lgr_manager.lnk_cluster_maps, key,
+							  smcr_lnk_cluster_rhl_params);
+
+			if (unlikely(tmp_lnkc)) {
+				pr_warn_ratelimited("smc: create cluster failed dues to duplicat key");
+				kfree(lnkc);
+				lnkc = NULL;
+				goto fail;
+			}
+		} while (1);
+
+		if (unlikely(err)) {
+			pr_warn_ratelimited("smc: rhashtable_insert_fast failed (%d)", err);
+			kfree(lnkc);
+			lnkc = NULL;
+		}
+	} else {
+		smc_lnk_cluster_hold(lnkc);
+	}
+fail:
+	spin_unlock_bh(&smc_lgr_manager.lock);
+	return lnkc;
+}
+
+/* Get or create a smc_lnk_cluster by lnk
+ * caller MUST call smc_lnk_cluster_put after this.
+ */
+static inline struct smc_lnk_cluster *smcr_lnk_get_cluster(struct smc_link *lnk)
+{
+	struct smc_lnk_cluster_compare_arg key;
+	struct smc_link_group *lgr;
+
+	lgr = lnk->lgr;
+	if (!lgr || lgr->is_smcd)
+		return NULL;
+
+	key.smcr_version = lgr->smc_version;
+	key.peer_systemid = lgr->peer_systemid;
+	key.peer_gid = lnk->peer_gid;
+	key.peer_mac = lnk->peer_mac;
+	key.role	 = lgr->role;
+	if (key.role == SMC_CLNT)
+		key.clcqpn = lnk->peer_qpn;
+
+	return smcr_lnk_get_or_create_cluster(&key);
+}
+
+/* Get or create a smc_lnk_cluster by ini
+ * caller MUST call smc_lnk_cluster_put after this.
+ */
+static inline struct smc_lnk_cluster *
+smcr_lnk_get_cluster_by_ini(struct smc_init_info *ini, int role)
+{
+	struct smc_lnk_cluster_compare_arg key;
+
+	if (ini->is_smcd)
+		return NULL;
+
+	key.smcr_version = ini->smcr_version;
+	key.peer_systemid = ini->peer_systemid;
+	key.peer_gid = ini->peer_gid;
+	key.peer_mac = ini->peer_mac;
+	key.role	= role;
+	if (role == SMC_CLNT)
+		key.clcqpn	= ini->ib_clcqpn;
+
+	return smcr_lnk_get_or_create_cluster(&key);
+}
+
+/* callback when smc link state change */
+void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk)
+{
+	struct smc_lnk_cluster *lnkc;
+	int nr = 0;
+
+	/* barrier for lnk->state */
+	smp_mb();
+
+	/* only first link can made connections block on
+	 * first_contact_waitqueue
+	 */
+	if (lnk->link_idx != SMC_SINGLE_LINK)
+		return;
+
+	/* state already seen  */
+	if (lnk->state_record & SMC_LNK_STATE_BIT(lnk->state))
+		return;
+
+	lnkc = smcr_lnk_get_cluster(lnk);
+
+	if (unlikely(!lnkc))
+		return;
+
+	spin_lock_bh(&lnkc->lock);
+
+	/* all lnk state change should be
+	 * 1. SMC_LNK_UNUSED -> SMC_LNK_TEAR_DWON (link init failed)
+	 * 2. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_TEAR_DWON
+	 * 3. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DWON
+	 * 4. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DWON
+	 * 5. SMC_LNK_UNUSED -> SMC_LNK_ATIVATING -> SMC_LNK_ACTIVE ->SMC_LNK_INACTIVE
+	 * -> SMC_LNK_TEAR_DWON
+	 */
+	switch (lnk->state) {
+	case SMC_LNK_ACTIVATING:
+		/* It's safe to hold a reference without lock
+		 * dues to the smcr_lnk_get_cluster already hold one
+		 */
+		smc_lnk_cluster_hold(lnkc);
+		break;
+	case SMC_LNK_TEAR_DWON:
+		if (lnk->state_record & SMC_LNK_STATE_BIT(SMC_LNK_ACTIVATING))
+			/* smc_lnk_cluster_hold in SMC_LNK_ACTIVATING */
+			smc_lnk_cluster_put(lnkc);
+		fallthrough;
+	case SMC_LNK_ACTIVE:
+	case SMC_LNK_INACTIVE:
+		if (!(lnk->state_record &
+			(SMC_LNK_STATE_BIT(SMC_LNK_ACTIVE)
+			| SMC_LNK_STATE_BIT(SMC_LNK_INACTIVE)))) {
+			lnkc->pending_capability -= (SMC_RMBS_PER_LGR_MAX - 1);
+			/* TODO: wakeup just one to perfrom first contact
+			 * if record state has no SMC_LNK_ACTIVE
+			 */
+			nr = SMC_RMBS_PER_LGR_MAX - 1;
+		}
+		break;
+	case SMC_LNK_UNUSED:
+		pr_warn_ratelimited("net/smc: invalid lnk state. ");
+		break;
+	}
+	SMC_LNK_STATE_RECORD(lnk, lnk->state);
+	spin_unlock_bh(&lnkc->lock);
+	if (nr)
+		wake_up_nr(&lnkc->first_contact_waitqueue, nr);
+	smc_lnk_cluster_put(lnkc);	/* smc_lnk_cluster_hold in smcr_lnk_get_cluster */
+}
+
 /* return head of link group list and its lock for a given link group */
 static inline struct list_head *smc_lgr_list_head(struct smc_link_group *lgr,
 						  spinlock_t **lgr_lock)
@@ -651,8 +931,10 @@ static void smcr_lgr_link_deactivate_all(struct smc_link_group *lgr)
 	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
 		struct smc_link *lnk = &lgr->lnk[i];
 
-		if (smc_link_sendable(lnk))
+		if (smc_link_sendable(lnk)) {
 			lnk->state = SMC_LNK_INACTIVE;
+			smcr_lnk_cluster_on_lnk_state(lnk);
+		}
 	}
 	wake_up_all(&lgr->llc_msg_waiter);
 	wake_up_all(&lgr->llc_flow_waiter);
@@ -762,6 +1044,9 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	atomic_set(&lnk->conn_cnt, 0);
 	smc_llc_link_set_uid(lnk);
 	INIT_WORK(&lnk->link_down_wrk, smc_link_down_work);
+	lnk->peer_qpn = ini->ib_clcqpn;
+	memcpy(lnk->peer_gid, ini->peer_gid, SMC_GID_SIZE);
+	memcpy(lnk->peer_mac, ini->peer_mac, sizeof(lnk->peer_mac));
 	if (!lnk->smcibdev->initialized) {
 		rc = (int)smc_ib_setup_per_ibdev(lnk->smcibdev);
 		if (rc)
@@ -792,6 +1077,7 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	if (rc)
 		goto destroy_qp;
 	lnk->state = SMC_LNK_ACTIVATING;
+	smcr_lnk_cluster_on_lnk_state(lnk);
 	return 0;
 
 destroy_qp:
@@ -806,6 +1092,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	smc_ibdev_cnt_dec(lnk);
 	put_device(&lnk->smcibdev->ibdev->dev);
 	smcibdev = lnk->smcibdev;
+	lnk->state = SMC_LNK_TEAR_DWON;
+	smcr_lnk_cluster_on_lnk_state(lnk);
 	memset(lnk, 0, sizeof(struct smc_link));
 	lnk->state = SMC_LNK_UNUSED;
 	if (!atomic_dec_return(&smcibdev->lnk_cnt))
@@ -1263,6 +1551,8 @@ void smcr_link_clear(struct smc_link *lnk, bool log)
 	if (!lnk->lgr || lnk->clearing ||
 	    lnk->state == SMC_LNK_UNUSED)
 		return;
+	lnk->state = SMC_LNK_TEAR_DWON;
+	smcr_lnk_cluster_on_lnk_state(lnk);
 	lnk->clearing = 1;
 	lnk->peer_qpn = 0;
 	smc_llc_link_clear(lnk, log);
@@ -1712,6 +2002,7 @@ void smcr_link_down_cond(struct smc_link *lnk)
 {
 	if (smc_link_downing(&lnk->state)) {
 		trace_smcr_link_down(lnk, __builtin_return_address(0));
+		smcr_lnk_cluster_on_lnk_state(lnk);
 		smcr_link_down(lnk);
 	}
 }
@@ -1721,6 +2012,7 @@ void smcr_link_down_cond_sched(struct smc_link *lnk)
 {
 	if (smc_link_downing(&lnk->state)) {
 		trace_smcr_link_down(lnk, __builtin_return_address(0));
+		smcr_lnk_cluster_on_lnk_state(lnk);
 		schedule_work(&lnk->link_down_wrk);
 	}
 }
@@ -1850,11 +2142,13 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 {
 	struct smc_connection *conn = &smc->conn;
 	struct net *net = sock_net(&smc->sk);
+	DECLARE_WAITQUEUE(wait, current);
+	struct smc_lnk_cluster *lnkc = NULL;
 	struct list_head *lgr_list;
 	struct smc_link_group *lgr;
 	enum smc_lgr_role role;
 	spinlock_t *lgr_lock;
-	int rc = 0;
+	int rc = 0, timeo = CLC_WAIT_TIME;
 
 	lgr_list = ini->is_smcd ? &ini->ism_dev[ini->ism_selected]->lgr_list :
 				  &smc_lgr_list.list;
@@ -1862,12 +2156,26 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 				  &smc_lgr_list.lock;
 	ini->first_contact_local = 1;
 	role = smc->listen_smc ? SMC_SERV : SMC_CLNT;
-	if (role == SMC_CLNT && ini->first_contact_peer)
+
+	if (!ini->is_smcd) {
+		lnkc = smcr_lnk_get_cluster_by_ini(ini, role);
+		if (unlikely(!lnkc))
+			return SMC_CLC_DECL_INTERR;
+	}
+
+	if (role == SMC_CLNT && ini->first_contact_peer) {
+		/* first_contact */
+		spin_lock_bh(&lnkc->lock);
+		lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
+		spin_unlock_bh(&lnkc->lock);
 		/* create new link group as well */
 		goto create;
+	}
 
 	/* determine if an existing link group can be reused */
 	spin_lock_bh(lgr_lock);
+	spin_lock(&lnkc->lock);
+again:
 	list_for_each_entry(lgr, lgr_list, list) {
 		write_lock_bh(&lgr->conns_lock);
 		if ((ini->is_smcd ?
@@ -1894,9 +2202,33 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 		}
 		write_unlock_bh(&lgr->conns_lock);
 	}
+	if (lnkc && ini->first_contact_local) {
+		if (lnkc->pending_capability > lnkc->conns_pending) {
+			lnkc->conns_pending++;
+			add_wait_queue(&lnkc->first_contact_waitqueue, &wait);
+			spin_unlock(&lnkc->lock);
+			spin_unlock_bh(lgr_lock);
+			set_current_state(TASK_INTERRUPTIBLE);
+			/* need to wait at least once first contact done */
+			timeo = schedule_timeout(timeo);
+			set_current_state(TASK_RUNNING);
+			remove_wait_queue(&lnkc->first_contact_waitqueue, &wait);
+			spin_lock_bh(lgr_lock);
+			spin_lock(&lnkc->lock);
+
+			lnkc->conns_pending--;
+			if (timeo)
+				goto again;
+		}
+		if (role == SMC_SERV) {
+			/* first_contact */
+			lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
+		}
+	}
+	spin_unlock(&lnkc->lock);
 	spin_unlock_bh(lgr_lock);
 	if (rc)
-		return rc;
+		goto out;
 
 	if (role == SMC_CLNT && !ini->first_contact_peer &&
 	    ini->first_contact_local) {
@@ -1904,7 +2236,8 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 		 * a new one
 		 * send out_of_sync decline, reason synchr. error
 		 */
-		return SMC_CLC_DECL_SYNCERR;
+		rc = SMC_CLC_DECL_SYNCERR;
+		goto out;
 	}
 
 create:
@@ -1941,6 +2274,8 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 #endif
 
 out:
+	/* smc_lnk_cluster_hold in smcr_lnk_get_or_create_cluster */
+	smc_lnk_cluster_put(lnkc);
 	return rc;
 }
 
@@ -2599,12 +2934,23 @@ static int smc_core_reboot_event(struct notifier_block *this,
 
 int __init smc_core_init(void)
 {
+	/* init smc lnk cluster maps */
+	rhashtable_init(&smc_lgr_manager.lnk_cluster_maps, &smcr_lnk_cluster_rhl_params);
 	return register_reboot_notifier(&smc_reboot_notifier);
 }
 
+static void smc_lnk_cluster_free_cb(void *ptr, void *arg)
+{
+	pr_warn("smc: smc lnk cluster refcnt leak.\n");
+	kfree(ptr);
+}
+
 /* Called (from smc_exit) when module is removed */
 void smc_core_exit(void)
 {
 	unregister_reboot_notifier(&smc_reboot_notifier);
 	smc_lgrs_shutdown();
+	/* destroy smc lnk cluster maps */
+	rhashtable_free_and_destroy(&smc_lgr_manager.lnk_cluster_maps, smc_lnk_cluster_free_cb,
+				    NULL);
 }
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index fe8b524..199f533 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -15,6 +15,7 @@
 #include <linux/atomic.h>
 #include <linux/smc.h>
 #include <linux/pci.h>
+#include <linux/rhashtable.h>
 #include <rdma/ib_verbs.h>
 #include <net/genetlink.h>
 
@@ -29,18 +30,62 @@ struct smc_lgr_list {			/* list of link group definition */
 	u32			num;	/* unique link group number */
 };
 
+struct smc_lgr_manager {		/* manager for link group */
+	struct rhashtable	lnk_cluster_maps;	/* maps of smc_lnk_cluster */
+	spinlock_t		lock;	/* lock for lgr_cm_maps */
+};
+
+struct smc_lnk_cluster {
+	struct rhash_head	rnode;	/* node for rhashtable */
+	struct wait_queue_head	first_contact_waitqueue;
+					/* queue for non first contact to wait
+					 * first contact to be established.
+					 */
+	spinlock_t		lock;	/* protection for link group */
+	refcount_t		ref;	/* refcount for cluster */
+	unsigned long		pending_capability;
+					/* maximum pending number of connections that
+					 * need wait first contact complete.
+					 */
+	unsigned long		conns_pending;
+					/* connections that are waiting for first contact
+					 * complete
+					 */
+	u8		peer_systemid[SMC_SYSTEMID_LEN];
+	u8		peer_mac[ETH_ALEN];	/* = gid[8:10||13:15] */
+	u8		peer_gid[SMC_GID_SIZE];	/* gid of peer*/
+	int		clcqpn;
+	int		role;
+};
+
 enum smc_lgr_role {		/* possible roles of a link group */
 	SMC_CLNT,	/* client */
 	SMC_SERV	/* server */
 };
 
+struct smc_lnk_cluster_compare_arg	/* key for smc_lnk_cluster */
+{
+	int	smcr_version;
+	enum smc_lgr_role role;
+	u8	*peer_systemid;
+	u8	*peer_gid;
+	u8	*peer_mac;
+	int clcqpn;
+};
+
 enum smc_link_state {			/* possible states of a link */
 	SMC_LNK_UNUSED,		/* link is unused */
 	SMC_LNK_INACTIVE,	/* link is inactive */
 	SMC_LNK_ACTIVATING,	/* link is being activated */
 	SMC_LNK_ACTIVE,		/* link is active */
+	SMC_LNK_TEAR_DWON,	/* link is tear down */
 };
 
+#define SMC_LNK_STATE_BIT(state)	(1 << (state))
+
+#define	SMC_LNK_STATE_RECORD(lnk, state)	\
+	((lnk)->state_record |= SMC_LNK_STATE_BIT(state))
+
 #define SMC_WR_BUF_SIZE		48	/* size of work request buffer */
 #define SMC_WR_BUF_V2_SIZE	8192	/* size of v2 work request buffer */
 
@@ -145,6 +190,7 @@ struct smc_link {
 	int			ndev_ifidx; /* network device ifindex */
 
 	enum smc_link_state	state;		/* state of link */
+	int			state_record;		/* record of previous state */
 	struct delayed_work	llc_testlink_wrk; /* testlink worker */
 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
 	int			llc_testlink_time; /* testlink interval */
@@ -557,6 +603,8 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
 int smcr_nl_get_link(struct sk_buff *skb, struct netlink_callback *cb);
 int smcd_nl_get_lgr(struct sk_buff *skb, struct netlink_callback *cb);
 
+void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk);
+
 static inline struct smc_link_group *smc_get_lgr(struct smc_link *link)
 {
 	return link->lgr;
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index 175026a..8134c15 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -1099,6 +1099,7 @@ int smc_llc_cli_add_link(struct smc_link *link, struct smc_llc_qentry *qentry)
 		goto out;
 out_clear_lnk:
 	lnk_new->state = SMC_LNK_INACTIVE;
+	smcr_lnk_cluster_on_lnk_state(lnk_new);
 	smcr_link_clear(lnk_new, false);
 out_reject:
 	smc_llc_cli_add_link_reject(qentry);
@@ -1278,6 +1279,7 @@ static void smc_llc_delete_asym_link(struct smc_link_group *lgr)
 		return; /* no asymmetric link */
 	if (!smc_link_downing(&lnk_asym->state))
 		return;
+	smcr_lnk_cluster_on_lnk_state(lnk_asym);
 	lnk_new = smc_switch_conns(lgr, lnk_asym, false);
 	smc_wr_tx_wait_no_pending_sends(lnk_asym);
 	if (!lnk_new)
@@ -1492,6 +1494,7 @@ int smc_llc_srv_add_link(struct smc_link *link,
 out_err:
 	if (link_new) {
 		link_new->state = SMC_LNK_INACTIVE;
+		smcr_lnk_cluster_on_lnk_state(link_new);
 		smcr_link_clear(link_new, false);
 	}
 out:
@@ -1602,8 +1605,10 @@ static void smc_llc_process_cli_delete_link(struct smc_link_group *lgr)
 	del_llc->reason = 0;
 	smc_llc_send_message(lnk, &qentry->msg); /* response */
 
-	if (smc_link_downing(&lnk_del->state))
+	if (smc_link_downing(&lnk_del->state)) {
+		smcr_lnk_cluster_on_lnk_state(lnk);
 		smc_switch_conns(lgr, lnk_del, false);
+	}
 	smcr_link_clear(lnk_del, true);
 
 	active_links = smc_llc_active_link_count(lgr);
@@ -1676,6 +1681,7 @@ static void smc_llc_process_srv_delete_link(struct smc_link_group *lgr)
 		goto out; /* asymmetric link already deleted */
 
 	if (smc_link_downing(&lnk_del->state)) {
+		smcr_lnk_cluster_on_lnk_state(lnk);
 		if (smc_switch_conns(lgr, lnk_del, false))
 			smc_wr_tx_wait_no_pending_sends(lnk_del);
 	}
@@ -2167,6 +2173,7 @@ void smc_llc_link_active(struct smc_link *link)
 		schedule_delayed_work(&link->llc_testlink_wrk,
 				      link->llc_testlink_time);
 	}
+	smcr_lnk_cluster_on_lnk_state(link);
 }
 
 /* called in worker context */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next 02/10] net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending
  2022-08-10 17:47 [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections D. Wythe
  2022-08-10 17:47 ` [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D. Wythe
@ 2022-08-10 17:47 ` D. Wythe
  2022-08-16  7:58   ` Tony Lu
  2022-08-10 17:47 ` [PATCH net-next 03/10] net/smc: allow confirm/delete rkey response deliver multiplex D. Wythe
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: D. Wythe @ 2022-08-10 17:47 UTC (permalink / raw)
  To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

From: "D. Wythe" <alibuda@linux.alibaba.com>

As commit "net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause
by server" mentioned, it works only when all connection creations are
completely protected by smc_server_lgr_pending lock, since we already
cancel the lock, we need to re-fix the issues.

Fixes: 4940a1fdf31c ("net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause by server")

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/af_smc.c   |  2 ++
 net/smc/smc_core.c | 11 ++++++++---
 net/smc/smc_core.h | 21 +++++++++++++++++++++
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index af4b0aa..c0842a9 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -2413,6 +2413,7 @@ static void smc_listen_work(struct work_struct *work)
 		if (rc)
 			goto out_unlock;
 	}
+	smc_conn_leave_rtoken_pending(new_smc, ini);
 	smc_conn_save_peer_info(new_smc, cclc);
 	smc_listen_out_connected(new_smc);
 	SMC_STAT_SERV_SUCC_INC(sock_net(newclcsock->sk), ini);
@@ -2422,6 +2423,7 @@ static void smc_listen_work(struct work_struct *work)
 	if (ini->is_smcd)
 		mutex_unlock(&smc_server_lgr_pending);
 out_decl:
+	smc_conn_leave_rtoken_pending(new_smc, ini);
 	smc_listen_decline(new_smc, rc, ini ? ini->first_contact_local : 0,
 			   proposal_version);
 out_free:
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index a3338cc..61a3854 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -2190,14 +2190,19 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 		     lgr->vlan_id == ini->vlan_id) &&
 		    (role == SMC_CLNT || ini->is_smcd ||
 		    (lgr->conns_num < SMC_RMBS_PER_LGR_MAX &&
-		      !bitmap_full(lgr->rtokens_used_mask, SMC_RMBS_PER_LGR_MAX)))) {
+		    (SMC_RMBS_PER_LGR_MAX -
+			bitmap_weight(lgr->rtokens_used_mask, SMC_RMBS_PER_LGR_MAX)
+				> atomic_read(&lgr->rtoken_pendings))))) {
 			/* link group found */
 			ini->first_contact_local = 0;
 			conn->lgr = lgr;
 			rc = smc_lgr_register_conn(conn, false);
 			write_unlock_bh(&lgr->conns_lock);
-			if (!rc && delayed_work_pending(&lgr->free_work))
-				cancel_delayed_work(&lgr->free_work);
+			if (!rc) {
+				smc_conn_enter_rtoken_pending(smc, ini);
+				if (delayed_work_pending(&lgr->free_work))
+					cancel_delayed_work(&lgr->free_work);
+			}
 			break;
 		}
 		write_unlock_bh(&lgr->conns_lock);
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 199f533..acc2869 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -293,6 +293,9 @@ struct smc_link_group {
 	struct rb_root		conns_all;	/* connection tree */
 	rwlock_t		conns_lock;	/* protects conns_all */
 	unsigned int		conns_num;	/* current # of connections */
+	atomic_t		rtoken_pendings;/* number of connection that
+						 * lgr assigned but no rtoken got yet
+						 */
 	unsigned short		vlan_id;	/* vlan id of link group */
 
 	struct list_head	sndbufs[SMC_RMBE_SIZES];/* tx buffers */
@@ -603,6 +606,24 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
 int smcr_nl_get_link(struct sk_buff *skb, struct netlink_callback *cb);
 int smcd_nl_get_lgr(struct sk_buff *skb, struct netlink_callback *cb);
 
+static inline void smc_conn_enter_rtoken_pending(struct smc_sock *smc, struct smc_init_info *ini)
+{
+	struct smc_link_group *lgr;
+
+	lgr = smc->conn.lgr;
+	if (lgr && !ini->first_contact_local)
+		atomic_inc(&lgr->rtoken_pendings);
+}
+
+static inline void smc_conn_leave_rtoken_pending(struct smc_sock *smc, struct smc_init_info *ini)
+{
+	struct smc_link_group *lgr;
+
+	lgr = smc->conn.lgr;
+	if (lgr && !ini->first_contact_local)
+		atomic_dec(&lgr->rtoken_pendings);
+}
+
 void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk);
 
 static inline struct smc_link_group *smc_get_lgr(struct smc_link *link)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next 03/10] net/smc: allow confirm/delete rkey response deliver multiplex
  2022-08-10 17:47 [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections D. Wythe
  2022-08-10 17:47 ` [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D. Wythe
  2022-08-10 17:47 ` [PATCH net-next 02/10] net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending D. Wythe
@ 2022-08-10 17:47 ` D. Wythe
  2022-08-16  8:17   ` Tony Lu
  2022-08-10 17:47 ` [PATCH net-next 04/10] net/smc: make SMC_LLC_FLOW_RKEY run concurrently D. Wythe
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: D. Wythe @ 2022-08-10 17:47 UTC (permalink / raw)
  To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

From: "D. Wythe" <alibuda@linux.alibaba.com>

We know that all flows except confirm_rkey and delete_rkey are exclusive,
confirm/delete rkey flows can run concurrently (local and remote).

Although the protocol allows, all flows are actually mutually exclusive
in implementation, dues to waiting for LLC messages is in serial.

This aggravates the time for establishing or destroying a SMC-R
connections, connections have to be queued in smc_llc_wait.

We use rtokens or rkey to correlate a confirm/delete rkey message
with its response.

Before sending a message, we put context with rtokens or rkey into
wait queue. When a response message received, we wakeup the context
which with the same rtokens or rkey against the response message.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/smc_llc.c | 174 +++++++++++++++++++++++++++++++++++++++++-------------
 net/smc/smc_wr.c  |  10 ----
 net/smc/smc_wr.h  |  10 ++++
 3 files changed, 143 insertions(+), 51 deletions(-)

diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index 8134c15..b026df2 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -200,6 +200,7 @@ struct smc_llc_msg_delete_rkey_v2 {	/* type 0x29 */
 struct smc_llc_qentry {
 	struct list_head list;
 	struct smc_link *link;
+	void *private;
 	union smc_llc_msg msg;
 };
 
@@ -479,19 +480,17 @@ int smc_llc_send_confirm_link(struct smc_link *link,
 	return rc;
 }
 
-/* send LLC confirm rkey request */
-static int smc_llc_send_confirm_rkey(struct smc_link *send_link,
-				     struct smc_buf_desc *rmb_desc)
+/* build LLC confirm rkey request */
+static int smc_llc_build_confirm_rkey_request(struct smc_link *send_link,
+					      struct smc_buf_desc *rmb_desc,
+					      struct smc_wr_tx_pend_priv **priv)
 {
 	struct smc_llc_msg_confirm_rkey *rkeyllc;
-	struct smc_wr_tx_pend_priv *pend;
 	struct smc_wr_buf *wr_buf;
 	struct smc_link *link;
 	int i, rc, rtok_ix;
 
-	if (!smc_wr_tx_link_hold(send_link))
-		return -ENOLINK;
-	rc = smc_llc_add_pending_send(send_link, &wr_buf, &pend);
+	rc = smc_llc_add_pending_send(send_link, &wr_buf, priv);
 	if (rc)
 		goto put_out;
 	rkeyllc = (struct smc_llc_msg_confirm_rkey *)wr_buf;
@@ -521,25 +520,20 @@ static int smc_llc_send_confirm_rkey(struct smc_link *send_link,
 		cpu_to_be64((uintptr_t)rmb_desc->cpu_addr) :
 		cpu_to_be64((u64)sg_dma_address
 			    (rmb_desc->sgt[send_link->link_idx].sgl));
-	/* send llc message */
-	rc = smc_wr_tx_send(send_link, pend);
 put_out:
-	smc_wr_tx_link_put(send_link);
 	return rc;
 }
 
-/* send LLC delete rkey request */
-static int smc_llc_send_delete_rkey(struct smc_link *link,
-				    struct smc_buf_desc *rmb_desc)
+/* build LLC delete rkey request */
+static int smc_llc_build_delete_rkey_request(struct smc_link *link,
+					     struct smc_buf_desc *rmb_desc,
+					     struct smc_wr_tx_pend_priv **priv)
 {
 	struct smc_llc_msg_delete_rkey *rkeyllc;
-	struct smc_wr_tx_pend_priv *pend;
 	struct smc_wr_buf *wr_buf;
 	int rc;
 
-	if (!smc_wr_tx_link_hold(link))
-		return -ENOLINK;
-	rc = smc_llc_add_pending_send(link, &wr_buf, &pend);
+	rc = smc_llc_add_pending_send(link, &wr_buf, priv);
 	if (rc)
 		goto put_out;
 	rkeyllc = (struct smc_llc_msg_delete_rkey *)wr_buf;
@@ -548,10 +542,7 @@ static int smc_llc_send_delete_rkey(struct smc_link *link,
 	smc_llc_init_msg_hdr(&rkeyllc->hd, link->lgr, sizeof(*rkeyllc));
 	rkeyllc->num_rkeys = 1;
 	rkeyllc->rkey[0] = htonl(rmb_desc->mr[link->link_idx]->rkey);
-	/* send llc message */
-	rc = smc_wr_tx_send(link, pend);
 put_out:
-	smc_wr_tx_link_put(link);
 	return rc;
 }
 
@@ -2023,7 +2014,8 @@ static void smc_llc_rx_response(struct smc_link *link,
 	case SMC_LLC_DELETE_RKEY:
 		if (flowtype != SMC_LLC_FLOW_RKEY || flow->qentry)
 			break;	/* drop out-of-flow response */
-		goto assign;
+		__wake_up(&link->lgr->llc_msg_waiter, TASK_NORMAL, 1, qentry);
+		goto free;
 	case SMC_LLC_CONFIRM_RKEY_CONT:
 		/* not used because max links is 3 */
 		break;
@@ -2032,6 +2024,7 @@ static void smc_llc_rx_response(struct smc_link *link,
 					   qentry->msg.raw.hdr.common.type);
 		break;
 	}
+free:
 	kfree(qentry);
 	return;
 assign:
@@ -2191,25 +2184,98 @@ void smc_llc_link_clear(struct smc_link *link, bool log)
 	cancel_delayed_work_sync(&link->llc_testlink_wrk);
 }
 
+static int smc_llc_rkey_response_wake_function(struct wait_queue_entry *wq_entry,
+					       unsigned int mode, int sync, void *key)
+{
+	struct smc_llc_qentry *except, *incoming;
+	u8 except_llc_type;
+
+	/* not a rkey response */
+	if (!key)
+		return 0;
+
+	except = wq_entry->private;
+	incoming = key;
+
+	except_llc_type = except->msg.raw.hdr.common.llc_type;
+
+	/* except LLC MSG TYPE mismatch */
+	if (except_llc_type != incoming->msg.raw.hdr.common.llc_type)
+		return 0;
+
+	switch (except_llc_type) {
+	case SMC_LLC_CONFIRM_RKEY:
+		if (memcmp(except->msg.confirm_rkey.rtoken, incoming->msg.confirm_rkey.rtoken,
+			   sizeof(struct smc_rmb_rtoken) *
+			   except->msg.confirm_rkey.rtoken[0].num_rkeys))
+			return 0;
+		break;
+	case SMC_LLC_DELETE_RKEY:
+		if (memcmp(except->msg.delete_rkey.rkey, incoming->msg.delete_rkey.rkey,
+			   sizeof(__be32) * except->msg.delete_rkey.num_rkeys))
+			return 0;
+		break;
+	default:
+		panic("invalid except llc msg %d", except_llc_type);
+		return 0;
+	}
+
+	/* match, save hdr */
+	memcpy(&except->msg.raw.hdr, &incoming->msg.raw.hdr, sizeof(except->msg.raw.hdr));
+
+	wq_entry->private = except->private;
+	return woken_wake_function(wq_entry, mode, sync, NULL);
+}
+
 /* register a new rtoken at the remote peer (for all links) */
 int smc_llc_do_confirm_rkey(struct smc_link *send_link,
 			    struct smc_buf_desc *rmb_desc)
 {
+	long timeout = SMC_LLC_WAIT_TIME;
 	struct smc_link_group *lgr = send_link->lgr;
-	struct smc_llc_qentry *qentry = NULL;
-	int rc = 0;
+	struct smc_llc_qentry qentry;
+	struct smc_wr_tx_pend *pend;
+	struct smc_wr_tx_pend_priv *priv;
+	DEFINE_WAIT_FUNC(wait, smc_llc_rkey_response_wake_function);
+	int rc = 0, flags;
 
-	rc = smc_llc_send_confirm_rkey(send_link, rmb_desc);
+	if (!smc_wr_tx_link_hold(send_link))
+		return -ENOLINK;
+
+	rc = smc_llc_build_confirm_rkey_request(send_link, rmb_desc, &priv);
 	if (rc)
 		goto out;
-	/* receive CONFIRM RKEY response from server over RoCE fabric */
-	qentry = smc_llc_wait(lgr, send_link, SMC_LLC_WAIT_TIME,
-			      SMC_LLC_CONFIRM_RKEY);
-	if (!qentry || (qentry->msg.raw.hdr.flags & SMC_LLC_FLAG_RKEY_NEG))
+
+	pend = container_of(priv, struct smc_wr_tx_pend, priv);
+	/* make a copy of send msg */
+	memcpy(&qentry.msg.confirm_rkey, send_link->wr_tx_bufs[pend->idx].raw,
+	       sizeof(qentry.msg.confirm_rkey));
+
+	qentry.private = wait.private;
+	wait.private = &qentry;
+
+	add_wait_queue(&lgr->llc_msg_waiter, &wait);
+
+	/* send llc message */
+	rc = smc_wr_tx_send(send_link, priv);
+	smc_wr_tx_link_put(send_link);
+	if (rc) {
+		remove_wait_queue(&lgr->llc_msg_waiter, &wait);
+		goto out;
+	}
+
+	while (!signal_pending(current) && timeout) {
+		timeout = wait_woken(&wait, TASK_INTERRUPTIBLE, timeout);
+		if (qentry.msg.raw.hdr.flags & SMC_LLC_FLAG_RESP)
+			break;
+	}
+
+	remove_wait_queue(&lgr->llc_msg_waiter, &wait);
+	flags = qentry.msg.raw.hdr.flags;
+
+	if (!(flags & SMC_LLC_FLAG_RESP) || flags & SMC_LLC_FLAG_RKEY_NEG)
 		rc = -EFAULT;
 out:
-	if (qentry)
-		smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
 	return rc;
 }
 
@@ -2217,26 +2283,52 @@ int smc_llc_do_confirm_rkey(struct smc_link *send_link,
 int smc_llc_do_delete_rkey(struct smc_link_group *lgr,
 			   struct smc_buf_desc *rmb_desc)
 {
-	struct smc_llc_qentry *qentry = NULL;
+	long timeout = SMC_LLC_WAIT_TIME;
+	struct smc_llc_qentry qentry;
+	struct smc_wr_tx_pend *pend;
 	struct smc_link *send_link;
-	int rc = 0;
+	struct smc_wr_tx_pend_priv *priv;
+	DEFINE_WAIT_FUNC(wait, smc_llc_rkey_response_wake_function);
+	int rc = 0, flags;
 
 	send_link = smc_llc_usable_link(lgr);
-	if (!send_link)
+	if (!send_link || !smc_wr_tx_link_hold(send_link))
 		return -ENOLINK;
 
-	/* protected by llc_flow control */
-	rc = smc_llc_send_delete_rkey(send_link, rmb_desc);
+	rc = smc_llc_build_delete_rkey_request(send_link, rmb_desc, &priv);
 	if (rc)
 		goto out;
-	/* receive DELETE RKEY response from server over RoCE fabric */
-	qentry = smc_llc_wait(lgr, send_link, SMC_LLC_WAIT_TIME,
-			      SMC_LLC_DELETE_RKEY);
-	if (!qentry || (qentry->msg.raw.hdr.flags & SMC_LLC_FLAG_RKEY_NEG))
+
+	pend = container_of(priv, struct smc_wr_tx_pend, priv);
+	/* make a copy of send msg */
+	memcpy(&qentry.msg.delete_link, send_link->wr_tx_bufs[pend->idx].raw,
+	       sizeof(qentry.msg.delete_link));
+
+	qentry.private = wait.private;
+	wait.private = &qentry;
+
+	add_wait_queue(&lgr->llc_msg_waiter, &wait);
+
+	/* send llc message */
+	rc = smc_wr_tx_send(send_link, priv);
+	smc_wr_tx_link_put(send_link);
+	if (rc) {
+		remove_wait_queue(&lgr->llc_msg_waiter, &wait);
+		goto out;
+	}
+
+	while (!signal_pending(current) && timeout) {
+		timeout = wait_woken(&wait, TASK_INTERRUPTIBLE, timeout);
+		if (qentry.msg.raw.hdr.flags & SMC_LLC_FLAG_RESP)
+			break;
+	}
+
+	remove_wait_queue(&lgr->llc_msg_waiter, &wait);
+	flags = qentry.msg.raw.hdr.flags;
+
+	if (!(flags & SMC_LLC_FLAG_RESP) || flags & SMC_LLC_FLAG_RKEY_NEG)
 		rc = -EFAULT;
 out:
-	if (qentry)
-		smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
 	return rc;
 }
 
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index 26f8f24..52af94f 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -37,16 +37,6 @@
 static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BITS);
 static DEFINE_SPINLOCK(smc_wr_rx_hash_lock);
 
-struct smc_wr_tx_pend {	/* control data for a pending send request */
-	u64			wr_id;		/* work request id sent */
-	smc_wr_tx_handler	handler;
-	enum ib_wc_status	wc_status;	/* CQE status */
-	struct smc_link		*link;
-	u32			idx;
-	struct smc_wr_tx_pend_priv priv;
-	u8			compl_requested;
-};
-
 /******************************** send queue *********************************/
 
 /*------------------------------- completion --------------------------------*/
diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
index a54e90a..9946ed5 100644
--- a/net/smc/smc_wr.h
+++ b/net/smc/smc_wr.h
@@ -46,6 +46,16 @@ struct smc_wr_rx_handler {
 	u8			type;
 };
 
+struct smc_wr_tx_pend {	/* control data for a pending send request */
+	u64			wr_id;		/* work request id sent */
+	smc_wr_tx_handler	handler;
+	enum ib_wc_status	wc_status;	/* CQE status */
+	struct smc_link		*link;
+	u32			idx;
+	struct smc_wr_tx_pend_priv priv;
+	u8			compl_requested;
+};
+
 /* Only used by RDMA write WRs.
  * All other WRs (CDC/LLC) use smc_wr_tx_send handling WR_ID implicitly
  */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next 04/10] net/smc: make SMC_LLC_FLOW_RKEY run concurrently
  2022-08-10 17:47 [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections D. Wythe
                   ` (2 preceding siblings ...)
  2022-08-10 17:47 ` [PATCH net-next 03/10] net/smc: allow confirm/delete rkey response deliver multiplex D. Wythe
@ 2022-08-10 17:47 ` D. Wythe
  2022-08-10 17:47 ` [PATCH net-next 05/10] net/smc: llc_conf_mutex refactor, replace it with rw_semaphore D. Wythe
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: D. Wythe @ 2022-08-10 17:47 UTC (permalink / raw)
  To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

From: "D. Wythe" <alibuda@linux.alibaba.com>

Once confirm/delete rkey response can be multiplex delivered,
We can allow parallel execution of start (remote) or
initialization (local) a SMC_LLC_FLOW_RKEY flow.

This patch will count the flows executed in parallel, and only when
the count reaches zero will the current flow type be removed.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/smc_core.h |  1 +
 net/smc/smc_llc.c  | 69 +++++++++++++++++++++++++++++++++++++++++-------------
 net/smc/smc_llc.h  |  6 +++++
 3 files changed, 60 insertions(+), 16 deletions(-)

diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index acc2869..8490676 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -286,6 +286,7 @@ enum smc_llc_flowtype {
 struct smc_llc_flow {
 	enum smc_llc_flowtype type;
 	struct smc_llc_qentry *qentry;
+	refcount_t	parallel_refcnt;
 };
 
 struct smc_link_group {
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index b026df2..965a3cc 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -231,10 +231,18 @@ static inline void smc_llc_flow_qentry_set(struct smc_llc_flow *flow,
 	flow->qentry = qentry;
 }
 
-static void smc_llc_flow_parallel(struct smc_link_group *lgr, u8 flow_type,
+static void smc_llc_flow_parallel(struct smc_link_group *lgr, struct smc_llc_flow *flow,
 				  struct smc_llc_qentry *qentry)
 {
 	u8 msg_type = qentry->msg.raw.hdr.common.llc_type;
+	u8 flow_type = flow->type;
+
+	/* SMC_LLC_FLOW_RKEY can be parallel */
+	if (flow_type == SMC_LLC_FLOW_RKEY &&
+	    (msg_type == SMC_LLC_CONFIRM_RKEY || msg_type == SMC_LLC_DELETE_RKEY)) {
+		refcount_inc(&flow->parallel_refcnt);
+		return;
+	}
 
 	if ((msg_type == SMC_LLC_ADD_LINK || msg_type == SMC_LLC_DELETE_LINK) &&
 	    flow_type != msg_type && !lgr->delayed_event) {
@@ -261,7 +269,7 @@ static bool smc_llc_flow_start(struct smc_llc_flow *flow,
 	spin_lock_bh(&lgr->llc_flow_lock);
 	if (flow->type) {
 		/* a flow is already active */
-		smc_llc_flow_parallel(lgr, flow->type, qentry);
+		smc_llc_flow_parallel(lgr, flow, qentry);
 		spin_unlock_bh(&lgr->llc_flow_lock);
 		return false;
 	}
@@ -280,6 +288,7 @@ static bool smc_llc_flow_start(struct smc_llc_flow *flow,
 		flow->type = SMC_LLC_FLOW_NONE;
 	}
 	smc_llc_flow_qentry_set(flow, qentry);
+	refcount_set(&flow->parallel_refcnt, 1);
 	spin_unlock_bh(&lgr->llc_flow_lock);
 	return true;
 }
@@ -289,6 +298,7 @@ int smc_llc_flow_initiate(struct smc_link_group *lgr,
 			  enum smc_llc_flowtype type)
 {
 	enum smc_llc_flowtype allowed_remote = SMC_LLC_FLOW_NONE;
+	bool accept = false;
 	int rc;
 
 	/* all flows except confirm_rkey and delete_rkey are exclusive,
@@ -300,10 +310,39 @@ int smc_llc_flow_initiate(struct smc_link_group *lgr,
 	if (list_empty(&lgr->list))
 		return -ENODEV;
 	spin_lock_bh(&lgr->llc_flow_lock);
-	if (lgr->llc_flow_lcl.type == SMC_LLC_FLOW_NONE &&
-	    (lgr->llc_flow_rmt.type == SMC_LLC_FLOW_NONE ||
-	     lgr->llc_flow_rmt.type == allowed_remote)) {
-		lgr->llc_flow_lcl.type = type;
+
+	/* Flow is initialized only if the following conditions are met:
+	 * incoming flow	local flow		remote flow
+	 * exclusive		NONE			NONE
+	 * SMC_LLC_FLOW_RKEY	SMC_LLC_FLOW_RKEY	SMC_LLC_FLOW_RKEY
+	 * SMC_LLC_FLOW_RKEY	NONE			SMC_LLC_FLOW_RKEY
+	 * SMC_LLC_FLOW_RKEY	SMC_LLC_FLOW_RKEY	NONE
+	 */
+	switch (type) {
+	case SMC_LLC_FLOW_RKEY:
+		if (!SMC_IS_PARALLEL_FLOW(lgr->llc_flow_lcl.type))
+			break;
+		if (!SMC_IS_PARALLEL_FLOW(lgr->llc_flow_rmt.type))
+			break;
+		/* accepted */
+		accept = true;
+		break;
+	default:
+		if (!SMC_IS_NONE_FLOW(lgr->llc_flow_lcl.type))
+			break;
+		if (!SMC_IS_NONE_FLOW(lgr->llc_flow_rmt.type))
+			break;
+		/* accepted */
+		accept = true;
+		break;
+	}
+	if (accept) {
+		if (SMC_IS_NONE_FLOW(lgr->llc_flow_lcl.type)) {
+			lgr->llc_flow_lcl.type = type;
+			refcount_set(&lgr->llc_flow_lcl.parallel_refcnt, 1);
+		} else {
+			refcount_inc(&lgr->llc_flow_lcl.parallel_refcnt);
+		}
 		spin_unlock_bh(&lgr->llc_flow_lock);
 		return 0;
 	}
@@ -322,6 +361,10 @@ int smc_llc_flow_initiate(struct smc_link_group *lgr,
 void smc_llc_flow_stop(struct smc_link_group *lgr, struct smc_llc_flow *flow)
 {
 	spin_lock_bh(&lgr->llc_flow_lock);
+	if (!refcount_dec_and_test(&flow->parallel_refcnt)) {
+		spin_unlock_bh(&lgr->llc_flow_lock);
+		return;
+	}
 	memset(flow, 0, sizeof(*flow));
 	flow->type = SMC_LLC_FLOW_NONE;
 	spin_unlock_bh(&lgr->llc_flow_lock);
@@ -1729,16 +1772,14 @@ static void smc_llc_delete_link_work(struct work_struct *work)
 }
 
 /* process a confirm_rkey request from peer, remote flow */
-static void smc_llc_rmt_conf_rkey(struct smc_link_group *lgr)
+static void smc_llc_rmt_conf_rkey(struct smc_link_group *lgr, struct smc_llc_qentry *qentry)
 {
 	struct smc_llc_msg_confirm_rkey *llc;
-	struct smc_llc_qentry *qentry;
 	struct smc_link *link;
 	int num_entries;
 	int rk_idx;
 	int i;
 
-	qentry = lgr->llc_flow_rmt.qentry;
 	llc = &qentry->msg.confirm_rkey;
 	link = qentry->link;
 
@@ -1765,19 +1806,16 @@ static void smc_llc_rmt_conf_rkey(struct smc_link_group *lgr)
 	llc->hd.flags |= SMC_LLC_FLAG_RESP;
 	smc_llc_init_msg_hdr(&llc->hd, link->lgr, sizeof(*llc));
 	smc_llc_send_message(link, &qentry->msg);
-	smc_llc_flow_qentry_del(&lgr->llc_flow_rmt);
 }
 
 /* process a delete_rkey request from peer, remote flow */
-static void smc_llc_rmt_delete_rkey(struct smc_link_group *lgr)
+static void smc_llc_rmt_delete_rkey(struct smc_link_group *lgr, struct smc_llc_qentry *qentry)
 {
 	struct smc_llc_msg_delete_rkey *llc;
-	struct smc_llc_qentry *qentry;
 	struct smc_link *link;
 	u8 err_mask = 0;
 	int i, max;
 
-	qentry = lgr->llc_flow_rmt.qentry;
 	llc = &qentry->msg.delete_rkey;
 	link = qentry->link;
 
@@ -1815,7 +1853,6 @@ static void smc_llc_rmt_delete_rkey(struct smc_link_group *lgr)
 finish:
 	llc->hd.flags |= SMC_LLC_FLAG_RESP;
 	smc_llc_send_message(link, &qentry->msg);
-	smc_llc_flow_qentry_del(&lgr->llc_flow_rmt);
 }
 
 static void smc_llc_protocol_violation(struct smc_link_group *lgr, u8 type)
@@ -1916,7 +1953,7 @@ static void smc_llc_event_handler(struct smc_llc_qentry *qentry)
 		/* new request from remote, assign to remote flow */
 		if (smc_llc_flow_start(&lgr->llc_flow_rmt, qentry)) {
 			/* process here, does not wait for more llc msgs */
-			smc_llc_rmt_conf_rkey(lgr);
+			smc_llc_rmt_conf_rkey(lgr, qentry);
 			smc_llc_flow_stop(lgr, &lgr->llc_flow_rmt);
 		}
 		return;
@@ -1929,7 +1966,7 @@ static void smc_llc_event_handler(struct smc_llc_qentry *qentry)
 		/* new request from remote, assign to remote flow */
 		if (smc_llc_flow_start(&lgr->llc_flow_rmt, qentry)) {
 			/* process here, does not wait for more llc msgs */
-			smc_llc_rmt_delete_rkey(lgr);
+			smc_llc_rmt_delete_rkey(lgr, qentry);
 			smc_llc_flow_stop(lgr, &lgr->llc_flow_rmt);
 		}
 		return;
diff --git a/net/smc/smc_llc.h b/net/smc/smc_llc.h
index 4404e52..005a81e 100644
--- a/net/smc/smc_llc.h
+++ b/net/smc/smc_llc.h
@@ -48,6 +48,12 @@ enum smc_llc_msg_type {
 #define smc_link_downing(state) \
 	(cmpxchg(state, SMC_LNK_ACTIVE, SMC_LNK_INACTIVE) == SMC_LNK_ACTIVE)
 
+#define SMC_IS_NONE_FLOW(type)		\
+	((type) == SMC_LLC_FLOW_NONE)
+
+#define SMC_IS_PARALLEL_FLOW(type)	\
+	(((type) == SMC_LLC_FLOW_RKEY) || SMC_IS_NONE_FLOW(type))
+
 /* LLC DELETE LINK Request Reason Codes */
 #define SMC_LLC_DEL_LOST_PATH		0x00010000
 #define SMC_LLC_DEL_OP_INIT_TERM	0x00020000
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next 05/10] net/smc: llc_conf_mutex refactor, replace it with rw_semaphore
  2022-08-10 17:47 [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections D. Wythe
                   ` (3 preceding siblings ...)
  2022-08-10 17:47 ` [PATCH net-next 04/10] net/smc: make SMC_LLC_FLOW_RKEY run concurrently D. Wythe
@ 2022-08-10 17:47 ` D. Wythe
  2022-08-10 17:47 ` [PATCH net-next 06/10] net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse() D. Wythe
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: D. Wythe @ 2022-08-10 17:47 UTC (permalink / raw)
  To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

From: "D. Wythe" <alibuda@linux.alibaba.com>

llc_conf_mutex was used to protect links and link related configurations
in the same link group, for example, add or delete links. However,
in most cases, the protected critical area has only read semantics and
with no write semantics at all, such as obtaining a usable link or an
available rmb_desc.

This patch do simply code refactoring, replace mutex with rw_semaphore,
replace mutex_lock with down_write and replace mutex_unlock with
up_write.

Theoretically, this replacement is equivalent, but after this patch,
we can distinguish lock granularity according to different semantics
of critical areas.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/af_smc.c   |  8 ++++----
 net/smc/smc_core.c | 20 ++++++++++----------
 net/smc/smc_core.h |  2 +-
 net/smc/smc_llc.c  | 18 +++++++++---------
 4 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index c0842a9..51b90e2 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -498,7 +498,7 @@ static int smcr_lgr_reg_sndbufs(struct smc_link *link,
 		return -EINVAL;
 
 	/* protect against parallel smcr_link_reg_buf() */
-	mutex_lock(&lgr->llc_conf_mutex);
+	down_write(&lgr->llc_conf_mutex);
 	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
 		if (!smc_link_active(&lgr->lnk[i]))
 			continue;
@@ -506,7 +506,7 @@ static int smcr_lgr_reg_sndbufs(struct smc_link *link,
 		if (rc)
 			break;
 	}
-	mutex_unlock(&lgr->llc_conf_mutex);
+	up_write(&lgr->llc_conf_mutex);
 	return rc;
 }
 
@@ -523,7 +523,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
 	/* protect against parallel smc_llc_cli_rkey_exchange() and
 	 * parallel smcr_link_reg_buf()
 	 */
-	mutex_lock(&lgr->llc_conf_mutex);
+	down_write(&lgr->llc_conf_mutex);
 	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
 		if (!smc_link_active(&lgr->lnk[i]))
 			continue;
@@ -540,7 +540,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
 	}
 	rmb_desc->is_conf_rkey = true;
 out:
-	mutex_unlock(&lgr->llc_conf_mutex);
+	up_write(&lgr->llc_conf_mutex);
 	smc_llc_flow_stop(lgr, &lgr->llc_flow_lcl);
 	return rc;
 }
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 61a3854..fc8deec 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1388,10 +1388,10 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
 		rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
 		if (!rc) {
 			/* protect against smc_llc_cli_rkey_exchange() */
-			mutex_lock(&lgr->llc_conf_mutex);
+			down_write(&lgr->llc_conf_mutex);
 			smc_llc_do_delete_rkey(lgr, buf_desc);
 			buf_desc->is_conf_rkey = false;
-			mutex_unlock(&lgr->llc_conf_mutex);
+			up_write(&lgr->llc_conf_mutex);
 			smc_llc_flow_stop(lgr, &lgr->llc_flow_lcl);
 		}
 	}
@@ -1661,12 +1661,12 @@ static void smc_lgr_free(struct smc_link_group *lgr)
 	int i;
 
 	if (!lgr->is_smcd) {
-		mutex_lock(&lgr->llc_conf_mutex);
+		down_write(&lgr->llc_conf_mutex);
 		for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
 			if (lgr->lnk[i].state != SMC_LNK_UNUSED)
 				smcr_link_clear(&lgr->lnk[i], false);
 		}
-		mutex_unlock(&lgr->llc_conf_mutex);
+		up_write(&lgr->llc_conf_mutex);
 		smc_llc_lgr_clear(lgr);
 	}
 
@@ -1980,12 +1980,12 @@ static void smcr_link_down(struct smc_link *lnk)
 	} else {
 		if (lgr->llc_flow_lcl.type != SMC_LLC_FLOW_NONE) {
 			/* another llc task is ongoing */
-			mutex_unlock(&lgr->llc_conf_mutex);
+			up_write(&lgr->llc_conf_mutex);
 			wait_event_timeout(lgr->llc_flow_waiter,
 				(list_empty(&lgr->list) ||
 				 lgr->llc_flow_lcl.type == SMC_LLC_FLOW_NONE),
 				SMC_LLC_WAIT_TIME);
-			mutex_lock(&lgr->llc_conf_mutex);
+			down_write(&lgr->llc_conf_mutex);
 		}
 		if (!list_empty(&lgr->list)) {
 			smc_llc_send_delete_link(to_lnk, del_link_id,
@@ -2047,9 +2047,9 @@ static void smc_link_down_work(struct work_struct *work)
 	if (list_empty(&lgr->list))
 		return;
 	wake_up_all(&lgr->llc_msg_waiter);
-	mutex_lock(&lgr->llc_conf_mutex);
+	down_write(&lgr->llc_conf_mutex);
 	smcr_link_down(link);
-	mutex_unlock(&lgr->llc_conf_mutex);
+	up_write(&lgr->llc_conf_mutex);
 }
 
 static int smc_vlan_by_tcpsk_walk(struct net_device *lower_dev,
@@ -2581,7 +2581,7 @@ static int smcr_buf_map_usable_links(struct smc_link_group *lgr,
 	int i, rc = 0;
 
 	/* protect against parallel link reconfiguration */
-	mutex_lock(&lgr->llc_conf_mutex);
+	down_write(&lgr->llc_conf_mutex);
 	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
 		struct smc_link *lnk = &lgr->lnk[i];
 
@@ -2593,7 +2593,7 @@ static int smcr_buf_map_usable_links(struct smc_link_group *lgr,
 		}
 	}
 out:
-	mutex_unlock(&lgr->llc_conf_mutex);
+	up_write(&lgr->llc_conf_mutex);
 	return rc;
 }
 
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 8490676..559d330 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -346,7 +346,7 @@ struct smc_link_group {
 						/* queue for llc events */
 			spinlock_t		llc_event_q_lock;
 						/* protects llc_event_q */
-			struct mutex		llc_conf_mutex;
+			struct rw_semaphore	llc_conf_mutex;
 						/* protects lgr reconfig. */
 			struct work_struct	llc_add_link_work;
 			struct work_struct	llc_del_link_work;
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index 965a3cc..d744937 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -1237,12 +1237,12 @@ static void smc_llc_process_cli_add_link(struct smc_link_group *lgr)
 
 	qentry = smc_llc_flow_qentry_clr(&lgr->llc_flow_lcl);
 
-	mutex_lock(&lgr->llc_conf_mutex);
+	down_write(&lgr->llc_conf_mutex);
 	if (smc_llc_is_local_add_link(&qentry->msg))
 		smc_llc_cli_add_link_invite(qentry->link, qentry);
 	else
 		smc_llc_cli_add_link(qentry->link, qentry);
-	mutex_unlock(&lgr->llc_conf_mutex);
+	up_write(&lgr->llc_conf_mutex);
 }
 
 static int smc_llc_active_link_count(struct smc_link_group *lgr)
@@ -1546,13 +1546,13 @@ static void smc_llc_process_srv_add_link(struct smc_link_group *lgr)
 
 	qentry = smc_llc_flow_qentry_clr(&lgr->llc_flow_lcl);
 
-	mutex_lock(&lgr->llc_conf_mutex);
+	down_write(&lgr->llc_conf_mutex);
 	rc = smc_llc_srv_add_link(link, qentry);
 	if (!rc && lgr->type == SMC_LGR_SYMMETRIC) {
 		/* delete any asymmetric link */
 		smc_llc_delete_asym_link(lgr);
 	}
-	mutex_unlock(&lgr->llc_conf_mutex);
+	up_write(&lgr->llc_conf_mutex);
 	kfree(qentry);
 }
 
@@ -1619,7 +1619,7 @@ static void smc_llc_process_cli_delete_link(struct smc_link_group *lgr)
 		smc_lgr_terminate_sched(lgr);
 		goto out;
 	}
-	mutex_lock(&lgr->llc_conf_mutex);
+	down_write(&lgr->llc_conf_mutex);
 	/* delete single link */
 	for (lnk_idx = 0; lnk_idx < SMC_LINKS_PER_LGR_MAX; lnk_idx++) {
 		if (lgr->lnk[lnk_idx].link_id != del_llc->link_num)
@@ -1655,7 +1655,7 @@ static void smc_llc_process_cli_delete_link(struct smc_link_group *lgr)
 		smc_lgr_terminate_sched(lgr);
 	}
 out_unlock:
-	mutex_unlock(&lgr->llc_conf_mutex);
+	up_write(&lgr->llc_conf_mutex);
 out:
 	kfree(qentry);
 }
@@ -1691,7 +1691,7 @@ static void smc_llc_process_srv_delete_link(struct smc_link_group *lgr)
 	int active_links;
 	int i;
 
-	mutex_lock(&lgr->llc_conf_mutex);
+	down_write(&lgr->llc_conf_mutex);
 	qentry = smc_llc_flow_qentry_clr(&lgr->llc_flow_lcl);
 	lnk = qentry->link;
 	del_llc = &qentry->msg.delete_link;
@@ -1748,7 +1748,7 @@ static void smc_llc_process_srv_delete_link(struct smc_link_group *lgr)
 		smc_llc_add_link_local(lnk);
 	}
 out:
-	mutex_unlock(&lgr->llc_conf_mutex);
+	up_write(&lgr->llc_conf_mutex);
 	kfree(qentry);
 }
 
@@ -2162,7 +2162,7 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
 	spin_lock_init(&lgr->llc_flow_lock);
 	init_waitqueue_head(&lgr->llc_flow_waiter);
 	init_waitqueue_head(&lgr->llc_msg_waiter);
-	mutex_init(&lgr->llc_conf_mutex);
+	init_rwsem(&lgr->llc_conf_mutex);
 	lgr->llc_testlink_time = READ_ONCE(net->ipv4.sysctl_tcp_keepalive_time);
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next 06/10] net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse()
  2022-08-10 17:47 [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections D. Wythe
                   ` (4 preceding siblings ...)
  2022-08-10 17:47 ` [PATCH net-next 05/10] net/smc: llc_conf_mutex refactor, replace it with rw_semaphore D. Wythe
@ 2022-08-10 17:47 ` D. Wythe
  2022-08-10 17:47 ` [PATCH net-next 07/10] net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs() D. Wythe
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: D. Wythe @ 2022-08-10 17:47 UTC (permalink / raw)
  To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

From: "D. Wythe" <alibuda@linux.alibaba.com>

Following is part of Off-CPU graph during frequent SMC-R short-lived
processing:

process_one_work				(51.19%)
smc_close_passive_work			(28.36%)
	smcr_buf_unuse				(28.34%)
	rwsem_down_write_slowpath		(28.22%)

smc_listen_work				(22.83%)
	smc_clc_wait_msg			(1.84%)
	smc_buf_create				(20.45%)
		smcr_buf_map_usable_links
		rwsem_down_write_slowpath	(20.43%)
	smcr_lgr_reg_rmbs			(0.53%)
		rwsem_down_write_slowpath	(0.43%)
		smc_llc_do_confirm_rkey		(0.08%)

We can clearly see that during the connection establishment time,
waiting time of connections is not on IO, but on llc_conf_mutex.

What is more important, the core critical area (smcr_buf_unuse() &
smc_buf_create()) only perfroms read semantics on links, we can
easily replace it with read semaphore.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/smc_core.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index fc8deec..113804d 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1388,10 +1388,10 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
 		rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
 		if (!rc) {
 			/* protect against smc_llc_cli_rkey_exchange() */
-			down_write(&lgr->llc_conf_mutex);
+			down_read(&lgr->llc_conf_mutex);
 			smc_llc_do_delete_rkey(lgr, buf_desc);
 			buf_desc->is_conf_rkey = false;
-			up_write(&lgr->llc_conf_mutex);
+			up_read(&lgr->llc_conf_mutex);
 			smc_llc_flow_stop(lgr, &lgr->llc_flow_lcl);
 		}
 	}
@@ -2581,7 +2581,7 @@ static int smcr_buf_map_usable_links(struct smc_link_group *lgr,
 	int i, rc = 0;
 
 	/* protect against parallel link reconfiguration */
-	down_write(&lgr->llc_conf_mutex);
+	down_read(&lgr->llc_conf_mutex);
 	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
 		struct smc_link *lnk = &lgr->lnk[i];
 
@@ -2593,7 +2593,7 @@ static int smcr_buf_map_usable_links(struct smc_link_group *lgr,
 		}
 	}
 out:
-	up_write(&lgr->llc_conf_mutex);
+	up_read(&lgr->llc_conf_mutex);
 	return rc;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next 07/10] net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs()
  2022-08-10 17:47 [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections D. Wythe
                   ` (5 preceding siblings ...)
  2022-08-10 17:47 ` [PATCH net-next 06/10] net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse() D. Wythe
@ 2022-08-10 17:47 ` D. Wythe
  2022-08-16  8:24   ` Tony Lu
  2022-08-10 17:47 ` [PATCH net-next 08/10] net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore D. Wythe
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: D. Wythe @ 2022-08-10 17:47 UTC (permalink / raw)
  To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

From: "D. Wythe" <alibuda@linux.alibaba.com>

Unlike smc_buf_create() and smcr_buf_unuse(), smcr_lgr_reg_rmbs() is
exclusive when assigned rmb_desc was not registered, although it can be
executed in parallel when assigned rmb_desc was registered already
and only performs read semtamics on it. Hence, we can not simply replace
it with read semaphore.

The idea here is that if the assigned rmb_desc was registered already,
use read semaphore to protect the critical section, once the assigned
rmb_desc was not registered, keep using keep write semaphore still
to keep its exclusivity.

Thanks to the reusable features of rmb_desc, which allows us to execute
in parallel in most cases.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/af_smc.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 51b90e2..39dbf39 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -516,10 +516,25 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
 {
 	struct smc_link_group *lgr = link->lgr;
 	int i, rc = 0;
+	bool slow = false;
 
 	rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
 	if (rc)
 		return rc;
+
+	down_read(&lgr->llc_conf_mutex);
+	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
+		if (!smc_link_active(&lgr->lnk[i]))
+			continue;
+		if (!rmb_desc->is_reg_mr[link->link_idx]) {
+			up_read(&lgr->llc_conf_mutex);
+			goto slow_path;
+		}
+	}
+	/* mr register already */
+	goto fast_path;
+slow_path:
+	slow = true;
 	/* protect against parallel smc_llc_cli_rkey_exchange() and
 	 * parallel smcr_link_reg_buf()
 	 */
@@ -531,7 +546,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
 		if (rc)
 			goto out;
 	}
-
+fast_path:
 	/* exchange confirm_rkey msg with peer */
 	rc = smc_llc_do_confirm_rkey(link, rmb_desc);
 	if (rc) {
@@ -540,7 +555,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
 	}
 	rmb_desc->is_conf_rkey = true;
 out:
-	up_write(&lgr->llc_conf_mutex);
+	slow ? up_write(&lgr->llc_conf_mutex) : up_read(&lgr->llc_conf_mutex);
 	smc_llc_flow_stop(lgr, &lgr->llc_flow_lcl);
 	return rc;
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next 08/10] net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore
  2022-08-10 17:47 [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections D. Wythe
                   ` (6 preceding siblings ...)
  2022-08-10 17:47 ` [PATCH net-next 07/10] net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs() D. Wythe
@ 2022-08-10 17:47 ` D. Wythe
  2022-08-16  8:37   ` Tony Lu
  2022-08-10 17:47 ` [PATCH net-next 09/10] net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link() D. Wythe
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: D. Wythe @ 2022-08-10 17:47 UTC (permalink / raw)
  To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

From: "D. Wythe" <alibuda@linux.alibaba.com>

It's clear that rmbs_lock and sndbufs_lock are aims to protect the
rmbs list or the sndbufs list.

During conenction establieshment, smc_buf_get_slot() will always
be invoke, and it only performs read semantics in rmbs list and
sndbufs list.

Based on the above considerations, we replace mutex with rw_semaphore.
Only smc_buf_get_slot() use down_read() to allow smc_buf_get_slot()
run concurrently, other part use down_write() to keep exclusive
semantics.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/smc_core.c | 55 +++++++++++++++++++++++++++---------------------------
 net/smc/smc_core.h |  4 ++--
 net/smc/smc_llc.c  | 16 ++++++++--------
 3 files changed, 38 insertions(+), 37 deletions(-)

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 113804d..b90970a 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1138,8 +1138,8 @@ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini)
 	lgr->freeing = 0;
 	lgr->vlan_id = ini->vlan_id;
 	refcount_set(&lgr->refcnt, 1); /* set lgr refcnt to 1 */
-	mutex_init(&lgr->sndbufs_lock);
-	mutex_init(&lgr->rmbs_lock);
+	init_rwsem(&lgr->sndbufs_lock);
+	init_rwsem(&lgr->rmbs_lock);
 	rwlock_init(&lgr->conns_lock);
 	for (i = 0; i < SMC_RMBE_SIZES; i++) {
 		INIT_LIST_HEAD(&lgr->sndbufs[i]);
@@ -1380,7 +1380,7 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
 static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
 			   struct smc_link_group *lgr)
 {
-	struct mutex *lock;	/* lock buffer list */
+	struct rw_semaphore *lock;	/* lock buffer list */
 	int rc;
 
 	if (is_rmb && buf_desc->is_conf_rkey && !list_empty(&lgr->list)) {
@@ -1400,9 +1400,9 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
 		/* buf registration failed, reuse not possible */
 		lock = is_rmb ? &lgr->rmbs_lock :
 				&lgr->sndbufs_lock;
-		mutex_lock(lock);
+		down_write(lock);
 		list_del(&buf_desc->list);
-		mutex_unlock(lock);
+		up_write(lock);
 
 		smc_buf_free(lgr, is_rmb, buf_desc);
 	} else {
@@ -1506,15 +1506,16 @@ static void smcr_buf_unmap_lgr(struct smc_link *lnk)
 	int i;
 
 	for (i = 0; i < SMC_RMBE_SIZES; i++) {
-		mutex_lock(&lgr->rmbs_lock);
+		down_write(&lgr->rmbs_lock);
 		list_for_each_entry_safe(buf_desc, bf, &lgr->rmbs[i], list)
 			smcr_buf_unmap_link(buf_desc, true, lnk);
-		mutex_unlock(&lgr->rmbs_lock);
-		mutex_lock(&lgr->sndbufs_lock);
+		up_write(&lgr->rmbs_lock);
+
+		down_write(&lgr->sndbufs_lock);
 		list_for_each_entry_safe(buf_desc, bf, &lgr->sndbufs[i],
 					 list)
 			smcr_buf_unmap_link(buf_desc, false, lnk);
-		mutex_unlock(&lgr->sndbufs_lock);
+		up_write(&lgr->sndbufs_lock);
 	}
 }
 
@@ -2324,19 +2325,19 @@ int smc_uncompress_bufsize(u8 compressed)
  * buffer size; if not available, return NULL
  */
 static struct smc_buf_desc *smc_buf_get_slot(int compressed_bufsize,
-					     struct mutex *lock,
+					     struct rw_semaphore *lock,
 					     struct list_head *buf_list)
 {
 	struct smc_buf_desc *buf_slot;
 
-	mutex_lock(lock);
+	down_read(lock);
 	list_for_each_entry(buf_slot, buf_list, list) {
 		if (cmpxchg(&buf_slot->used, 0, 1) == 0) {
-			mutex_unlock(lock);
+			up_read(lock);
 			return buf_slot;
 		}
 	}
-	mutex_unlock(lock);
+	up_read(lock);
 	return NULL;
 }
 
@@ -2445,13 +2446,13 @@ int smcr_link_reg_buf(struct smc_link *link, struct smc_buf_desc *buf_desc)
 	return 0;
 }
 
-static int _smcr_buf_map_lgr(struct smc_link *lnk, struct mutex *lock,
+static int _smcr_buf_map_lgr(struct smc_link *lnk, struct rw_semaphore *lock,
 			     struct list_head *lst, bool is_rmb)
 {
 	struct smc_buf_desc *buf_desc, *bf;
 	int rc = 0;
 
-	mutex_lock(lock);
+	down_write(lock);
 	list_for_each_entry_safe(buf_desc, bf, lst, list) {
 		if (!buf_desc->used)
 			continue;
@@ -2460,7 +2461,7 @@ static int _smcr_buf_map_lgr(struct smc_link *lnk, struct mutex *lock,
 			goto out;
 	}
 out:
-	mutex_unlock(lock);
+	up_write(lock);
 	return rc;
 }
 
@@ -2493,37 +2494,37 @@ int smcr_buf_reg_lgr(struct smc_link *lnk)
 	int i, rc = 0;
 
 	/* reg all RMBs for a new link */
-	mutex_lock(&lgr->rmbs_lock);
+	down_write(&lgr->rmbs_lock);
 	for (i = 0; i < SMC_RMBE_SIZES; i++) {
 		list_for_each_entry_safe(buf_desc, bf, &lgr->rmbs[i], list) {
 			if (!buf_desc->used)
 				continue;
 			rc = smcr_link_reg_buf(lnk, buf_desc);
 			if (rc) {
-				mutex_unlock(&lgr->rmbs_lock);
+				up_write(&lgr->rmbs_lock);
 				return rc;
 			}
 		}
 	}
-	mutex_unlock(&lgr->rmbs_lock);
+	up_write(&lgr->rmbs_lock);
 
 	if (lgr->buf_type == SMCR_PHYS_CONT_BUFS)
 		return rc;
 
 	/* reg all vzalloced sndbufs for a new link */
-	mutex_lock(&lgr->sndbufs_lock);
+	down_write(&lgr->sndbufs_lock);
 	for (i = 0; i < SMC_RMBE_SIZES; i++) {
 		list_for_each_entry_safe(buf_desc, bf, &lgr->sndbufs[i], list) {
 			if (!buf_desc->used || !buf_desc->is_vm)
 				continue;
 			rc = smcr_link_reg_buf(lnk, buf_desc);
 			if (rc) {
-				mutex_unlock(&lgr->sndbufs_lock);
+				up_write(&lgr->sndbufs_lock);
 				return rc;
 			}
 		}
 	}
-	mutex_unlock(&lgr->sndbufs_lock);
+	up_write(&lgr->sndbufs_lock);
 	return rc;
 }
 
@@ -2641,7 +2642,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
 	struct list_head *buf_list;
 	int bufsize, bufsize_short;
 	bool is_dgraded = false;
-	struct mutex *lock;	/* lock buffer list */
+	struct rw_semaphore *lock;	/* lock buffer list */
 	int sk_buf_size;
 
 	if (is_rmb)
@@ -2689,9 +2690,9 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
 		SMC_STAT_RMB_ALLOC(smc, is_smcd, is_rmb);
 		SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, bufsize);
 		buf_desc->used = 1;
-		mutex_lock(lock);
+		down_write(lock);
 		list_add(&buf_desc->list, buf_list);
-		mutex_unlock(lock);
+		up_write(lock);
 		break; /* found */
 	}
 
@@ -2765,9 +2766,9 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
 	/* create rmb */
 	rc = __smc_buf_create(smc, is_smcd, true);
 	if (rc) {
-		mutex_lock(&smc->conn.lgr->sndbufs_lock);
+		down_write(&smc->conn.lgr->sndbufs_lock);
 		list_del(&smc->conn.sndbuf_desc->list);
-		mutex_unlock(&smc->conn.lgr->sndbufs_lock);
+		up_write(&smc->conn.lgr->sndbufs_lock);
 		smc_buf_free(smc->conn.lgr, false, smc->conn.sndbuf_desc);
 		smc->conn.sndbuf_desc = NULL;
 	}
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 559d330..008148c 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -300,9 +300,9 @@ struct smc_link_group {
 	unsigned short		vlan_id;	/* vlan id of link group */
 
 	struct list_head	sndbufs[SMC_RMBE_SIZES];/* tx buffers */
-	struct mutex		sndbufs_lock;	/* protects tx buffers */
+	struct rw_semaphore	sndbufs_lock;	/* protects tx buffers */
 	struct list_head	rmbs[SMC_RMBE_SIZES];	/* rx buffers */
-	struct mutex		rmbs_lock;	/* protects rx buffers */
+	struct rw_semaphore	rmbs_lock;	/* protects rx buffers */
 
 	u8			id[SMC_LGR_ID_SIZE];	/* unique lgr id */
 	struct delayed_work	free_work;	/* delayed freeing of an lgr */
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index d744937..76f9906 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -642,7 +642,7 @@ static int smc_llc_fill_ext_v2(struct smc_llc_msg_add_link_v2_ext *ext,
 
 	prim_lnk_idx = link->link_idx;
 	lnk_idx = link_new->link_idx;
-	mutex_lock(&lgr->rmbs_lock);
+	down_write(&lgr->rmbs_lock);
 	ext->num_rkeys = lgr->conns_num;
 	if (!ext->num_rkeys)
 		goto out;
@@ -662,7 +662,7 @@ static int smc_llc_fill_ext_v2(struct smc_llc_msg_add_link_v2_ext *ext,
 	}
 	len += i * sizeof(ext->rt[0]);
 out:
-	mutex_unlock(&lgr->rmbs_lock);
+	up_write(&lgr->rmbs_lock);
 	return len;
 }
 
@@ -923,7 +923,7 @@ static int smc_llc_cli_rkey_exchange(struct smc_link *link,
 	int rc = 0;
 	int i;
 
-	mutex_lock(&lgr->rmbs_lock);
+	down_write(&lgr->rmbs_lock);
 	num_rkeys_send = lgr->conns_num;
 	buf_pos = smc_llc_get_first_rmb(lgr, &buf_lst);
 	do {
@@ -950,7 +950,7 @@ static int smc_llc_cli_rkey_exchange(struct smc_link *link,
 			break;
 	} while (num_rkeys_send || num_rkeys_recv);
 
-	mutex_unlock(&lgr->rmbs_lock);
+	up_write(&lgr->rmbs_lock);
 	return rc;
 }
 
@@ -1033,14 +1033,14 @@ static void smc_llc_save_add_link_rkeys(struct smc_link *link,
 	ext = (struct smc_llc_msg_add_link_v2_ext *)((u8 *)lgr->wr_rx_buf_v2 +
 						     SMC_WR_TX_SIZE);
 	max = min_t(u8, ext->num_rkeys, SMC_LLC_RKEYS_PER_MSG_V2);
-	mutex_lock(&lgr->rmbs_lock);
+	down_write(&lgr->rmbs_lock);
 	for (i = 0; i < max; i++) {
 		smc_rtoken_set(lgr, link->link_idx, link_new->link_idx,
 			       ext->rt[i].rmb_key,
 			       ext->rt[i].rmb_vaddr_new,
 			       ext->rt[i].rmb_key_new);
 	}
-	mutex_unlock(&lgr->rmbs_lock);
+	up_write(&lgr->rmbs_lock);
 }
 
 static void smc_llc_save_add_link_info(struct smc_link *link,
@@ -1349,7 +1349,7 @@ static int smc_llc_srv_rkey_exchange(struct smc_link *link,
 	int rc = 0;
 	int i;
 
-	mutex_lock(&lgr->rmbs_lock);
+	down_write(&lgr->rmbs_lock);
 	num_rkeys_send = lgr->conns_num;
 	buf_pos = smc_llc_get_first_rmb(lgr, &buf_lst);
 	do {
@@ -1374,7 +1374,7 @@ static int smc_llc_srv_rkey_exchange(struct smc_link *link,
 		smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
 	} while (num_rkeys_send || num_rkeys_recv);
 out:
-	mutex_unlock(&lgr->rmbs_lock);
+	up_write(&lgr->rmbs_lock);
 	return rc;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next 09/10] net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link()
  2022-08-10 17:47 [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections D. Wythe
                   ` (7 preceding siblings ...)
  2022-08-10 17:47 ` [PATCH net-next 08/10] net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore D. Wythe
@ 2022-08-10 17:47 ` D. Wythe
  2022-08-16  8:28   ` Tony Lu
  2022-08-10 17:47 ` [PATCH net-next 10/10] net/smc: fix application data exception D. Wythe
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: D. Wythe @ 2022-08-10 17:47 UTC (permalink / raw)
  To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

From: "D. Wythe" <alibuda@linux.alibaba.com>

After we optimize the parallel capability of SMC-R connection
establishment, there is a certain chance to trigger the
following panic:

PID: 5900   TASK: ffff88c1c8af4100  CPU: 1   COMMAND: "kworker/1:48"
 #0 [ffff9456c1cc79a0] machine_kexec at ffffffff870665b7
 #1 [ffff9456c1cc79f0] __crash_kexec at ffffffff871b4c7a
 #2 [ffff9456c1cc7ab0] crash_kexec at ffffffff871b5b60
 #3 [ffff9456c1cc7ac0] oops_end at ffffffff87026ce7
 #4 [ffff9456c1cc7ae0] page_fault_oops at ffffffff87075715
 #5 [ffff9456c1cc7b58] exc_page_fault at ffffffff87ad0654
 #6 [ffff9456c1cc7b80] asm_exc_page_fault at ffffffff87c00b62
    [exception RIP: ib_alloc_mr+19]
    RIP: ffffffffc0c9cce3  RSP: ffff9456c1cc7c38  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000004
    RDX: 0000000000000010  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ffff88c1ea281d00   R8: 000000020a34ffff   R9: ffff88c1350bbb20
    R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
    R13: 0000000000000010  R14: ffff88c1ab040a50  R15: ffff88c1ea281d00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff9456c1cc7c60] smc_ib_get_memory_region at ffffffffc0aff6df [smc]
 #8 [ffff9456c1cc7c88] smcr_buf_map_link at ffffffffc0b0278c [smc]
 #9 [ffff9456c1cc7ce0] __smc_buf_create at ffffffffc0b03586 [smc]

The reason here is that when the server tries to create a second link,
smc_llc_srv_add_link() has no protection and may add a new link to
link group. This breaks the security environment protected by
llc_conf_mutex.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/af_smc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 39dbf39..0b0c53a 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1834,8 +1834,10 @@ static int smcr_serv_conf_first_link(struct smc_sock *smc)
 	smc_llc_link_active(link);
 	smcr_lgr_set_type(link->lgr, SMC_LGR_SINGLE);
 
+	down_write(&link->lgr->llc_conf_mutex);
 	/* initial contact - try to establish second link */
 	smc_llc_srv_add_link(link, NULL);
+	up_write(&link->lgr->llc_conf_mutex);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next 10/10] net/smc: fix application data exception
  2022-08-10 17:47 [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections D. Wythe
                   ` (8 preceding siblings ...)
  2022-08-10 17:47 ` [PATCH net-next 09/10] net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link() D. Wythe
@ 2022-08-10 17:47 ` D. Wythe
  2022-08-11  3:28 ` [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections Jakub Kicinski
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: D. Wythe @ 2022-08-10 17:47 UTC (permalink / raw)
  To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

From: "D. Wythe" <alibuda@linux.alibaba.com>

After we optimize the parallel capability of SMC-R connection
establishment, There is a certain probability that following
exceptions will occur in the wrk benchmark test:

Running 10s test @ http://11.213.45.6:80
  8 threads and 64 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.72ms   13.94ms 245.33ms   94.17%
    Req/Sec     1.96k   713.67     5.41k    75.16%
  155262 requests in 10.10s, 23.10MB read
Non-2xx or 3xx responses: 3

We will find that the error is HTTP 400 error, which is a serious
exception in our test, which means the application data was
corrupted.

Consider the following scenarios:

CPU0                            CPU1

buf_desc->used = 0;
                                cmpxchg(buf_desc->used, 0, 1)
                                deal_with(buf_desc)

memset(buf_desc->cpu_addr,0);

This will cause the data received by a victim connection to be cleared,
thus triggering an HTTP 400 error in the server.

This patch exchange the order between clear used and memset, add
barrier to ensure memory consistency.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/smc_core.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index b90970a..7d42125 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1406,8 +1406,9 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
 
 		smc_buf_free(lgr, is_rmb, buf_desc);
 	} else {
-		buf_desc->used = 0;
-		memset(buf_desc->cpu_addr, 0, buf_desc->len);
+		/* memzero_explicit provides potential memory barrier semantics */
+		memzero_explicit(buf_desc->cpu_addr, buf_desc->len);
+		WRITE_ONCE(buf_desc->used, 0);
 	}
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections
  2022-08-10 17:47 [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections D. Wythe
                   ` (9 preceding siblings ...)
  2022-08-10 17:47 ` [PATCH net-next 10/10] net/smc: fix application data exception D. Wythe
@ 2022-08-11  3:28 ` Jakub Kicinski
  2022-08-11  5:13   ` Tony Lu
  2022-08-11 12:31 ` Karsten Graul
  2022-08-16  9:35 ` Jan Karcher
  12 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2022-08-11  3:28 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, wenjia, davem, netdev, linux-s390, linux-rdma

On Thu, 11 Aug 2022 01:47:31 +0800 D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> This patch set attempts to optimize the parallelism of SMC-R connections,
> mainly to reduce unnecessary blocking on locks, and to fix exceptions that
> occur after thoses optimization.

net-next is closed until Monday, please see the FAQ.

Also Al Viro complained about the SMC ULP:

https://lore.kernel.org/all/YutBc9aCQOvPPlWN@ZenIV/

I didn't see any responses, what the situation there?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending
  2022-08-10 17:47 ` [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D. Wythe
@ 2022-08-11  3:41   ` kernel test robot
  2022-08-11 11:51   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2022-08-11  3:41 UTC (permalink / raw)
  To: D. Wythe, kgraul, wenjia
  Cc: llvm, kbuild-all, kuba, davem, netdev, linux-s390, linux-rdma

Hi Wythe",

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/D-Wythe/net-smc-optimize-the-parallelism-of-SMC-R-connections/20220811-014942
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git f86d1fbbe7858884d6754534a0afbb74fc30bc26
config: i386-randconfig-a013 (https://download.01.org/0day-ci/archive/20220811/202208111145.LdpP76au-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 5f1c7e2cc5a3c07cbc2412e851a7283c1841f520)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/2c1c2e644fb8dbce9b8a004e604792340cbfccb8
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review D-Wythe/net-smc-optimize-the-parallelism-of-SMC-R-connections/20220811-014942
        git checkout 2c1c2e644fb8dbce9b8a004e604792340cbfccb8
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash net/smc/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> net/smc/smc_core.c:95:30: warning: operator '?:' has lower precedence than '+'; '+' will be evaluated first [-Wparentheses]
                   + (lnkc->role == SMC_SERV) ? 0 : lnkc->clcqpn;
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
   net/smc/smc_core.c:95:30: note: place parentheses around the '+' expression to silence this warning
                   + (lnkc->role == SMC_SERV) ? 0 : lnkc->clcqpn;
                                              ^
                                             )
   net/smc/smc_core.c:95:30: note: place parentheses around the '?:' expression to evaluate it first
                   + (lnkc->role == SMC_SERV) ? 0 : lnkc->clcqpn;
                                              ^
                     (                                          )
   net/smc/smc_core.c:104:29: warning: operator '?:' has lower precedence than '+'; '+' will be evaluated first [-Wparentheses]
                   + (key->role == SMC_SERV) ? 0 : key->clcqpn;
                   ~~~~~~~~~~~~~~~~~~~~~~~~~ ^
   net/smc/smc_core.c:104:29: note: place parentheses around the '+' expression to silence this warning
                   + (key->role == SMC_SERV) ? 0 : key->clcqpn;
                                             ^
                                            )
   net/smc/smc_core.c:104:29: note: place parentheses around the '?:' expression to evaluate it first
                   + (key->role == SMC_SERV) ? 0 : key->clcqpn;
                                             ^
                     (                                        )
   2 warnings generated.


vim +95 net/smc/smc_core.c

    88	
    89	/* SMC-R lnk cluster hash func */
    90	static u32 smcr_lnk_cluster_hashfn(const void *data, u32 len, u32 seed)
    91	{
    92		const struct smc_lnk_cluster *lnkc = data;
    93	
    94		return jhash2((u32 *)lnkc->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
  > 95			+ (lnkc->role == SMC_SERV) ? 0 : lnkc->clcqpn;
    96	}
    97	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections
  2022-08-11  3:28 ` [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections Jakub Kicinski
@ 2022-08-11  5:13   ` Tony Lu
  0 siblings, 0 replies; 29+ messages in thread
From: Tony Lu @ 2022-08-11  5:13 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: D. Wythe, kgraul, wenjia, davem, netdev, linux-s390, linux-rdma

On Wed, Aug 10, 2022 at 08:28:45PM -0700, Jakub Kicinski wrote:
> On Thu, 11 Aug 2022 01:47:31 +0800 D. Wythe wrote:
> > From: "D. Wythe" <alibuda@linux.alibaba.com>
> > 
> > This patch set attempts to optimize the parallelism of SMC-R connections,
> > mainly to reduce unnecessary blocking on locks, and to fix exceptions that
> > occur after thoses optimization.
> 
> net-next is closed until Monday, please see the FAQ.
> 
> Also Al Viro complained about the SMC ULP:
> 
> https://lore.kernel.org/all/YutBc9aCQOvPPlWN@ZenIV/
> 
> I didn't see any responses, what the situation there?

Sorry for the late reply. I am working on it and will give out the
details as soon as possible.

Tony Lu

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending
  2022-08-10 17:47 ` [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D. Wythe
  2022-08-11  3:41   ` kernel test robot
@ 2022-08-11 11:51   ` kernel test robot
  2022-08-16  9:43   ` Jan Karcher
  2022-08-16 12:52   ` Tony Lu
  3 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2022-08-11 11:51 UTC (permalink / raw)
  To: D. Wythe, kgraul, wenjia
  Cc: kbuild-all, kuba, davem, netdev, linux-s390, linux-rdma

Hi Wythe",

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/D-Wythe/net-smc-optimize-the-parallelism-of-SMC-R-connections/20220811-014942
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git f86d1fbbe7858884d6754534a0afbb74fc30bc26
config: x86_64-randconfig-s021 (https://download.01.org/0day-ci/archive/20220811/202208111933.9PvuHltH-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-39-gce1a6720-dirty
        # https://github.com/intel-lab-lkp/linux/commit/2c1c2e644fb8dbce9b8a004e604792340cbfccb8
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review D-Wythe/net-smc-optimize-the-parallelism-of-SMC-R-connections/20220811-014942
        git checkout 2c1c2e644fb8dbce9b8a004e604792340cbfccb8
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

sparse warnings: (new ones prefixed by >>)
>> net/smc/smc_core.c:49:24: sparse: sparse: symbol 'smc_lgr_manager' was not declared. Should it be static?

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections
  2022-08-10 17:47 [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections D. Wythe
                   ` (10 preceding siblings ...)
  2022-08-11  3:28 ` [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections Jakub Kicinski
@ 2022-08-11 12:31 ` Karsten Graul
  2022-08-16  9:35 ` Jan Karcher
  12 siblings, 0 replies; 29+ messages in thread
From: Karsten Graul @ 2022-08-11 12:31 UTC (permalink / raw)
  To: D. Wythe, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

On 10/08/2022 19:47, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> This patch set attempts to optimize the parallelism of SMC-R connections,
> mainly to reduce unnecessary blocking on locks, and to fix exceptions that
> occur after thoses optimization.

This are very interesting changes. Please allow us to review and test on 
the s390 architecture. Thank you for this submission!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 02/10] net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending
  2022-08-10 17:47 ` [PATCH net-next 02/10] net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending D. Wythe
@ 2022-08-16  7:58   ` Tony Lu
  0 siblings, 0 replies; 29+ messages in thread
From: Tony Lu @ 2022-08-16  7:58 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma

On Thu, Aug 11, 2022 at 01:47:33AM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> As commit "net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause
> by server" mentioned, it works only when all connection creations are

This is a format issue, it's better to use:

commit 4940a1fdf31c ("net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB
error cause by server").

> completely protected by smc_server_lgr_pending lock, since we already
> cancel the lock, we need to re-fix the issues.
> 
> Fixes: 4940a1fdf31c ("net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause by server")
> 
^^^ This blank line is unnecessary.

> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  net/smc/af_smc.c   |  2 ++
>  net/smc/smc_core.c | 11 ++++++++---
>  net/smc/smc_core.h | 21 +++++++++++++++++++++
>  3 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index af4b0aa..c0842a9 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -2413,6 +2413,7 @@ static void smc_listen_work(struct work_struct *work)
>  		if (rc)
>  			goto out_unlock;
>  	}
> +	smc_conn_leave_rtoken_pending(new_smc, ini);
>  	smc_conn_save_peer_info(new_smc, cclc);
>  	smc_listen_out_connected(new_smc);
>  	SMC_STAT_SERV_SUCC_INC(sock_net(newclcsock->sk), ini);
> @@ -2422,6 +2423,7 @@ static void smc_listen_work(struct work_struct *work)
>  	if (ini->is_smcd)
>  		mutex_unlock(&smc_server_lgr_pending);
>  out_decl:
> +	smc_conn_leave_rtoken_pending(new_smc, ini);
>  	smc_listen_decline(new_smc, rc, ini ? ini->first_contact_local : 0,
>  			   proposal_version);
>  out_free:
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index a3338cc..61a3854 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -2190,14 +2190,19 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>  		     lgr->vlan_id == ini->vlan_id) &&
>  		    (role == SMC_CLNT || ini->is_smcd ||
>  		    (lgr->conns_num < SMC_RMBS_PER_LGR_MAX &&
> -		      !bitmap_full(lgr->rtokens_used_mask, SMC_RMBS_PER_LGR_MAX)))) {
> +		    (SMC_RMBS_PER_LGR_MAX -
> +			bitmap_weight(lgr->rtokens_used_mask, SMC_RMBS_PER_LGR_MAX)
> +				> atomic_read(&lgr->rtoken_pendings))))) {
>  			/* link group found */
>  			ini->first_contact_local = 0;
>  			conn->lgr = lgr;
>  			rc = smc_lgr_register_conn(conn, false);
>  			write_unlock_bh(&lgr->conns_lock);
> -			if (!rc && delayed_work_pending(&lgr->free_work))
> -				cancel_delayed_work(&lgr->free_work);
> +			if (!rc) {
> +				smc_conn_enter_rtoken_pending(smc, ini);
> +				if (delayed_work_pending(&lgr->free_work))
> +					cancel_delayed_work(&lgr->free_work);
> +			}
>  			break;
>  		}
>  		write_unlock_bh(&lgr->conns_lock);
> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> index 199f533..acc2869 100644
> --- a/net/smc/smc_core.h
> +++ b/net/smc/smc_core.h
> @@ -293,6 +293,9 @@ struct smc_link_group {
>  	struct rb_root		conns_all;	/* connection tree */
>  	rwlock_t		conns_lock;	/* protects conns_all */
>  	unsigned int		conns_num;	/* current # of connections */
> +	atomic_t		rtoken_pendings;/* number of connection that
> +						 * lgr assigned but no rtoken got yet
> +						 */
>  	unsigned short		vlan_id;	/* vlan id of link group */
>  
>  	struct list_head	sndbufs[SMC_RMBE_SIZES];/* tx buffers */
> @@ -603,6 +606,24 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
>  int smcr_nl_get_link(struct sk_buff *skb, struct netlink_callback *cb);
>  int smcd_nl_get_lgr(struct sk_buff *skb, struct netlink_callback *cb);
>  
> +static inline void smc_conn_enter_rtoken_pending(struct smc_sock *smc, struct smc_init_info *ini)
> +{
> +	struct smc_link_group *lgr;

Consider this: struct smc_link_group *lgr = smc->conn.lgr ?

> +
> +	lgr = smc->conn.lgr;
> +	if (lgr && !ini->first_contact_local)
> +		atomic_inc(&lgr->rtoken_pendings);
> +}
> +
> +static inline void smc_conn_leave_rtoken_pending(struct smc_sock *smc, struct smc_init_info *ini)
> +{
> +	struct smc_link_group *lgr;

Ditto.

> +
> +	lgr = smc->conn.lgr;
> +	if (lgr && !ini->first_contact_local)
> +		atomic_dec(&lgr->rtoken_pendings);
> +}
> +
>  void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk);
>  
>  static inline struct smc_link_group *smc_get_lgr(struct smc_link *link)
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 03/10] net/smc: allow confirm/delete rkey response deliver multiplex
  2022-08-10 17:47 ` [PATCH net-next 03/10] net/smc: allow confirm/delete rkey response deliver multiplex D. Wythe
@ 2022-08-16  8:17   ` Tony Lu
  0 siblings, 0 replies; 29+ messages in thread
From: Tony Lu @ 2022-08-16  8:17 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma

On Thu, Aug 11, 2022 at 01:47:34AM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> We know that all flows except confirm_rkey and delete_rkey are exclusive,
> confirm/delete rkey flows can run concurrently (local and remote).
> 
> Although the protocol allows, all flows are actually mutually exclusive
> in implementation, dues to waiting for LLC messages is in serial.
> 
> This aggravates the time for establishing or destroying a SMC-R
> connections, connections have to be queued in smc_llc_wait.
> 
> We use rtokens or rkey to correlate a confirm/delete rkey message
> with its response.
> 
> Before sending a message, we put context with rtokens or rkey into
> wait queue. When a response message received, we wakeup the context
> which with the same rtokens or rkey against the response message.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  net/smc/smc_llc.c | 174 +++++++++++++++++++++++++++++++++++++++++-------------
>  net/smc/smc_wr.c  |  10 ----
>  net/smc/smc_wr.h  |  10 ++++
>  3 files changed, 143 insertions(+), 51 deletions(-)
> 
> diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
> index 8134c15..b026df2 100644
> --- a/net/smc/smc_llc.c
> +++ b/net/smc/smc_llc.c
> @@ -200,6 +200,7 @@ struct smc_llc_msg_delete_rkey_v2 {	/* type 0x29 */
>  struct smc_llc_qentry {
>  	struct list_head list;
>  	struct smc_link *link;
> +	void *private;
>  	union smc_llc_msg msg;
>  };
>  
> @@ -479,19 +480,17 @@ int smc_llc_send_confirm_link(struct smc_link *link,
>  	return rc;
>  }
>  
> -/* send LLC confirm rkey request */
> -static int smc_llc_send_confirm_rkey(struct smc_link *send_link,
> -				     struct smc_buf_desc *rmb_desc)
> +/* build LLC confirm rkey request */
> +static int smc_llc_build_confirm_rkey_request(struct smc_link *send_link,
> +					      struct smc_buf_desc *rmb_desc,
> +					      struct smc_wr_tx_pend_priv **priv)
>  {
>  	struct smc_llc_msg_confirm_rkey *rkeyllc;
> -	struct smc_wr_tx_pend_priv *pend;
>  	struct smc_wr_buf *wr_buf;
>  	struct smc_link *link;
>  	int i, rc, rtok_ix;
>  
> -	if (!smc_wr_tx_link_hold(send_link))
> -		return -ENOLINK;
> -	rc = smc_llc_add_pending_send(send_link, &wr_buf, &pend);
> +	rc = smc_llc_add_pending_send(send_link, &wr_buf, priv);
>  	if (rc)
>  		goto put_out;
>  	rkeyllc = (struct smc_llc_msg_confirm_rkey *)wr_buf;
> @@ -521,25 +520,20 @@ static int smc_llc_send_confirm_rkey(struct smc_link *send_link,
>  		cpu_to_be64((uintptr_t)rmb_desc->cpu_addr) :
>  		cpu_to_be64((u64)sg_dma_address
>  			    (rmb_desc->sgt[send_link->link_idx].sgl));
> -	/* send llc message */
> -	rc = smc_wr_tx_send(send_link, pend);
>  put_out:
> -	smc_wr_tx_link_put(send_link);
>  	return rc;
>  }
>  
> -/* send LLC delete rkey request */
> -static int smc_llc_send_delete_rkey(struct smc_link *link,
> -				    struct smc_buf_desc *rmb_desc)
> +/* build LLC delete rkey request */
> +static int smc_llc_build_delete_rkey_request(struct smc_link *link,
> +					     struct smc_buf_desc *rmb_desc,
> +					     struct smc_wr_tx_pend_priv **priv)
>  {
>  	struct smc_llc_msg_delete_rkey *rkeyllc;
> -	struct smc_wr_tx_pend_priv *pend;
>  	struct smc_wr_buf *wr_buf;
>  	int rc;
>  
> -	if (!smc_wr_tx_link_hold(link))
> -		return -ENOLINK;
> -	rc = smc_llc_add_pending_send(link, &wr_buf, &pend);
> +	rc = smc_llc_add_pending_send(link, &wr_buf, priv);
>  	if (rc)
>  		goto put_out;
>  	rkeyllc = (struct smc_llc_msg_delete_rkey *)wr_buf;
> @@ -548,10 +542,7 @@ static int smc_llc_send_delete_rkey(struct smc_link *link,
>  	smc_llc_init_msg_hdr(&rkeyllc->hd, link->lgr, sizeof(*rkeyllc));
>  	rkeyllc->num_rkeys = 1;
>  	rkeyllc->rkey[0] = htonl(rmb_desc->mr[link->link_idx]->rkey);
> -	/* send llc message */
> -	rc = smc_wr_tx_send(link, pend);
>  put_out:
> -	smc_wr_tx_link_put(link);
>  	return rc;
>  }
>  
> @@ -2023,7 +2014,8 @@ static void smc_llc_rx_response(struct smc_link *link,
>  	case SMC_LLC_DELETE_RKEY:
>  		if (flowtype != SMC_LLC_FLOW_RKEY || flow->qentry)
>  			break;	/* drop out-of-flow response */
> -		goto assign;
> +		__wake_up(&link->lgr->llc_msg_waiter, TASK_NORMAL, 1, qentry);
> +		goto free;
>  	case SMC_LLC_CONFIRM_RKEY_CONT:
>  		/* not used because max links is 3 */
>  		break;
> @@ -2032,6 +2024,7 @@ static void smc_llc_rx_response(struct smc_link *link,
>  					   qentry->msg.raw.hdr.common.type);
>  		break;
>  	}
> +free:
>  	kfree(qentry);
>  	return;
>  assign:
> @@ -2191,25 +2184,98 @@ void smc_llc_link_clear(struct smc_link *link, bool log)
>  	cancel_delayed_work_sync(&link->llc_testlink_wrk);
>  }
>  
> +static int smc_llc_rkey_response_wake_function(struct wait_queue_entry *wq_entry,
> +					       unsigned int mode, int sync, void *key)
> +{
> +	struct smc_llc_qentry *except, *incoming;
> +	u8 except_llc_type;
> +
> +	/* not a rkey response */
> +	if (!key)
> +		return 0;
> +
> +	except = wq_entry->private;
> +	incoming = key;
> +
> +	except_llc_type = except->msg.raw.hdr.common.llc_type;
> +
> +	/* except LLC MSG TYPE mismatch */
> +	if (except_llc_type != incoming->msg.raw.hdr.common.llc_type)
> +		return 0;
> +
> +	switch (except_llc_type) {
> +	case SMC_LLC_CONFIRM_RKEY:
> +		if (memcmp(except->msg.confirm_rkey.rtoken, incoming->msg.confirm_rkey.rtoken,
> +			   sizeof(struct smc_rmb_rtoken) *
> +			   except->msg.confirm_rkey.rtoken[0].num_rkeys))
> +			return 0;
> +		break;
> +	case SMC_LLC_DELETE_RKEY:
> +		if (memcmp(except->msg.delete_rkey.rkey, incoming->msg.delete_rkey.rkey,
> +			   sizeof(__be32) * except->msg.delete_rkey.num_rkeys))
> +			return 0;
> +		break;
> +	default:
> +		panic("invalid except llc msg %d", except_llc_type);

Replace panic with pr_warn?

> +		return 0;
> +	}
> +
> +	/* match, save hdr */
> +	memcpy(&except->msg.raw.hdr, &incoming->msg.raw.hdr, sizeof(except->msg.raw.hdr));
> +
> +	wq_entry->private = except->private;
> +	return woken_wake_function(wq_entry, mode, sync, NULL);
> +}
> +
>  /* register a new rtoken at the remote peer (for all links) */
>  int smc_llc_do_confirm_rkey(struct smc_link *send_link,
>  			    struct smc_buf_desc *rmb_desc)
>  {
> +	long timeout = SMC_LLC_WAIT_TIME;

Reverse Christmas trees.

>  	struct smc_link_group *lgr = send_link->lgr;
> -	struct smc_llc_qentry *qentry = NULL;
> -	int rc = 0;
> +	struct smc_llc_qentry qentry;
> +	struct smc_wr_tx_pend *pend;
> +	struct smc_wr_tx_pend_priv *priv;
> +	DEFINE_WAIT_FUNC(wait, smc_llc_rkey_response_wake_function);
> +	int rc = 0, flags;

Ditto.

>  
> -	rc = smc_llc_send_confirm_rkey(send_link, rmb_desc);
> +	if (!smc_wr_tx_link_hold(send_link))
> +		return -ENOLINK;
> +
> +	rc = smc_llc_build_confirm_rkey_request(send_link, rmb_desc, &priv);
>  	if (rc)
>  		goto out;
> -	/* receive CONFIRM RKEY response from server over RoCE fabric */
> -	qentry = smc_llc_wait(lgr, send_link, SMC_LLC_WAIT_TIME,
> -			      SMC_LLC_CONFIRM_RKEY);
> -	if (!qentry || (qentry->msg.raw.hdr.flags & SMC_LLC_FLAG_RKEY_NEG))
> +
> +	pend = container_of(priv, struct smc_wr_tx_pend, priv);
> +	/* make a copy of send msg */
> +	memcpy(&qentry.msg.confirm_rkey, send_link->wr_tx_bufs[pend->idx].raw,
> +	       sizeof(qentry.msg.confirm_rkey));
> +
> +	qentry.private = wait.private;
> +	wait.private = &qentry;
> +
> +	add_wait_queue(&lgr->llc_msg_waiter, &wait);
> +
> +	/* send llc message */
> +	rc = smc_wr_tx_send(send_link, priv);
> +	smc_wr_tx_link_put(send_link);
> +	if (rc) {
> +		remove_wait_queue(&lgr->llc_msg_waiter, &wait);
> +		goto out;
> +	}
> +
> +	while (!signal_pending(current) && timeout) {
> +		timeout = wait_woken(&wait, TASK_INTERRUPTIBLE, timeout);
> +		if (qentry.msg.raw.hdr.flags & SMC_LLC_FLAG_RESP)
> +			break;
> +	}
> +
> +	remove_wait_queue(&lgr->llc_msg_waiter, &wait);
> +	flags = qentry.msg.raw.hdr.flags;
> +
> +	if (!(flags & SMC_LLC_FLAG_RESP) || flags & SMC_LLC_FLAG_RKEY_NEG)
>  		rc = -EFAULT;
>  out:
> -	if (qentry)
> -		smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
>  	return rc;
>  }
>  
> @@ -2217,26 +2283,52 @@ int smc_llc_do_confirm_rkey(struct smc_link *send_link,
>  int smc_llc_do_delete_rkey(struct smc_link_group *lgr,
>  			   struct smc_buf_desc *rmb_desc)
>  {
> -	struct smc_llc_qentry *qentry = NULL;
> +	long timeout = SMC_LLC_WAIT_TIME;
> +	struct smc_llc_qentry qentry;
> +	struct smc_wr_tx_pend *pend;
>  	struct smc_link *send_link;
> -	int rc = 0;
> +	struct smc_wr_tx_pend_priv *priv;

Reverse Christmas trees.

> +	DEFINE_WAIT_FUNC(wait, smc_llc_rkey_response_wake_function);
> +	int rc = 0, flags;
>  
>  	send_link = smc_llc_usable_link(lgr);
> -	if (!send_link)
> +	if (!send_link || !smc_wr_tx_link_hold(send_link))
>  		return -ENOLINK;
>  
> -	/* protected by llc_flow control */
> -	rc = smc_llc_send_delete_rkey(send_link, rmb_desc);
> +	rc = smc_llc_build_delete_rkey_request(send_link, rmb_desc, &priv);
>  	if (rc)
>  		goto out;
> -	/* receive DELETE RKEY response from server over RoCE fabric */
> -	qentry = smc_llc_wait(lgr, send_link, SMC_LLC_WAIT_TIME,
> -			      SMC_LLC_DELETE_RKEY);
> -	if (!qentry || (qentry->msg.raw.hdr.flags & SMC_LLC_FLAG_RKEY_NEG))
> +
> +	pend = container_of(priv, struct smc_wr_tx_pend, priv);
> +	/* make a copy of send msg */
> +	memcpy(&qentry.msg.delete_link, send_link->wr_tx_bufs[pend->idx].raw,
> +	       sizeof(qentry.msg.delete_link));
> +
> +	qentry.private = wait.private;
> +	wait.private = &qentry;
> +
> +	add_wait_queue(&lgr->llc_msg_waiter, &wait);
> +
> +	/* send llc message */
> +	rc = smc_wr_tx_send(send_link, priv);
> +	smc_wr_tx_link_put(send_link);
> +	if (rc) {
> +		remove_wait_queue(&lgr->llc_msg_waiter, &wait);
> +		goto out;
> +	}
> +
> +	while (!signal_pending(current) && timeout) {
> +		timeout = wait_woken(&wait, TASK_INTERRUPTIBLE, timeout);
> +		if (qentry.msg.raw.hdr.flags & SMC_LLC_FLAG_RESP)
> +			break;
> +	}
> +
> +	remove_wait_queue(&lgr->llc_msg_waiter, &wait);
> +	flags = qentry.msg.raw.hdr.flags;
> +
> +	if (!(flags & SMC_LLC_FLAG_RESP) || flags & SMC_LLC_FLAG_RKEY_NEG)
>  		rc = -EFAULT;
>  out:
> -	if (qentry)
> -		smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
>  	return rc;
>  }
>  
> diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
> index 26f8f24..52af94f 100644
> --- a/net/smc/smc_wr.c
> +++ b/net/smc/smc_wr.c
> @@ -37,16 +37,6 @@
>  static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BITS);
>  static DEFINE_SPINLOCK(smc_wr_rx_hash_lock);
>  
> -struct smc_wr_tx_pend {	/* control data for a pending send request */
> -	u64			wr_id;		/* work request id sent */
> -	smc_wr_tx_handler	handler;
> -	enum ib_wc_status	wc_status;	/* CQE status */
> -	struct smc_link		*link;
> -	u32			idx;
> -	struct smc_wr_tx_pend_priv priv;
> -	u8			compl_requested;
> -};
> -
>  /******************************** send queue *********************************/
>  
>  /*------------------------------- completion --------------------------------*/
> diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
> index a54e90a..9946ed5 100644
> --- a/net/smc/smc_wr.h
> +++ b/net/smc/smc_wr.h
> @@ -46,6 +46,16 @@ struct smc_wr_rx_handler {
>  	u8			type;
>  };
>  
> +struct smc_wr_tx_pend {	/* control data for a pending send request */
> +	u64			wr_id;		/* work request id sent */
> +	smc_wr_tx_handler	handler;
> +	enum ib_wc_status	wc_status;	/* CQE status */
> +	struct smc_link		*link;
> +	u32			idx;
> +	struct smc_wr_tx_pend_priv priv;
> +	u8			compl_requested;
> +};
> +
>  /* Only used by RDMA write WRs.
>   * All other WRs (CDC/LLC) use smc_wr_tx_send handling WR_ID implicitly
>   */
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 07/10] net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs()
  2022-08-10 17:47 ` [PATCH net-next 07/10] net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs() D. Wythe
@ 2022-08-16  8:24   ` Tony Lu
  0 siblings, 0 replies; 29+ messages in thread
From: Tony Lu @ 2022-08-16  8:24 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma

On Thu, Aug 11, 2022 at 01:47:38AM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> Unlike smc_buf_create() and smcr_buf_unuse(), smcr_lgr_reg_rmbs() is
> exclusive when assigned rmb_desc was not registered, although it can be
> executed in parallel when assigned rmb_desc was registered already
> and only performs read semtamics on it. Hence, we can not simply replace
> it with read semaphore.
> 
> The idea here is that if the assigned rmb_desc was registered already,
> use read semaphore to protect the critical section, once the assigned
> rmb_desc was not registered, keep using keep write semaphore still
> to keep its exclusivity.
> 
> Thanks to the reusable features of rmb_desc, which allows us to execute
> in parallel in most cases.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  net/smc/af_smc.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 51b90e2..39dbf39 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -516,10 +516,25 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
>  {
>  	struct smc_link_group *lgr = link->lgr;
>  	int i, rc = 0;
> +	bool slow = false;

Consider do_slow?

Reverse Christmas tree.

>  
>  	rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
>  	if (rc)
>  		return rc;
> +
> +	down_read(&lgr->llc_conf_mutex);
> +	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
> +		if (!smc_link_active(&lgr->lnk[i]))
> +			continue;
> +		if (!rmb_desc->is_reg_mr[link->link_idx]) {
> +			up_read(&lgr->llc_conf_mutex);
> +			goto slow_path;
> +		}
> +	}
> +	/* mr register already */
> +	goto fast_path;
> +slow_path:
> +	slow = true;
>  	/* protect against parallel smc_llc_cli_rkey_exchange() and
>  	 * parallel smcr_link_reg_buf()
>  	 */
> @@ -531,7 +546,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
>  		if (rc)
>  			goto out;
>  	}
> -
> +fast_path:
>  	/* exchange confirm_rkey msg with peer */
>  	rc = smc_llc_do_confirm_rkey(link, rmb_desc);
>  	if (rc) {
> @@ -540,7 +555,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
>  	}
>  	rmb_desc->is_conf_rkey = true;
>  out:
> -	up_write(&lgr->llc_conf_mutex);
> +	slow ? up_write(&lgr->llc_conf_mutex) : up_read(&lgr->llc_conf_mutex);
>  	smc_llc_flow_stop(lgr, &lgr->llc_flow_lcl);
>  	return rc;
>  }
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 09/10] net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link()
  2022-08-10 17:47 ` [PATCH net-next 09/10] net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link() D. Wythe
@ 2022-08-16  8:28   ` Tony Lu
  0 siblings, 0 replies; 29+ messages in thread
From: Tony Lu @ 2022-08-16  8:28 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma

On Thu, Aug 11, 2022 at 01:47:40AM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> After we optimize the parallel capability of SMC-R connection
> establishment, there is a certain chance to trigger the
> following panic:
> 
> PID: 5900   TASK: ffff88c1c8af4100  CPU: 1   COMMAND: "kworker/1:48"
>  #0 [ffff9456c1cc79a0] machine_kexec at ffffffff870665b7
>  #1 [ffff9456c1cc79f0] __crash_kexec at ffffffff871b4c7a
>  #2 [ffff9456c1cc7ab0] crash_kexec at ffffffff871b5b60
>  #3 [ffff9456c1cc7ac0] oops_end at ffffffff87026ce7
>  #4 [ffff9456c1cc7ae0] page_fault_oops at ffffffff87075715
>  #5 [ffff9456c1cc7b58] exc_page_fault at ffffffff87ad0654
>  #6 [ffff9456c1cc7b80] asm_exc_page_fault at ffffffff87c00b62
>     [exception RIP: ib_alloc_mr+19]
>     RIP: ffffffffc0c9cce3  RSP: ffff9456c1cc7c38  RFLAGS: 00010202
>     RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000004
>     RDX: 0000000000000010  RSI: 0000000000000000  RDI: 0000000000000000
>     RBP: ffff88c1ea281d00   R8: 000000020a34ffff   R9: ffff88c1350bbb20
>     R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000000
>     R13: 0000000000000010  R14: ffff88c1ab040a50  R15: ffff88c1ea281d00
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #7 [ffff9456c1cc7c60] smc_ib_get_memory_region at ffffffffc0aff6df [smc]
>  #8 [ffff9456c1cc7c88] smcr_buf_map_link at ffffffffc0b0278c [smc]
>  #9 [ffff9456c1cc7ce0] __smc_buf_create at ffffffffc0b03586 [smc]
> 
> The reason here is that when the server tries to create a second link,
> smc_llc_srv_add_link() has no protection and may add a new link to
> link group. This breaks the security environment protected by
> llc_conf_mutex.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>

I am curious if this patch can be merged with the previous one? It seems
that this panic is introduced by previous one?

> ---
>  net/smc/af_smc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 39dbf39..0b0c53a 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -1834,8 +1834,10 @@ static int smcr_serv_conf_first_link(struct smc_sock *smc)
>  	smc_llc_link_active(link);
>  	smcr_lgr_set_type(link->lgr, SMC_LGR_SINGLE);
>  
> +	down_write(&link->lgr->llc_conf_mutex);
>  	/* initial contact - try to establish second link */
>  	smc_llc_srv_add_link(link, NULL);
> +	up_write(&link->lgr->llc_conf_mutex);
>  	return 0;
>  }
>  
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 08/10] net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore
  2022-08-10 17:47 ` [PATCH net-next 08/10] net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore D. Wythe
@ 2022-08-16  8:37   ` Tony Lu
  0 siblings, 0 replies; 29+ messages in thread
From: Tony Lu @ 2022-08-16  8:37 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma

On Thu, Aug 11, 2022 at 01:47:39AM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> It's clear that rmbs_lock and sndbufs_lock are aims to protect the
> rmbs list or the sndbufs list.
> 
> During conenction establieshment, smc_buf_get_slot() will always

conenction -> connection

> be invoke, and it only performs read semantics in rmbs list and

invoke -> invoked.

> sndbufs list.
> 
> Based on the above considerations, we replace mutex with rw_semaphore.
> Only smc_buf_get_slot() use down_read() to allow smc_buf_get_slot()
> run concurrently, other part use down_write() to keep exclusive
> semantics.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  net/smc/smc_core.c | 55 +++++++++++++++++++++++++++---------------------------
>  net/smc/smc_core.h |  4 ++--
>  net/smc/smc_llc.c  | 16 ++++++++--------
>  3 files changed, 38 insertions(+), 37 deletions(-)
> 
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index 113804d..b90970a 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -1138,8 +1138,8 @@ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini)
>  	lgr->freeing = 0;
>  	lgr->vlan_id = ini->vlan_id;
>  	refcount_set(&lgr->refcnt, 1); /* set lgr refcnt to 1 */
> -	mutex_init(&lgr->sndbufs_lock);
> -	mutex_init(&lgr->rmbs_lock);
> +	init_rwsem(&lgr->sndbufs_lock);
> +	init_rwsem(&lgr->rmbs_lock);
>  	rwlock_init(&lgr->conns_lock);
>  	for (i = 0; i < SMC_RMBE_SIZES; i++) {
>  		INIT_LIST_HEAD(&lgr->sndbufs[i]);
> @@ -1380,7 +1380,7 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
>  static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
>  			   struct smc_link_group *lgr)
>  {
> -	struct mutex *lock;	/* lock buffer list */
> +	struct rw_semaphore *lock;	/* lock buffer list */
>  	int rc;
>  
>  	if (is_rmb && buf_desc->is_conf_rkey && !list_empty(&lgr->list)) {
> @@ -1400,9 +1400,9 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
>  		/* buf registration failed, reuse not possible */
>  		lock = is_rmb ? &lgr->rmbs_lock :
>  				&lgr->sndbufs_lock;
> -		mutex_lock(lock);
> +		down_write(lock);
>  		list_del(&buf_desc->list);
> -		mutex_unlock(lock);
> +		up_write(lock);
>  
>  		smc_buf_free(lgr, is_rmb, buf_desc);
>  	} else {
> @@ -1506,15 +1506,16 @@ static void smcr_buf_unmap_lgr(struct smc_link *lnk)
>  	int i;
>  
>  	for (i = 0; i < SMC_RMBE_SIZES; i++) {
> -		mutex_lock(&lgr->rmbs_lock);
> +		down_write(&lgr->rmbs_lock);
>  		list_for_each_entry_safe(buf_desc, bf, &lgr->rmbs[i], list)
>  			smcr_buf_unmap_link(buf_desc, true, lnk);
> -		mutex_unlock(&lgr->rmbs_lock);
> -		mutex_lock(&lgr->sndbufs_lock);
> +		up_write(&lgr->rmbs_lock);
> +
> +		down_write(&lgr->sndbufs_lock);
>  		list_for_each_entry_safe(buf_desc, bf, &lgr->sndbufs[i],
>  					 list)
>  			smcr_buf_unmap_link(buf_desc, false, lnk);
> -		mutex_unlock(&lgr->sndbufs_lock);
> +		up_write(&lgr->sndbufs_lock);
>  	}
>  }
>  
> @@ -2324,19 +2325,19 @@ int smc_uncompress_bufsize(u8 compressed)
>   * buffer size; if not available, return NULL
>   */
>  static struct smc_buf_desc *smc_buf_get_slot(int compressed_bufsize,
> -					     struct mutex *lock,
> +					     struct rw_semaphore *lock,
>  					     struct list_head *buf_list)
>  {
>  	struct smc_buf_desc *buf_slot;
>  
> -	mutex_lock(lock);
> +	down_read(lock);
>  	list_for_each_entry(buf_slot, buf_list, list) {
>  		if (cmpxchg(&buf_slot->used, 0, 1) == 0) {
> -			mutex_unlock(lock);
> +			up_read(lock);
>  			return buf_slot;
>  		}
>  	}
> -	mutex_unlock(lock);
> +	up_read(lock);
>  	return NULL;
>  }
>  
> @@ -2445,13 +2446,13 @@ int smcr_link_reg_buf(struct smc_link *link, struct smc_buf_desc *buf_desc)
>  	return 0;
>  }
>  
> -static int _smcr_buf_map_lgr(struct smc_link *lnk, struct mutex *lock,
> +static int _smcr_buf_map_lgr(struct smc_link *lnk, struct rw_semaphore *lock,
>  			     struct list_head *lst, bool is_rmb)
>  {
>  	struct smc_buf_desc *buf_desc, *bf;
>  	int rc = 0;
>  
> -	mutex_lock(lock);
> +	down_write(lock);
>  	list_for_each_entry_safe(buf_desc, bf, lst, list) {
>  		if (!buf_desc->used)
>  			continue;
> @@ -2460,7 +2461,7 @@ static int _smcr_buf_map_lgr(struct smc_link *lnk, struct mutex *lock,
>  			goto out;
>  	}
>  out:
> -	mutex_unlock(lock);
> +	up_write(lock);
>  	return rc;
>  }
>  
> @@ -2493,37 +2494,37 @@ int smcr_buf_reg_lgr(struct smc_link *lnk)
>  	int i, rc = 0;
>  
>  	/* reg all RMBs for a new link */
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	for (i = 0; i < SMC_RMBE_SIZES; i++) {
>  		list_for_each_entry_safe(buf_desc, bf, &lgr->rmbs[i], list) {
>  			if (!buf_desc->used)
>  				continue;
>  			rc = smcr_link_reg_buf(lnk, buf_desc);
>  			if (rc) {
> -				mutex_unlock(&lgr->rmbs_lock);
> +				up_write(&lgr->rmbs_lock);
>  				return rc;
>  			}
>  		}
>  	}
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  
>  	if (lgr->buf_type == SMCR_PHYS_CONT_BUFS)
>  		return rc;
>  
>  	/* reg all vzalloced sndbufs for a new link */
> -	mutex_lock(&lgr->sndbufs_lock);
> +	down_write(&lgr->sndbufs_lock);
>  	for (i = 0; i < SMC_RMBE_SIZES; i++) {
>  		list_for_each_entry_safe(buf_desc, bf, &lgr->sndbufs[i], list) {
>  			if (!buf_desc->used || !buf_desc->is_vm)
>  				continue;
>  			rc = smcr_link_reg_buf(lnk, buf_desc);
>  			if (rc) {
> -				mutex_unlock(&lgr->sndbufs_lock);
> +				up_write(&lgr->sndbufs_lock);
>  				return rc;
>  			}
>  		}
>  	}
> -	mutex_unlock(&lgr->sndbufs_lock);
> +	up_write(&lgr->sndbufs_lock);
>  	return rc;
>  }
>  
> @@ -2641,7 +2642,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
>  	struct list_head *buf_list;
>  	int bufsize, bufsize_short;
>  	bool is_dgraded = false;
> -	struct mutex *lock;	/* lock buffer list */
> +	struct rw_semaphore *lock;	/* lock buffer list */
>  	int sk_buf_size;
>  
>  	if (is_rmb)
> @@ -2689,9 +2690,9 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
>  		SMC_STAT_RMB_ALLOC(smc, is_smcd, is_rmb);
>  		SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, bufsize);
>  		buf_desc->used = 1;
> -		mutex_lock(lock);
> +		down_write(lock);
>  		list_add(&buf_desc->list, buf_list);
> -		mutex_unlock(lock);
> +		up_write(lock);
>  		break; /* found */
>  	}
>  
> @@ -2765,9 +2766,9 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
>  	/* create rmb */
>  	rc = __smc_buf_create(smc, is_smcd, true);
>  	if (rc) {
> -		mutex_lock(&smc->conn.lgr->sndbufs_lock);
> +		down_write(&smc->conn.lgr->sndbufs_lock);
>  		list_del(&smc->conn.sndbuf_desc->list);
> -		mutex_unlock(&smc->conn.lgr->sndbufs_lock);
> +		up_write(&smc->conn.lgr->sndbufs_lock);
>  		smc_buf_free(smc->conn.lgr, false, smc->conn.sndbuf_desc);
>  		smc->conn.sndbuf_desc = NULL;
>  	}
> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> index 559d330..008148c 100644
> --- a/net/smc/smc_core.h
> +++ b/net/smc/smc_core.h
> @@ -300,9 +300,9 @@ struct smc_link_group {
>  	unsigned short		vlan_id;	/* vlan id of link group */
>  
>  	struct list_head	sndbufs[SMC_RMBE_SIZES];/* tx buffers */
> -	struct mutex		sndbufs_lock;	/* protects tx buffers */
> +	struct rw_semaphore	sndbufs_lock;	/* protects tx buffers */
>  	struct list_head	rmbs[SMC_RMBE_SIZES];	/* rx buffers */
> -	struct mutex		rmbs_lock;	/* protects rx buffers */
> +	struct rw_semaphore	rmbs_lock;	/* protects rx buffers */
>  
>  	u8			id[SMC_LGR_ID_SIZE];	/* unique lgr id */
>  	struct delayed_work	free_work;	/* delayed freeing of an lgr */
> diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
> index d744937..76f9906 100644
> --- a/net/smc/smc_llc.c
> +++ b/net/smc/smc_llc.c
> @@ -642,7 +642,7 @@ static int smc_llc_fill_ext_v2(struct smc_llc_msg_add_link_v2_ext *ext,
>  
>  	prim_lnk_idx = link->link_idx;
>  	lnk_idx = link_new->link_idx;
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	ext->num_rkeys = lgr->conns_num;
>  	if (!ext->num_rkeys)
>  		goto out;
> @@ -662,7 +662,7 @@ static int smc_llc_fill_ext_v2(struct smc_llc_msg_add_link_v2_ext *ext,
>  	}
>  	len += i * sizeof(ext->rt[0]);
>  out:
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  	return len;
>  }
>  
> @@ -923,7 +923,7 @@ static int smc_llc_cli_rkey_exchange(struct smc_link *link,
>  	int rc = 0;
>  	int i;
>  
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	num_rkeys_send = lgr->conns_num;
>  	buf_pos = smc_llc_get_first_rmb(lgr, &buf_lst);
>  	do {
> @@ -950,7 +950,7 @@ static int smc_llc_cli_rkey_exchange(struct smc_link *link,
>  			break;
>  	} while (num_rkeys_send || num_rkeys_recv);
>  
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  	return rc;
>  }
>  
> @@ -1033,14 +1033,14 @@ static void smc_llc_save_add_link_rkeys(struct smc_link *link,
>  	ext = (struct smc_llc_msg_add_link_v2_ext *)((u8 *)lgr->wr_rx_buf_v2 +
>  						     SMC_WR_TX_SIZE);
>  	max = min_t(u8, ext->num_rkeys, SMC_LLC_RKEYS_PER_MSG_V2);
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	for (i = 0; i < max; i++) {
>  		smc_rtoken_set(lgr, link->link_idx, link_new->link_idx,
>  			       ext->rt[i].rmb_key,
>  			       ext->rt[i].rmb_vaddr_new,
>  			       ext->rt[i].rmb_key_new);
>  	}
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  }
>  
>  static void smc_llc_save_add_link_info(struct smc_link *link,
> @@ -1349,7 +1349,7 @@ static int smc_llc_srv_rkey_exchange(struct smc_link *link,
>  	int rc = 0;
>  	int i;
>  
> -	mutex_lock(&lgr->rmbs_lock);
> +	down_write(&lgr->rmbs_lock);
>  	num_rkeys_send = lgr->conns_num;
>  	buf_pos = smc_llc_get_first_rmb(lgr, &buf_lst);
>  	do {
> @@ -1374,7 +1374,7 @@ static int smc_llc_srv_rkey_exchange(struct smc_link *link,
>  		smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
>  	} while (num_rkeys_send || num_rkeys_recv);
>  out:
> -	mutex_unlock(&lgr->rmbs_lock);
> +	up_write(&lgr->rmbs_lock);
>  	return rc;
>  }
>  
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections
  2022-08-10 17:47 [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections D. Wythe
                   ` (11 preceding siblings ...)
  2022-08-11 12:31 ` Karsten Graul
@ 2022-08-16  9:35 ` Jan Karcher
  2022-08-16 12:40   ` Tony Lu
  2022-08-17  4:55   ` D. Wythe
  12 siblings, 2 replies; 29+ messages in thread
From: Jan Karcher @ 2022-08-16  9:35 UTC (permalink / raw)
  To: D. Wythe, kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma



On 10.08.2022 19:47, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> This patch set attempts to optimize the parallelism of SMC-R connections,
> mainly to reduce unnecessary blocking on locks, and to fix exceptions that
> occur after thoses optimization.
>

Thank you again for your submission!
Let me give you a quick update from our side:
We tested your patches on top of the net-next kernel on our s390 
systems. They did crash our systems. After verifying our environment we 
pulled console logs and now we can tell that there is indeed a problem 
with your patches regarding SMC-D. So please do not integrate this 
change as of right now. I'm going to do more in depth reviews of your 
patches but i need some time for them so here is a quick a description 
of the problem:

It is a SMC-D problem, that occurs while building up the connection. In 
smc_conn_create you set struct smc_lnk_cluster *lnkc = NULL. For the 
SMC-R path you do grab the pointer, for SMC-D that never happens. Still 
you are using this refernce for SMC-D => Crash. This problem can be 
reproduced using the SMC-D path. Here is an example console output:

[  779.516382] Unable to handle kernel pointer dereference in virtual 
kernel address space
[  779.516389] Failing address: 0000000000000000 TEID: 0000000000000483
[  779.516391] Fault in home space mode while using kernel ASCE.
[  779.516395] AS:0000000069628007 R3:00000000ffbf0007 
S:00000000ffbef800 P:000000000000003d
[  779.516431] Oops: 0004 ilc:2 [#1] SMP
[  779.516436] Modules linked in: tcp_diag inet_diag ism mlx5_ib 
ib_uverbs mlx5_core smc_diag smc ib_core nft_fib_inet nft_fib_ipv4
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 
nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv
6 nf_defrag_ipv4 ip_set nf_tables n
[  779.516470] CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 
5.19.0-13940-g22a46254655a #3
[  779.516476] Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)

[  779.522738] Workqueue: smc_hs_wq smc_listen_work [smc]
[  779.522755] Krnl PSW : 0704c00180000000 000003ff803da89c 
(smc_conn_create+0x174/0x968 [smc])
[  779.522766]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 
PM:0 RI:0 EA:3
[  779.522770] Krnl GPRS: 0000000000000002 0000000000000000 
0000000000000001 0000000000000000
[  779.522773]            000000008a4128a0 000003ff803f21aa 
000000008e30d640 0000000086d72000
[  779.522776]            0000000086d72000 000000008a412803 
000000008a412800 000000008e30d650
[  779.522779]            0000000080934200 0000000000000000 
000003ff803cb954 00000380002dfa88
[  779.522789] Krnl Code: 000003ff803da88e: e310f0e80024        stg 
%r1,232(%r15)
[  779.522789]            000003ff803da894: a7180000            lhi 
%r1,0
[  779.522789]           #000003ff803da898: 582003ac            l 
%r2,940
[  779.522789]           >000003ff803da89c: ba123020            cs 
%r1,%r2,32(%r3)
[  779.522789]            000003ff803da8a0: ec1603be007e        cij 
%r1,0,6,000003ff803db01c

[  779.522789]            000003ff803da8a6: 4110b002            la 
%r1,2(%r11)
[  779.522789]            000003ff803da8aa: e310f0f00024        stg 
%r1,240(%r15)
[  779.522789]            000003ff803da8b0: e310f0c00004        lg 
%r1,192(%r15)
[  779.522870] Call Trace:
[  779.522873]  [<000003ff803da89c>] smc_conn_create+0x174/0x968 [smc]
[  779.522884]  [<000003ff803cb954>] 
smc_find_ism_v2_device_serv+0x1b4/0x300 [smc]
01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP 
stop from CPU 01.
01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP 
stop from CPU 00.
[  779.522894]  [<000003ff803cbace>] smc_listen_find_device+0x2e/0x370 [smc]


I'm going to send the review for the first patch right away (which is 
the one causing the crash), so far I'm done with it. The others are 
going to follow. Maybe you can look over the problem and come up with a 
solution, otherwise we are going to decide if we want to look into it as 
soon as I'm done with the reviews. Thank you for your patience.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending
  2022-08-10 17:47 ` [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D. Wythe
  2022-08-11  3:41   ` kernel test robot
  2022-08-11 11:51   ` kernel test robot
@ 2022-08-16  9:43   ` Jan Karcher
  2022-08-16 12:47     ` Tony Lu
  2022-08-16 12:52   ` Tony Lu
  3 siblings, 1 reply; 29+ messages in thread
From: Jan Karcher @ 2022-08-16  9:43 UTC (permalink / raw)
  To: D. Wythe, kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma



On 10.08.2022 19:47, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> This patch attempts to remove locks named smc_client_lgr_pending and
> smc_server_lgr_pending, which aim to serialize the creation of link
> group. However, once link group existed already, those locks are
> meaningless, worse still, they make incoming connections have to be
> queued one after the other.
> 
> Now, the creation of link group is no longer generated by competition,
> but allocated through following strategy.
> 
> 1. Try to find a suitable link group, if successd, current connection
> is considered as NON first contact connection. ends.
> 
> 2. Check the number of connections currently waiting for a suitable
> link group to be created, if it is not less that the number of link
> groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
> increase the number of link groups to be created, current connection
> is considered as the first contact connection. ends.
> 
> 3. Increase the number of connections currently waiting, and wait
> for woken up.
> 
> 4. Decrease the number of connections currently waiting, goto 1.
> 
> We wake up the connection that was put to sleep in stage 3 through
> the SMC link state change event. Once the link moves out of the
> SMC_LNK_ACTIVATING state, decrease the number of link groups to
> be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
> connections.
> 
> In the iplementation, we introduce the concept of lnk cluster, which is
> a collection of links with the same characteristics (see
> smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
> wake up efficiently in the scenario of N v.s 1.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>   net/smc/af_smc.c   |  11 +-
>   net/smc/smc_core.c | 356 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>   net/smc/smc_core.h |  48 ++++++++
>   net/smc/smc_llc.c  |   9 +-
>   4 files changed, 411 insertions(+), 13 deletions(-)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 79c1318..af4b0aa 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -1194,10 +1194,8 @@ static int smc_connect_rdma(struct smc_sock *smc,
>   	if (reason_code)
>   		return reason_code;
> 
> -	mutex_lock(&smc_client_lgr_pending);
>   	reason_code = smc_conn_create(smc, ini);
>   	if (reason_code) {
> -		mutex_unlock(&smc_client_lgr_pending);
>   		return reason_code;
>   	}
> 
> @@ -1289,7 +1287,6 @@ static int smc_connect_rdma(struct smc_sock *smc,
>   		if (reason_code)
>   			goto connect_abort;
>   	}
> -	mutex_unlock(&smc_client_lgr_pending);
> 
>   	smc_copy_sock_settings_to_clc(smc);
>   	smc->connect_nonblock = 0;
> @@ -1299,7 +1296,6 @@ static int smc_connect_rdma(struct smc_sock *smc,
>   	return 0;
>   connect_abort:
>   	smc_conn_abort(smc, ini->first_contact_local);
> -	mutex_unlock(&smc_client_lgr_pending);
>   	smc->connect_nonblock = 0;
> 
>   	return reason_code;


You are removing the locking mechanism out of this function completly, 
which is fine because it is only called for a SMC-R connection.


> @@ -2377,7 +2373,8 @@ static void smc_listen_work(struct work_struct *work)
>   	if (rc)
>   		goto out_decl;
> 
> -	mutex_lock(&smc_server_lgr_pending);
> +	if (ini->is_smcd)
> +		mutex_lock(&smc_server_lgr_pending);
>   	smc_close_init(new_smc);
>   	smc_rx_init(new_smc);
>   	smc_tx_init(new_smc);
> @@ -2415,7 +2412,6 @@ static void smc_listen_work(struct work_struct *work)
>   					    ini->first_contact_local, ini);
>   		if (rc)
>   			goto out_unlock;
> -		mutex_unlock(&smc_server_lgr_pending);
>   	}
>   	smc_conn_save_peer_info(new_smc, cclc);
>   	smc_listen_out_connected(new_smc);
> @@ -2423,7 +2419,8 @@ static void smc_listen_work(struct work_struct *work)
>   	goto out_free;
> 
>   out_unlock:
> -	mutex_unlock(&smc_server_lgr_pending);
> +	if (ini->is_smcd)
> +		mutex_unlock(&smc_server_lgr_pending);


You want to remove the mutex lock for SMC-R so you are only locking for 
a SMC-D connection. So far so good. I think you could also remove this 
unlock call since it is only called in the case of a SMC-R connection - 
which means it is obsolete:

l2398 ff. (with your patch on net-next)

     /* receive SMC Confirm CLC message */
     memset(buf, 0, sizeof(*buf));
     cclc = (struct smc_clc_msg_accept_confirm *)buf;
     rc = smc_clc_wait_msg(new_smc, cclc, sizeof(*buf),
                   SMC_CLC_CONFIRM, CLC_WAIT_TIME);
     if (rc) {
x        if (!ini->is_smcd)
x            goto out_unlock;
         goto out_decl;
     }

>   out_decl:
>   	smc_listen_decline(new_smc, rc, ini ? ini->first_contact_local : 0,
>   			   proposal_version);
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index ff49a11..a3338cc 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -46,6 +46,10 @@ struct smc_lgr_list smc_lgr_list = {	/* established link groups */
>   	.num = 0,
>   };
> 
> +struct smc_lgr_manager smc_lgr_manager = {
> +	.lock = __SPIN_LOCK_UNLOCKED(smc_lgr_manager.lock),
> +};
> +
>   static atomic_t lgr_cnt = ATOMIC_INIT(0); /* number of existing link groups */
>   static DECLARE_WAIT_QUEUE_HEAD(lgrs_deleted);
> 
> @@ -55,6 +59,282 @@ static void smc_buf_free(struct smc_link_group *lgr, bool is_rmb,
> 
>   static void smc_link_down_work(struct work_struct *work);
> 
> +/* SMC-R lnk cluster compare func
> + * All lnks that meet the description conditions of this function
> + * are logically aggregated, called lnk cluster.
> + * For the server side, lnk cluster is used to determine whether
> + * a new group needs to be created when processing new imcoming connections.
> + * For the client side, lnk cluster is used to determine whether
> + * to wait for link ready (in other words, first contact ready).
> + */
> +static int smcr_lnk_cluster_cmpfn(struct rhashtable_compare_arg *arg, const void *obj)
> +{
> +	const struct smc_lnk_cluster_compare_arg *key = arg->key;
> +	const struct smc_lnk_cluster *lnkc = obj;
> +
> +	if (memcmp(key->peer_systemid, lnkc->peer_systemid, SMC_SYSTEMID_LEN))
> +		return 1;
> +
> +	if (memcmp(key->peer_gid, lnkc->peer_gid, SMC_GID_SIZE))
> +		return 1;
> +
> +	if ((key->role == SMC_SERV || key->clcqpn == lnkc->clcqpn) &&
> +	    (key->smcr_version == SMC_V2 ||
> +	    !memcmp(key->peer_mac, lnkc->peer_mac, ETH_ALEN)))
> +		return 0;
> +
> +	return 1;
> +}
> +
> +/* SMC-R lnk cluster hash func */
> +static u32 smcr_lnk_cluster_hashfn(const void *data, u32 len, u32 seed)
> +{
> +	const struct smc_lnk_cluster *lnkc = data;
> +
> +	return jhash2((u32 *)lnkc->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
> +		+ (lnkc->role == SMC_SERV) ? 0 : lnkc->clcqpn;
> +}
> +
> +/* SMC-R lnk cluster compare arg hash func */
> +static u32 smcr_lnk_cluster_compare_arg_hashfn(const void *data, u32 len, u32 seed)
> +{
> +	const struct smc_lnk_cluster_compare_arg *key = data;
> +
> +	return jhash2((u32 *)key->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
> +		+ (key->role == SMC_SERV) ? 0 : key->clcqpn;
> +}
> +
> +static const struct rhashtable_params smcr_lnk_cluster_rhl_params = {
> +	.head_offset = offsetof(struct smc_lnk_cluster, rnode),
> +	.key_len = sizeof(struct smc_lnk_cluster_compare_arg),
> +	.obj_cmpfn = smcr_lnk_cluster_cmpfn,
> +	.obj_hashfn = smcr_lnk_cluster_hashfn,
> +	.hashfn = smcr_lnk_cluster_compare_arg_hashfn,
> +	.automatic_shrinking = true,
> +};
> +
> +/* hold a reference for smc_lnk_cluster */
> +static inline void smc_lnk_cluster_hold(struct smc_lnk_cluster *lnkc)
> +{
> +	if (likely(lnkc))
> +		refcount_inc(&lnkc->ref);
> +}
> +
> +/* release a reference for smc_lnk_cluster */
> +static inline void smc_lnk_cluster_put(struct smc_lnk_cluster *lnkc)
> +{
> +	bool do_free = false;
> +
> +	if (!lnkc)
> +		return;
> +
> +	if (refcount_dec_not_one(&lnkc->ref))
> +		return;
> +
> +	spin_lock_bh(&smc_lgr_manager.lock);
> +	/* last ref */
> +	if (refcount_dec_and_test(&lnkc->ref)) {
> +		do_free = true;
> +		rhashtable_remove_fast(&smc_lgr_manager.lnk_cluster_maps, &lnkc->rnode,
> +				       smcr_lnk_cluster_rhl_params);
> +	}
> +	spin_unlock_bh(&smc_lgr_manager.lock);
> +	if (do_free)
> +		kfree(lnkc);
> +}
> +
> +/* Get or create smc_lnk_cluster by key
> + * This function will hold a reference of returned smc_lnk_cluster
> + * or create a new smc_lnk_cluster with the reference initialized to 1。
> + * caller MUST call smc_lnk_cluster_put after this.
> + */
> +static inline struct smc_lnk_cluster *
> +smcr_lnk_get_or_create_cluster(struct smc_lnk_cluster_compare_arg *key)
> +{
> +	struct smc_lnk_cluster *lnkc, *tmp_lnkc;
> +	bool busy_retry;
> +	int err;
> +
> +	/* serving a hardware or software interrupt, or preemption is disabled */
> +	busy_retry = !in_interrupt();
> +
> +	spin_lock_bh(&smc_lgr_manager.lock);
> +	lnkc = rhashtable_lookup_fast(&smc_lgr_manager.lnk_cluster_maps, key,
> +				      smcr_lnk_cluster_rhl_params);
> +	if (!lnkc) {
> +		lnkc = kzalloc(sizeof(*lnkc), GFP_ATOMIC);
> +		if (unlikely(!lnkc))
> +			goto fail;
> +
> +		/* init cluster */
> +		spin_lock_init(&lnkc->lock);
> +		lnkc->role = key->role;
> +		if (key->role == SMC_CLNT)
> +			lnkc->clcqpn = key->clcqpn;
> +		init_waitqueue_head(&lnkc->first_contact_waitqueue);
> +		memcpy(lnkc->peer_systemid, key->peer_systemid, SMC_SYSTEMID_LEN);
> +		memcpy(lnkc->peer_gid, key->peer_gid, SMC_GID_SIZE);
> +		memcpy(lnkc->peer_mac, key->peer_mac, ETH_ALEN);
> +		refcount_set(&lnkc->ref, 1);
> +
> +		do {
> +			err = rhashtable_insert_fast(&smc_lgr_manager.lnk_cluster_maps,
> +						     &lnkc->rnode, smcr_lnk_cluster_rhl_params);
> +
> +			/* success or fatal error */
> +			if (err != -EBUSY)
> +				break;
> +
> +			/* impossible in fact right now */
> +			if (unlikely(!busy_retry)) {
> +				pr_warn_ratelimited("smc: create lnk cluster in softirq\n");
> +				break;
> +			}
> +
> +			spin_unlock_bh(&smc_lgr_manager.lock);
> +			/* yeild */
> +			cond_resched();
> +			spin_lock_bh(&smc_lgr_manager.lock);
> +
> +			/* after spin_unlock_bh(), lnk_cluster_maps may be changed */
> +			tmp_lnkc = rhashtable_lookup_fast(&smc_lgr_manager.lnk_cluster_maps, key,
> +							  smcr_lnk_cluster_rhl_params);
> +
> +			if (unlikely(tmp_lnkc)) {
> +				pr_warn_ratelimited("smc: create cluster failed dues to duplicat key");
> +				kfree(lnkc);
> +				lnkc = NULL;
> +				goto fail;
> +			}
> +		} while (1);
> +
> +		if (unlikely(err)) {
> +			pr_warn_ratelimited("smc: rhashtable_insert_fast failed (%d)", err);
> +			kfree(lnkc);
> +			lnkc = NULL;
> +		}
> +	} else {
> +		smc_lnk_cluster_hold(lnkc);
> +	}
> +fail:
> +	spin_unlock_bh(&smc_lgr_manager.lock);
> +	return lnkc;
> +}
> +
> +/* Get or create a smc_lnk_cluster by lnk
> + * caller MUST call smc_lnk_cluster_put after this.
> + */
> +static inline struct smc_lnk_cluster *smcr_lnk_get_cluster(struct smc_link *lnk)
> +{
> +	struct smc_lnk_cluster_compare_arg key;
> +	struct smc_link_group *lgr;
> +
> +	lgr = lnk->lgr;
> +	if (!lgr || lgr->is_smcd)
> +		return NULL;
> +
> +	key.smcr_version = lgr->smc_version;
> +	key.peer_systemid = lgr->peer_systemid;
> +	key.peer_gid = lnk->peer_gid;
> +	key.peer_mac = lnk->peer_mac;
> +	key.role	 = lgr->role;
> +	if (key.role == SMC_CLNT)
> +		key.clcqpn = lnk->peer_qpn;
> +
> +	return smcr_lnk_get_or_create_cluster(&key);
> +}
> +
> +/* Get or create a smc_lnk_cluster by ini
> + * caller MUST call smc_lnk_cluster_put after this.
> + */
> +static inline struct smc_lnk_cluster *
> +smcr_lnk_get_cluster_by_ini(struct smc_init_info *ini, int role)
> +{
> +	struct smc_lnk_cluster_compare_arg key;
> +
> +	if (ini->is_smcd)
> +		return NULL;
> +
> +	key.smcr_version = ini->smcr_version;
> +	key.peer_systemid = ini->peer_systemid;
> +	key.peer_gid = ini->peer_gid;
> +	key.peer_mac = ini->peer_mac;
> +	key.role	= role;
> +	if (role == SMC_CLNT)
> +		key.clcqpn	= ini->ib_clcqpn;
> +
> +	return smcr_lnk_get_or_create_cluster(&key);
> +}
> +
> +/* callback when smc link state change */
> +void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk)
> +{
> +	struct smc_lnk_cluster *lnkc;
> +	int nr = 0;
> +
> +	/* barrier for lnk->state */
> +	smp_mb();
> +
> +	/* only first link can made connections block on
> +	 * first_contact_waitqueue
> +	 */
> +	if (lnk->link_idx != SMC_SINGLE_LINK)
> +		return;
> +
> +	/* state already seen  */
> +	if (lnk->state_record & SMC_LNK_STATE_BIT(lnk->state))
> +		return;
> +
> +	lnkc = smcr_lnk_get_cluster(lnk);
> +
> +	if (unlikely(!lnkc))
> +		return;
> +
> +	spin_lock_bh(&lnkc->lock);
> +
> +	/* all lnk state change should be
> +	 * 1. SMC_LNK_UNUSED -> SMC_LNK_TEAR_DWON (link init failed)

Should this really be DWON and not DOWN?

> +	 * 2. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_TEAR_DWON
> +	 * 3. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DWON
> +	 * 4. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DWON
> +	 * 5. SMC_LNK_UNUSED -> SMC_LNK_ATIVATING -> SMC_LNK_ACTIVE ->SMC_LNK_INACTIVE
> +	 * -> SMC_LNK_TEAR_DWON
> +	 */
> +	switch (lnk->state) {
> +	case SMC_LNK_ACTIVATING:
> +		/* It's safe to hold a reference without lock
> +		 * dues to the smcr_lnk_get_cluster already hold one
> +		 */
> +		smc_lnk_cluster_hold(lnkc);
> +		break;
> +	case SMC_LNK_TEAR_DWON:
> +		if (lnk->state_record & SMC_LNK_STATE_BIT(SMC_LNK_ACTIVATING))
> +			/* smc_lnk_cluster_hold in SMC_LNK_ACTIVATING */
> +			smc_lnk_cluster_put(lnkc);
> +		fallthrough;
> +	case SMC_LNK_ACTIVE:
> +	case SMC_LNK_INACTIVE:
> +		if (!(lnk->state_record &
> +			(SMC_LNK_STATE_BIT(SMC_LNK_ACTIVE)
> +			| SMC_LNK_STATE_BIT(SMC_LNK_INACTIVE)))) {
> +			lnkc->pending_capability -= (SMC_RMBS_PER_LGR_MAX - 1);
> +			/* TODO: wakeup just one to perfrom first contact
> +			 * if record state has no SMC_LNK_ACTIVE
> +			 */


Todo in a patch.

> +			nr = SMC_RMBS_PER_LGR_MAX - 1;
> +		}
> +		break;
> +	case SMC_LNK_UNUSED:
> +		pr_warn_ratelimited("net/smc: invalid lnk state. ");
> +		break;
> +	}
> +	SMC_LNK_STATE_RECORD(lnk, lnk->state);
> +	spin_unlock_bh(&lnkc->lock);
> +	if (nr)
> +		wake_up_nr(&lnkc->first_contact_waitqueue, nr);
> +	smc_lnk_cluster_put(lnkc);	/* smc_lnk_cluster_hold in smcr_lnk_get_cluster */
> +}
> +
>   /* return head of link group list and its lock for a given link group */
>   static inline struct list_head *smc_lgr_list_head(struct smc_link_group *lgr,
>   						  spinlock_t **lgr_lock)
> @@ -651,8 +931,10 @@ static void smcr_lgr_link_deactivate_all(struct smc_link_group *lgr)
>   	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
>   		struct smc_link *lnk = &lgr->lnk[i];
> 
> -		if (smc_link_sendable(lnk))
> +		if (smc_link_sendable(lnk)) {
>   			lnk->state = SMC_LNK_INACTIVE;
> +			smcr_lnk_cluster_on_lnk_state(lnk);
> +		}
>   	}
>   	wake_up_all(&lgr->llc_msg_waiter);
>   	wake_up_all(&lgr->llc_flow_waiter);
> @@ -762,6 +1044,9 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
>   	atomic_set(&lnk->conn_cnt, 0);
>   	smc_llc_link_set_uid(lnk);
>   	INIT_WORK(&lnk->link_down_wrk, smc_link_down_work);
> +	lnk->peer_qpn = ini->ib_clcqpn;
> +	memcpy(lnk->peer_gid, ini->peer_gid, SMC_GID_SIZE);
> +	memcpy(lnk->peer_mac, ini->peer_mac, sizeof(lnk->peer_mac));
>   	if (!lnk->smcibdev->initialized) {
>   		rc = (int)smc_ib_setup_per_ibdev(lnk->smcibdev);
>   		if (rc)
> @@ -792,6 +1077,7 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
>   	if (rc)
>   		goto destroy_qp;
>   	lnk->state = SMC_LNK_ACTIVATING;
> +	smcr_lnk_cluster_on_lnk_state(lnk);
>   	return 0;
> 
>   destroy_qp:
> @@ -806,6 +1092,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
>   	smc_ibdev_cnt_dec(lnk);
>   	put_device(&lnk->smcibdev->ibdev->dev);
>   	smcibdev = lnk->smcibdev;
> +	lnk->state = SMC_LNK_TEAR_DWON;
> +	smcr_lnk_cluster_on_lnk_state(lnk);
>   	memset(lnk, 0, sizeof(struct smc_link));
>   	lnk->state = SMC_LNK_UNUSED;
>   	if (!atomic_dec_return(&smcibdev->lnk_cnt))
> @@ -1263,6 +1551,8 @@ void smcr_link_clear(struct smc_link *lnk, bool log)
>   	if (!lnk->lgr || lnk->clearing ||
>   	    lnk->state == SMC_LNK_UNUSED)
>   		return;
> +	lnk->state = SMC_LNK_TEAR_DWON;
> +	smcr_lnk_cluster_on_lnk_state(lnk);
>   	lnk->clearing = 1;
>   	lnk->peer_qpn = 0;
>   	smc_llc_link_clear(lnk, log);
> @@ -1712,6 +2002,7 @@ void smcr_link_down_cond(struct smc_link *lnk)
>   {
>   	if (smc_link_downing(&lnk->state)) {
>   		trace_smcr_link_down(lnk, __builtin_return_address(0));
> +		smcr_lnk_cluster_on_lnk_state(lnk);
>   		smcr_link_down(lnk);
>   	}
>   }
> @@ -1721,6 +2012,7 @@ void smcr_link_down_cond_sched(struct smc_link *lnk)
>   {
>   	if (smc_link_downing(&lnk->state)) {
>   		trace_smcr_link_down(lnk, __builtin_return_address(0));
> +		smcr_lnk_cluster_on_lnk_state(lnk);
>   		schedule_work(&lnk->link_down_wrk);
>   	}
>   }
> @@ -1850,11 +2142,13 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>   {
>   	struct smc_connection *conn = &smc->conn;
>   	struct net *net = sock_net(&smc->sk);
> +	DECLARE_WAITQUEUE(wait, current);
> +	struct smc_lnk_cluster *lnkc = NULL;

Declared as NULL.

>   	struct list_head *lgr_list;
>   	struct smc_link_group *lgr;
>   	enum smc_lgr_role role;
>   	spinlock_t *lgr_lock;
> -	int rc = 0;
> +	int rc = 0, timeo = CLC_WAIT_TIME;
> 
>   	lgr_list = ini->is_smcd ? &ini->ism_dev[ini->ism_selected]->lgr_list :
>   				  &smc_lgr_list.list;
> @@ -1862,12 +2156,26 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>   				  &smc_lgr_list.lock;
>   	ini->first_contact_local = 1;
>   	role = smc->listen_smc ? SMC_SERV : SMC_CLNT;
> -	if (role == SMC_CLNT && ini->first_contact_peer)
> +
> +	if (!ini->is_smcd) {
> +		lnkc = smcr_lnk_get_cluster_by_ini(ini, role);

Here linkc is set if it is SMC-R.

> +		if (unlikely(!lnkc))
> +			return SMC_CLC_DECL_INTERR;
> +	}
> +
> +	if (role == SMC_CLNT && ini->first_contact_peer) {
> +		/* first_contact */
> +		spin_lock_bh(&lnkc->lock);

And here SMC-D dies because of the NULL address. This kills our Systems 
if we try to talk via SMC-D.

[  779.516389] Failing address: 0000000000000000 TEID: 0000000000000483
[  779.516391] Fault in home space mode while using kernel ASCE.
[  779.516395] AS:0000000069628007 R3:00000000ffbf0007 
S:00000000ffbef800 P:000000000000003d
[  779.516431] Oops: 0004 ilc:2 [#1] SMP
[  779.516436] Modules linked in: tcp_diag inet_diag ism mlx5_ib 
ib_uverbs mlx5_core smc_diag smc ib_core nft_fib_inet nft_fib_ipv4
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 
nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv
6 nf_defrag_ipv4 ip_set nf_tables n
[  779.516470] CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 
5.19.0-13940-g22a46254655a #3
[  779.516476] Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)

[  779.522738] Workqueue: smc_hs_wq smc_listen_work [smc]
[  779.522755] Krnl PSW : 0704c00180000000 000003ff803da89c 
(smc_conn_create+0x174/0x968 [smc])
[  779.522766]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 
PM:0 RI:0 EA:3
[  779.522770] Krnl GPRS: 0000000000000002 0000000000000000 
0000000000000001 0000000000000000
[  779.522773]            000000008a4128a0 000003ff803f21aa 
000000008e30d640 0000000086d72000
[  779.522776]            0000000086d72000 000000008a412803 
000000008a412800 000000008e30d650
[  779.522779]            0000000080934200 0000000000000000 
000003ff803cb954 00000380002dfa88
[  779.522789] Krnl Code: 000003ff803da88e: e310f0e80024        stg 
%r1,232(%r15)
[  779.522789]            000003ff803da894: a7180000            lhi %r1,0
[  779.522789]           #000003ff803da898: 582003ac            l %r2,940
[  779.522789]           >000003ff803da89c: ba123020            cs 
%r1,%r2,32(%r3)
[  779.522789]            000003ff803da8a0: ec1603be007e        cij 
%r1,0,6,000003ff803db01c

[  779.522789]            000003ff803da8a6: 4110b002            la 
%r1,2(%r11)
[  779.522789]            000003ff803da8aa: e310f0f00024        stg 
%r1,240(%r15)
[  779.522789]            000003ff803da8b0: e310f0c00004        lg 
%r1,192(%r15)
[  779.522870] Call Trace:
[  779.522873]  [<000003ff803da89c>] smc_conn_create+0x174/0x968 [smc]
[  779.522884]  [<000003ff803cb954>] 
smc_find_ism_v2_device_serv+0x1b4/0x300 [smc]

> +		lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
> +		spin_unlock_bh(&lnkc->lock);
>   		/* create new link group as well */
>   		goto create;
> +	}
> 
>   	/* determine if an existing link group can be reused */
>   	spin_lock_bh(lgr_lock);
> +	spin_lock(&lnkc->lock);
> +again:
>   	list_for_each_entry(lgr, lgr_list, list) {
>   		write_lock_bh(&lgr->conns_lock);
>   		if ((ini->is_smcd ?
> @@ -1894,9 +2202,33 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>   		}
>   		write_unlock_bh(&lgr->conns_lock);
>   	}
> +	if (lnkc && ini->first_contact_local) {
> +		if (lnkc->pending_capability > lnkc->conns_pending) {
> +			lnkc->conns_pending++;
> +			add_wait_queue(&lnkc->first_contact_waitqueue, &wait);
> +			spin_unlock(&lnkc->lock);
> +			spin_unlock_bh(lgr_lock);
> +			set_current_state(TASK_INTERRUPTIBLE);
> +			/* need to wait at least once first contact done */
> +			timeo = schedule_timeout(timeo);
> +			set_current_state(TASK_RUNNING);
> +			remove_wait_queue(&lnkc->first_contact_waitqueue, &wait);
> +			spin_lock_bh(lgr_lock);
> +			spin_lock(&lnkc->lock);
> +
> +			lnkc->conns_pending--;
> +			if (timeo)
> +				goto again;
> +		}
> +		if (role == SMC_SERV) {
> +			/* first_contact */
> +			lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
> +		}
> +	}
> +	spin_unlock(&lnkc->lock);
>   	spin_unlock_bh(lgr_lock);
>   	if (rc)
> -		return rc;
> +		goto out;
> 
>   	if (role == SMC_CLNT && !ini->first_contact_peer &&
>   	    ini->first_contact_local) {
> @@ -1904,7 +2236,8 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>   		 * a new one
>   		 * send out_of_sync decline, reason synchr. error
>   		 */
> -		return SMC_CLC_DECL_SYNCERR;
> +		rc = SMC_CLC_DECL_SYNCERR;
> +		goto out;
>   	}
> 
>   create:
> @@ -1941,6 +2274,8 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>   #endif
> 
>   out:
> +	/* smc_lnk_cluster_hold in smcr_lnk_get_or_create_cluster */
> +	smc_lnk_cluster_put(lnkc);
>   	return rc;
>   }
> 
> @@ -2599,12 +2934,23 @@ static int smc_core_reboot_event(struct notifier_block *this,
> 
>   int __init smc_core_init(void)
>   {
> +	/* init smc lnk cluster maps */
> +	rhashtable_init(&smc_lgr_manager.lnk_cluster_maps, &smcr_lnk_cluster_rhl_params);
>   	return register_reboot_notifier(&smc_reboot_notifier);
>   }
> 
> +static void smc_lnk_cluster_free_cb(void *ptr, void *arg)
> +{
> +	pr_warn("smc: smc lnk cluster refcnt leak.\n");
> +	kfree(ptr);
> +}
> +
>   /* Called (from smc_exit) when module is removed */
>   void smc_core_exit(void)
>   {
>   	unregister_reboot_notifier(&smc_reboot_notifier);
>   	smc_lgrs_shutdown();
> +	/* destroy smc lnk cluster maps */
> +	rhashtable_free_and_destroy(&smc_lgr_manager.lnk_cluster_maps, smc_lnk_cluster_free_cb,
> +				    NULL);
>   }
> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> index fe8b524..199f533 100644
> --- a/net/smc/smc_core.h
> +++ b/net/smc/smc_core.h
> @@ -15,6 +15,7 @@
>   #include <linux/atomic.h>
>   #include <linux/smc.h>
>   #include <linux/pci.h>
> +#include <linux/rhashtable.h>
>   #include <rdma/ib_verbs.h>
>   #include <net/genetlink.h>
> 
> @@ -29,18 +30,62 @@ struct smc_lgr_list {			/* list of link group definition */
>   	u32			num;	/* unique link group number */
>   };
> 
> +struct smc_lgr_manager {		/* manager for link group */
> +	struct rhashtable	lnk_cluster_maps;	/* maps of smc_lnk_cluster */
> +	spinlock_t		lock;	/* lock for lgr_cm_maps */
> +};
> +
> +struct smc_lnk_cluster {
> +	struct rhash_head	rnode;	/* node for rhashtable */
> +	struct wait_queue_head	first_contact_waitqueue;
> +					/* queue for non first contact to wait
> +					 * first contact to be established.
> +					 */
> +	spinlock_t		lock;	/* protection for link group */
> +	refcount_t		ref;	/* refcount for cluster */
> +	unsigned long		pending_capability;
> +					/* maximum pending number of connections that
> +					 * need wait first contact complete.
> +					 */
> +	unsigned long		conns_pending;
> +					/* connections that are waiting for first contact
> +					 * complete
> +					 */
> +	u8		peer_systemid[SMC_SYSTEMID_LEN];
> +	u8		peer_mac[ETH_ALEN];	/* = gid[8:10||13:15] */
> +	u8		peer_gid[SMC_GID_SIZE];	/* gid of peer*/
> +	int		clcqpn;
> +	int		role;
> +};
> +
>   enum smc_lgr_role {		/* possible roles of a link group */
>   	SMC_CLNT,	/* client */
>   	SMC_SERV	/* server */
>   };
> 
> +struct smc_lnk_cluster_compare_arg	/* key for smc_lnk_cluster */
> +{
> +	int	smcr_version;
> +	enum smc_lgr_role role;
> +	u8	*peer_systemid;
> +	u8	*peer_gid;
> +	u8	*peer_mac;
> +	int clcqpn;
> +};
> +
>   enum smc_link_state {			/* possible states of a link */
>   	SMC_LNK_UNUSED,		/* link is unused */
>   	SMC_LNK_INACTIVE,	/* link is inactive */
>   	SMC_LNK_ACTIVATING,	/* link is being activated */
>   	SMC_LNK_ACTIVE,		/* link is active */
> +	SMC_LNK_TEAR_DWON,	/* link is tear down */
>   };
> 
> +#define SMC_LNK_STATE_BIT(state)	(1 << (state))
> +
> +#define	SMC_LNK_STATE_RECORD(lnk, state)	\
> +	((lnk)->state_record |= SMC_LNK_STATE_BIT(state))
> +
>   #define SMC_WR_BUF_SIZE		48	/* size of work request buffer */
>   #define SMC_WR_BUF_V2_SIZE	8192	/* size of v2 work request buffer */
> 
> @@ -145,6 +190,7 @@ struct smc_link {
>   	int			ndev_ifidx; /* network device ifindex */
> 
>   	enum smc_link_state	state;		/* state of link */
> +	int			state_record;		/* record of previous state */
>   	struct delayed_work	llc_testlink_wrk; /* testlink worker */
>   	struct completion	llc_testlink_resp; /* wait for rx of testlink */
>   	int			llc_testlink_time; /* testlink interval */
> @@ -557,6 +603,8 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
>   int smcr_nl_get_link(struct sk_buff *skb, struct netlink_callback *cb);
>   int smcd_nl_get_lgr(struct sk_buff *skb, struct netlink_callback *cb);
> 
> +void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk);
> +
>   static inline struct smc_link_group *smc_get_lgr(struct smc_link *link)
>   {
>   	return link->lgr;
> diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
> index 175026a..8134c15 100644
> --- a/net/smc/smc_llc.c
> +++ b/net/smc/smc_llc.c
> @@ -1099,6 +1099,7 @@ int smc_llc_cli_add_link(struct smc_link *link, struct smc_llc_qentry *qentry)
>   		goto out;
>   out_clear_lnk:
>   	lnk_new->state = SMC_LNK_INACTIVE;
> +	smcr_lnk_cluster_on_lnk_state(lnk_new);
>   	smcr_link_clear(lnk_new, false);
>   out_reject:
>   	smc_llc_cli_add_link_reject(qentry);
> @@ -1278,6 +1279,7 @@ static void smc_llc_delete_asym_link(struct smc_link_group *lgr)
>   		return; /* no asymmetric link */
>   	if (!smc_link_downing(&lnk_asym->state))
>   		return;
> +	smcr_lnk_cluster_on_lnk_state(lnk_asym);
>   	lnk_new = smc_switch_conns(lgr, lnk_asym, false);
>   	smc_wr_tx_wait_no_pending_sends(lnk_asym);
>   	if (!lnk_new)
> @@ -1492,6 +1494,7 @@ int smc_llc_srv_add_link(struct smc_link *link,
>   out_err:
>   	if (link_new) {
>   		link_new->state = SMC_LNK_INACTIVE;
> +		smcr_lnk_cluster_on_lnk_state(link_new);
>   		smcr_link_clear(link_new, false);
>   	}
>   out:
> @@ -1602,8 +1605,10 @@ static void smc_llc_process_cli_delete_link(struct smc_link_group *lgr)
>   	del_llc->reason = 0;
>   	smc_llc_send_message(lnk, &qentry->msg); /* response */
> 
> -	if (smc_link_downing(&lnk_del->state))
> +	if (smc_link_downing(&lnk_del->state)) {
> +		smcr_lnk_cluster_on_lnk_state(lnk);
>   		smc_switch_conns(lgr, lnk_del, false);
> +	}
>   	smcr_link_clear(lnk_del, true);
> 
>   	active_links = smc_llc_active_link_count(lgr);
> @@ -1676,6 +1681,7 @@ static void smc_llc_process_srv_delete_link(struct smc_link_group *lgr)
>   		goto out; /* asymmetric link already deleted */
> 
>   	if (smc_link_downing(&lnk_del->state)) {
> +		smcr_lnk_cluster_on_lnk_state(lnk);
>   		if (smc_switch_conns(lgr, lnk_del, false))
>   			smc_wr_tx_wait_no_pending_sends(lnk_del);
>   	}
> @@ -2167,6 +2173,7 @@ void smc_llc_link_active(struct smc_link *link)
>   		schedule_delayed_work(&link->llc_testlink_wrk,
>   				      link->llc_testlink_time);
>   	}
> +	smcr_lnk_cluster_on_lnk_state(link);
>   }
> 
>   /* called in worker context */

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections
  2022-08-16  9:35 ` Jan Karcher
@ 2022-08-16 12:40   ` Tony Lu
  2022-08-17  4:55   ` D. Wythe
  1 sibling, 0 replies; 29+ messages in thread
From: Tony Lu @ 2022-08-16 12:40 UTC (permalink / raw)
  To: Jan Karcher
  Cc: D. Wythe, kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma

On Tue, Aug 16, 2022 at 11:35:15AM +0200, Jan Karcher wrote:
> 
> 
> On 10.08.2022 19:47, D. Wythe wrote:
> > From: "D. Wythe" <alibuda@linux.alibaba.com>
> > 
> > This patch set attempts to optimize the parallelism of SMC-R connections,
> > mainly to reduce unnecessary blocking on locks, and to fix exceptions that
> > occur after thoses optimization.
> > 
> 
> Thank you again for your submission!
> Let me give you a quick update from our side:
> We tested your patches on top of the net-next kernel on our s390 systems.
> They did crash our systems. After verifying our environment we pulled
> console logs and now we can tell that there is indeed a problem with your
> patches regarding SMC-D. So please do not integrate this change as of right
> now. I'm going to do more in depth reviews of your patches but i need some
> time for them so here is a quick a description of the problem:
> 
> It is a SMC-D problem, that occurs while building up the connection. In
> smc_conn_create you set struct smc_lnk_cluster *lnkc = NULL. For the SMC-R
> path you do grab the pointer, for SMC-D that never happens. Still you are
> using this refernce for SMC-D => Crash. This problem can be reproduced using
> the SMC-D path. Here is an example console output:

Got it.

> 
> [  779.516382] Unable to handle kernel pointer dereference in virtual kernel
> address space
> [  779.516389] Failing address: 0000000000000000 TEID: 0000000000000483
> [  779.516391] Fault in home space mode while using kernel ASCE.
> [  779.516395] AS:0000000069628007 R3:00000000ffbf0007 S:00000000ffbef800
> P:000000000000003d
> [  779.516431] Oops: 0004 ilc:2 [#1] SMP
> [  779.516436] Modules linked in: tcp_diag inet_diag ism mlx5_ib ib_uverbs
> mlx5_core smc_diag smc ib_core nft_fib_inet nft_fib_ipv4
> nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
> nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv
> 6 nf_defrag_ipv4 ip_set nf_tables n
> [  779.516470] CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted
> 5.19.0-13940-g22a46254655a #3
> [  779.516476] Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
> 
> [  779.522738] Workqueue: smc_hs_wq smc_listen_work [smc]
> [  779.522755] Krnl PSW : 0704c00180000000 000003ff803da89c
> (smc_conn_create+0x174/0x968 [smc])
> [  779.522766]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0
> RI:0 EA:3
> [  779.522770] Krnl GPRS: 0000000000000002 0000000000000000 0000000000000001
> 0000000000000000
> [  779.522773]            000000008a4128a0 000003ff803f21aa 000000008e30d640
> 0000000086d72000
> [  779.522776]            0000000086d72000 000000008a412803 000000008a412800
> 000000008e30d650
> [  779.522779]            0000000080934200 0000000000000000 000003ff803cb954
> 00000380002dfa88
> [  779.522789] Krnl Code: 000003ff803da88e: e310f0e80024        stg
> %r1,232(%r15)
> [  779.522789]            000003ff803da894: a7180000            lhi %r1,0
> [  779.522789]           #000003ff803da898: 582003ac            l %r2,940
> [  779.522789]           >000003ff803da89c: ba123020            cs
> %r1,%r2,32(%r3)
> [  779.522789]            000003ff803da8a0: ec1603be007e        cij
> %r1,0,6,000003ff803db01c
> 
> [  779.522789]            000003ff803da8a6: 4110b002            la
> %r1,2(%r11)
> [  779.522789]            000003ff803da8aa: e310f0f00024        stg
> %r1,240(%r15)
> [  779.522789]            000003ff803da8b0: e310f0c00004        lg
> %r1,192(%r15)
> [  779.522870] Call Trace:
> [  779.522873]  [<000003ff803da89c>] smc_conn_create+0x174/0x968 [smc]
> [  779.522884]  [<000003ff803cb954>] smc_find_ism_v2_device_serv+0x1b4/0x300
> [smc]
> 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop
> from CPU 01.
> 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop
> from CPU 00.
> [  779.522894]  [<000003ff803cbace>] smc_listen_find_device+0x2e/0x370 [smc]
> 
> 
> I'm going to send the review for the first patch right away (which is the
> one causing the crash), so far I'm done with it. The others are going to
> follow. Maybe you can look over the problem and come up with a solution,
> otherwise we are going to decide if we want to look into it as soon as I'm
> done with the reviews. Thank you for your patience.

Thanks for pointing this issue. We will fix this soon in v2.

Tony Lu

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending
  2022-08-16  9:43   ` Jan Karcher
@ 2022-08-16 12:47     ` Tony Lu
  0 siblings, 0 replies; 29+ messages in thread
From: Tony Lu @ 2022-08-16 12:47 UTC (permalink / raw)
  To: Jan Karcher
  Cc: D. Wythe, kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma

On Tue, Aug 16, 2022 at 11:43:23AM +0200, Jan Karcher wrote:
> 
> 
> On 10.08.2022 19:47, D. Wythe wrote:
> > From: "D. Wythe" <alibuda@linux.alibaba.com>
> > 
> > This patch attempts to remove locks named smc_client_lgr_pending and
> > smc_server_lgr_pending, which aim to serialize the creation of link
> > group. However, once link group existed already, those locks are
> > meaningless, worse still, they make incoming connections have to be
> > queued one after the other.
> > 
> > Now, the creation of link group is no longer generated by competition,
> > but allocated through following strategy.
> > 
> > 1. Try to find a suitable link group, if successd, current connection
> > is considered as NON first contact connection. ends.
> > 
> > 2. Check the number of connections currently waiting for a suitable
> > link group to be created, if it is not less that the number of link
> > groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
> > increase the number of link groups to be created, current connection
> > is considered as the first contact connection. ends.
> > 
> > 3. Increase the number of connections currently waiting, and wait
> > for woken up.
> > 
> > 4. Decrease the number of connections currently waiting, goto 1.
> > 
> > We wake up the connection that was put to sleep in stage 3 through
> > the SMC link state change event. Once the link moves out of the
> > SMC_LNK_ACTIVATING state, decrease the number of link groups to
> > be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
> > connections.
> > 
> > In the iplementation, we introduce the concept of lnk cluster, which is
> > a collection of links with the same characteristics (see
> > smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
> > wake up efficiently in the scenario of N v.s 1.
> > 
> > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > ---
> >   net/smc/af_smc.c   |  11 +-
> >   net/smc/smc_core.c | 356 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >   net/smc/smc_core.h |  48 ++++++++
> >   net/smc/smc_llc.c  |   9 +-
> >   4 files changed, 411 insertions(+), 13 deletions(-)
> > 
> > diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> > index 79c1318..af4b0aa 100644
> > --- a/net/smc/af_smc.c
> > +++ b/net/smc/af_smc.c
> > @@ -1194,10 +1194,8 @@ static int smc_connect_rdma(struct smc_sock *smc,
> >   	if (reason_code)
> >   		return reason_code;
> > 
> > -	mutex_lock(&smc_client_lgr_pending);
> >   	reason_code = smc_conn_create(smc, ini);
> >   	if (reason_code) {
> > -		mutex_unlock(&smc_client_lgr_pending);
> >   		return reason_code;
> >   	}
> > 
> > @@ -1289,7 +1287,6 @@ static int smc_connect_rdma(struct smc_sock *smc,
> >   		if (reason_code)
> >   			goto connect_abort;
> >   	}
> > -	mutex_unlock(&smc_client_lgr_pending);
> > 
> >   	smc_copy_sock_settings_to_clc(smc);
> >   	smc->connect_nonblock = 0;
> > @@ -1299,7 +1296,6 @@ static int smc_connect_rdma(struct smc_sock *smc,
> >   	return 0;
> >   connect_abort:
> >   	smc_conn_abort(smc, ini->first_contact_local);
> > -	mutex_unlock(&smc_client_lgr_pending);
> >   	smc->connect_nonblock = 0;
> > 
> >   	return reason_code;
> 
> 
> You are removing the locking mechanism out of this function completly, which
> is fine because it is only called for a SMC-R connection.
> 
> 
> > @@ -2377,7 +2373,8 @@ static void smc_listen_work(struct work_struct *work)
> >   	if (rc)
> >   		goto out_decl;
> > 
> > -	mutex_lock(&smc_server_lgr_pending);
> > +	if (ini->is_smcd)
> > +		mutex_lock(&smc_server_lgr_pending);
> >   	smc_close_init(new_smc);
> >   	smc_rx_init(new_smc);
> >   	smc_tx_init(new_smc);
> > @@ -2415,7 +2412,6 @@ static void smc_listen_work(struct work_struct *work)
> >   					    ini->first_contact_local, ini);
> >   		if (rc)
> >   			goto out_unlock;
> > -		mutex_unlock(&smc_server_lgr_pending);
> >   	}
> >   	smc_conn_save_peer_info(new_smc, cclc);
> >   	smc_listen_out_connected(new_smc);
> > @@ -2423,7 +2419,8 @@ static void smc_listen_work(struct work_struct *work)
> >   	goto out_free;
> > 
> >   out_unlock:
> > -	mutex_unlock(&smc_server_lgr_pending);
> > +	if (ini->is_smcd)
> > +		mutex_unlock(&smc_server_lgr_pending);
> 
> 
> You want to remove the mutex lock for SMC-R so you are only locking for a
> SMC-D connection. So far so good. I think you could also remove this unlock
> call since it is only called in the case of a SMC-R connection - which means
> it is obsolete:
> 
> l2398 ff. (with your patch on net-next)
> 
>     /* receive SMC Confirm CLC message */
>     memset(buf, 0, sizeof(*buf));
>     cclc = (struct smc_clc_msg_accept_confirm *)buf;
>     rc = smc_clc_wait_msg(new_smc, cclc, sizeof(*buf),
>                   SMC_CLC_CONFIRM, CLC_WAIT_TIME);
>     if (rc) {
> x        if (!ini->is_smcd)
> x            goto out_unlock;
>         goto out_decl;
>     }
> 
> >   out_decl:
> >   	smc_listen_decline(new_smc, rc, ini ? ini->first_contact_local : 0,
> >   			   proposal_version);
> > diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> > index ff49a11..a3338cc 100644
> > --- a/net/smc/smc_core.c
> > +++ b/net/smc/smc_core.c
> > @@ -46,6 +46,10 @@ struct smc_lgr_list smc_lgr_list = {	/* established link groups */
> >   	.num = 0,
> >   };
> > 
> > +struct smc_lgr_manager smc_lgr_manager = {
> > +	.lock = __SPIN_LOCK_UNLOCKED(smc_lgr_manager.lock),
> > +};
> > +
> >   static atomic_t lgr_cnt = ATOMIC_INIT(0); /* number of existing link groups */
> >   static DECLARE_WAIT_QUEUE_HEAD(lgrs_deleted);
> > 
> > @@ -55,6 +59,282 @@ static void smc_buf_free(struct smc_link_group *lgr, bool is_rmb,
> > 
> >   static void smc_link_down_work(struct work_struct *work);
> > 
> > +/* SMC-R lnk cluster compare func
> > + * All lnks that meet the description conditions of this function
> > + * are logically aggregated, called lnk cluster.
> > + * For the server side, lnk cluster is used to determine whether
> > + * a new group needs to be created when processing new imcoming connections.
> > + * For the client side, lnk cluster is used to determine whether
> > + * to wait for link ready (in other words, first contact ready).
> > + */
> > +static int smcr_lnk_cluster_cmpfn(struct rhashtable_compare_arg *arg, const void *obj)
> > +{
> > +	const struct smc_lnk_cluster_compare_arg *key = arg->key;
> > +	const struct smc_lnk_cluster *lnkc = obj;
> > +
> > +	if (memcmp(key->peer_systemid, lnkc->peer_systemid, SMC_SYSTEMID_LEN))
> > +		return 1;
> > +
> > +	if (memcmp(key->peer_gid, lnkc->peer_gid, SMC_GID_SIZE))
> > +		return 1;
> > +
> > +	if ((key->role == SMC_SERV || key->clcqpn == lnkc->clcqpn) &&
> > +	    (key->smcr_version == SMC_V2 ||
> > +	    !memcmp(key->peer_mac, lnkc->peer_mac, ETH_ALEN)))
> > +		return 0;
> > +
> > +	return 1;
> > +}
> > +
> > +/* SMC-R lnk cluster hash func */
> > +static u32 smcr_lnk_cluster_hashfn(const void *data, u32 len, u32 seed)
> > +{
> > +	const struct smc_lnk_cluster *lnkc = data;
> > +
> > +	return jhash2((u32 *)lnkc->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
> > +		+ (lnkc->role == SMC_SERV) ? 0 : lnkc->clcqpn;
> > +}
> > +
> > +/* SMC-R lnk cluster compare arg hash func */
> > +static u32 smcr_lnk_cluster_compare_arg_hashfn(const void *data, u32 len, u32 seed)
> > +{
> > +	const struct smc_lnk_cluster_compare_arg *key = data;
> > +
> > +	return jhash2((u32 *)key->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
> > +		+ (key->role == SMC_SERV) ? 0 : key->clcqpn;
> > +}
> > +
> > +static const struct rhashtable_params smcr_lnk_cluster_rhl_params = {
> > +	.head_offset = offsetof(struct smc_lnk_cluster, rnode),
> > +	.key_len = sizeof(struct smc_lnk_cluster_compare_arg),
> > +	.obj_cmpfn = smcr_lnk_cluster_cmpfn,
> > +	.obj_hashfn = smcr_lnk_cluster_hashfn,
> > +	.hashfn = smcr_lnk_cluster_compare_arg_hashfn,
> > +	.automatic_shrinking = true,
> > +};
> > +
> > +/* hold a reference for smc_lnk_cluster */
> > +static inline void smc_lnk_cluster_hold(struct smc_lnk_cluster *lnkc)
> > +{
> > +	if (likely(lnkc))
> > +		refcount_inc(&lnkc->ref);
> > +}
> > +
> > +/* release a reference for smc_lnk_cluster */
> > +static inline void smc_lnk_cluster_put(struct smc_lnk_cluster *lnkc)
> > +{
> > +	bool do_free = false;
> > +
> > +	if (!lnkc)
> > +		return;
> > +
> > +	if (refcount_dec_not_one(&lnkc->ref))
> > +		return;
> > +
> > +	spin_lock_bh(&smc_lgr_manager.lock);
> > +	/* last ref */
> > +	if (refcount_dec_and_test(&lnkc->ref)) {
> > +		do_free = true;
> > +		rhashtable_remove_fast(&smc_lgr_manager.lnk_cluster_maps, &lnkc->rnode,
> > +				       smcr_lnk_cluster_rhl_params);
> > +	}
> > +	spin_unlock_bh(&smc_lgr_manager.lock);
> > +	if (do_free)
> > +		kfree(lnkc);
> > +}
> > +
> > +/* Get or create smc_lnk_cluster by key
> > + * This function will hold a reference of returned smc_lnk_cluster
> > + * or create a new smc_lnk_cluster with the reference initialized to 1。
> > + * caller MUST call smc_lnk_cluster_put after this.
> > + */
> > +static inline struct smc_lnk_cluster *
> > +smcr_lnk_get_or_create_cluster(struct smc_lnk_cluster_compare_arg *key)
> > +{
> > +	struct smc_lnk_cluster *lnkc, *tmp_lnkc;
> > +	bool busy_retry;
> > +	int err;
> > +
> > +	/* serving a hardware or software interrupt, or preemption is disabled */
> > +	busy_retry = !in_interrupt();
> > +
> > +	spin_lock_bh(&smc_lgr_manager.lock);
> > +	lnkc = rhashtable_lookup_fast(&smc_lgr_manager.lnk_cluster_maps, key,
> > +				      smcr_lnk_cluster_rhl_params);
> > +	if (!lnkc) {
> > +		lnkc = kzalloc(sizeof(*lnkc), GFP_ATOMIC);
> > +		if (unlikely(!lnkc))
> > +			goto fail;
> > +
> > +		/* init cluster */
> > +		spin_lock_init(&lnkc->lock);
> > +		lnkc->role = key->role;
> > +		if (key->role == SMC_CLNT)
> > +			lnkc->clcqpn = key->clcqpn;
> > +		init_waitqueue_head(&lnkc->first_contact_waitqueue);
> > +		memcpy(lnkc->peer_systemid, key->peer_systemid, SMC_SYSTEMID_LEN);
> > +		memcpy(lnkc->peer_gid, key->peer_gid, SMC_GID_SIZE);
> > +		memcpy(lnkc->peer_mac, key->peer_mac, ETH_ALEN);
> > +		refcount_set(&lnkc->ref, 1);
> > +
> > +		do {
> > +			err = rhashtable_insert_fast(&smc_lgr_manager.lnk_cluster_maps,
> > +						     &lnkc->rnode, smcr_lnk_cluster_rhl_params);
> > +
> > +			/* success or fatal error */
> > +			if (err != -EBUSY)
> > +				break;
> > +
> > +			/* impossible in fact right now */
> > +			if (unlikely(!busy_retry)) {
> > +				pr_warn_ratelimited("smc: create lnk cluster in softirq\n");
> > +				break;
> > +			}
> > +
> > +			spin_unlock_bh(&smc_lgr_manager.lock);
> > +			/* yeild */
> > +			cond_resched();
> > +			spin_lock_bh(&smc_lgr_manager.lock);
> > +
> > +			/* after spin_unlock_bh(), lnk_cluster_maps may be changed */
> > +			tmp_lnkc = rhashtable_lookup_fast(&smc_lgr_manager.lnk_cluster_maps, key,
> > +							  smcr_lnk_cluster_rhl_params);
> > +
> > +			if (unlikely(tmp_lnkc)) {
> > +				pr_warn_ratelimited("smc: create cluster failed dues to duplicat key");
> > +				kfree(lnkc);
> > +				lnkc = NULL;
> > +				goto fail;
> > +			}
> > +		} while (1);
> > +
> > +		if (unlikely(err)) {
> > +			pr_warn_ratelimited("smc: rhashtable_insert_fast failed (%d)", err);
> > +			kfree(lnkc);
> > +			lnkc = NULL;
> > +		}
> > +	} else {
> > +		smc_lnk_cluster_hold(lnkc);
> > +	}
> > +fail:
> > +	spin_unlock_bh(&smc_lgr_manager.lock);
> > +	return lnkc;
> > +}
> > +
> > +/* Get or create a smc_lnk_cluster by lnk
> > + * caller MUST call smc_lnk_cluster_put after this.
> > + */
> > +static inline struct smc_lnk_cluster *smcr_lnk_get_cluster(struct smc_link *lnk)
> > +{
> > +	struct smc_lnk_cluster_compare_arg key;
> > +	struct smc_link_group *lgr;
> > +
> > +	lgr = lnk->lgr;
> > +	if (!lgr || lgr->is_smcd)
> > +		return NULL;
> > +
> > +	key.smcr_version = lgr->smc_version;
> > +	key.peer_systemid = lgr->peer_systemid;
> > +	key.peer_gid = lnk->peer_gid;
> > +	key.peer_mac = lnk->peer_mac;
> > +	key.role	 = lgr->role;
> > +	if (key.role == SMC_CLNT)
> > +		key.clcqpn = lnk->peer_qpn;
> > +
> > +	return smcr_lnk_get_or_create_cluster(&key);
> > +}
> > +
> > +/* Get or create a smc_lnk_cluster by ini
> > + * caller MUST call smc_lnk_cluster_put after this.
> > + */
> > +static inline struct smc_lnk_cluster *
> > +smcr_lnk_get_cluster_by_ini(struct smc_init_info *ini, int role)
> > +{
> > +	struct smc_lnk_cluster_compare_arg key;
> > +
> > +	if (ini->is_smcd)
> > +		return NULL;
> > +
> > +	key.smcr_version = ini->smcr_version;
> > +	key.peer_systemid = ini->peer_systemid;
> > +	key.peer_gid = ini->peer_gid;
> > +	key.peer_mac = ini->peer_mac;
> > +	key.role	= role;
> > +	if (role == SMC_CLNT)
> > +		key.clcqpn	= ini->ib_clcqpn;
> > +
> > +	return smcr_lnk_get_or_create_cluster(&key);
> > +}
> > +
> > +/* callback when smc link state change */
> > +void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk)
> > +{
> > +	struct smc_lnk_cluster *lnkc;
> > +	int nr = 0;
> > +
> > +	/* barrier for lnk->state */
> > +	smp_mb();
> > +
> > +	/* only first link can made connections block on
> > +	 * first_contact_waitqueue
> > +	 */
> > +	if (lnk->link_idx != SMC_SINGLE_LINK)
> > +		return;
> > +
> > +	/* state already seen  */
> > +	if (lnk->state_record & SMC_LNK_STATE_BIT(lnk->state))
> > +		return;
> > +
> > +	lnkc = smcr_lnk_get_cluster(lnk);
> > +
> > +	if (unlikely(!lnkc))
> > +		return;
> > +
> > +	spin_lock_bh(&lnkc->lock);
> > +
> > +	/* all lnk state change should be
> > +	 * 1. SMC_LNK_UNUSED -> SMC_LNK_TEAR_DWON (link init failed)
> 
> Should this really be DWON and not DOWN?
> 
> > +	 * 2. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_TEAR_DWON
> > +	 * 3. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DWON
> > +	 * 4. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DWON
> > +	 * 5. SMC_LNK_UNUSED -> SMC_LNK_ATIVATING -> SMC_LNK_ACTIVE ->SMC_LNK_INACTIVE
> > +	 * -> SMC_LNK_TEAR_DWON
> > +	 */
> > +	switch (lnk->state) {
> > +	case SMC_LNK_ACTIVATING:
> > +		/* It's safe to hold a reference without lock
> > +		 * dues to the smcr_lnk_get_cluster already hold one
> > +		 */
> > +		smc_lnk_cluster_hold(lnkc);
> > +		break;
> > +	case SMC_LNK_TEAR_DWON:
> > +		if (lnk->state_record & SMC_LNK_STATE_BIT(SMC_LNK_ACTIVATING))
> > +			/* smc_lnk_cluster_hold in SMC_LNK_ACTIVATING */
> > +			smc_lnk_cluster_put(lnkc);
> > +		fallthrough;
> > +	case SMC_LNK_ACTIVE:
> > +	case SMC_LNK_INACTIVE:
> > +		if (!(lnk->state_record &
> > +			(SMC_LNK_STATE_BIT(SMC_LNK_ACTIVE)
> > +			| SMC_LNK_STATE_BIT(SMC_LNK_INACTIVE)))) {
> > +			lnkc->pending_capability -= (SMC_RMBS_PER_LGR_MAX - 1);
> > +			/* TODO: wakeup just one to perfrom first contact
> > +			 * if record state has no SMC_LNK_ACTIVE
> > +			 */
> 
> 
> Todo in a patch.
> 
> > +			nr = SMC_RMBS_PER_LGR_MAX - 1;
> > +		}
> > +		break;
> > +	case SMC_LNK_UNUSED:
> > +		pr_warn_ratelimited("net/smc: invalid lnk state. ");
> > +		break;
> > +	}
> > +	SMC_LNK_STATE_RECORD(lnk, lnk->state);
> > +	spin_unlock_bh(&lnkc->lock);
> > +	if (nr)
> > +		wake_up_nr(&lnkc->first_contact_waitqueue, nr);
> > +	smc_lnk_cluster_put(lnkc);	/* smc_lnk_cluster_hold in smcr_lnk_get_cluster */
> > +}
> > +
> >   /* return head of link group list and its lock for a given link group */
> >   static inline struct list_head *smc_lgr_list_head(struct smc_link_group *lgr,
> >   						  spinlock_t **lgr_lock)
> > @@ -651,8 +931,10 @@ static void smcr_lgr_link_deactivate_all(struct smc_link_group *lgr)
> >   	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
> >   		struct smc_link *lnk = &lgr->lnk[i];
> > 
> > -		if (smc_link_sendable(lnk))
> > +		if (smc_link_sendable(lnk)) {
> >   			lnk->state = SMC_LNK_INACTIVE;
> > +			smcr_lnk_cluster_on_lnk_state(lnk);
> > +		}
> >   	}
> >   	wake_up_all(&lgr->llc_msg_waiter);
> >   	wake_up_all(&lgr->llc_flow_waiter);
> > @@ -762,6 +1044,9 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
> >   	atomic_set(&lnk->conn_cnt, 0);
> >   	smc_llc_link_set_uid(lnk);
> >   	INIT_WORK(&lnk->link_down_wrk, smc_link_down_work);
> > +	lnk->peer_qpn = ini->ib_clcqpn;
> > +	memcpy(lnk->peer_gid, ini->peer_gid, SMC_GID_SIZE);
> > +	memcpy(lnk->peer_mac, ini->peer_mac, sizeof(lnk->peer_mac));
> >   	if (!lnk->smcibdev->initialized) {
> >   		rc = (int)smc_ib_setup_per_ibdev(lnk->smcibdev);
> >   		if (rc)
> > @@ -792,6 +1077,7 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
> >   	if (rc)
> >   		goto destroy_qp;
> >   	lnk->state = SMC_LNK_ACTIVATING;
> > +	smcr_lnk_cluster_on_lnk_state(lnk);
> >   	return 0;
> > 
> >   destroy_qp:
> > @@ -806,6 +1092,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
> >   	smc_ibdev_cnt_dec(lnk);
> >   	put_device(&lnk->smcibdev->ibdev->dev);
> >   	smcibdev = lnk->smcibdev;
> > +	lnk->state = SMC_LNK_TEAR_DWON;
> > +	smcr_lnk_cluster_on_lnk_state(lnk);
> >   	memset(lnk, 0, sizeof(struct smc_link));
> >   	lnk->state = SMC_LNK_UNUSED;
> >   	if (!atomic_dec_return(&smcibdev->lnk_cnt))
> > @@ -1263,6 +1551,8 @@ void smcr_link_clear(struct smc_link *lnk, bool log)
> >   	if (!lnk->lgr || lnk->clearing ||
> >   	    lnk->state == SMC_LNK_UNUSED)
> >   		return;
> > +	lnk->state = SMC_LNK_TEAR_DWON;
> > +	smcr_lnk_cluster_on_lnk_state(lnk);
> >   	lnk->clearing = 1;
> >   	lnk->peer_qpn = 0;
> >   	smc_llc_link_clear(lnk, log);
> > @@ -1712,6 +2002,7 @@ void smcr_link_down_cond(struct smc_link *lnk)
> >   {
> >   	if (smc_link_downing(&lnk->state)) {
> >   		trace_smcr_link_down(lnk, __builtin_return_address(0));
> > +		smcr_lnk_cluster_on_lnk_state(lnk);
> >   		smcr_link_down(lnk);
> >   	}
> >   }
> > @@ -1721,6 +2012,7 @@ void smcr_link_down_cond_sched(struct smc_link *lnk)
> >   {
> >   	if (smc_link_downing(&lnk->state)) {
> >   		trace_smcr_link_down(lnk, __builtin_return_address(0));
> > +		smcr_lnk_cluster_on_lnk_state(lnk);
> >   		schedule_work(&lnk->link_down_wrk);
> >   	}
> >   }
> > @@ -1850,11 +2142,13 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
> >   {
> >   	struct smc_connection *conn = &smc->conn;
> >   	struct net *net = sock_net(&smc->sk);
> > +	DECLARE_WAITQUEUE(wait, current);
> > +	struct smc_lnk_cluster *lnkc = NULL;
> 
> Declared as NULL.
> 
> >   	struct list_head *lgr_list;
> >   	struct smc_link_group *lgr;
> >   	enum smc_lgr_role role;
> >   	spinlock_t *lgr_lock;
> > -	int rc = 0;
> > +	int rc = 0, timeo = CLC_WAIT_TIME;
> > 
> >   	lgr_list = ini->is_smcd ? &ini->ism_dev[ini->ism_selected]->lgr_list :
> >   				  &smc_lgr_list.list;
> > @@ -1862,12 +2156,26 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
> >   				  &smc_lgr_list.lock;
> >   	ini->first_contact_local = 1;
> >   	role = smc->listen_smc ? SMC_SERV : SMC_CLNT;
> > -	if (role == SMC_CLNT && ini->first_contact_peer)
> > +
> > +	if (!ini->is_smcd) {
> > +		lnkc = smcr_lnk_get_cluster_by_ini(ini, role);
> 
> Here linkc is set if it is SMC-R.
> 
> > +		if (unlikely(!lnkc))
> > +			return SMC_CLC_DECL_INTERR;
> > +	}
> > +
> > +	if (role == SMC_CLNT && ini->first_contact_peer) {
> > +		/* first_contact */
> > +		spin_lock_bh(&lnkc->lock);
> 
> And here SMC-D dies because of the NULL address. This kills our Systems if
> we try to talk via SMC-D.

Got it, thanks.

> 
> [  779.516389] Failing address: 0000000000000000 TEID: 0000000000000483
> [  779.516391] Fault in home space mode while using kernel ASCE.
> [  779.516395] AS:0000000069628007 R3:00000000ffbf0007 S:00000000ffbef800
> P:000000000000003d
> [  779.516431] Oops: 0004 ilc:2 [#1] SMP
> [  779.516436] Modules linked in: tcp_diag inet_diag ism mlx5_ib ib_uverbs
> mlx5_core smc_diag smc ib_core nft_fib_inet nft_fib_ipv4
> nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
> nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv
> 6 nf_defrag_ipv4 ip_set nf_tables n
> [  779.516470] CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted
> 5.19.0-13940-g22a46254655a #3
> [  779.516476] Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
> 
> [  779.522738] Workqueue: smc_hs_wq smc_listen_work [smc]
> [  779.522755] Krnl PSW : 0704c00180000000 000003ff803da89c
> (smc_conn_create+0x174/0x968 [smc])
> [  779.522766]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0
> RI:0 EA:3
> [  779.522770] Krnl GPRS: 0000000000000002 0000000000000000 0000000000000001
> 0000000000000000
> [  779.522773]            000000008a4128a0 000003ff803f21aa 000000008e30d640
> 0000000086d72000
> [  779.522776]            0000000086d72000 000000008a412803 000000008a412800
> 000000008e30d650
> [  779.522779]            0000000080934200 0000000000000000 000003ff803cb954
> 00000380002dfa88
> [  779.522789] Krnl Code: 000003ff803da88e: e310f0e80024        stg
> %r1,232(%r15)
> [  779.522789]            000003ff803da894: a7180000            lhi %r1,0
> [  779.522789]           #000003ff803da898: 582003ac            l %r2,940
> [  779.522789]           >000003ff803da89c: ba123020            cs
> %r1,%r2,32(%r3)
> [  779.522789]            000003ff803da8a0: ec1603be007e        cij
> %r1,0,6,000003ff803db01c
> 
> [  779.522789]            000003ff803da8a6: 4110b002            la
> %r1,2(%r11)
> [  779.522789]            000003ff803da8aa: e310f0f00024        stg
> %r1,240(%r15)
> [  779.522789]            000003ff803da8b0: e310f0c00004        lg
> %r1,192(%r15)
> [  779.522870] Call Trace:
> [  779.522873]  [<000003ff803da89c>] smc_conn_create+0x174/0x968 [smc]
> [  779.522884]  [<000003ff803cb954>] smc_find_ism_v2_device_serv+0x1b4/0x300
> [smc]
> 

D. Wythe is on vacation now. We will fix it soon.

Thanks again for your reviews. 

Cheers,
Tony Lu

> > +		lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
> > +		spin_unlock_bh(&lnkc->lock);
> >   		/* create new link group as well */
> >   		goto create;
> > +	}
> > 
> >   	/* determine if an existing link group can be reused */
> >   	spin_lock_bh(lgr_lock);
> > +	spin_lock(&lnkc->lock);
> > +again:
> >   	list_for_each_entry(lgr, lgr_list, list) {
> >   		write_lock_bh(&lgr->conns_lock);
> >   		if ((ini->is_smcd ?
> > @@ -1894,9 +2202,33 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
> >   		}
> >   		write_unlock_bh(&lgr->conns_lock);
> >   	}
> > +	if (lnkc && ini->first_contact_local) {
> > +		if (lnkc->pending_capability > lnkc->conns_pending) {
> > +			lnkc->conns_pending++;
> > +			add_wait_queue(&lnkc->first_contact_waitqueue, &wait);
> > +			spin_unlock(&lnkc->lock);
> > +			spin_unlock_bh(lgr_lock);
> > +			set_current_state(TASK_INTERRUPTIBLE);
> > +			/* need to wait at least once first contact done */
> > +			timeo = schedule_timeout(timeo);
> > +			set_current_state(TASK_RUNNING);
> > +			remove_wait_queue(&lnkc->first_contact_waitqueue, &wait);
> > +			spin_lock_bh(lgr_lock);
> > +			spin_lock(&lnkc->lock);
> > +
> > +			lnkc->conns_pending--;
> > +			if (timeo)
> > +				goto again;
> > +		}
> > +		if (role == SMC_SERV) {
> > +			/* first_contact */
> > +			lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
> > +		}
> > +	}
> > +	spin_unlock(&lnkc->lock);
> >   	spin_unlock_bh(lgr_lock);
> >   	if (rc)
> > -		return rc;
> > +		goto out;
> > 
> >   	if (role == SMC_CLNT && !ini->first_contact_peer &&
> >   	    ini->first_contact_local) {
> > @@ -1904,7 +2236,8 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
> >   		 * a new one
> >   		 * send out_of_sync decline, reason synchr. error
> >   		 */
> > -		return SMC_CLC_DECL_SYNCERR;
> > +		rc = SMC_CLC_DECL_SYNCERR;
> > +		goto out;
> >   	}
> > 
> >   create:
> > @@ -1941,6 +2274,8 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
> >   #endif
> > 
> >   out:
> > +	/* smc_lnk_cluster_hold in smcr_lnk_get_or_create_cluster */
> > +	smc_lnk_cluster_put(lnkc);
> >   	return rc;
> >   }
> > 
> > @@ -2599,12 +2934,23 @@ static int smc_core_reboot_event(struct notifier_block *this,
> > 
> >   int __init smc_core_init(void)
> >   {
> > +	/* init smc lnk cluster maps */
> > +	rhashtable_init(&smc_lgr_manager.lnk_cluster_maps, &smcr_lnk_cluster_rhl_params);
> >   	return register_reboot_notifier(&smc_reboot_notifier);
> >   }
> > 
> > +static void smc_lnk_cluster_free_cb(void *ptr, void *arg)
> > +{
> > +	pr_warn("smc: smc lnk cluster refcnt leak.\n");
> > +	kfree(ptr);
> > +}
> > +
> >   /* Called (from smc_exit) when module is removed */
> >   void smc_core_exit(void)
> >   {
> >   	unregister_reboot_notifier(&smc_reboot_notifier);
> >   	smc_lgrs_shutdown();
> > +	/* destroy smc lnk cluster maps */
> > +	rhashtable_free_and_destroy(&smc_lgr_manager.lnk_cluster_maps, smc_lnk_cluster_free_cb,
> > +				    NULL);
> >   }
> > diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> > index fe8b524..199f533 100644
> > --- a/net/smc/smc_core.h
> > +++ b/net/smc/smc_core.h
> > @@ -15,6 +15,7 @@
> >   #include <linux/atomic.h>
> >   #include <linux/smc.h>
> >   #include <linux/pci.h>
> > +#include <linux/rhashtable.h>
> >   #include <rdma/ib_verbs.h>
> >   #include <net/genetlink.h>
> > 
> > @@ -29,18 +30,62 @@ struct smc_lgr_list {			/* list of link group definition */
> >   	u32			num;	/* unique link group number */
> >   };
> > 
> > +struct smc_lgr_manager {		/* manager for link group */
> > +	struct rhashtable	lnk_cluster_maps;	/* maps of smc_lnk_cluster */
> > +	spinlock_t		lock;	/* lock for lgr_cm_maps */
> > +};
> > +
> > +struct smc_lnk_cluster {
> > +	struct rhash_head	rnode;	/* node for rhashtable */
> > +	struct wait_queue_head	first_contact_waitqueue;
> > +					/* queue for non first contact to wait
> > +					 * first contact to be established.
> > +					 */
> > +	spinlock_t		lock;	/* protection for link group */
> > +	refcount_t		ref;	/* refcount for cluster */
> > +	unsigned long		pending_capability;
> > +					/* maximum pending number of connections that
> > +					 * need wait first contact complete.
> > +					 */
> > +	unsigned long		conns_pending;
> > +					/* connections that are waiting for first contact
> > +					 * complete
> > +					 */
> > +	u8		peer_systemid[SMC_SYSTEMID_LEN];
> > +	u8		peer_mac[ETH_ALEN];	/* = gid[8:10||13:15] */
> > +	u8		peer_gid[SMC_GID_SIZE];	/* gid of peer*/
> > +	int		clcqpn;
> > +	int		role;
> > +};
> > +
> >   enum smc_lgr_role {		/* possible roles of a link group */
> >   	SMC_CLNT,	/* client */
> >   	SMC_SERV	/* server */
> >   };
> > 
> > +struct smc_lnk_cluster_compare_arg	/* key for smc_lnk_cluster */
> > +{
> > +	int	smcr_version;
> > +	enum smc_lgr_role role;
> > +	u8	*peer_systemid;
> > +	u8	*peer_gid;
> > +	u8	*peer_mac;
> > +	int clcqpn;
> > +};
> > +
> >   enum smc_link_state {			/* possible states of a link */
> >   	SMC_LNK_UNUSED,		/* link is unused */
> >   	SMC_LNK_INACTIVE,	/* link is inactive */
> >   	SMC_LNK_ACTIVATING,	/* link is being activated */
> >   	SMC_LNK_ACTIVE,		/* link is active */
> > +	SMC_LNK_TEAR_DWON,	/* link is tear down */
> >   };
> > 
> > +#define SMC_LNK_STATE_BIT(state)	(1 << (state))
> > +
> > +#define	SMC_LNK_STATE_RECORD(lnk, state)	\
> > +	((lnk)->state_record |= SMC_LNK_STATE_BIT(state))
> > +
> >   #define SMC_WR_BUF_SIZE		48	/* size of work request buffer */
> >   #define SMC_WR_BUF_V2_SIZE	8192	/* size of v2 work request buffer */
> > 
> > @@ -145,6 +190,7 @@ struct smc_link {
> >   	int			ndev_ifidx; /* network device ifindex */
> > 
> >   	enum smc_link_state	state;		/* state of link */
> > +	int			state_record;		/* record of previous state */
> >   	struct delayed_work	llc_testlink_wrk; /* testlink worker */
> >   	struct completion	llc_testlink_resp; /* wait for rx of testlink */
> >   	int			llc_testlink_time; /* testlink interval */
> > @@ -557,6 +603,8 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
> >   int smcr_nl_get_link(struct sk_buff *skb, struct netlink_callback *cb);
> >   int smcd_nl_get_lgr(struct sk_buff *skb, struct netlink_callback *cb);
> > 
> > +void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk);
> > +
> >   static inline struct smc_link_group *smc_get_lgr(struct smc_link *link)
> >   {
> >   	return link->lgr;
> > diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
> > index 175026a..8134c15 100644
> > --- a/net/smc/smc_llc.c
> > +++ b/net/smc/smc_llc.c
> > @@ -1099,6 +1099,7 @@ int smc_llc_cli_add_link(struct smc_link *link, struct smc_llc_qentry *qentry)
> >   		goto out;
> >   out_clear_lnk:
> >   	lnk_new->state = SMC_LNK_INACTIVE;
> > +	smcr_lnk_cluster_on_lnk_state(lnk_new);
> >   	smcr_link_clear(lnk_new, false);
> >   out_reject:
> >   	smc_llc_cli_add_link_reject(qentry);
> > @@ -1278,6 +1279,7 @@ static void smc_llc_delete_asym_link(struct smc_link_group *lgr)
> >   		return; /* no asymmetric link */
> >   	if (!smc_link_downing(&lnk_asym->state))
> >   		return;
> > +	smcr_lnk_cluster_on_lnk_state(lnk_asym);
> >   	lnk_new = smc_switch_conns(lgr, lnk_asym, false);
> >   	smc_wr_tx_wait_no_pending_sends(lnk_asym);
> >   	if (!lnk_new)
> > @@ -1492,6 +1494,7 @@ int smc_llc_srv_add_link(struct smc_link *link,
> >   out_err:
> >   	if (link_new) {
> >   		link_new->state = SMC_LNK_INACTIVE;
> > +		smcr_lnk_cluster_on_lnk_state(link_new);
> >   		smcr_link_clear(link_new, false);
> >   	}
> >   out:
> > @@ -1602,8 +1605,10 @@ static void smc_llc_process_cli_delete_link(struct smc_link_group *lgr)
> >   	del_llc->reason = 0;
> >   	smc_llc_send_message(lnk, &qentry->msg); /* response */
> > 
> > -	if (smc_link_downing(&lnk_del->state))
> > +	if (smc_link_downing(&lnk_del->state)) {
> > +		smcr_lnk_cluster_on_lnk_state(lnk);
> >   		smc_switch_conns(lgr, lnk_del, false);
> > +	}
> >   	smcr_link_clear(lnk_del, true);
> > 
> >   	active_links = smc_llc_active_link_count(lgr);
> > @@ -1676,6 +1681,7 @@ static void smc_llc_process_srv_delete_link(struct smc_link_group *lgr)
> >   		goto out; /* asymmetric link already deleted */
> > 
> >   	if (smc_link_downing(&lnk_del->state)) {
> > +		smcr_lnk_cluster_on_lnk_state(lnk);
> >   		if (smc_switch_conns(lgr, lnk_del, false))
> >   			smc_wr_tx_wait_no_pending_sends(lnk_del);
> >   	}
> > @@ -2167,6 +2173,7 @@ void smc_llc_link_active(struct smc_link *link)
> >   		schedule_delayed_work(&link->llc_testlink_wrk,
> >   				      link->llc_testlink_time);
> >   	}
> > +	smcr_lnk_cluster_on_lnk_state(link);
> >   }
> > 
> >   /* called in worker context */

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending
  2022-08-10 17:47 ` [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D. Wythe
                     ` (2 preceding siblings ...)
  2022-08-16  9:43   ` Jan Karcher
@ 2022-08-16 12:52   ` Tony Lu
  3 siblings, 0 replies; 29+ messages in thread
From: Tony Lu @ 2022-08-16 12:52 UTC (permalink / raw)
  To: D. Wythe; +Cc: kgraul, wenjia, kuba, davem, netdev, linux-s390, linux-rdma

On Thu, Aug 11, 2022 at 01:47:32AM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> This patch attempts to remove locks named smc_client_lgr_pending and
> smc_server_lgr_pending, which aim to serialize the creation of link
> group. However, once link group existed already, those locks are
> meaningless, worse still, they make incoming connections have to be
> queued one after the other.
> 
> Now, the creation of link group is no longer generated by competition,
> but allocated through following strategy.
> 
> 1. Try to find a suitable link group, if successd, current connection
> is considered as NON first contact connection. ends.
> 
> 2. Check the number of connections currently waiting for a suitable
> link group to be created, if it is not less that the number of link
> groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
> increase the number of link groups to be created, current connection
> is considered as the first contact connection. ends.
> 
> 3. Increase the number of connections currently waiting, and wait
> for woken up.
> 
> 4. Decrease the number of connections currently waiting, goto 1.
> 
> We wake up the connection that was put to sleep in stage 3 through
> the SMC link state change event. Once the link moves out of the
> SMC_LNK_ACTIVATING state, decrease the number of link groups to
> be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
> connections.
> 
> In the iplementation, we introduce the concept of lnk cluster, which is
> a collection of links with the same characteristics (see
> smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
> wake up efficiently in the scenario of N v.s 1.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  net/smc/af_smc.c   |  11 +-
>  net/smc/smc_core.c | 356 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  net/smc/smc_core.h |  48 ++++++++
>  net/smc/smc_llc.c  |   9 +-
>  4 files changed, 411 insertions(+), 13 deletions(-)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 79c1318..af4b0aa 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -1194,10 +1194,8 @@ static int smc_connect_rdma(struct smc_sock *smc,
>  	if (reason_code)
>  		return reason_code;
>  
> -	mutex_lock(&smc_client_lgr_pending);
>  	reason_code = smc_conn_create(smc, ini);
>  	if (reason_code) {
> -		mutex_unlock(&smc_client_lgr_pending);
>  		return reason_code;
>  	}
>  
> @@ -1289,7 +1287,6 @@ static int smc_connect_rdma(struct smc_sock *smc,
>  		if (reason_code)
>  			goto connect_abort;
>  	}
> -	mutex_unlock(&smc_client_lgr_pending);
>  
>  	smc_copy_sock_settings_to_clc(smc);
>  	smc->connect_nonblock = 0;
> @@ -1299,7 +1296,6 @@ static int smc_connect_rdma(struct smc_sock *smc,
>  	return 0;
>  connect_abort:
>  	smc_conn_abort(smc, ini->first_contact_local);
> -	mutex_unlock(&smc_client_lgr_pending);
>  	smc->connect_nonblock = 0;
>  
>  	return reason_code;
> @@ -2377,7 +2373,8 @@ static void smc_listen_work(struct work_struct *work)
>  	if (rc)
>  		goto out_decl;
>  
> -	mutex_lock(&smc_server_lgr_pending);
> +	if (ini->is_smcd)
> +		mutex_lock(&smc_server_lgr_pending);
>  	smc_close_init(new_smc);
>  	smc_rx_init(new_smc);
>  	smc_tx_init(new_smc);
> @@ -2415,7 +2412,6 @@ static void smc_listen_work(struct work_struct *work)
>  					    ini->first_contact_local, ini);
>  		if (rc)
>  			goto out_unlock;
> -		mutex_unlock(&smc_server_lgr_pending);
>  	}
>  	smc_conn_save_peer_info(new_smc, cclc);
>  	smc_listen_out_connected(new_smc);
> @@ -2423,7 +2419,8 @@ static void smc_listen_work(struct work_struct *work)
>  	goto out_free;
>  
>  out_unlock:
> -	mutex_unlock(&smc_server_lgr_pending);
> +	if (ini->is_smcd)
> +		mutex_unlock(&smc_server_lgr_pending);
>  out_decl:
>  	smc_listen_decline(new_smc, rc, ini ? ini->first_contact_local : 0,
>  			   proposal_version);
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index ff49a11..a3338cc 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -46,6 +46,10 @@ struct smc_lgr_list smc_lgr_list = {	/* established link groups */
>  	.num = 0,
>  };
>  
> +struct smc_lgr_manager smc_lgr_manager = {
> +	.lock = __SPIN_LOCK_UNLOCKED(smc_lgr_manager.lock),
> +};
> +
>  static atomic_t lgr_cnt = ATOMIC_INIT(0); /* number of existing link groups */
>  static DECLARE_WAIT_QUEUE_HEAD(lgrs_deleted);
>  
> @@ -55,6 +59,282 @@ static void smc_buf_free(struct smc_link_group *lgr, bool is_rmb,
>  
>  static void smc_link_down_work(struct work_struct *work);
>  
> +/* SMC-R lnk cluster compare func
> + * All lnks that meet the description conditions of this function
> + * are logically aggregated, called lnk cluster.
> + * For the server side, lnk cluster is used to determine whether
> + * a new group needs to be created when processing new imcoming connections.
> + * For the client side, lnk cluster is used to determine whether
> + * to wait for link ready (in other words, first contact ready).
> + */
> +static int smcr_lnk_cluster_cmpfn(struct rhashtable_compare_arg *arg, const void *obj)
> +{
> +	const struct smc_lnk_cluster_compare_arg *key = arg->key;
> +	const struct smc_lnk_cluster *lnkc = obj;
> +
> +	if (memcmp(key->peer_systemid, lnkc->peer_systemid, SMC_SYSTEMID_LEN))
> +		return 1;
> +
> +	if (memcmp(key->peer_gid, lnkc->peer_gid, SMC_GID_SIZE))
> +		return 1;
> +
> +	if ((key->role == SMC_SERV || key->clcqpn == lnkc->clcqpn) &&
> +	    (key->smcr_version == SMC_V2 ||
> +	    !memcmp(key->peer_mac, lnkc->peer_mac, ETH_ALEN)))
> +		return 0;
> +
> +	return 1;
> +}

Also, SMC prefers to use *link* for function name and struct name. It's
okay to use *lnk* for local variables.

> +
> +/* SMC-R lnk cluster hash func */
> +static u32 smcr_lnk_cluster_hashfn(const void *data, u32 len, u32 seed)
> +{
> +	const struct smc_lnk_cluster *lnkc = data;
> +
> +	return jhash2((u32 *)lnkc->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
> +		+ (lnkc->role == SMC_SERV) ? 0 : lnkc->clcqpn;
> +}
> +
> +/* SMC-R lnk cluster compare arg hash func */
> +static u32 smcr_lnk_cluster_compare_arg_hashfn(const void *data, u32 len, u32 seed)
> +{
> +	const struct smc_lnk_cluster_compare_arg *key = data;
> +
> +	return jhash2((u32 *)key->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
> +		+ (key->role == SMC_SERV) ? 0 : key->clcqpn;
> +}
> +
> +static const struct rhashtable_params smcr_lnk_cluster_rhl_params = {
> +	.head_offset = offsetof(struct smc_lnk_cluster, rnode),
> +	.key_len = sizeof(struct smc_lnk_cluster_compare_arg),
> +	.obj_cmpfn = smcr_lnk_cluster_cmpfn,
> +	.obj_hashfn = smcr_lnk_cluster_hashfn,
> +	.hashfn = smcr_lnk_cluster_compare_arg_hashfn,
> +	.automatic_shrinking = true,
> +};
> +
> +/* hold a reference for smc_lnk_cluster */
> +static inline void smc_lnk_cluster_hold(struct smc_lnk_cluster *lnkc)
> +{
> +	if (likely(lnkc))
> +		refcount_inc(&lnkc->ref);
> +}
> +
> +/* release a reference for smc_lnk_cluster */
> +static inline void smc_lnk_cluster_put(struct smc_lnk_cluster *lnkc)
> +{
> +	bool do_free = false;
> +
> +	if (!lnkc)
> +		return;
> +
> +	if (refcount_dec_not_one(&lnkc->ref))
> +		return;
> +
> +	spin_lock_bh(&smc_lgr_manager.lock);
> +	/* last ref */
> +	if (refcount_dec_and_test(&lnkc->ref)) {
> +		do_free = true;
> +		rhashtable_remove_fast(&smc_lgr_manager.lnk_cluster_maps, &lnkc->rnode,
> +				       smcr_lnk_cluster_rhl_params);
> +	}
> +	spin_unlock_bh(&smc_lgr_manager.lock);
> +	if (do_free)
> +		kfree(lnkc);
> +}
> +
> +/* Get or create smc_lnk_cluster by key
> + * This function will hold a reference of returned smc_lnk_cluster
> + * or create a new smc_lnk_cluster with the reference initialized to 1。
> + * caller MUST call smc_lnk_cluster_put after this.
> + */
> +static inline struct smc_lnk_cluster *
> +smcr_lnk_get_or_create_cluster(struct smc_lnk_cluster_compare_arg *key)
> +{
> +	struct smc_lnk_cluster *lnkc, *tmp_lnkc;
> +	bool busy_retry;
> +	int err;
> +
> +	/* serving a hardware or software interrupt, or preemption is disabled */
> +	busy_retry = !in_interrupt();
> +
> +	spin_lock_bh(&smc_lgr_manager.lock);
> +	lnkc = rhashtable_lookup_fast(&smc_lgr_manager.lnk_cluster_maps, key,
> +				      smcr_lnk_cluster_rhl_params);
> +	if (!lnkc) {
> +		lnkc = kzalloc(sizeof(*lnkc), GFP_ATOMIC);
> +		if (unlikely(!lnkc))
> +			goto fail;
> +
> +		/* init cluster */
> +		spin_lock_init(&lnkc->lock);
> +		lnkc->role = key->role;
> +		if (key->role == SMC_CLNT)
> +			lnkc->clcqpn = key->clcqpn;
> +		init_waitqueue_head(&lnkc->first_contact_waitqueue);
> +		memcpy(lnkc->peer_systemid, key->peer_systemid, SMC_SYSTEMID_LEN);
> +		memcpy(lnkc->peer_gid, key->peer_gid, SMC_GID_SIZE);
> +		memcpy(lnkc->peer_mac, key->peer_mac, ETH_ALEN);
> +		refcount_set(&lnkc->ref, 1);
> +
> +		do {
> +			err = rhashtable_insert_fast(&smc_lgr_manager.lnk_cluster_maps,
> +						     &lnkc->rnode, smcr_lnk_cluster_rhl_params);
> +
> +			/* success or fatal error */
> +			if (err != -EBUSY)
> +				break;
> +
> +			/* impossible in fact right now */
> +			if (unlikely(!busy_retry)) {
> +				pr_warn_ratelimited("smc: create lnk cluster in softirq\n");
> +				break;
> +			}
> +
> +			spin_unlock_bh(&smc_lgr_manager.lock);
> +			/* yeild */
> +			cond_resched();
> +			spin_lock_bh(&smc_lgr_manager.lock);
> +
> +			/* after spin_unlock_bh(), lnk_cluster_maps may be changed */
> +			tmp_lnkc = rhashtable_lookup_fast(&smc_lgr_manager.lnk_cluster_maps, key,
> +							  smcr_lnk_cluster_rhl_params);
> +
> +			if (unlikely(tmp_lnkc)) {
> +				pr_warn_ratelimited("smc: create cluster failed dues to duplicat key");
> +				kfree(lnkc);
> +				lnkc = NULL;
> +				goto fail;
> +			}
> +		} while (1);
> +
> +		if (unlikely(err)) {
> +			pr_warn_ratelimited("smc: rhashtable_insert_fast failed (%d)", err);
> +			kfree(lnkc);
> +			lnkc = NULL;
> +		}
> +	} else {
> +		smc_lnk_cluster_hold(lnkc);
> +	}
> +fail:
> +	spin_unlock_bh(&smc_lgr_manager.lock);
> +	return lnkc;
> +}
> +
> +/* Get or create a smc_lnk_cluster by lnk
> + * caller MUST call smc_lnk_cluster_put after this.
> + */
> +static inline struct smc_lnk_cluster *smcr_lnk_get_cluster(struct smc_link *lnk)
> +{
> +	struct smc_lnk_cluster_compare_arg key;
> +	struct smc_link_group *lgr;
> +
> +	lgr = lnk->lgr;
> +	if (!lgr || lgr->is_smcd)
> +		return NULL;
> +
> +	key.smcr_version = lgr->smc_version;
> +	key.peer_systemid = lgr->peer_systemid;
> +	key.peer_gid = lnk->peer_gid;
> +	key.peer_mac = lnk->peer_mac;
> +	key.role	 = lgr->role;
> +	if (key.role == SMC_CLNT)
> +		key.clcqpn = lnk->peer_qpn;
> +
> +	return smcr_lnk_get_or_create_cluster(&key);
> +}
> +
> +/* Get or create a smc_lnk_cluster by ini
> + * caller MUST call smc_lnk_cluster_put after this.
> + */
> +static inline struct smc_lnk_cluster *
> +smcr_lnk_get_cluster_by_ini(struct smc_init_info *ini, int role)
> +{
> +	struct smc_lnk_cluster_compare_arg key;
> +
> +	if (ini->is_smcd)
> +		return NULL;
> +
> +	key.smcr_version = ini->smcr_version;
> +	key.peer_systemid = ini->peer_systemid;
> +	key.peer_gid = ini->peer_gid;
> +	key.peer_mac = ini->peer_mac;
> +	key.role	= role;
> +	if (role == SMC_CLNT)
> +		key.clcqpn	= ini->ib_clcqpn;
> +
> +	return smcr_lnk_get_or_create_cluster(&key);
> +}
> +
> +/* callback when smc link state change */
> +void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk)
> +{
> +	struct smc_lnk_cluster *lnkc;
> +	int nr = 0;
> +
> +	/* barrier for lnk->state */
> +	smp_mb();
> +
> +	/* only first link can made connections block on
> +	 * first_contact_waitqueue
> +	 */
> +	if (lnk->link_idx != SMC_SINGLE_LINK)
> +		return;
> +
> +	/* state already seen  */
> +	if (lnk->state_record & SMC_LNK_STATE_BIT(lnk->state))
> +		return;
> +
> +	lnkc = smcr_lnk_get_cluster(lnk);
> +
> +	if (unlikely(!lnkc))
> +		return;
> +
> +	spin_lock_bh(&lnkc->lock);
> +
> +	/* all lnk state change should be
> +	 * 1. SMC_LNK_UNUSED -> SMC_LNK_TEAR_DWON (link init failed)
> +	 * 2. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_TEAR_DWON
> +	 * 3. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DWON
> +	 * 4. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DWON
> +	 * 5. SMC_LNK_UNUSED -> SMC_LNK_ATIVATING -> SMC_LNK_ACTIVE ->SMC_LNK_INACTIVE
> +	 * -> SMC_LNK_TEAR_DWON
> +	 */
> +	switch (lnk->state) {
> +	case SMC_LNK_ACTIVATING:
> +		/* It's safe to hold a reference without lock
> +		 * dues to the smcr_lnk_get_cluster already hold one
> +		 */
> +		smc_lnk_cluster_hold(lnkc);
> +		break;
> +	case SMC_LNK_TEAR_DWON:
> +		if (lnk->state_record & SMC_LNK_STATE_BIT(SMC_LNK_ACTIVATING))
> +			/* smc_lnk_cluster_hold in SMC_LNK_ACTIVATING */
> +			smc_lnk_cluster_put(lnkc);
> +		fallthrough;
> +	case SMC_LNK_ACTIVE:
> +	case SMC_LNK_INACTIVE:
> +		if (!(lnk->state_record &
> +			(SMC_LNK_STATE_BIT(SMC_LNK_ACTIVE)
> +			| SMC_LNK_STATE_BIT(SMC_LNK_INACTIVE)))) {
> +			lnkc->pending_capability -= (SMC_RMBS_PER_LGR_MAX - 1);
> +			/* TODO: wakeup just one to perfrom first contact
> +			 * if record state has no SMC_LNK_ACTIVE
> +			 */
> +			nr = SMC_RMBS_PER_LGR_MAX - 1;
> +		}
> +		break;
> +	case SMC_LNK_UNUSED:
> +		pr_warn_ratelimited("net/smc: invalid lnk state. ");
> +		break;
> +	}
> +	SMC_LNK_STATE_RECORD(lnk, lnk->state);
> +	spin_unlock_bh(&lnkc->lock);
> +	if (nr)
> +		wake_up_nr(&lnkc->first_contact_waitqueue, nr);
> +	smc_lnk_cluster_put(lnkc);	/* smc_lnk_cluster_hold in smcr_lnk_get_cluster */
> +}
> +
>  /* return head of link group list and its lock for a given link group */
>  static inline struct list_head *smc_lgr_list_head(struct smc_link_group *lgr,
>  						  spinlock_t **lgr_lock)
> @@ -651,8 +931,10 @@ static void smcr_lgr_link_deactivate_all(struct smc_link_group *lgr)
>  	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
>  		struct smc_link *lnk = &lgr->lnk[i];
>  
> -		if (smc_link_sendable(lnk))
> +		if (smc_link_sendable(lnk)) {
>  			lnk->state = SMC_LNK_INACTIVE;
> +			smcr_lnk_cluster_on_lnk_state(lnk);
> +		}
>  	}
>  	wake_up_all(&lgr->llc_msg_waiter);
>  	wake_up_all(&lgr->llc_flow_waiter);
> @@ -762,6 +1044,9 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
>  	atomic_set(&lnk->conn_cnt, 0);
>  	smc_llc_link_set_uid(lnk);
>  	INIT_WORK(&lnk->link_down_wrk, smc_link_down_work);
> +	lnk->peer_qpn = ini->ib_clcqpn;
> +	memcpy(lnk->peer_gid, ini->peer_gid, SMC_GID_SIZE);
> +	memcpy(lnk->peer_mac, ini->peer_mac, sizeof(lnk->peer_mac));
>  	if (!lnk->smcibdev->initialized) {
>  		rc = (int)smc_ib_setup_per_ibdev(lnk->smcibdev);
>  		if (rc)
> @@ -792,6 +1077,7 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
>  	if (rc)
>  		goto destroy_qp;
>  	lnk->state = SMC_LNK_ACTIVATING;
> +	smcr_lnk_cluster_on_lnk_state(lnk);
>  	return 0;
>  
>  destroy_qp:
> @@ -806,6 +1092,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
>  	smc_ibdev_cnt_dec(lnk);
>  	put_device(&lnk->smcibdev->ibdev->dev);
>  	smcibdev = lnk->smcibdev;
> +	lnk->state = SMC_LNK_TEAR_DWON;
> +	smcr_lnk_cluster_on_lnk_state(lnk);
>  	memset(lnk, 0, sizeof(struct smc_link));
>  	lnk->state = SMC_LNK_UNUSED;
>  	if (!atomic_dec_return(&smcibdev->lnk_cnt))
> @@ -1263,6 +1551,8 @@ void smcr_link_clear(struct smc_link *lnk, bool log)
>  	if (!lnk->lgr || lnk->clearing ||
>  	    lnk->state == SMC_LNK_UNUSED)
>  		return;
> +	lnk->state = SMC_LNK_TEAR_DWON;
> +	smcr_lnk_cluster_on_lnk_state(lnk);
>  	lnk->clearing = 1;
>  	lnk->peer_qpn = 0;
>  	smc_llc_link_clear(lnk, log);
> @@ -1712,6 +2002,7 @@ void smcr_link_down_cond(struct smc_link *lnk)
>  {
>  	if (smc_link_downing(&lnk->state)) {
>  		trace_smcr_link_down(lnk, __builtin_return_address(0));
> +		smcr_lnk_cluster_on_lnk_state(lnk);
>  		smcr_link_down(lnk);
>  	}
>  }
> @@ -1721,6 +2012,7 @@ void smcr_link_down_cond_sched(struct smc_link *lnk)
>  {
>  	if (smc_link_downing(&lnk->state)) {
>  		trace_smcr_link_down(lnk, __builtin_return_address(0));
> +		smcr_lnk_cluster_on_lnk_state(lnk);
>  		schedule_work(&lnk->link_down_wrk);
>  	}
>  }
> @@ -1850,11 +2142,13 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>  {
>  	struct smc_connection *conn = &smc->conn;
>  	struct net *net = sock_net(&smc->sk);
> +	DECLARE_WAITQUEUE(wait, current);
> +	struct smc_lnk_cluster *lnkc = NULL;
>  	struct list_head *lgr_list;
>  	struct smc_link_group *lgr;
>  	enum smc_lgr_role role;
>  	spinlock_t *lgr_lock;
> -	int rc = 0;
> +	int rc = 0, timeo = CLC_WAIT_TIME;
>  
>  	lgr_list = ini->is_smcd ? &ini->ism_dev[ini->ism_selected]->lgr_list :
>  				  &smc_lgr_list.list;
> @@ -1862,12 +2156,26 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>  				  &smc_lgr_list.lock;
>  	ini->first_contact_local = 1;
>  	role = smc->listen_smc ? SMC_SERV : SMC_CLNT;
> -	if (role == SMC_CLNT && ini->first_contact_peer)
> +
> +	if (!ini->is_smcd) {
> +		lnkc = smcr_lnk_get_cluster_by_ini(ini, role);
> +		if (unlikely(!lnkc))
> +			return SMC_CLC_DECL_INTERR;
> +	}
> +
> +	if (role == SMC_CLNT && ini->first_contact_peer) {
> +		/* first_contact */
> +		spin_lock_bh(&lnkc->lock);
> +		lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
> +		spin_unlock_bh(&lnkc->lock);
>  		/* create new link group as well */
>  		goto create;
> +	}
>  
>  	/* determine if an existing link group can be reused */
>  	spin_lock_bh(lgr_lock);
> +	spin_lock(&lnkc->lock);
> +again:
>  	list_for_each_entry(lgr, lgr_list, list) {
>  		write_lock_bh(&lgr->conns_lock);
>  		if ((ini->is_smcd ?
> @@ -1894,9 +2202,33 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>  		}
>  		write_unlock_bh(&lgr->conns_lock);
>  	}
> +	if (lnkc && ini->first_contact_local) {
> +		if (lnkc->pending_capability > lnkc->conns_pending) {
> +			lnkc->conns_pending++;
> +			add_wait_queue(&lnkc->first_contact_waitqueue, &wait);
> +			spin_unlock(&lnkc->lock);
> +			spin_unlock_bh(lgr_lock);
> +			set_current_state(TASK_INTERRUPTIBLE);
> +			/* need to wait at least once first contact done */
> +			timeo = schedule_timeout(timeo);
> +			set_current_state(TASK_RUNNING);
> +			remove_wait_queue(&lnkc->first_contact_waitqueue, &wait);
> +			spin_lock_bh(lgr_lock);
> +			spin_lock(&lnkc->lock);
> +
> +			lnkc->conns_pending--;
> +			if (timeo)
> +				goto again;
> +		}
> +		if (role == SMC_SERV) {
> +			/* first_contact */
> +			lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
> +		}
> +	}
> +	spin_unlock(&lnkc->lock);
>  	spin_unlock_bh(lgr_lock);
>  	if (rc)
> -		return rc;
> +		goto out;
>  
>  	if (role == SMC_CLNT && !ini->first_contact_peer &&
>  	    ini->first_contact_local) {
> @@ -1904,7 +2236,8 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>  		 * a new one
>  		 * send out_of_sync decline, reason synchr. error
>  		 */
> -		return SMC_CLC_DECL_SYNCERR;
> +		rc = SMC_CLC_DECL_SYNCERR;
> +		goto out;
>  	}
>  
>  create:
> @@ -1941,6 +2274,8 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
>  #endif
>  
>  out:
> +	/* smc_lnk_cluster_hold in smcr_lnk_get_or_create_cluster */
> +	smc_lnk_cluster_put(lnkc);
>  	return rc;
>  }
>  
> @@ -2599,12 +2934,23 @@ static int smc_core_reboot_event(struct notifier_block *this,
>  
>  int __init smc_core_init(void)
>  {
> +	/* init smc lnk cluster maps */
> +	rhashtable_init(&smc_lgr_manager.lnk_cluster_maps, &smcr_lnk_cluster_rhl_params);
>  	return register_reboot_notifier(&smc_reboot_notifier);
>  }
>  
> +static void smc_lnk_cluster_free_cb(void *ptr, void *arg)
> +{
> +	pr_warn("smc: smc lnk cluster refcnt leak.\n");
> +	kfree(ptr);
> +}
> +
>  /* Called (from smc_exit) when module is removed */
>  void smc_core_exit(void)
>  {
>  	unregister_reboot_notifier(&smc_reboot_notifier);
>  	smc_lgrs_shutdown();
> +	/* destroy smc lnk cluster maps */
> +	rhashtable_free_and_destroy(&smc_lgr_manager.lnk_cluster_maps, smc_lnk_cluster_free_cb,
> +				    NULL);
>  }
> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> index fe8b524..199f533 100644
> --- a/net/smc/smc_core.h
> +++ b/net/smc/smc_core.h
> @@ -15,6 +15,7 @@
>  #include <linux/atomic.h>
>  #include <linux/smc.h>
>  #include <linux/pci.h>
> +#include <linux/rhashtable.h>
>  #include <rdma/ib_verbs.h>
>  #include <net/genetlink.h>
>  
> @@ -29,18 +30,62 @@ struct smc_lgr_list {			/* list of link group definition */
>  	u32			num;	/* unique link group number */
>  };
>  
> +struct smc_lgr_manager {		/* manager for link group */
> +	struct rhashtable	lnk_cluster_maps;	/* maps of smc_lnk_cluster */
> +	spinlock_t		lock;	/* lock for lgr_cm_maps */
> +};
> +
> +struct smc_lnk_cluster {
> +	struct rhash_head	rnode;	/* node for rhashtable */
> +	struct wait_queue_head	first_contact_waitqueue;
> +					/* queue for non first contact to wait
> +					 * first contact to be established.
> +					 */
> +	spinlock_t		lock;	/* protection for link group */
> +	refcount_t		ref;	/* refcount for cluster */
> +	unsigned long		pending_capability;
> +					/* maximum pending number of connections that
> +					 * need wait first contact complete.
> +					 */
> +	unsigned long		conns_pending;
> +					/* connections that are waiting for first contact
> +					 * complete
> +					 */
> +	u8		peer_systemid[SMC_SYSTEMID_LEN];
> +	u8		peer_mac[ETH_ALEN];	/* = gid[8:10||13:15] */
> +	u8		peer_gid[SMC_GID_SIZE];	/* gid of peer*/
> +	int		clcqpn;
> +	int		role;
> +};
> +
>  enum smc_lgr_role {		/* possible roles of a link group */
>  	SMC_CLNT,	/* client */
>  	SMC_SERV	/* server */
>  };
>  
> +struct smc_lnk_cluster_compare_arg	/* key for smc_lnk_cluster */
> +{
> +	int	smcr_version;
> +	enum smc_lgr_role role;
> +	u8	*peer_systemid;
> +	u8	*peer_gid;
> +	u8	*peer_mac;
> +	int clcqpn;
> +};
> +
>  enum smc_link_state {			/* possible states of a link */
>  	SMC_LNK_UNUSED,		/* link is unused */
>  	SMC_LNK_INACTIVE,	/* link is inactive */
>  	SMC_LNK_ACTIVATING,	/* link is being activated */
>  	SMC_LNK_ACTIVE,		/* link is active */
> +	SMC_LNK_TEAR_DWON,	/* link is tear down */
>  };
>  
> +#define SMC_LNK_STATE_BIT(state)	(1 << (state))
> +
> +#define	SMC_LNK_STATE_RECORD(lnk, state)	\
> +	((lnk)->state_record |= SMC_LNK_STATE_BIT(state))
> +
>  #define SMC_WR_BUF_SIZE		48	/* size of work request buffer */
>  #define SMC_WR_BUF_V2_SIZE	8192	/* size of v2 work request buffer */
>  
> @@ -145,6 +190,7 @@ struct smc_link {
>  	int			ndev_ifidx; /* network device ifindex */
>  
>  	enum smc_link_state	state;		/* state of link */
> +	int			state_record;		/* record of previous state */
>  	struct delayed_work	llc_testlink_wrk; /* testlink worker */
>  	struct completion	llc_testlink_resp; /* wait for rx of testlink */
>  	int			llc_testlink_time; /* testlink interval */
> @@ -557,6 +603,8 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
>  int smcr_nl_get_link(struct sk_buff *skb, struct netlink_callback *cb);
>  int smcd_nl_get_lgr(struct sk_buff *skb, struct netlink_callback *cb);
>  
> +void smcr_lnk_cluster_on_lnk_state(struct smc_link *lnk);
> +
>  static inline struct smc_link_group *smc_get_lgr(struct smc_link *link)
>  {
>  	return link->lgr;
> diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
> index 175026a..8134c15 100644
> --- a/net/smc/smc_llc.c
> +++ b/net/smc/smc_llc.c
> @@ -1099,6 +1099,7 @@ int smc_llc_cli_add_link(struct smc_link *link, struct smc_llc_qentry *qentry)
>  		goto out;
>  out_clear_lnk:
>  	lnk_new->state = SMC_LNK_INACTIVE;
> +	smcr_lnk_cluster_on_lnk_state(lnk_new);
>  	smcr_link_clear(lnk_new, false);
>  out_reject:
>  	smc_llc_cli_add_link_reject(qentry);
> @@ -1278,6 +1279,7 @@ static void smc_llc_delete_asym_link(struct smc_link_group *lgr)
>  		return; /* no asymmetric link */
>  	if (!smc_link_downing(&lnk_asym->state))
>  		return;
> +	smcr_lnk_cluster_on_lnk_state(lnk_asym);
>  	lnk_new = smc_switch_conns(lgr, lnk_asym, false);
>  	smc_wr_tx_wait_no_pending_sends(lnk_asym);
>  	if (!lnk_new)
> @@ -1492,6 +1494,7 @@ int smc_llc_srv_add_link(struct smc_link *link,
>  out_err:
>  	if (link_new) {
>  		link_new->state = SMC_LNK_INACTIVE;
> +		smcr_lnk_cluster_on_lnk_state(link_new);
>  		smcr_link_clear(link_new, false);
>  	}
>  out:
> @@ -1602,8 +1605,10 @@ static void smc_llc_process_cli_delete_link(struct smc_link_group *lgr)
>  	del_llc->reason = 0;
>  	smc_llc_send_message(lnk, &qentry->msg); /* response */
>  
> -	if (smc_link_downing(&lnk_del->state))
> +	if (smc_link_downing(&lnk_del->state)) {
> +		smcr_lnk_cluster_on_lnk_state(lnk);
>  		smc_switch_conns(lgr, lnk_del, false);
> +	}
>  	smcr_link_clear(lnk_del, true);
>  
>  	active_links = smc_llc_active_link_count(lgr);
> @@ -1676,6 +1681,7 @@ static void smc_llc_process_srv_delete_link(struct smc_link_group *lgr)
>  		goto out; /* asymmetric link already deleted */
>  
>  	if (smc_link_downing(&lnk_del->state)) {
> +		smcr_lnk_cluster_on_lnk_state(lnk);
>  		if (smc_switch_conns(lgr, lnk_del, false))
>  			smc_wr_tx_wait_no_pending_sends(lnk_del);
>  	}
> @@ -2167,6 +2173,7 @@ void smc_llc_link_active(struct smc_link *link)
>  		schedule_delayed_work(&link->llc_testlink_wrk,
>  				      link->llc_testlink_time);
>  	}
> +	smcr_lnk_cluster_on_lnk_state(link);
>  }
>  
>  /* called in worker context */
> -- 
> 1.8.3.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections
  2022-08-16  9:35 ` Jan Karcher
  2022-08-16 12:40   ` Tony Lu
@ 2022-08-17  4:55   ` D. Wythe
  2022-08-17 16:52     ` Jan Karcher
  1 sibling, 1 reply; 29+ messages in thread
From: D. Wythe @ 2022-08-17  4:55 UTC (permalink / raw)
  To: Jan Karcher, kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma



On 8/16/22 5:35 PM, Jan Karcher wrote:
> 
> 
> On 10.08.2022 19:47, D. Wythe wrote:
>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>
>> This patch set attempts to optimize the parallelism of SMC-R connections,
>> mainly to reduce unnecessary blocking on locks, and to fix exceptions that
>> occur after thoses optimization.
>>
> 
> Thank you again for your submission!
> Let me give you a quick update from our side:
> We tested your patches on top of the net-next kernel on our s390 systems. They did crash our systems. After verifying our environment we pulled console logs and now we can tell that there is indeed a problem with your patches regarding SMC-D. So please do not integrate this change as of right now. I'm going to do more in depth reviews of your patches but i need some time for them so here is a quick a description of the problem:

Sorry for the late reply, and thanks a lot for your comment.

I'm sorry for the low-level mistake. In the early design, I hoped that lnkc can also work on SMC-D,
but in later tests I found out that we don't have SMC-D environment to test, so I have to canceled this logic.
But dues to the rollback isn't thorough enough, leaving this issues, we are very sorry for that.


> It is a SMC-D problem, that occurs while building up the connection. In smc_conn_create you set struct smc_lnk_cluster *lnkc = NULL. For the SMC-R path you do grab the pointer, for SMC-D that never happens. Still you are using this refernce for SMC-D => Crash. This problem can be reproduced using the SMC-D path. Here is an example console output:
> 
> [  779.516382] Unable to handle kernel pointer dereference in virtual kernel address space
> [  779.516389] Failing address: 0000000000000000 TEID: 0000000000000483
> [  779.516391] Fault in home space mode while using kernel ASCE.
> [  779.516395] AS:0000000069628007 R3:00000000ffbf0007 S:00000000ffbef800 P:000000000000003d
> [  779.516431] Oops: 0004 ilc:2 [#1] SMP
> [  779.516436] Modules linked in: tcp_diag inet_diag ism mlx5_ib ib_uverbs mlx5_core smc_diag smc ib_core nft_fib_inet nft_fib_ipv4
> nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv
> 6 nf_defrag_ipv4 ip_set nf_tables n
> [  779.516470] CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 5.19.0-13940-g22a46254655a #3
> [  779.516476] Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
> 
> [  779.522738] Workqueue: smc_hs_wq smc_listen_work [smc]
> [  779.522755] Krnl PSW : 0704c00180000000 000003ff803da89c (smc_conn_create+0x174/0x968 [smc])
> [  779.522766]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> [  779.522770] Krnl GPRS: 0000000000000002 0000000000000000 0000000000000001 0000000000000000
> [  779.522773]            000000008a4128a0 000003ff803f21aa 000000008e30d640 0000000086d72000
> [  779.522776]            0000000086d72000 000000008a412803 000000008a412800 000000008e30d650
> [  779.522779]            0000000080934200 0000000000000000 000003ff803cb954 00000380002dfa88
> [  779.522789] Krnl Code: 000003ff803da88e: e310f0e80024        stg %r1,232(%r15)
> [  779.522789]            000003ff803da894: a7180000            lhi %r1,0
> [  779.522789]           #000003ff803da898: 582003ac            l %r2,940
> [  779.522789]           >000003ff803da89c: ba123020            cs %r1,%r2,32(%r3)
> [  779.522789]            000003ff803da8a0: ec1603be007e        cij %r1,0,6,000003ff803db01c
> 
> [  779.522789]            000003ff803da8a6: 4110b002            la %r1,2(%r11)
> [  779.522789]            000003ff803da8aa: e310f0f00024        stg %r1,240(%r15)
> [  779.522789]            000003ff803da8b0: e310f0c00004        lg %r1,192(%r15)
> [  779.522870] Call Trace:
> [  779.522873]  [<000003ff803da89c>] smc_conn_create+0x174/0x968 [smc]
> [  779.522884]  [<000003ff803cb954>] smc_find_ism_v2_device_serv+0x1b4/0x300 [smc]
> 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 01.
> 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 00.
> [  779.522894]  [<000003ff803cbace>] smc_listen_find_device+0x2e/0x370 [smc]
> 
> 
> I'm going to send the review for the first patch right away (which is the one causing the crash), so far I'm done with it. The others are going to follow. Maybe you can look over the problem and come up with a solution, otherwise we are going to decide if we want to look into it as soon as I'm done with the reviews. Thank you for your patience.

In the next revision, I will add additional judgment to protect the SMC-D environment,
thanks for your comments.

And Looking forward to your other comments, thanks again.

D. Wythe


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections
  2022-08-17  4:55   ` D. Wythe
@ 2022-08-17 16:52     ` Jan Karcher
  2022-08-18 13:06       ` D. Wythe
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Karcher @ 2022-08-17 16:52 UTC (permalink / raw)
  To: D. Wythe, kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma



On 17.08.2022 06:55, D. Wythe wrote:
> 
> 
> On 8/16/22 5:35 PM, Jan Karcher wrote:
>>
>>
>> On 10.08.2022 19:47, D. Wythe wrote:
>>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>>
>>> This patch set attempts to optimize the parallelism of SMC-R 
>>> connections,
>>> mainly to reduce unnecessary blocking on locks, and to fix exceptions 
>>> that
>>> occur after thoses optimization.
>>>
>>
>> Thank you again for your submission!
>> Let me give you a quick update from our side:
>> We tested your patches on top of the net-next kernel on our s390 
>> systems. They did crash our systems. After verifying our environment 
>> we pulled console logs and now we can tell that there is indeed a 
>> problem with your patches regarding SMC-D. So please do not integrate 
>> this change as of right now. I'm going to do more in depth reviews of 
>> your patches but i need some time for them so here is a quick a 
>> description of the problem:
> 
> Sorry for the late reply, and thanks a lot for your comment.
> 
> I'm sorry for the low-level mistake. In the early design, I hoped that 
> lnkc can also work on SMC-D,
> but in later tests I found out that we don't have SMC-D environment to 
> test, so I have to canceled this logic.
> But dues to the rollback isn't thorough enough, leaving this issues, we 
> are very sorry for that.
> 

One more comment:
If the only reason why you do not touch SMC-D is that you do not have 
the environment to test it we strongly encourage you to change it anyway.

At some point doing kernel development, especially driver development 
you are going to reach the point where you do not have the environment 
to test it. It is on the maintainers to test those changes and verify 
that nothing is broken.

So please:
If testing is the only reason change SMC-D as well and we are going to 
test it for you verifying if it does work or not.

Thank you
Jan

> 
>> It is a SMC-D problem, that occurs while building up the connection. 
>> In smc_conn_create you set struct smc_lnk_cluster *lnkc = NULL. For 
>> the SMC-R path you do grab the pointer, for SMC-D that never happens. 
>> Still you are using this refernce for SMC-D => Crash. This problem can 
>> be reproduced using the SMC-D path. Here is an example console output:
>>
>> [  779.516382] Unable to handle kernel pointer dereference in virtual 
>> kernel address space
>> [  779.516389] Failing address: 0000000000000000 TEID: 0000000000000483
>> [  779.516391] Fault in home space mode while using kernel ASCE.
>> [  779.516395] AS:0000000069628007 R3:00000000ffbf0007 
>> S:00000000ffbef800 P:000000000000003d
>> [  779.516431] Oops: 0004 ilc:2 [#1] SMP
>> [  779.516436] Modules linked in: tcp_diag inet_diag ism mlx5_ib 
>> ib_uverbs mlx5_core smc_diag smc ib_core nft_fib_inet nft_fib_ipv4
>> nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 
>> nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv
>> 6 nf_defrag_ipv4 ip_set nf_tables n
>> [  779.516470] CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 
>> 5.19.0-13940-g22a46254655a #3
>> [  779.516476] Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
>>
>> [  779.522738] Workqueue: smc_hs_wq smc_listen_work [smc]
>> [  779.522755] Krnl PSW : 0704c00180000000 000003ff803da89c 
>> (smc_conn_create+0x174/0x968 [smc])
>> [  779.522766]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 
>> CC:0 PM:0 RI:0 EA:3
>> [  779.522770] Krnl GPRS: 0000000000000002 0000000000000000 
>> 0000000000000001 0000000000000000
>> [  779.522773]            000000008a4128a0 000003ff803f21aa 
>> 000000008e30d640 0000000086d72000
>> [  779.522776]            0000000086d72000 000000008a412803 
>> 000000008a412800 000000008e30d650
>> [  779.522779]            0000000080934200 0000000000000000 
>> 000003ff803cb954 00000380002dfa88
>> [  779.522789] Krnl Code: 000003ff803da88e: e310f0e80024        stg 
>> %r1,232(%r15)
>> [  779.522789]            000003ff803da894: a7180000            lhi %r1,0
>> [  779.522789]           #000003ff803da898: 582003ac            l %r2,940
>> [  779.522789]           >000003ff803da89c: ba123020            cs 
>> %r1,%r2,32(%r3)
>> [  779.522789]            000003ff803da8a0: ec1603be007e        cij 
>> %r1,0,6,000003ff803db01c
>>
>> [  779.522789]            000003ff803da8a6: 4110b002            la 
>> %r1,2(%r11)
>> [  779.522789]            000003ff803da8aa: e310f0f00024        stg 
>> %r1,240(%r15)
>> [  779.522789]            000003ff803da8b0: e310f0c00004        lg 
>> %r1,192(%r15)
>> [  779.522870] Call Trace:
>> [  779.522873]  [<000003ff803da89c>] smc_conn_create+0x174/0x968 [smc]
>> [  779.522884]  [<000003ff803cb954>] 
>> smc_find_ism_v2_device_serv+0x1b4/0x300 [smc]
>> 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP 
>> stop from CPU 01.
>> 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP 
>> stop from CPU 00.
>> [  779.522894]  [<000003ff803cbace>] smc_listen_find_device+0x2e/0x370 
>> [smc]
>>
>>
>> I'm going to send the review for the first patch right away (which is 
>> the one causing the crash), so far I'm done with it. The others are 
>> going to follow. Maybe you can look over the problem and come up with 
>> a solution, otherwise we are going to decide if we want to look into 
>> it as soon as I'm done with the reviews. Thank you for your patience.
> 
> In the next revision, I will add additional judgment to protect the 
> SMC-D environment,
> thanks for your comments.
> 
> And Looking forward to your other comments, thanks again.
> 
> D. Wythe
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections
  2022-08-17 16:52     ` Jan Karcher
@ 2022-08-18 13:06       ` D. Wythe
  0 siblings, 0 replies; 29+ messages in thread
From: D. Wythe @ 2022-08-18 13:06 UTC (permalink / raw)
  To: Jan Karcher, kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma



On 8/18/22 12:52 AM, Jan Karcher wrote:
> 
> 
> On 17.08.2022 06:55, D. Wythe wrote:
>>
>>
>> On 8/16/22 5:35 PM, Jan Karcher wrote:
>>>
>>>
>>> On 10.08.2022 19:47, D. Wythe wrote:
>>>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>>>
>>>> This patch set attempts to optimize the parallelism of SMC-R connections,
>>>> mainly to reduce unnecessary blocking on locks, and to fix exceptions that
>>>> occur after thoses optimization.
>>>>
>>>
>>> Thank you again for your submission!
>>> Let me give you a quick update from our side:
>>> We tested your patches on top of the net-next kernel on our s390 systems. They did crash our systems. After verifying our environment we pulled console logs and now we can tell that there is indeed a problem with your patches regarding SMC-D. So please do not integrate this change as of right now. I'm going to do more in depth reviews of your patches but i need some time for them so here is a quick a description of the problem:
>>
>> Sorry for the late reply, and thanks a lot for your comment.
>>
>> I'm sorry for the low-level mistake. In the early design, I hoped that lnkc can also work on SMC-D,
>> but in later tests I found out that we don't have SMC-D environment to test, so I have to canceled this logic.
>> But dues to the rollback isn't thorough enough, leaving this issues, we are very sorry for that.
>>
> 
> One more comment:
> If the only reason why you do not touch SMC-D is that you do not have the environment to test it we strongly encourage you to change it anyway.
> 
> At some point doing kernel development, especially driver development you are going to reach the point where you do not have the environment to test it. It is on the maintainers to test those changes and verify that nothing is broken.
> 
> So please:
> If testing is the only reason change SMC-D as well and we are going to test it for you verifying if it does work or not.
> 
> Thank you
> Jan

Actually, this is not the only reason. The purpose of remove smc_server_lgr_pending & smc_client_lgr_pending
is mainly to solve the problem of excessive lock granularity in SMC-R. In SMC-R those locks protect
a complete CLC message exchange process, including sending and receiving. This results in a large number of
connections having to be queued. But this is not the case with SMC-D. SMC-D releases the lock in advance
before receiving the CLC message, which makes the problem less severe in SMC-D than in SMC-R.

Of course, lnkc can be used for SMC-D, but considering that we have no way to test it,
and it is not the core bottleneck of SMC-D, so we gave up it.

I will fix the panic problem first in the next revison. If you have a strong demand for this feature,
I may commit a separate PATCH to support it, dues to current patch is quite complicated, adding SMC-D support
will exacerbate its complexity, which may affect the other reviewer progress.


Thanks
D. Wythe

>>
>>> It is a SMC-D problem, that occurs while building up the connection. In smc_conn_create you set struct smc_lnk_cluster *lnkc = NULL. For the SMC-R path you do grab the pointer, for SMC-D that never happens. Still you are using this refernce for SMC-D => Crash. This problem can be reproduced using the SMC-D path. Here is an example console output:
>>>
>>> [  779.516382] Unable to handle kernel pointer dereference in virtual kernel address space
>>> [  779.516389] Failing address: 0000000000000000 TEID: 0000000000000483
>>> [  779.516391] Fault in home space mode while using kernel ASCE.
>>> [  779.516395] AS:0000000069628007 R3:00000000ffbf0007 S:00000000ffbef800 P:000000000000003d
>>> [  779.516431] Oops: 0004 ilc:2 [#1] SMP
>>> [  779.516436] Modules linked in: tcp_diag inet_diag ism mlx5_ib ib_uverbs mlx5_core smc_diag smc ib_core nft_fib_inet nft_fib_ipv4
>>> nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv
>>> 6 nf_defrag_ipv4 ip_set nf_tables n
>>> [  779.516470] CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 5.19.0-13940-g22a46254655a #3
>>> [  779.516476] Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
>>>
>>> [  779.522738] Workqueue: smc_hs_wq smc_listen_work [smc]
>>> [  779.522755] Krnl PSW : 0704c00180000000 000003ff803da89c (smc_conn_create+0x174/0x968 [smc])
>>> [  779.522766]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>>> [  779.522770] Krnl GPRS: 0000000000000002 0000000000000000 0000000000000001 0000000000000000
>>> [  779.522773]            000000008a4128a0 000003ff803f21aa 000000008e30d640 0000000086d72000
>>> [  779.522776]            0000000086d72000 000000008a412803 000000008a412800 000000008e30d650
>>> [  779.522779]            0000000080934200 0000000000000000 000003ff803cb954 00000380002dfa88
>>> [  779.522789] Krnl Code: 000003ff803da88e: e310f0e80024        stg %r1,232(%r15)
>>> [  779.522789]            000003ff803da894: a7180000            lhi %r1,0
>>> [  779.522789]           #000003ff803da898: 582003ac            l %r2,940
>>> [  779.522789]           >000003ff803da89c: ba123020            cs %r1,%r2,32(%r3)
>>> [  779.522789]            000003ff803da8a0: ec1603be007e        cij %r1,0,6,000003ff803db01c
>>>
>>> [  779.522789]            000003ff803da8a6: 4110b002            la %r1,2(%r11)
>>> [  779.522789]            000003ff803da8aa: e310f0f00024        stg %r1,240(%r15)
>>> [  779.522789]            000003ff803da8b0: e310f0c00004        lg %r1,192(%r15)
>>> [  779.522870] Call Trace:
>>> [  779.522873]  [<000003ff803da89c>] smc_conn_create+0x174/0x968 [smc]
>>> [  779.522884]  [<000003ff803cb954>] smc_find_ism_v2_device_serv+0x1b4/0x300 [smc]
>>> 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 01.
>>> 01: HCPGSP2629I The virtual machine is placed in CP mode due to a SIGP stop from CPU 00.
>>> [  779.522894]  [<000003ff803cbace>] smc_listen_find_device+0x2e/0x370 [smc]
>>>
>>>
>>> I'm going to send the review for the first patch right away (which is the one causing the crash), so far I'm done with it. The others are going to follow. Maybe you can look over the problem and come up with a solution, otherwise we are going to decide if we want to look into it as soon as I'm done with the reviews. Thank you for your patience.
>>
>> In the next revision, I will add additional judgment to protect the SMC-D environment,
>> thanks for your comments.
>>
>> And Looking forward to your other comments, thanks again.
>>
>> D. Wythe
>>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-08-18 13:06 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-10 17:47 [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections D. Wythe
2022-08-10 17:47 ` [PATCH net-next 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D. Wythe
2022-08-11  3:41   ` kernel test robot
2022-08-11 11:51   ` kernel test robot
2022-08-16  9:43   ` Jan Karcher
2022-08-16 12:47     ` Tony Lu
2022-08-16 12:52   ` Tony Lu
2022-08-10 17:47 ` [PATCH net-next 02/10] net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending D. Wythe
2022-08-16  7:58   ` Tony Lu
2022-08-10 17:47 ` [PATCH net-next 03/10] net/smc: allow confirm/delete rkey response deliver multiplex D. Wythe
2022-08-16  8:17   ` Tony Lu
2022-08-10 17:47 ` [PATCH net-next 04/10] net/smc: make SMC_LLC_FLOW_RKEY run concurrently D. Wythe
2022-08-10 17:47 ` [PATCH net-next 05/10] net/smc: llc_conf_mutex refactor, replace it with rw_semaphore D. Wythe
2022-08-10 17:47 ` [PATCH net-next 06/10] net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse() D. Wythe
2022-08-10 17:47 ` [PATCH net-next 07/10] net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs() D. Wythe
2022-08-16  8:24   ` Tony Lu
2022-08-10 17:47 ` [PATCH net-next 08/10] net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore D. Wythe
2022-08-16  8:37   ` Tony Lu
2022-08-10 17:47 ` [PATCH net-next 09/10] net/smc: fix potential panic dues to unprotected smc_llc_srv_add_link() D. Wythe
2022-08-16  8:28   ` Tony Lu
2022-08-10 17:47 ` [PATCH net-next 10/10] net/smc: fix application data exception D. Wythe
2022-08-11  3:28 ` [PATCH net-next 00/10] net/smc: optimize the parallelism of SMC-R connections Jakub Kicinski
2022-08-11  5:13   ` Tony Lu
2022-08-11 12:31 ` Karsten Graul
2022-08-16  9:35 ` Jan Karcher
2022-08-16 12:40   ` Tony Lu
2022-08-17  4:55   ` D. Wythe
2022-08-17 16:52     ` Jan Karcher
2022-08-18 13:06       ` D. Wythe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).