* [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections
@ 2022-08-26 9:51 D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D. Wythe
` (11 more replies)
0 siblings, 12 replies; 22+ messages in thread
From: D. Wythe @ 2022-08-26 9:51 UTC (permalink / raw)
To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma, D. Wythe
From: "D. Wythe" <alibuda@linux.alibaba.com>
This patch set attempts to optimize the parallelism of SMC-R connections,
mainly to reduce unnecessary blocking on locks, and to fix exceptions that
occur after thoses optimization.
According to Off-CPU graph, SMC worker's off-CPU as that:
smc_close_passive_work (1.09%)
smcr_buf_unuse (1.08%)
smc_llc_flow_initiate (1.02%)
smc_listen_work (48.17%)
__mutex_lock.isra.11 (47.96%)
An ideal SMC-R connection process should only block on the IO events
of the network, but it's quite clear that the SMC-R connection now is
queued on the lock most of the time.
The goal of this patchset is to achieve our ideal situation where
network IO events are blocked for the majority of the connection lifetime.
There are three big locks here:
1. smc_client_lgr_pending & smc_server_lgr_pending
2. llc_conf_mutex
3. rmbs_lock & sndbufs_lock
And an implementation issue:
1. confirm/delete rkey msg can't be sent concurrently while
protocol allows indeed.
Unfortunately,The above problems together affect the parallelism of
SMC-R connection. If any of them are not solved. our goal cannot
be achieved.
After this patch set, we can get a quite ideal off-CPU graph as
following:
smc_close_passive_work (41.58%)
smcr_buf_unuse (41.57%)
smc_llc_do_delete_rkey (41.57%)
smc_listen_work (39.10%)
smc_clc_wait_msg (13.18%)
tcp_recvmsg_locked (13.18)
smc_listen_find_device (25.87%)
smcr_lgr_reg_rmbs (25.87%)
smc_llc_do_confirm_rkey (25.87%)
We can see that most of the waiting times are waiting for network IO
events. This also has a certain performance improvement on our
short-lived conenction wrk/nginx benchmark test:
+--------------+------+------+-------+--------+------+--------+
|conns/qps |c4 | c8 | c16 | c32 | c64 | c200 |
+--------------+------+------+-------+--------+------+--------+
|SMC-R before |9.7k | 10k | 10k | 9.9k | 9.1k | 8.9k |
+--------------+------+------+-------+--------+------+--------+
|SMC-R now |13k | 19k | 18k | 16k | 15k | 12k |
+--------------+------+------+-------+--------+------+--------+
|TCP |15k | 35k | 51k | 80k | 100k | 162k |
+--------------+------+------+-------+--------+------+--------+
The reason why the benefit is not obvious after the number of connections
has increased dues to workqueue. If we try to change workqueue to UNBOUND,
we can obtain at least 4-5 times performance improvement, reach up to half
of TCP. However, this is not an elegant solution, the optimization of it
will be much more complicated. But in any case, we will submit relevant
optimization patches as soon as possible.
Please note that the premise here is that the lock related problem
must be solved first, otherwise, no matter how we optimize the workqueue,
there won't be much improvement.
Because there are a lot of related changes to the code, if you have
any questions or suggestions, please let me know.
Thanks
D. Wythe
v1 -> v2:
1. Fix panic in SMC-D scenario
2. Fix lnkc related hashfn calculation exception, caused by operator
priority.
3. Remove -EBUSY processing of rhashtable_insert_fast, see more details
in comments around smcr_link_get_or_create_cluster().
4. Only wake up one connection if the link has not been active.
5. Delete obsolete unlock logic in smc_listen_work().
6. PATCH format, do Reverse Christmas tree.
7. PATCH format, change all xxx_lnk_xxx function to xxx_link_xxx.
8. PATCH format, add correct fix tag for the patches for fixes.
9. PATCH format, fix some spelling error.
10.PATCH format, rename slow to do_slow in smcr_lgr_reg_rmbs().
D. Wythe (10):
net/smc: remove locks smc_client_lgr_pending and
smc_server_lgr_pending
net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending
net/smc: allow confirm/delete rkey response deliver multiplex
net/smc: make SMC_LLC_FLOW_RKEY run concurrently
net/smc: llc_conf_mutex refactor, replace it with rw_semaphore
net/smc: use read semaphores to reduce unnecessary blocking in
smc_buf_create() & smcr_buf_unuse()
net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs()
net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore
net/smc: Fix potential panic dues to unprotected
smc_llc_srv_add_link()
net/smc: fix application data exception
net/smc/af_smc.c | 42 +++--
net/smc/smc_core.c | 443 +++++++++++++++++++++++++++++++++++++++++++++++------
net/smc/smc_core.h | 78 +++++++++-
net/smc/smc_llc.c | 286 +++++++++++++++++++++++++---------
net/smc/smc_llc.h | 6 +
net/smc/smc_wr.c | 10 --
net/smc/smc_wr.h | 10 ++
7 files changed, 725 insertions(+), 150 deletions(-)
--
1.8.3.1
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH net-next v2 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending
2022-08-26 9:51 [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections D. Wythe
@ 2022-08-26 9:51 ` D. Wythe
2022-08-29 14:48 ` Jan Karcher
2022-08-31 15:04 ` Jan Karcher
2022-08-26 9:51 ` [PATCH net-next v2 02/10] net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending D. Wythe
` (10 subsequent siblings)
11 siblings, 2 replies; 22+ messages in thread
From: D. Wythe @ 2022-08-26 9:51 UTC (permalink / raw)
To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma, D. Wythe
From: "D. Wythe" <alibuda@linux.alibaba.com>
This patch attempts to remove locks named smc_client_lgr_pending and
smc_server_lgr_pending, which aim to serialize the creation of link
group. However, once link group existed already, those locks are
meaningless, worse still, they make incoming connections have to be
queued one after the other.
Now, the creation of link group is no longer generated by competition,
but allocated through following strategy.
1. Try to find a suitable link group, if successd, current connection
is considered as NON first contact connection. ends.
2. Check the number of connections currently waiting for a suitable
link group to be created, if it is not less that the number of link
groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
increase the number of link groups to be created, current connection
is considered as the first contact connection. ends.
3. Increase the number of connections currently waiting, and wait
for woken up.
4. Decrease the number of connections currently waiting, goto 1.
We wake up the connection that was put to sleep in stage 3 through
the SMC link state change event. Once the link moves out of the
SMC_LNK_ACTIVATING state, decrease the number of link groups to
be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
connections.
In the iplementation, we introduce the concept of lnk cluster, which is
a collection of links with the same characteristics (see
smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
wake up efficiently in the scenario of N v.s 1.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
net/smc/af_smc.c | 13 +-
net/smc/smc_core.c | 352 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
net/smc/smc_core.h | 53 ++++++++
net/smc/smc_llc.c | 9 +-
4 files changed, 411 insertions(+), 16 deletions(-)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 79c1318..d0e6bec 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1194,10 +1194,8 @@ static int smc_connect_rdma(struct smc_sock *smc,
if (reason_code)
return reason_code;
- mutex_lock(&smc_client_lgr_pending);
reason_code = smc_conn_create(smc, ini);
if (reason_code) {
- mutex_unlock(&smc_client_lgr_pending);
return reason_code;
}
@@ -1289,7 +1287,6 @@ static int smc_connect_rdma(struct smc_sock *smc,
if (reason_code)
goto connect_abort;
}
- mutex_unlock(&smc_client_lgr_pending);
smc_copy_sock_settings_to_clc(smc);
smc->connect_nonblock = 0;
@@ -1299,7 +1296,6 @@ static int smc_connect_rdma(struct smc_sock *smc,
return 0;
connect_abort:
smc_conn_abort(smc, ini->first_contact_local);
- mutex_unlock(&smc_client_lgr_pending);
smc->connect_nonblock = 0;
return reason_code;
@@ -2377,7 +2373,8 @@ static void smc_listen_work(struct work_struct *work)
if (rc)
goto out_decl;
- mutex_lock(&smc_server_lgr_pending);
+ if (ini->is_smcd)
+ mutex_lock(&smc_server_lgr_pending);
smc_close_init(new_smc);
smc_rx_init(new_smc);
smc_tx_init(new_smc);
@@ -2404,8 +2401,6 @@ static void smc_listen_work(struct work_struct *work)
rc = smc_clc_wait_msg(new_smc, cclc, sizeof(*buf),
SMC_CLC_CONFIRM, CLC_WAIT_TIME);
if (rc) {
- if (!ini->is_smcd)
- goto out_unlock;
goto out_decl;
}
@@ -2415,7 +2410,6 @@ static void smc_listen_work(struct work_struct *work)
ini->first_contact_local, ini);
if (rc)
goto out_unlock;
- mutex_unlock(&smc_server_lgr_pending);
}
smc_conn_save_peer_info(new_smc, cclc);
smc_listen_out_connected(new_smc);
@@ -2423,7 +2417,8 @@ static void smc_listen_work(struct work_struct *work)
goto out_free;
out_unlock:
- mutex_unlock(&smc_server_lgr_pending);
+ if (ini->is_smcd)
+ mutex_unlock(&smc_server_lgr_pending);
out_decl:
smc_listen_decline(new_smc, rc, ini ? ini->first_contact_local : 0,
proposal_version);
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index ff49a11..cfaddf2 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -46,6 +46,10 @@ struct smc_lgr_list smc_lgr_list = { /* established link groups */
.num = 0,
};
+static struct smc_lgr_manager smc_lgr_manager = {
+ .lock = __SPIN_LOCK_UNLOCKED(smc_lgr_manager.lock),
+};
+
static atomic_t lgr_cnt = ATOMIC_INIT(0); /* number of existing link groups */
static DECLARE_WAIT_QUEUE_HEAD(lgrs_deleted);
@@ -55,6 +59,255 @@ static void smc_buf_free(struct smc_link_group *lgr, bool is_rmb,
static void smc_link_down_work(struct work_struct *work);
+/* SMC-R lnk cluster compare func
+ * All lnks that meet the description conditions of this function
+ * are logically aggregated, called lnk cluster.
+ * For the server side, lnk cluster is used to determine whether
+ * a new group needs to be created when processing new imcoming connections.
+ * For the client side, lnk cluster is used to determine whether
+ * to wait for link ready (in other words, first contact ready).
+ */
+static int smcr_link_cluster_cmpfn(struct rhashtable_compare_arg *arg, const void *obj)
+{
+ const struct smc_link_cluster_compare_arg *key = arg->key;
+ const struct smc_link_cluster *lnkc = obj;
+
+ if (memcmp(key->peer_systemid, lnkc->peer_systemid, SMC_SYSTEMID_LEN))
+ return 1;
+
+ if (memcmp(key->peer_gid, lnkc->peer_gid, SMC_GID_SIZE))
+ return 1;
+
+ if ((key->role == SMC_SERV || key->clcqpn == lnkc->clcqpn) &&
+ (key->smcr_version == SMC_V2 ||
+ !memcmp(key->peer_mac, lnkc->peer_mac, ETH_ALEN)))
+ return 0;
+
+ return 1;
+}
+
+/* SMC-R lnk cluster hash func */
+static u32 smcr_link_cluster_hashfn(const void *data, u32 len, u32 seed)
+{
+ const struct smc_link_cluster *lnkc = data;
+
+ return jhash2((u32 *)lnkc->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
+ + ((lnkc->role == SMC_SERV) ? 0 : lnkc->clcqpn);
+}
+
+/* SMC-R lnk cluster compare arg hash func */
+static u32 smcr_link_cluster_compare_arg_hashfn(const void *data, u32 len, u32 seed)
+{
+ const struct smc_link_cluster_compare_arg *key = data;
+
+ return jhash2((u32 *)key->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
+ + ((key->role == SMC_SERV) ? 0 : key->clcqpn);
+}
+
+static const struct rhashtable_params smcr_link_cluster_rhl_params = {
+ .head_offset = offsetof(struct smc_link_cluster, rnode),
+ .key_len = sizeof(struct smc_link_cluster_compare_arg),
+ .obj_cmpfn = smcr_link_cluster_cmpfn,
+ .obj_hashfn = smcr_link_cluster_hashfn,
+ .hashfn = smcr_link_cluster_compare_arg_hashfn,
+ .automatic_shrinking = true,
+};
+
+/* hold a reference for smc_link_cluster */
+static inline void smc_link_cluster_hold(struct smc_link_cluster *lnkc)
+{
+ if (likely(lnkc))
+ refcount_inc(&lnkc->ref);
+}
+
+/* release a reference for smc_link_cluster */
+static inline void smc_link_cluster_put(struct smc_link_cluster *lnkc)
+{
+ bool do_free = false;
+
+ if (!lnkc)
+ return;
+
+ if (refcount_dec_not_one(&lnkc->ref))
+ return;
+
+ spin_lock_bh(&smc_lgr_manager.lock);
+ /* last ref */
+ if (refcount_dec_and_test(&lnkc->ref)) {
+ do_free = true;
+ rhashtable_remove_fast(&smc_lgr_manager.link_cluster_maps, &lnkc->rnode,
+ smcr_link_cluster_rhl_params);
+ }
+ spin_unlock_bh(&smc_lgr_manager.lock);
+ if (do_free)
+ kfree(lnkc);
+}
+
+/* Get or create smc_link_cluster by key
+ * This function will hold a reference of returned smc_link_cluster
+ * or create a new smc_link_cluster with the reference initialized to 1。
+ * caller MUST call smc_link_cluster_put after this.
+ */
+static inline struct smc_link_cluster *
+smcr_link_get_or_create_cluster(struct smc_link_cluster_compare_arg *key)
+{
+ struct smc_link_cluster *lnkc;
+ int err;
+
+ spin_lock_bh(&smc_lgr_manager.lock);
+ lnkc = rhashtable_lookup_fast(&smc_lgr_manager.link_cluster_maps, key,
+ smcr_link_cluster_rhl_params);
+ if (!lnkc) {
+ lnkc = kzalloc(sizeof(*lnkc), GFP_ATOMIC);
+ if (unlikely(!lnkc))
+ goto fail;
+
+ /* init cluster */
+ spin_lock_init(&lnkc->lock);
+ lnkc->role = key->role;
+ if (key->role == SMC_CLNT)
+ lnkc->clcqpn = key->clcqpn;
+ init_waitqueue_head(&lnkc->first_contact_waitqueue);
+ memcpy(lnkc->peer_systemid, key->peer_systemid, SMC_SYSTEMID_LEN);
+ memcpy(lnkc->peer_gid, key->peer_gid, SMC_GID_SIZE);
+ memcpy(lnkc->peer_mac, key->peer_mac, ETH_ALEN);
+ refcount_set(&lnkc->ref, 1);
+
+ err = rhashtable_insert_fast(&smc_lgr_manager.link_cluster_maps,
+ &lnkc->rnode, smcr_link_cluster_rhl_params);
+ if (unlikely(err)) {
+ pr_warn_ratelimited("smc: rhashtable_insert_fast failed (%d)", err);
+ kfree(lnkc);
+ lnkc = NULL;
+ }
+ } else {
+ smc_link_cluster_hold(lnkc);
+ }
+fail:
+ spin_unlock_bh(&smc_lgr_manager.lock);
+ return lnkc;
+}
+
+/* Get or create a smc_link_cluster by lnk
+ * caller MUST call smc_link_cluster_put after this.
+ */
+static inline struct smc_link_cluster *smcr_link_get_cluster(struct smc_link *lnk)
+{
+ struct smc_link_cluster_compare_arg key;
+ struct smc_link_group *lgr;
+
+ lgr = lnk->lgr;
+ if (!lgr || lgr->is_smcd)
+ return NULL;
+
+ key.smcr_version = lgr->smc_version;
+ key.peer_systemid = lgr->peer_systemid;
+ key.peer_gid = lnk->peer_gid;
+ key.peer_mac = lnk->peer_mac;
+ key.role = lgr->role;
+ if (key.role == SMC_CLNT)
+ key.clcqpn = lnk->peer_qpn;
+
+ return smcr_link_get_or_create_cluster(&key);
+}
+
+/* Get or create a smc_link_cluster by ini
+ * caller MUST call smc_link_cluster_put after this.
+ */
+static inline struct smc_link_cluster *
+smcr_link_get_cluster_by_ini(struct smc_init_info *ini, int role)
+{
+ struct smc_link_cluster_compare_arg key;
+
+ if (ini->is_smcd)
+ return NULL;
+
+ key.smcr_version = ini->smcr_version;
+ key.peer_systemid = ini->peer_systemid;
+ key.peer_gid = ini->peer_gid;
+ key.peer_mac = ini->peer_mac;
+ key.role = role;
+ if (role == SMC_CLNT)
+ key.clcqpn = ini->ib_clcqpn;
+
+ return smcr_link_get_or_create_cluster(&key);
+}
+
+/* callback when smc link state change */
+void smcr_link_cluster_on_link_state(struct smc_link *lnk)
+{
+ struct smc_link_cluster *lnkc;
+ int nr = 0;
+
+ /* barrier for lnk->state */
+ smp_mb();
+
+ /* only first link can made connections block on
+ * first_contact_waitqueue
+ */
+ if (lnk->link_idx != SMC_SINGLE_LINK)
+ return;
+
+ /* state already seen */
+ if (lnk->state_record & SMC_LNK_STATE_BIT(lnk->state))
+ return;
+
+ lnkc = smcr_link_get_cluster(lnk);
+
+ if (unlikely(!lnkc))
+ return;
+
+ spin_lock_bh(&lnkc->lock);
+
+ /* all lnk state change should be
+ * 1. SMC_LNK_UNUSED -> SMC_LNK_TEAR_DOWN (link init failed)
+ * 2. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_TEAR_DOWN
+ * 3. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DOWN
+ * 4. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DOWN
+ * 5. SMC_LNK_UNUSED -> SMC_LNK_ATIVATING -> SMC_LNK_ACTIVE ->SMC_LNK_INACTIVE
+ * -> SMC_LNK_TEAR_DOWN
+ */
+ switch (lnk->state) {
+ case SMC_LNK_ACTIVATING:
+ /* It's safe to hold a reference without lock
+ * dues to the smcr_link_get_cluster already hold one
+ */
+ smc_link_cluster_hold(lnkc);
+ break;
+ case SMC_LNK_TEAR_DOWN:
+ if (lnk->state_record & SMC_LNK_STATE_BIT(SMC_LNK_ACTIVATING))
+ /* smc_link_cluster_hold in SMC_LNK_ACTIVATING */
+ smc_link_cluster_put(lnkc);
+ fallthrough;
+ case SMC_LNK_ACTIVE:
+ case SMC_LNK_INACTIVE:
+ if (!(lnk->state_record &
+ (SMC_LNK_STATE_BIT(SMC_LNK_ACTIVE)
+ | SMC_LNK_STATE_BIT(SMC_LNK_INACTIVE)))) {
+ lnkc->pending_capability -= (SMC_RMBS_PER_LGR_MAX - 1);
+ nr = SMC_RMBS_PER_LGR_MAX - 1;
+ if (unlikely(lnk->state != SMC_LNK_ACTIVE)) {
+ lnkc->lacking_first_contact++;
+ /* only to wake up one connection to perfrom
+ * first contact in server side, client MUST wake up
+ * all to decline.
+ */
+ if (lnkc->role == SMC_SERV)
+ nr = 1;
+ }
+ }
+ break;
+ case SMC_LNK_UNUSED:
+ pr_warn_ratelimited("net/smc: invalid lnk state. ");
+ break;
+ }
+ SMC_LNK_STATE_RECORD(lnk, lnk->state);
+ spin_unlock_bh(&lnkc->lock);
+ if (nr)
+ wake_up_nr(&lnkc->first_contact_waitqueue, nr);
+ smc_link_cluster_put(lnkc); /* smc_link_cluster_hold in smcr_link_get_cluster */
+}
+
/* return head of link group list and its lock for a given link group */
static inline struct list_head *smc_lgr_list_head(struct smc_link_group *lgr,
spinlock_t **lgr_lock)
@@ -651,8 +904,10 @@ static void smcr_lgr_link_deactivate_all(struct smc_link_group *lgr)
for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
struct smc_link *lnk = &lgr->lnk[i];
- if (smc_link_sendable(lnk))
+ if (smc_link_sendable(lnk)) {
lnk->state = SMC_LNK_INACTIVE;
+ smcr_link_cluster_on_link_state(lnk);
+ }
}
wake_up_all(&lgr->llc_msg_waiter);
wake_up_all(&lgr->llc_flow_waiter);
@@ -756,12 +1011,16 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
lnk->link_id = smcr_next_link_id(lgr);
lnk->lgr = lgr;
smc_lgr_hold(lgr); /* lgr_put in smcr_link_clear() */
+ rwlock_init(&lnk->rtokens_lock);
lnk->link_idx = link_idx;
smc_ibdev_cnt_inc(lnk);
smcr_copy_dev_info_to_link(lnk);
atomic_set(&lnk->conn_cnt, 0);
smc_llc_link_set_uid(lnk);
INIT_WORK(&lnk->link_down_wrk, smc_link_down_work);
+ lnk->peer_qpn = ini->ib_clcqpn;
+ memcpy(lnk->peer_gid, ini->peer_gid, SMC_GID_SIZE);
+ memcpy(lnk->peer_mac, ini->peer_mac, sizeof(lnk->peer_mac));
if (!lnk->smcibdev->initialized) {
rc = (int)smc_ib_setup_per_ibdev(lnk->smcibdev);
if (rc)
@@ -792,6 +1051,7 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
if (rc)
goto destroy_qp;
lnk->state = SMC_LNK_ACTIVATING;
+ smcr_link_cluster_on_link_state(lnk);
return 0;
destroy_qp:
@@ -806,6 +1066,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
smc_ibdev_cnt_dec(lnk);
put_device(&lnk->smcibdev->ibdev->dev);
smcibdev = lnk->smcibdev;
+ lnk->state = SMC_LNK_TEAR_DOWN;
+ smcr_link_cluster_on_link_state(lnk);
memset(lnk, 0, sizeof(struct smc_link));
lnk->state = SMC_LNK_UNUSED;
if (!atomic_dec_return(&smcibdev->lnk_cnt))
@@ -1263,6 +1525,8 @@ void smcr_link_clear(struct smc_link *lnk, bool log)
if (!lnk->lgr || lnk->clearing ||
lnk->state == SMC_LNK_UNUSED)
return;
+ lnk->state = SMC_LNK_TEAR_DOWN;
+ smcr_link_cluster_on_link_state(lnk);
lnk->clearing = 1;
lnk->peer_qpn = 0;
smc_llc_link_clear(lnk, log);
@@ -1712,6 +1976,7 @@ void smcr_link_down_cond(struct smc_link *lnk)
{
if (smc_link_downing(&lnk->state)) {
trace_smcr_link_down(lnk, __builtin_return_address(0));
+ smcr_link_cluster_on_link_state(lnk);
smcr_link_down(lnk);
}
}
@@ -1721,6 +1986,7 @@ void smcr_link_down_cond_sched(struct smc_link *lnk)
{
if (smc_link_downing(&lnk->state)) {
trace_smcr_link_down(lnk, __builtin_return_address(0));
+ smcr_link_cluster_on_link_state(lnk);
schedule_work(&lnk->link_down_wrk);
}
}
@@ -1850,11 +2116,13 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
{
struct smc_connection *conn = &smc->conn;
struct net *net = sock_net(&smc->sk);
+ DECLARE_WAITQUEUE(wait, current);
+ struct smc_link_cluster *lnkc = NULL;
struct list_head *lgr_list;
struct smc_link_group *lgr;
enum smc_lgr_role role;
spinlock_t *lgr_lock;
- int rc = 0;
+ int rc = 0, timeo = CLC_WAIT_TIME;
lgr_list = ini->is_smcd ? &ini->ism_dev[ini->ism_selected]->lgr_list :
&smc_lgr_list.list;
@@ -1862,12 +2130,29 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
&smc_lgr_list.lock;
ini->first_contact_local = 1;
role = smc->listen_smc ? SMC_SERV : SMC_CLNT;
- if (role == SMC_CLNT && ini->first_contact_peer)
+
+ if (!ini->is_smcd) {
+ lnkc = smcr_link_get_cluster_by_ini(ini, role);
+ if (unlikely(!lnkc))
+ return SMC_CLC_DECL_INTERR;
+ }
+
+ if (role == SMC_CLNT && ini->first_contact_peer) {
+ if (!ini->is_smcd) {
+ /* first_contact */
+ spin_lock_bh(&lnkc->lock);
+ lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
+ spin_unlock_bh(&lnkc->lock);
+ }
/* create new link group as well */
goto create;
+ }
/* determine if an existing link group can be reused */
spin_lock_bh(lgr_lock);
+ if (!ini->is_smcd)
+ spin_lock(&lnkc->lock);
+again:
list_for_each_entry(lgr, lgr_list, list) {
write_lock_bh(&lgr->conns_lock);
if ((ini->is_smcd ?
@@ -1894,9 +2179,41 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
}
write_unlock_bh(&lgr->conns_lock);
}
+ if (!ini->is_smcd && ini->first_contact_local) {
+ if (lnkc->pending_capability > lnkc->conns_pending) {
+ lnkc->conns_pending++;
+ add_wait_queue(&lnkc->first_contact_waitqueue, &wait);
+ spin_unlock(&lnkc->lock);
+ spin_unlock_bh(lgr_lock);
+ set_current_state(TASK_INTERRUPTIBLE);
+ /* need to wait at least once first contact done */
+ timeo = schedule_timeout(timeo);
+ set_current_state(TASK_RUNNING);
+ remove_wait_queue(&lnkc->first_contact_waitqueue, &wait);
+ spin_lock_bh(lgr_lock);
+ spin_lock(&lnkc->lock);
+
+ lnkc->conns_pending--;
+ if (likely(timeo && !lnkc->lacking_first_contact))
+ goto again;
+
+ /* lnk create failed, only server side can raise
+ * a new first contact. client side here will
+ * fallback by SMC_CLC_DECL_SYNCERR.
+ */
+ if (role == SMC_SERV && lnkc->lacking_first_contact)
+ lnkc->lacking_first_contact--;
+ }
+ if (role == SMC_SERV) {
+ /* first_contact */
+ lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
+ }
+ }
+ if (!ini->is_smcd)
+ spin_unlock(&lnkc->lock);
spin_unlock_bh(lgr_lock);
if (rc)
- return rc;
+ goto out;
if (role == SMC_CLNT && !ini->first_contact_peer &&
ini->first_contact_local) {
@@ -1904,7 +2221,8 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
* a new one
* send out_of_sync decline, reason synchr. error
*/
- return SMC_CLC_DECL_SYNCERR;
+ rc = SMC_CLC_DECL_SYNCERR;
+ goto out;
}
create:
@@ -1941,6 +2259,9 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
#endif
out:
+ /* smc_link_cluster_hold in smcr_link_get_or_create_cluster */
+ if (!ini->is_smcd)
+ smc_link_cluster_put(lnkc);
return rc;
}
@@ -2500,19 +2821,24 @@ int smc_rtoken_add(struct smc_link *lnk, __be64 nw_vaddr, __be32 nw_rkey)
u32 rkey = ntohl(nw_rkey);
int i;
+ write_lock_bh(&lnk->rtokens_lock);
for (i = 0; i < SMC_RMBS_PER_LGR_MAX; i++) {
if (lgr->rtokens[i][lnk->link_idx].rkey == rkey &&
lgr->rtokens[i][lnk->link_idx].dma_addr == dma_addr &&
test_bit(i, lgr->rtokens_used_mask)) {
/* already in list */
+ write_unlock_bh(&lnk->rtokens_lock);
return i;
}
}
i = smc_rmb_reserve_rtoken_idx(lgr);
- if (i < 0)
+ if (i < 0) {
+ write_unlock_bh(&lnk->rtokens_lock);
return i;
+ }
lgr->rtokens[i][lnk->link_idx].rkey = rkey;
lgr->rtokens[i][lnk->link_idx].dma_addr = dma_addr;
+ write_unlock_bh(&lnk->rtokens_lock);
return i;
}
@@ -2523,6 +2849,7 @@ int smc_rtoken_delete(struct smc_link *lnk, __be32 nw_rkey)
u32 rkey = ntohl(nw_rkey);
int i, j;
+ write_lock_bh(&lnk->rtokens_lock);
for (i = 0; i < SMC_RMBS_PER_LGR_MAX; i++) {
if (lgr->rtokens[i][lnk->link_idx].rkey == rkey &&
test_bit(i, lgr->rtokens_used_mask)) {
@@ -2531,9 +2858,11 @@ int smc_rtoken_delete(struct smc_link *lnk, __be32 nw_rkey)
lgr->rtokens[i][j].dma_addr = 0;
}
clear_bit(i, lgr->rtokens_used_mask);
+ write_unlock_bh(&lnk->rtokens_lock);
return 0;
}
}
+ write_unlock_bh(&lnk->rtokens_lock);
return -ENOENT;
}
@@ -2599,12 +2928,23 @@ static int smc_core_reboot_event(struct notifier_block *this,
int __init smc_core_init(void)
{
+ /* init smc lnk cluster maps */
+ rhashtable_init(&smc_lgr_manager.link_cluster_maps, &smcr_link_cluster_rhl_params);
return register_reboot_notifier(&smc_reboot_notifier);
}
+static void smc_link_cluster_free_cb(void *ptr, void *arg)
+{
+ pr_warn("smc: smc lnk cluster refcnt leak.\n");
+ kfree(ptr);
+}
+
/* Called (from smc_exit) when module is removed */
void smc_core_exit(void)
{
unregister_reboot_notifier(&smc_reboot_notifier);
smc_lgrs_shutdown();
+ /* destroy smc lnk cluster maps */
+ rhashtable_free_and_destroy(&smc_lgr_manager.link_cluster_maps, smc_link_cluster_free_cb,
+ NULL);
}
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index fe8b524..3c3bc11 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -15,6 +15,7 @@
#include <linux/atomic.h>
#include <linux/smc.h>
#include <linux/pci.h>
+#include <linux/rhashtable.h>
#include <rdma/ib_verbs.h>
#include <net/genetlink.h>
@@ -29,18 +30,66 @@ struct smc_lgr_list { /* list of link group definition */
u32 num; /* unique link group number */
};
+struct smc_lgr_manager { /* manager for link group */
+ struct rhashtable link_cluster_maps; /* maps of smc_link_cluster */
+ spinlock_t lock; /* lock for lgr_cm_maps */
+};
+
+struct smc_link_cluster {
+ struct rhash_head rnode; /* node for rhashtable */
+ struct wait_queue_head first_contact_waitqueue;
+ /* queue for non first contact to wait
+ * first contact to be established.
+ */
+ spinlock_t lock; /* protection for link group */
+ refcount_t ref; /* refcount for cluster */
+ unsigned long pending_capability;
+ /* maximum pending number of connections that
+ * need wait first contact complete.
+ */
+ unsigned long conns_pending;
+ /* connections that are waiting for first contact
+ * complete
+ */
+ u32 lacking_first_contact;
+ /* indicate that the connection
+ * should perform first contact.
+ */
+ u8 peer_systemid[SMC_SYSTEMID_LEN];
+ u8 peer_mac[ETH_ALEN]; /* = gid[8:10||13:15] */
+ u8 peer_gid[SMC_GID_SIZE]; /* gid of peer*/
+ int clcqpn;
+ int role;
+};
+
enum smc_lgr_role { /* possible roles of a link group */
SMC_CLNT, /* client */
SMC_SERV /* server */
};
+struct smc_link_cluster_compare_arg /* key for smc_link_cluster */
+{
+ int smcr_version;
+ enum smc_lgr_role role;
+ u8 *peer_systemid;
+ u8 *peer_gid;
+ u8 *peer_mac;
+ int clcqpn;
+};
+
enum smc_link_state { /* possible states of a link */
SMC_LNK_UNUSED, /* link is unused */
SMC_LNK_INACTIVE, /* link is inactive */
SMC_LNK_ACTIVATING, /* link is being activated */
SMC_LNK_ACTIVE, /* link is active */
+ SMC_LNK_TEAR_DOWN, /* link is tear down */
};
+#define SMC_LNK_STATE_BIT(state) (1 << (state))
+
+#define SMC_LNK_STATE_RECORD(lnk, state) \
+ ((lnk)->state_record |= SMC_LNK_STATE_BIT(state))
+
#define SMC_WR_BUF_SIZE 48 /* size of work request buffer */
#define SMC_WR_BUF_V2_SIZE 8192 /* size of v2 work request buffer */
@@ -107,6 +156,7 @@ struct smc_link {
u32 wr_tx_cnt; /* number of WR send buffers */
wait_queue_head_t wr_tx_wait; /* wait for free WR send buf */
atomic_t wr_tx_refcnt; /* tx refs to link */
+ rwlock_t rtokens_lock;
struct smc_wr_buf *wr_rx_bufs; /* WR recv payload buffers */
struct ib_recv_wr *wr_rx_ibs; /* WR recv meta data */
@@ -145,6 +195,7 @@ struct smc_link {
int ndev_ifidx; /* network device ifindex */
enum smc_link_state state; /* state of link */
+ int state_record; /* record of previous state */
struct delayed_work llc_testlink_wrk; /* testlink worker */
struct completion llc_testlink_resp; /* wait for rx of testlink */
int llc_testlink_time; /* testlink interval */
@@ -557,6 +608,8 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
int smcr_nl_get_link(struct sk_buff *skb, struct netlink_callback *cb);
int smcd_nl_get_lgr(struct sk_buff *skb, struct netlink_callback *cb);
+void smcr_link_cluster_on_link_state(struct smc_link *lnk);
+
static inline struct smc_link_group *smc_get_lgr(struct smc_link *link)
{
return link->lgr;
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index 175026a..c1ce80b 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -1099,6 +1099,7 @@ int smc_llc_cli_add_link(struct smc_link *link, struct smc_llc_qentry *qentry)
goto out;
out_clear_lnk:
lnk_new->state = SMC_LNK_INACTIVE;
+ smcr_link_cluster_on_link_state(lnk_new);
smcr_link_clear(lnk_new, false);
out_reject:
smc_llc_cli_add_link_reject(qentry);
@@ -1278,6 +1279,7 @@ static void smc_llc_delete_asym_link(struct smc_link_group *lgr)
return; /* no asymmetric link */
if (!smc_link_downing(&lnk_asym->state))
return;
+ smcr_link_cluster_on_link_state(lnk_asym);
lnk_new = smc_switch_conns(lgr, lnk_asym, false);
smc_wr_tx_wait_no_pending_sends(lnk_asym);
if (!lnk_new)
@@ -1492,6 +1494,7 @@ int smc_llc_srv_add_link(struct smc_link *link,
out_err:
if (link_new) {
link_new->state = SMC_LNK_INACTIVE;
+ smcr_link_cluster_on_link_state(link_new);
smcr_link_clear(link_new, false);
}
out:
@@ -1602,8 +1605,10 @@ static void smc_llc_process_cli_delete_link(struct smc_link_group *lgr)
del_llc->reason = 0;
smc_llc_send_message(lnk, &qentry->msg); /* response */
- if (smc_link_downing(&lnk_del->state))
+ if (smc_link_downing(&lnk_del->state)) {
+ smcr_link_cluster_on_link_state(lnk);
smc_switch_conns(lgr, lnk_del, false);
+ }
smcr_link_clear(lnk_del, true);
active_links = smc_llc_active_link_count(lgr);
@@ -1676,6 +1681,7 @@ static void smc_llc_process_srv_delete_link(struct smc_link_group *lgr)
goto out; /* asymmetric link already deleted */
if (smc_link_downing(&lnk_del->state)) {
+ smcr_link_cluster_on_link_state(lnk);
if (smc_switch_conns(lgr, lnk_del, false))
smc_wr_tx_wait_no_pending_sends(lnk_del);
}
@@ -2167,6 +2173,7 @@ void smc_llc_link_active(struct smc_link *link)
schedule_delayed_work(&link->llc_testlink_wrk,
link->llc_testlink_time);
}
+ smcr_link_cluster_on_link_state(link);
}
/* called in worker context */
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next v2 02/10] net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending
2022-08-26 9:51 [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D. Wythe
@ 2022-08-26 9:51 ` D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 03/10] net/smc: allow confirm/delete rkey response deliver multiplex D. Wythe
` (9 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: D. Wythe @ 2022-08-26 9:51 UTC (permalink / raw)
To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma, D. Wythe
From: "D. Wythe" <alibuda@linux.alibaba.com>
As commit 4940a1fdf31c ("net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB
error cause by server") mentioned, it works only when all connection
creations are completely protected by smc_server_lgr_pending, since we
already remove this lock, we need to re-fix the issues.
Fixes: 4940a1fdf31c ("net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause by server")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
net/smc/af_smc.c | 2 ++
net/smc/smc_core.c | 11 ++++++++---
net/smc/smc_core.h | 18 ++++++++++++++++++
3 files changed, 28 insertions(+), 3 deletions(-)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index d0e6bec..ddca170 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -2411,6 +2411,7 @@ static void smc_listen_work(struct work_struct *work)
if (rc)
goto out_unlock;
}
+ smc_conn_leave_rtoken_pending(new_smc, ini);
smc_conn_save_peer_info(new_smc, cclc);
smc_listen_out_connected(new_smc);
SMC_STAT_SERV_SUCC_INC(sock_net(newclcsock->sk), ini);
@@ -2420,6 +2421,7 @@ static void smc_listen_work(struct work_struct *work)
if (ini->is_smcd)
mutex_unlock(&smc_server_lgr_pending);
out_decl:
+ smc_conn_leave_rtoken_pending(new_smc, ini);
smc_listen_decline(new_smc, rc, ini ? ini->first_contact_local : 0,
proposal_version);
out_free:
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index cfaddf2..f93c69f 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -2167,14 +2167,19 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
lgr->vlan_id == ini->vlan_id) &&
(role == SMC_CLNT || ini->is_smcd ||
(lgr->conns_num < SMC_RMBS_PER_LGR_MAX &&
- !bitmap_full(lgr->rtokens_used_mask, SMC_RMBS_PER_LGR_MAX)))) {
+ (SMC_RMBS_PER_LGR_MAX -
+ bitmap_weight(lgr->rtokens_used_mask, SMC_RMBS_PER_LGR_MAX)
+ > atomic_read(&lgr->rtoken_pendings))))) {
/* link group found */
ini->first_contact_local = 0;
conn->lgr = lgr;
rc = smc_lgr_register_conn(conn, false);
write_unlock_bh(&lgr->conns_lock);
- if (!rc && delayed_work_pending(&lgr->free_work))
- cancel_delayed_work(&lgr->free_work);
+ if (!rc) {
+ smc_conn_enter_rtoken_pending(smc, ini);
+ if (delayed_work_pending(&lgr->free_work))
+ cancel_delayed_work(&lgr->free_work);
+ }
break;
}
write_unlock_bh(&lgr->conns_lock);
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 3c3bc11..a304ef3 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -298,6 +298,9 @@ struct smc_link_group {
struct rb_root conns_all; /* connection tree */
rwlock_t conns_lock; /* protects conns_all */
unsigned int conns_num; /* current # of connections */
+ atomic_t rtoken_pendings;/* number of connection that
+ * lgr assigned but no rtoken got yet
+ */
unsigned short vlan_id; /* vlan id of link group */
struct list_head sndbufs[SMC_RMBE_SIZES];/* tx buffers */
@@ -609,6 +612,21 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
int smcd_nl_get_lgr(struct sk_buff *skb, struct netlink_callback *cb);
void smcr_link_cluster_on_link_state(struct smc_link *lnk);
+static inline void smc_conn_enter_rtoken_pending(struct smc_sock *smc, struct smc_init_info *ini)
+{
+ struct smc_link_group *lgr = smc->conn.lgr;
+
+ if (lgr && !ini->first_contact_local)
+ atomic_inc(&lgr->rtoken_pendings);
+}
+
+static inline void smc_conn_leave_rtoken_pending(struct smc_sock *smc, struct smc_init_info *ini)
+{
+ struct smc_link_group *lgr = smc->conn.lgr;
+
+ if (lgr && !ini->first_contact_local)
+ atomic_dec(&lgr->rtoken_pendings);
+}
static inline struct smc_link_group *smc_get_lgr(struct smc_link *link)
{
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next v2 03/10] net/smc: allow confirm/delete rkey response deliver multiplex
2022-08-26 9:51 [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 02/10] net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending D. Wythe
@ 2022-08-26 9:51 ` D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 04/10] net/smc: make SMC_LLC_FLOW_RKEY run concurrently D. Wythe
` (8 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: D. Wythe @ 2022-08-26 9:51 UTC (permalink / raw)
To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma, D. Wythe
From: "D. Wythe" <alibuda@linux.alibaba.com>
We know that all flows except confirm_rkey and delete_rkey are exclusive,
confirm/delete rkey flows can run concurrently (local and remote).
Although the protocol allows, all flows are actually mutually exclusive
in implementation, dues to waiting for LLC messages is in serial.
This aggravates the time for establishing or destroying a SMC-R
connections, connections have to be queued in smc_llc_wait.
We use rtokens or rkey to correlate a confirm/delete rkey message
with its response.
Before sending a message, we put context with rtokens or rkey into
wait queue. When a response message received, we wakeup the context
which with the same rtokens or rkey against the response message.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
net/smc/smc_llc.c | 174 +++++++++++++++++++++++++++++++++++++++++-------------
net/smc/smc_wr.c | 10 ----
net/smc/smc_wr.h | 10 ++++
3 files changed, 143 insertions(+), 51 deletions(-)
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index c1ce80b..2c8bfff 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -200,6 +200,7 @@ struct smc_llc_msg_delete_rkey_v2 { /* type 0x29 */
struct smc_llc_qentry {
struct list_head list;
struct smc_link *link;
+ void *private;
union smc_llc_msg msg;
};
@@ -479,19 +480,17 @@ int smc_llc_send_confirm_link(struct smc_link *link,
return rc;
}
-/* send LLC confirm rkey request */
-static int smc_llc_send_confirm_rkey(struct smc_link *send_link,
- struct smc_buf_desc *rmb_desc)
+/* build LLC confirm rkey request */
+static int smc_llc_build_confirm_rkey_request(struct smc_link *send_link,
+ struct smc_buf_desc *rmb_desc,
+ struct smc_wr_tx_pend_priv **priv)
{
struct smc_llc_msg_confirm_rkey *rkeyllc;
- struct smc_wr_tx_pend_priv *pend;
struct smc_wr_buf *wr_buf;
struct smc_link *link;
int i, rc, rtok_ix;
- if (!smc_wr_tx_link_hold(send_link))
- return -ENOLINK;
- rc = smc_llc_add_pending_send(send_link, &wr_buf, &pend);
+ rc = smc_llc_add_pending_send(send_link, &wr_buf, priv);
if (rc)
goto put_out;
rkeyllc = (struct smc_llc_msg_confirm_rkey *)wr_buf;
@@ -521,25 +520,20 @@ static int smc_llc_send_confirm_rkey(struct smc_link *send_link,
cpu_to_be64((uintptr_t)rmb_desc->cpu_addr) :
cpu_to_be64((u64)sg_dma_address
(rmb_desc->sgt[send_link->link_idx].sgl));
- /* send llc message */
- rc = smc_wr_tx_send(send_link, pend);
put_out:
- smc_wr_tx_link_put(send_link);
return rc;
}
-/* send LLC delete rkey request */
-static int smc_llc_send_delete_rkey(struct smc_link *link,
- struct smc_buf_desc *rmb_desc)
+/* build LLC delete rkey request */
+static int smc_llc_build_delete_rkey_request(struct smc_link *link,
+ struct smc_buf_desc *rmb_desc,
+ struct smc_wr_tx_pend_priv **priv)
{
struct smc_llc_msg_delete_rkey *rkeyllc;
- struct smc_wr_tx_pend_priv *pend;
struct smc_wr_buf *wr_buf;
int rc;
- if (!smc_wr_tx_link_hold(link))
- return -ENOLINK;
- rc = smc_llc_add_pending_send(link, &wr_buf, &pend);
+ rc = smc_llc_add_pending_send(link, &wr_buf, priv);
if (rc)
goto put_out;
rkeyllc = (struct smc_llc_msg_delete_rkey *)wr_buf;
@@ -548,10 +542,7 @@ static int smc_llc_send_delete_rkey(struct smc_link *link,
smc_llc_init_msg_hdr(&rkeyllc->hd, link->lgr, sizeof(*rkeyllc));
rkeyllc->num_rkeys = 1;
rkeyllc->rkey[0] = htonl(rmb_desc->mr[link->link_idx]->rkey);
- /* send llc message */
- rc = smc_wr_tx_send(link, pend);
put_out:
- smc_wr_tx_link_put(link);
return rc;
}
@@ -2023,7 +2014,8 @@ static void smc_llc_rx_response(struct smc_link *link,
case SMC_LLC_DELETE_RKEY:
if (flowtype != SMC_LLC_FLOW_RKEY || flow->qentry)
break; /* drop out-of-flow response */
- goto assign;
+ __wake_up(&link->lgr->llc_msg_waiter, TASK_NORMAL, 1, qentry);
+ goto free;
case SMC_LLC_CONFIRM_RKEY_CONT:
/* not used because max links is 3 */
break;
@@ -2032,6 +2024,7 @@ static void smc_llc_rx_response(struct smc_link *link,
qentry->msg.raw.hdr.common.type);
break;
}
+free:
kfree(qentry);
return;
assign:
@@ -2191,25 +2184,98 @@ void smc_llc_link_clear(struct smc_link *link, bool log)
cancel_delayed_work_sync(&link->llc_testlink_wrk);
}
+static int smc_llc_rkey_response_wake_function(struct wait_queue_entry *wq_entry,
+ unsigned int mode, int sync, void *key)
+{
+ struct smc_llc_qentry *except, *incoming;
+ u8 except_llc_type;
+
+ /* not a rkey response */
+ if (!key)
+ return 0;
+
+ except = wq_entry->private;
+ incoming = key;
+
+ except_llc_type = except->msg.raw.hdr.common.llc_type;
+
+ /* except LLC MSG TYPE mismatch */
+ if (except_llc_type != incoming->msg.raw.hdr.common.llc_type)
+ return 0;
+
+ switch (except_llc_type) {
+ case SMC_LLC_CONFIRM_RKEY:
+ if (memcmp(except->msg.confirm_rkey.rtoken, incoming->msg.confirm_rkey.rtoken,
+ sizeof(struct smc_rmb_rtoken) *
+ except->msg.confirm_rkey.rtoken[0].num_rkeys))
+ return 0;
+ break;
+ case SMC_LLC_DELETE_RKEY:
+ if (memcmp(except->msg.delete_rkey.rkey, incoming->msg.delete_rkey.rkey,
+ sizeof(__be32) * except->msg.delete_rkey.num_rkeys))
+ return 0;
+ break;
+ default:
+ pr_warn("smc: invalid except llc msg %d.\n", except_llc_type);
+ return 0;
+ }
+
+ /* match, save hdr */
+ memcpy(&except->msg.raw.hdr, &incoming->msg.raw.hdr, sizeof(except->msg.raw.hdr));
+
+ wq_entry->private = except->private;
+ return woken_wake_function(wq_entry, mode, sync, NULL);
+}
+
/* register a new rtoken at the remote peer (for all links) */
int smc_llc_do_confirm_rkey(struct smc_link *send_link,
struct smc_buf_desc *rmb_desc)
{
+ DEFINE_WAIT_FUNC(wait, smc_llc_rkey_response_wake_function);
struct smc_link_group *lgr = send_link->lgr;
- struct smc_llc_qentry *qentry = NULL;
- int rc = 0;
+ long timeout = SMC_LLC_WAIT_TIME;
+ struct smc_wr_tx_pend_priv *priv;
+ struct smc_llc_qentry qentry;
+ struct smc_wr_tx_pend *pend;
+ int rc = 0, flags;
- rc = smc_llc_send_confirm_rkey(send_link, rmb_desc);
+ if (!smc_wr_tx_link_hold(send_link))
+ return -ENOLINK;
+
+ rc = smc_llc_build_confirm_rkey_request(send_link, rmb_desc, &priv);
if (rc)
goto out;
- /* receive CONFIRM RKEY response from server over RoCE fabric */
- qentry = smc_llc_wait(lgr, send_link, SMC_LLC_WAIT_TIME,
- SMC_LLC_CONFIRM_RKEY);
- if (!qentry || (qentry->msg.raw.hdr.flags & SMC_LLC_FLAG_RKEY_NEG))
+
+ pend = container_of(priv, struct smc_wr_tx_pend, priv);
+ /* make a copy of send msg */
+ memcpy(&qentry.msg.confirm_rkey, send_link->wr_tx_bufs[pend->idx].raw,
+ sizeof(qentry.msg.confirm_rkey));
+
+ qentry.private = wait.private;
+ wait.private = &qentry;
+
+ add_wait_queue(&lgr->llc_msg_waiter, &wait);
+
+ /* send llc message */
+ rc = smc_wr_tx_send(send_link, priv);
+ smc_wr_tx_link_put(send_link);
+ if (rc) {
+ remove_wait_queue(&lgr->llc_msg_waiter, &wait);
+ goto out;
+ }
+
+ while (!signal_pending(current) && timeout) {
+ timeout = wait_woken(&wait, TASK_INTERRUPTIBLE, timeout);
+ if (qentry.msg.raw.hdr.flags & SMC_LLC_FLAG_RESP)
+ break;
+ }
+
+ remove_wait_queue(&lgr->llc_msg_waiter, &wait);
+ flags = qentry.msg.raw.hdr.flags;
+
+ if (!(flags & SMC_LLC_FLAG_RESP) || flags & SMC_LLC_FLAG_RKEY_NEG)
rc = -EFAULT;
out:
- if (qentry)
- smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
return rc;
}
@@ -2217,26 +2283,52 @@ int smc_llc_do_confirm_rkey(struct smc_link *send_link,
int smc_llc_do_delete_rkey(struct smc_link_group *lgr,
struct smc_buf_desc *rmb_desc)
{
- struct smc_llc_qentry *qentry = NULL;
+ DEFINE_WAIT_FUNC(wait, smc_llc_rkey_response_wake_function);
+ long timeout = SMC_LLC_WAIT_TIME;
+ struct smc_wr_tx_pend_priv *priv;
+ struct smc_llc_qentry qentry;
+ struct smc_wr_tx_pend *pend;
struct smc_link *send_link;
- int rc = 0;
+ int rc = 0, flags;
send_link = smc_llc_usable_link(lgr);
- if (!send_link)
+ if (!send_link || !smc_wr_tx_link_hold(send_link))
return -ENOLINK;
- /* protected by llc_flow control */
- rc = smc_llc_send_delete_rkey(send_link, rmb_desc);
+ rc = smc_llc_build_delete_rkey_request(send_link, rmb_desc, &priv);
if (rc)
goto out;
- /* receive DELETE RKEY response from server over RoCE fabric */
- qentry = smc_llc_wait(lgr, send_link, SMC_LLC_WAIT_TIME,
- SMC_LLC_DELETE_RKEY);
- if (!qentry || (qentry->msg.raw.hdr.flags & SMC_LLC_FLAG_RKEY_NEG))
+
+ pend = container_of(priv, struct smc_wr_tx_pend, priv);
+ /* make a copy of send msg */
+ memcpy(&qentry.msg.delete_link, send_link->wr_tx_bufs[pend->idx].raw,
+ sizeof(qentry.msg.delete_link));
+
+ qentry.private = wait.private;
+ wait.private = &qentry;
+
+ add_wait_queue(&lgr->llc_msg_waiter, &wait);
+
+ /* send llc message */
+ rc = smc_wr_tx_send(send_link, priv);
+ smc_wr_tx_link_put(send_link);
+ if (rc) {
+ remove_wait_queue(&lgr->llc_msg_waiter, &wait);
+ goto out;
+ }
+
+ while (!signal_pending(current) && timeout) {
+ timeout = wait_woken(&wait, TASK_INTERRUPTIBLE, timeout);
+ if (qentry.msg.raw.hdr.flags & SMC_LLC_FLAG_RESP)
+ break;
+ }
+
+ remove_wait_queue(&lgr->llc_msg_waiter, &wait);
+ flags = qentry.msg.raw.hdr.flags;
+
+ if (!(flags & SMC_LLC_FLAG_RESP) || flags & SMC_LLC_FLAG_RKEY_NEG)
rc = -EFAULT;
out:
- if (qentry)
- smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
return rc;
}
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index 26f8f24..52af94f 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -37,16 +37,6 @@
static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BITS);
static DEFINE_SPINLOCK(smc_wr_rx_hash_lock);
-struct smc_wr_tx_pend { /* control data for a pending send request */
- u64 wr_id; /* work request id sent */
- smc_wr_tx_handler handler;
- enum ib_wc_status wc_status; /* CQE status */
- struct smc_link *link;
- u32 idx;
- struct smc_wr_tx_pend_priv priv;
- u8 compl_requested;
-};
-
/******************************** send queue *********************************/
/*------------------------------- completion --------------------------------*/
diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
index a54e90a..9946ed5 100644
--- a/net/smc/smc_wr.h
+++ b/net/smc/smc_wr.h
@@ -46,6 +46,16 @@ struct smc_wr_rx_handler {
u8 type;
};
+struct smc_wr_tx_pend { /* control data for a pending send request */
+ u64 wr_id; /* work request id sent */
+ smc_wr_tx_handler handler;
+ enum ib_wc_status wc_status; /* CQE status */
+ struct smc_link *link;
+ u32 idx;
+ struct smc_wr_tx_pend_priv priv;
+ u8 compl_requested;
+};
+
/* Only used by RDMA write WRs.
* All other WRs (CDC/LLC) use smc_wr_tx_send handling WR_ID implicitly
*/
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next v2 04/10] net/smc: make SMC_LLC_FLOW_RKEY run concurrently
2022-08-26 9:51 [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections D. Wythe
` (2 preceding siblings ...)
2022-08-26 9:51 ` [PATCH net-next v2 03/10] net/smc: allow confirm/delete rkey response deliver multiplex D. Wythe
@ 2022-08-26 9:51 ` D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 05/10] net/smc: llc_conf_mutex refactor, replace it with rw_semaphore D. Wythe
` (7 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: D. Wythe @ 2022-08-26 9:51 UTC (permalink / raw)
To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma, D. Wythe
From: "D. Wythe" <alibuda@linux.alibaba.com>
Once confirm/delete rkey response can be multiplex delivered,
We can allow parallel execution of start (remote) or
initialization (local) a SMC_LLC_FLOW_RKEY flow.
This patch will count the flows executed in parallel, and only when
the count reaches zero will the current flow type be removed.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
net/smc/smc_core.h | 1 +
net/smc/smc_llc.c | 69 +++++++++++++++++++++++++++++++++++++++++-------------
net/smc/smc_llc.h | 6 +++++
3 files changed, 60 insertions(+), 16 deletions(-)
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index a304ef3..65f9460 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -291,6 +291,7 @@ enum smc_llc_flowtype {
struct smc_llc_flow {
enum smc_llc_flowtype type;
struct smc_llc_qentry *qentry;
+ refcount_t parallel_refcnt;
};
struct smc_link_group {
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index 2c8bfff..49d92e6 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -231,10 +231,18 @@ static inline void smc_llc_flow_qentry_set(struct smc_llc_flow *flow,
flow->qentry = qentry;
}
-static void smc_llc_flow_parallel(struct smc_link_group *lgr, u8 flow_type,
+static void smc_llc_flow_parallel(struct smc_link_group *lgr, struct smc_llc_flow *flow,
struct smc_llc_qentry *qentry)
{
u8 msg_type = qentry->msg.raw.hdr.common.llc_type;
+ u8 flow_type = flow->type;
+
+ /* SMC_LLC_FLOW_RKEY can be parallel */
+ if (flow_type == SMC_LLC_FLOW_RKEY &&
+ (msg_type == SMC_LLC_CONFIRM_RKEY || msg_type == SMC_LLC_DELETE_RKEY)) {
+ refcount_inc(&flow->parallel_refcnt);
+ return;
+ }
if ((msg_type == SMC_LLC_ADD_LINK || msg_type == SMC_LLC_DELETE_LINK) &&
flow_type != msg_type && !lgr->delayed_event) {
@@ -261,7 +269,7 @@ static bool smc_llc_flow_start(struct smc_llc_flow *flow,
spin_lock_bh(&lgr->llc_flow_lock);
if (flow->type) {
/* a flow is already active */
- smc_llc_flow_parallel(lgr, flow->type, qentry);
+ smc_llc_flow_parallel(lgr, flow, qentry);
spin_unlock_bh(&lgr->llc_flow_lock);
return false;
}
@@ -280,6 +288,7 @@ static bool smc_llc_flow_start(struct smc_llc_flow *flow,
flow->type = SMC_LLC_FLOW_NONE;
}
smc_llc_flow_qentry_set(flow, qentry);
+ refcount_set(&flow->parallel_refcnt, 1);
spin_unlock_bh(&lgr->llc_flow_lock);
return true;
}
@@ -289,6 +298,7 @@ int smc_llc_flow_initiate(struct smc_link_group *lgr,
enum smc_llc_flowtype type)
{
enum smc_llc_flowtype allowed_remote = SMC_LLC_FLOW_NONE;
+ bool accept = false;
int rc;
/* all flows except confirm_rkey and delete_rkey are exclusive,
@@ -300,10 +310,39 @@ int smc_llc_flow_initiate(struct smc_link_group *lgr,
if (list_empty(&lgr->list))
return -ENODEV;
spin_lock_bh(&lgr->llc_flow_lock);
- if (lgr->llc_flow_lcl.type == SMC_LLC_FLOW_NONE &&
- (lgr->llc_flow_rmt.type == SMC_LLC_FLOW_NONE ||
- lgr->llc_flow_rmt.type == allowed_remote)) {
- lgr->llc_flow_lcl.type = type;
+
+ /* Flow is initialized only if the following conditions are met:
+ * incoming flow local flow remote flow
+ * exclusive NONE NONE
+ * SMC_LLC_FLOW_RKEY SMC_LLC_FLOW_RKEY SMC_LLC_FLOW_RKEY
+ * SMC_LLC_FLOW_RKEY NONE SMC_LLC_FLOW_RKEY
+ * SMC_LLC_FLOW_RKEY SMC_LLC_FLOW_RKEY NONE
+ */
+ switch (type) {
+ case SMC_LLC_FLOW_RKEY:
+ if (!SMC_IS_PARALLEL_FLOW(lgr->llc_flow_lcl.type))
+ break;
+ if (!SMC_IS_PARALLEL_FLOW(lgr->llc_flow_rmt.type))
+ break;
+ /* accepted */
+ accept = true;
+ break;
+ default:
+ if (!SMC_IS_NONE_FLOW(lgr->llc_flow_lcl.type))
+ break;
+ if (!SMC_IS_NONE_FLOW(lgr->llc_flow_rmt.type))
+ break;
+ /* accepted */
+ accept = true;
+ break;
+ }
+ if (accept) {
+ if (SMC_IS_NONE_FLOW(lgr->llc_flow_lcl.type)) {
+ lgr->llc_flow_lcl.type = type;
+ refcount_set(&lgr->llc_flow_lcl.parallel_refcnt, 1);
+ } else {
+ refcount_inc(&lgr->llc_flow_lcl.parallel_refcnt);
+ }
spin_unlock_bh(&lgr->llc_flow_lock);
return 0;
}
@@ -322,6 +361,10 @@ int smc_llc_flow_initiate(struct smc_link_group *lgr,
void smc_llc_flow_stop(struct smc_link_group *lgr, struct smc_llc_flow *flow)
{
spin_lock_bh(&lgr->llc_flow_lock);
+ if (!refcount_dec_and_test(&flow->parallel_refcnt)) {
+ spin_unlock_bh(&lgr->llc_flow_lock);
+ return;
+ }
memset(flow, 0, sizeof(*flow));
flow->type = SMC_LLC_FLOW_NONE;
spin_unlock_bh(&lgr->llc_flow_lock);
@@ -1729,16 +1772,14 @@ static void smc_llc_delete_link_work(struct work_struct *work)
}
/* process a confirm_rkey request from peer, remote flow */
-static void smc_llc_rmt_conf_rkey(struct smc_link_group *lgr)
+static void smc_llc_rmt_conf_rkey(struct smc_link_group *lgr, struct smc_llc_qentry *qentry)
{
struct smc_llc_msg_confirm_rkey *llc;
- struct smc_llc_qentry *qentry;
struct smc_link *link;
int num_entries;
int rk_idx;
int i;
- qentry = lgr->llc_flow_rmt.qentry;
llc = &qentry->msg.confirm_rkey;
link = qentry->link;
@@ -1765,19 +1806,16 @@ static void smc_llc_rmt_conf_rkey(struct smc_link_group *lgr)
llc->hd.flags |= SMC_LLC_FLAG_RESP;
smc_llc_init_msg_hdr(&llc->hd, link->lgr, sizeof(*llc));
smc_llc_send_message(link, &qentry->msg);
- smc_llc_flow_qentry_del(&lgr->llc_flow_rmt);
}
/* process a delete_rkey request from peer, remote flow */
-static void smc_llc_rmt_delete_rkey(struct smc_link_group *lgr)
+static void smc_llc_rmt_delete_rkey(struct smc_link_group *lgr, struct smc_llc_qentry *qentry)
{
struct smc_llc_msg_delete_rkey *llc;
- struct smc_llc_qentry *qentry;
struct smc_link *link;
u8 err_mask = 0;
int i, max;
- qentry = lgr->llc_flow_rmt.qentry;
llc = &qentry->msg.delete_rkey;
link = qentry->link;
@@ -1815,7 +1853,6 @@ static void smc_llc_rmt_delete_rkey(struct smc_link_group *lgr)
finish:
llc->hd.flags |= SMC_LLC_FLAG_RESP;
smc_llc_send_message(link, &qentry->msg);
- smc_llc_flow_qentry_del(&lgr->llc_flow_rmt);
}
static void smc_llc_protocol_violation(struct smc_link_group *lgr, u8 type)
@@ -1916,7 +1953,7 @@ static void smc_llc_event_handler(struct smc_llc_qentry *qentry)
/* new request from remote, assign to remote flow */
if (smc_llc_flow_start(&lgr->llc_flow_rmt, qentry)) {
/* process here, does not wait for more llc msgs */
- smc_llc_rmt_conf_rkey(lgr);
+ smc_llc_rmt_conf_rkey(lgr, qentry);
smc_llc_flow_stop(lgr, &lgr->llc_flow_rmt);
}
return;
@@ -1929,7 +1966,7 @@ static void smc_llc_event_handler(struct smc_llc_qentry *qentry)
/* new request from remote, assign to remote flow */
if (smc_llc_flow_start(&lgr->llc_flow_rmt, qentry)) {
/* process here, does not wait for more llc msgs */
- smc_llc_rmt_delete_rkey(lgr);
+ smc_llc_rmt_delete_rkey(lgr, qentry);
smc_llc_flow_stop(lgr, &lgr->llc_flow_rmt);
}
return;
diff --git a/net/smc/smc_llc.h b/net/smc/smc_llc.h
index 4404e52..005a81e 100644
--- a/net/smc/smc_llc.h
+++ b/net/smc/smc_llc.h
@@ -48,6 +48,12 @@ enum smc_llc_msg_type {
#define smc_link_downing(state) \
(cmpxchg(state, SMC_LNK_ACTIVE, SMC_LNK_INACTIVE) == SMC_LNK_ACTIVE)
+#define SMC_IS_NONE_FLOW(type) \
+ ((type) == SMC_LLC_FLOW_NONE)
+
+#define SMC_IS_PARALLEL_FLOW(type) \
+ (((type) == SMC_LLC_FLOW_RKEY) || SMC_IS_NONE_FLOW(type))
+
/* LLC DELETE LINK Request Reason Codes */
#define SMC_LLC_DEL_LOST_PATH 0x00010000
#define SMC_LLC_DEL_OP_INIT_TERM 0x00020000
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next v2 05/10] net/smc: llc_conf_mutex refactor, replace it with rw_semaphore
2022-08-26 9:51 [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections D. Wythe
` (3 preceding siblings ...)
2022-08-26 9:51 ` [PATCH net-next v2 04/10] net/smc: make SMC_LLC_FLOW_RKEY run concurrently D. Wythe
@ 2022-08-26 9:51 ` D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 06/10] net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse() D. Wythe
` (6 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: D. Wythe @ 2022-08-26 9:51 UTC (permalink / raw)
To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma, D. Wythe
From: "D. Wythe" <alibuda@linux.alibaba.com>
llc_conf_mutex was used to protect links and link related configurations
in the same link group, for example, add or delete links. However,
in most cases, the protected critical area has only read semantics and
with no write semantics at all, such as obtaining a usable link or an
available rmb_desc.
This patch do simply code refactoring, replace mutex with rw_semaphore,
replace mutex_lock with down_write and replace mutex_unlock with
up_write.
Theoretically, this replacement is equivalent, but after this patch,
we can distinguish lock granularity according to different semantics
of critical areas.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
net/smc/af_smc.c | 8 ++++----
net/smc/smc_core.c | 20 ++++++++++----------
net/smc/smc_core.h | 2 +-
net/smc/smc_llc.c | 18 +++++++++---------
4 files changed, 24 insertions(+), 24 deletions(-)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index ddca170..04d79d1 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -498,7 +498,7 @@ static int smcr_lgr_reg_sndbufs(struct smc_link *link,
return -EINVAL;
/* protect against parallel smcr_link_reg_buf() */
- mutex_lock(&lgr->llc_conf_mutex);
+ down_write(&lgr->llc_conf_mutex);
for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
if (!smc_link_active(&lgr->lnk[i]))
continue;
@@ -506,7 +506,7 @@ static int smcr_lgr_reg_sndbufs(struct smc_link *link,
if (rc)
break;
}
- mutex_unlock(&lgr->llc_conf_mutex);
+ up_write(&lgr->llc_conf_mutex);
return rc;
}
@@ -523,7 +523,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
/* protect against parallel smc_llc_cli_rkey_exchange() and
* parallel smcr_link_reg_buf()
*/
- mutex_lock(&lgr->llc_conf_mutex);
+ down_write(&lgr->llc_conf_mutex);
for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
if (!smc_link_active(&lgr->lnk[i]))
continue;
@@ -540,7 +540,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
}
rmb_desc->is_conf_rkey = true;
out:
- mutex_unlock(&lgr->llc_conf_mutex);
+ up_write(&lgr->llc_conf_mutex);
smc_llc_flow_stop(lgr, &lgr->llc_flow_lcl);
return rc;
}
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index f93c69f..95d8438 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1362,10 +1362,10 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
if (!rc) {
/* protect against smc_llc_cli_rkey_exchange() */
- mutex_lock(&lgr->llc_conf_mutex);
+ down_write(&lgr->llc_conf_mutex);
smc_llc_do_delete_rkey(lgr, buf_desc);
buf_desc->is_conf_rkey = false;
- mutex_unlock(&lgr->llc_conf_mutex);
+ up_write(&lgr->llc_conf_mutex);
smc_llc_flow_stop(lgr, &lgr->llc_flow_lcl);
}
}
@@ -1635,12 +1635,12 @@ static void smc_lgr_free(struct smc_link_group *lgr)
int i;
if (!lgr->is_smcd) {
- mutex_lock(&lgr->llc_conf_mutex);
+ down_write(&lgr->llc_conf_mutex);
for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
if (lgr->lnk[i].state != SMC_LNK_UNUSED)
smcr_link_clear(&lgr->lnk[i], false);
}
- mutex_unlock(&lgr->llc_conf_mutex);
+ up_write(&lgr->llc_conf_mutex);
smc_llc_lgr_clear(lgr);
}
@@ -1954,12 +1954,12 @@ static void smcr_link_down(struct smc_link *lnk)
} else {
if (lgr->llc_flow_lcl.type != SMC_LLC_FLOW_NONE) {
/* another llc task is ongoing */
- mutex_unlock(&lgr->llc_conf_mutex);
+ up_write(&lgr->llc_conf_mutex);
wait_event_timeout(lgr->llc_flow_waiter,
(list_empty(&lgr->list) ||
lgr->llc_flow_lcl.type == SMC_LLC_FLOW_NONE),
SMC_LLC_WAIT_TIME);
- mutex_lock(&lgr->llc_conf_mutex);
+ down_write(&lgr->llc_conf_mutex);
}
if (!list_empty(&lgr->list)) {
smc_llc_send_delete_link(to_lnk, del_link_id,
@@ -2021,9 +2021,9 @@ static void smc_link_down_work(struct work_struct *work)
if (list_empty(&lgr->list))
return;
wake_up_all(&lgr->llc_msg_waiter);
- mutex_lock(&lgr->llc_conf_mutex);
+ down_write(&lgr->llc_conf_mutex);
smcr_link_down(link);
- mutex_unlock(&lgr->llc_conf_mutex);
+ up_write(&lgr->llc_conf_mutex);
}
static int smc_vlan_by_tcpsk_walk(struct net_device *lower_dev,
@@ -2567,7 +2567,7 @@ static int smcr_buf_map_usable_links(struct smc_link_group *lgr,
int i, rc = 0;
/* protect against parallel link reconfiguration */
- mutex_lock(&lgr->llc_conf_mutex);
+ down_write(&lgr->llc_conf_mutex);
for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
struct smc_link *lnk = &lgr->lnk[i];
@@ -2579,7 +2579,7 @@ static int smcr_buf_map_usable_links(struct smc_link_group *lgr,
}
}
out:
- mutex_unlock(&lgr->llc_conf_mutex);
+ up_write(&lgr->llc_conf_mutex);
return rc;
}
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 65f9460..3035ddb 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -351,7 +351,7 @@ struct smc_link_group {
/* queue for llc events */
spinlock_t llc_event_q_lock;
/* protects llc_event_q */
- struct mutex llc_conf_mutex;
+ struct rw_semaphore llc_conf_mutex;
/* protects lgr reconfig. */
struct work_struct llc_add_link_work;
struct work_struct llc_del_link_work;
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index 49d92e6..f563f95 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -1237,12 +1237,12 @@ static void smc_llc_process_cli_add_link(struct smc_link_group *lgr)
qentry = smc_llc_flow_qentry_clr(&lgr->llc_flow_lcl);
- mutex_lock(&lgr->llc_conf_mutex);
+ down_write(&lgr->llc_conf_mutex);
if (smc_llc_is_local_add_link(&qentry->msg))
smc_llc_cli_add_link_invite(qentry->link, qentry);
else
smc_llc_cli_add_link(qentry->link, qentry);
- mutex_unlock(&lgr->llc_conf_mutex);
+ up_write(&lgr->llc_conf_mutex);
}
static int smc_llc_active_link_count(struct smc_link_group *lgr)
@@ -1546,13 +1546,13 @@ static void smc_llc_process_srv_add_link(struct smc_link_group *lgr)
qentry = smc_llc_flow_qentry_clr(&lgr->llc_flow_lcl);
- mutex_lock(&lgr->llc_conf_mutex);
+ down_write(&lgr->llc_conf_mutex);
rc = smc_llc_srv_add_link(link, qentry);
if (!rc && lgr->type == SMC_LGR_SYMMETRIC) {
/* delete any asymmetric link */
smc_llc_delete_asym_link(lgr);
}
- mutex_unlock(&lgr->llc_conf_mutex);
+ up_write(&lgr->llc_conf_mutex);
kfree(qentry);
}
@@ -1619,7 +1619,7 @@ static void smc_llc_process_cli_delete_link(struct smc_link_group *lgr)
smc_lgr_terminate_sched(lgr);
goto out;
}
- mutex_lock(&lgr->llc_conf_mutex);
+ down_write(&lgr->llc_conf_mutex);
/* delete single link */
for (lnk_idx = 0; lnk_idx < SMC_LINKS_PER_LGR_MAX; lnk_idx++) {
if (lgr->lnk[lnk_idx].link_id != del_llc->link_num)
@@ -1655,7 +1655,7 @@ static void smc_llc_process_cli_delete_link(struct smc_link_group *lgr)
smc_lgr_terminate_sched(lgr);
}
out_unlock:
- mutex_unlock(&lgr->llc_conf_mutex);
+ up_write(&lgr->llc_conf_mutex);
out:
kfree(qentry);
}
@@ -1691,7 +1691,7 @@ static void smc_llc_process_srv_delete_link(struct smc_link_group *lgr)
int active_links;
int i;
- mutex_lock(&lgr->llc_conf_mutex);
+ down_write(&lgr->llc_conf_mutex);
qentry = smc_llc_flow_qentry_clr(&lgr->llc_flow_lcl);
lnk = qentry->link;
del_llc = &qentry->msg.delete_link;
@@ -1748,7 +1748,7 @@ static void smc_llc_process_srv_delete_link(struct smc_link_group *lgr)
smc_llc_add_link_local(lnk);
}
out:
- mutex_unlock(&lgr->llc_conf_mutex);
+ up_write(&lgr->llc_conf_mutex);
kfree(qentry);
}
@@ -2162,7 +2162,7 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
spin_lock_init(&lgr->llc_flow_lock);
init_waitqueue_head(&lgr->llc_flow_waiter);
init_waitqueue_head(&lgr->llc_msg_waiter);
- mutex_init(&lgr->llc_conf_mutex);
+ init_rwsem(&lgr->llc_conf_mutex);
lgr->llc_testlink_time = READ_ONCE(net->ipv4.sysctl_tcp_keepalive_time);
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next v2 06/10] net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse()
2022-08-26 9:51 [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections D. Wythe
` (4 preceding siblings ...)
2022-08-26 9:51 ` [PATCH net-next v2 05/10] net/smc: llc_conf_mutex refactor, replace it with rw_semaphore D. Wythe
@ 2022-08-26 9:51 ` D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 07/10] net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs() D. Wythe
` (5 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: D. Wythe @ 2022-08-26 9:51 UTC (permalink / raw)
To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma, D. Wythe
From: "D. Wythe" <alibuda@linux.alibaba.com>
Following is part of Off-CPU graph during frequent SMC-R short-lived
processing:
process_one_work (51.19%)
smc_close_passive_work (28.36%)
smcr_buf_unuse (28.34%)
rwsem_down_write_slowpath (28.22%)
smc_listen_work (22.83%)
smc_clc_wait_msg (1.84%)
smc_buf_create (20.45%)
smcr_buf_map_usable_links
rwsem_down_write_slowpath (20.43%)
smcr_lgr_reg_rmbs (0.53%)
rwsem_down_write_slowpath (0.43%)
smc_llc_do_confirm_rkey (0.08%)
We can clearly see that during the connection establishment time,
waiting time of connections is not on IO, but on llc_conf_mutex.
What is more important, the core critical area (smcr_buf_unuse() &
smc_buf_create()) only perfroms read semantics on links, we can
easily replace it with read semaphore.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
net/smc/smc_core.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 95d8438..1f66f14 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1362,10 +1362,10 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
if (!rc) {
/* protect against smc_llc_cli_rkey_exchange() */
- down_write(&lgr->llc_conf_mutex);
+ down_read(&lgr->llc_conf_mutex);
smc_llc_do_delete_rkey(lgr, buf_desc);
buf_desc->is_conf_rkey = false;
- up_write(&lgr->llc_conf_mutex);
+ up_read(&lgr->llc_conf_mutex);
smc_llc_flow_stop(lgr, &lgr->llc_flow_lcl);
}
}
@@ -2567,7 +2567,7 @@ static int smcr_buf_map_usable_links(struct smc_link_group *lgr,
int i, rc = 0;
/* protect against parallel link reconfiguration */
- down_write(&lgr->llc_conf_mutex);
+ down_read(&lgr->llc_conf_mutex);
for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
struct smc_link *lnk = &lgr->lnk[i];
@@ -2579,7 +2579,7 @@ static int smcr_buf_map_usable_links(struct smc_link_group *lgr,
}
}
out:
- up_write(&lgr->llc_conf_mutex);
+ up_read(&lgr->llc_conf_mutex);
return rc;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next v2 07/10] net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs()
2022-08-26 9:51 [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections D. Wythe
` (5 preceding siblings ...)
2022-08-26 9:51 ` [PATCH net-next v2 06/10] net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse() D. Wythe
@ 2022-08-26 9:51 ` D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 08/10] net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore D. Wythe
` (4 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: D. Wythe @ 2022-08-26 9:51 UTC (permalink / raw)
To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma, D. Wythe
From: "D. Wythe" <alibuda@linux.alibaba.com>
Unlike smc_buf_create() and smcr_buf_unuse(), smcr_lgr_reg_rmbs() is
exclusive when assigned rmb_desc was not registered, although it can be
executed in parallel when assigned rmb_desc was registered already
and only performs read semtamics on it. Hence, we can not simply replace
it with read semaphore.
The idea here is that if the assigned rmb_desc was registered already,
use read semaphore to protect the critical section, once the assigned
rmb_desc was not registered, keep using keep write semaphore still
to keep its exclusivity.
Thanks to the reusable features of rmb_desc, which allows us to execute
in parallel in most cases.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
net/smc/af_smc.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 04d79d1..e865f5e 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -515,11 +515,26 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
struct smc_buf_desc *rmb_desc)
{
struct smc_link_group *lgr = link->lgr;
+ bool do_slow = false;
int i, rc = 0;
rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
if (rc)
return rc;
+
+ down_read(&lgr->llc_conf_mutex);
+ for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
+ if (!smc_link_active(&lgr->lnk[i]))
+ continue;
+ if (!rmb_desc->is_reg_mr[link->link_idx]) {
+ up_read(&lgr->llc_conf_mutex);
+ goto slow_path;
+ }
+ }
+ /* mr register already */
+ goto fast_path;
+slow_path:
+ do_slow = true;
/* protect against parallel smc_llc_cli_rkey_exchange() and
* parallel smcr_link_reg_buf()
*/
@@ -531,7 +546,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
if (rc)
goto out;
}
-
+fast_path:
/* exchange confirm_rkey msg with peer */
rc = smc_llc_do_confirm_rkey(link, rmb_desc);
if (rc) {
@@ -540,7 +555,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
}
rmb_desc->is_conf_rkey = true;
out:
- up_write(&lgr->llc_conf_mutex);
+ do_slow ? up_write(&lgr->llc_conf_mutex) : up_read(&lgr->llc_conf_mutex);
smc_llc_flow_stop(lgr, &lgr->llc_flow_lcl);
return rc;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next v2 08/10] net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore
2022-08-26 9:51 [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections D. Wythe
` (6 preceding siblings ...)
2022-08-26 9:51 ` [PATCH net-next v2 07/10] net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs() D. Wythe
@ 2022-08-26 9:51 ` D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 09/10] net/smc: Fix potential panic dues to unprotected smc_llc_srv_add_link() D. Wythe
` (3 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: D. Wythe @ 2022-08-26 9:51 UTC (permalink / raw)
To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma, D. Wythe
From: "D. Wythe" <alibuda@linux.alibaba.com>
It's clear that rmbs_lock and sndbufs_lock are aims to protect the
rmbs list or the sndbufs list.
During connection establieshment, smc_buf_get_slot() will always
be invoked, and it only performs read semantics in rmbs list and
sndbufs list.
Based on the above considerations, we replace mutex with rw_semaphore.
Only smc_buf_get_slot() use down_read() to allow smc_buf_get_slot()
run concurrently, other part use down_write() to keep exclusive
semantics.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
net/smc/smc_core.c | 55 +++++++++++++++++++++++++++---------------------------
net/smc/smc_core.h | 4 ++--
net/smc/smc_llc.c | 16 ++++++++--------
3 files changed, 38 insertions(+), 37 deletions(-)
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 1f66f14..84bf84c 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1112,8 +1112,8 @@ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini)
lgr->freeing = 0;
lgr->vlan_id = ini->vlan_id;
refcount_set(&lgr->refcnt, 1); /* set lgr refcnt to 1 */
- mutex_init(&lgr->sndbufs_lock);
- mutex_init(&lgr->rmbs_lock);
+ init_rwsem(&lgr->sndbufs_lock);
+ init_rwsem(&lgr->rmbs_lock);
rwlock_init(&lgr->conns_lock);
for (i = 0; i < SMC_RMBE_SIZES; i++) {
INIT_LIST_HEAD(&lgr->sndbufs[i]);
@@ -1354,7 +1354,7 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
struct smc_link_group *lgr)
{
- struct mutex *lock; /* lock buffer list */
+ struct rw_semaphore *lock; /* lock buffer list */
int rc;
if (is_rmb && buf_desc->is_conf_rkey && !list_empty(&lgr->list)) {
@@ -1374,9 +1374,9 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
/* buf registration failed, reuse not possible */
lock = is_rmb ? &lgr->rmbs_lock :
&lgr->sndbufs_lock;
- mutex_lock(lock);
+ down_write(lock);
list_del(&buf_desc->list);
- mutex_unlock(lock);
+ up_write(lock);
smc_buf_free(lgr, is_rmb, buf_desc);
} else {
@@ -1480,15 +1480,16 @@ static void smcr_buf_unmap_lgr(struct smc_link *lnk)
int i;
for (i = 0; i < SMC_RMBE_SIZES; i++) {
- mutex_lock(&lgr->rmbs_lock);
+ down_write(&lgr->rmbs_lock);
list_for_each_entry_safe(buf_desc, bf, &lgr->rmbs[i], list)
smcr_buf_unmap_link(buf_desc, true, lnk);
- mutex_unlock(&lgr->rmbs_lock);
- mutex_lock(&lgr->sndbufs_lock);
+ up_write(&lgr->rmbs_lock);
+
+ down_write(&lgr->sndbufs_lock);
list_for_each_entry_safe(buf_desc, bf, &lgr->sndbufs[i],
list)
smcr_buf_unmap_link(buf_desc, false, lnk);
- mutex_unlock(&lgr->sndbufs_lock);
+ up_write(&lgr->sndbufs_lock);
}
}
@@ -2310,19 +2311,19 @@ int smc_uncompress_bufsize(u8 compressed)
* buffer size; if not available, return NULL
*/
static struct smc_buf_desc *smc_buf_get_slot(int compressed_bufsize,
- struct mutex *lock,
+ struct rw_semaphore *lock,
struct list_head *buf_list)
{
struct smc_buf_desc *buf_slot;
- mutex_lock(lock);
+ down_read(lock);
list_for_each_entry(buf_slot, buf_list, list) {
if (cmpxchg(&buf_slot->used, 0, 1) == 0) {
- mutex_unlock(lock);
+ up_read(lock);
return buf_slot;
}
}
- mutex_unlock(lock);
+ up_read(lock);
return NULL;
}
@@ -2431,13 +2432,13 @@ int smcr_link_reg_buf(struct smc_link *link, struct smc_buf_desc *buf_desc)
return 0;
}
-static int _smcr_buf_map_lgr(struct smc_link *lnk, struct mutex *lock,
+static int _smcr_buf_map_lgr(struct smc_link *lnk, struct rw_semaphore *lock,
struct list_head *lst, bool is_rmb)
{
struct smc_buf_desc *buf_desc, *bf;
int rc = 0;
- mutex_lock(lock);
+ down_write(lock);
list_for_each_entry_safe(buf_desc, bf, lst, list) {
if (!buf_desc->used)
continue;
@@ -2446,7 +2447,7 @@ static int _smcr_buf_map_lgr(struct smc_link *lnk, struct mutex *lock,
goto out;
}
out:
- mutex_unlock(lock);
+ up_write(lock);
return rc;
}
@@ -2479,37 +2480,37 @@ int smcr_buf_reg_lgr(struct smc_link *lnk)
int i, rc = 0;
/* reg all RMBs for a new link */
- mutex_lock(&lgr->rmbs_lock);
+ down_write(&lgr->rmbs_lock);
for (i = 0; i < SMC_RMBE_SIZES; i++) {
list_for_each_entry_safe(buf_desc, bf, &lgr->rmbs[i], list) {
if (!buf_desc->used)
continue;
rc = smcr_link_reg_buf(lnk, buf_desc);
if (rc) {
- mutex_unlock(&lgr->rmbs_lock);
+ up_write(&lgr->rmbs_lock);
return rc;
}
}
}
- mutex_unlock(&lgr->rmbs_lock);
+ up_write(&lgr->rmbs_lock);
if (lgr->buf_type == SMCR_PHYS_CONT_BUFS)
return rc;
/* reg all vzalloced sndbufs for a new link */
- mutex_lock(&lgr->sndbufs_lock);
+ down_write(&lgr->sndbufs_lock);
for (i = 0; i < SMC_RMBE_SIZES; i++) {
list_for_each_entry_safe(buf_desc, bf, &lgr->sndbufs[i], list) {
if (!buf_desc->used || !buf_desc->is_vm)
continue;
rc = smcr_link_reg_buf(lnk, buf_desc);
if (rc) {
- mutex_unlock(&lgr->sndbufs_lock);
+ up_write(&lgr->sndbufs_lock);
return rc;
}
}
}
- mutex_unlock(&lgr->sndbufs_lock);
+ up_write(&lgr->sndbufs_lock);
return rc;
}
@@ -2627,7 +2628,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
struct list_head *buf_list;
int bufsize, bufsize_short;
bool is_dgraded = false;
- struct mutex *lock; /* lock buffer list */
+ struct rw_semaphore *lock; /* lock buffer list */
int sk_buf_size;
if (is_rmb)
@@ -2675,9 +2676,9 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
SMC_STAT_RMB_ALLOC(smc, is_smcd, is_rmb);
SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, bufsize);
buf_desc->used = 1;
- mutex_lock(lock);
+ down_write(lock);
list_add(&buf_desc->list, buf_list);
- mutex_unlock(lock);
+ up_write(lock);
break; /* found */
}
@@ -2751,9 +2752,9 @@ int smc_buf_create(struct smc_sock *smc, bool is_smcd)
/* create rmb */
rc = __smc_buf_create(smc, is_smcd, true);
if (rc) {
- mutex_lock(&smc->conn.lgr->sndbufs_lock);
+ down_write(&smc->conn.lgr->sndbufs_lock);
list_del(&smc->conn.sndbuf_desc->list);
- mutex_unlock(&smc->conn.lgr->sndbufs_lock);
+ up_write(&smc->conn.lgr->sndbufs_lock);
smc_buf_free(smc->conn.lgr, false, smc->conn.sndbuf_desc);
smc->conn.sndbuf_desc = NULL;
}
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 3035ddb..a501ed5 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -305,9 +305,9 @@ struct smc_link_group {
unsigned short vlan_id; /* vlan id of link group */
struct list_head sndbufs[SMC_RMBE_SIZES];/* tx buffers */
- struct mutex sndbufs_lock; /* protects tx buffers */
+ struct rw_semaphore sndbufs_lock; /* protects tx buffers */
struct list_head rmbs[SMC_RMBE_SIZES]; /* rx buffers */
- struct mutex rmbs_lock; /* protects rx buffers */
+ struct rw_semaphore rmbs_lock; /* protects rx buffers */
u8 id[SMC_LGR_ID_SIZE]; /* unique lgr id */
struct delayed_work free_work; /* delayed freeing of an lgr */
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index f563f95..61a24e1 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -642,7 +642,7 @@ static int smc_llc_fill_ext_v2(struct smc_llc_msg_add_link_v2_ext *ext,
prim_lnk_idx = link->link_idx;
lnk_idx = link_new->link_idx;
- mutex_lock(&lgr->rmbs_lock);
+ down_write(&lgr->rmbs_lock);
ext->num_rkeys = lgr->conns_num;
if (!ext->num_rkeys)
goto out;
@@ -662,7 +662,7 @@ static int smc_llc_fill_ext_v2(struct smc_llc_msg_add_link_v2_ext *ext,
}
len += i * sizeof(ext->rt[0]);
out:
- mutex_unlock(&lgr->rmbs_lock);
+ up_write(&lgr->rmbs_lock);
return len;
}
@@ -923,7 +923,7 @@ static int smc_llc_cli_rkey_exchange(struct smc_link *link,
int rc = 0;
int i;
- mutex_lock(&lgr->rmbs_lock);
+ down_write(&lgr->rmbs_lock);
num_rkeys_send = lgr->conns_num;
buf_pos = smc_llc_get_first_rmb(lgr, &buf_lst);
do {
@@ -950,7 +950,7 @@ static int smc_llc_cli_rkey_exchange(struct smc_link *link,
break;
} while (num_rkeys_send || num_rkeys_recv);
- mutex_unlock(&lgr->rmbs_lock);
+ up_write(&lgr->rmbs_lock);
return rc;
}
@@ -1033,14 +1033,14 @@ static void smc_llc_save_add_link_rkeys(struct smc_link *link,
ext = (struct smc_llc_msg_add_link_v2_ext *)((u8 *)lgr->wr_rx_buf_v2 +
SMC_WR_TX_SIZE);
max = min_t(u8, ext->num_rkeys, SMC_LLC_RKEYS_PER_MSG_V2);
- mutex_lock(&lgr->rmbs_lock);
+ down_write(&lgr->rmbs_lock);
for (i = 0; i < max; i++) {
smc_rtoken_set(lgr, link->link_idx, link_new->link_idx,
ext->rt[i].rmb_key,
ext->rt[i].rmb_vaddr_new,
ext->rt[i].rmb_key_new);
}
- mutex_unlock(&lgr->rmbs_lock);
+ up_write(&lgr->rmbs_lock);
}
static void smc_llc_save_add_link_info(struct smc_link *link,
@@ -1349,7 +1349,7 @@ static int smc_llc_srv_rkey_exchange(struct smc_link *link,
int rc = 0;
int i;
- mutex_lock(&lgr->rmbs_lock);
+ down_write(&lgr->rmbs_lock);
num_rkeys_send = lgr->conns_num;
buf_pos = smc_llc_get_first_rmb(lgr, &buf_lst);
do {
@@ -1374,7 +1374,7 @@ static int smc_llc_srv_rkey_exchange(struct smc_link *link,
smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
} while (num_rkeys_send || num_rkeys_recv);
out:
- mutex_unlock(&lgr->rmbs_lock);
+ up_write(&lgr->rmbs_lock);
return rc;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next v2 09/10] net/smc: Fix potential panic dues to unprotected smc_llc_srv_add_link()
2022-08-26 9:51 [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections D. Wythe
` (7 preceding siblings ...)
2022-08-26 9:51 ` [PATCH net-next v2 08/10] net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore D. Wythe
@ 2022-08-26 9:51 ` D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 10/10] net/smc: fix application data exception D. Wythe
` (2 subsequent siblings)
11 siblings, 0 replies; 22+ messages in thread
From: D. Wythe @ 2022-08-26 9:51 UTC (permalink / raw)
To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma, D. Wythe
From: "D. Wythe" <alibuda@linux.alibaba.com>
After we optimize the parallel capability of SMC-R connection establish,
there is a certain chance to trigger the following panic:
PID: 5900 TASK: ffff88c1c8af4100 CPU: 1 COMMAND: "kworker/1:48"
#0 [ffff9456c1cc79a0] machine_kexec at ffffffff870665b7
#1 [ffff9456c1cc79f0] __crash_kexec at ffffffff871b4c7a
#2 [ffff9456c1cc7ab0] crash_kexec at ffffffff871b5b60
#3 [ffff9456c1cc7ac0] oops_end at ffffffff87026ce7
#4 [ffff9456c1cc7ae0] page_fault_oops at ffffffff87075715
#5 [ffff9456c1cc7b58] exc_page_fault at ffffffff87ad0654
#6 [ffff9456c1cc7b80] asm_exc_page_fault at ffffffff87c00b62
[exception RIP: ib_alloc_mr+19]
RIP: ffffffffc0c9cce3 RSP: ffff9456c1cc7c38 RFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000004
RDX: 0000000000000010 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88c1ea281d00 R8: 000000020a34ffff R9: ffff88c1350bbb20
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000010 R14: ffff88c1ab040a50 R15: ffff88c1ea281d00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff9456c1cc7c60] smc_ib_get_memory_region at ffffffffc0aff6df [smc]
#8 [ffff9456c1cc7c88] smcr_buf_map_link at ffffffffc0b0278c [smc]
#9 [ffff9456c1cc7ce0] __smc_buf_create at ffffffffc0b03586 [smc]
The reason here is that when the server tries to create a second link,
smc_llc_srv_add_link() has no protection and may add a new link to
link group. This breaks the security environment protected by
llc_conf_mutex.
Fixes: 2d2209f20189 ("net/smc: first part of add link processing as SMC server")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
net/smc/af_smc.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index e865f5e..763601e 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1834,8 +1834,10 @@ static int smcr_serv_conf_first_link(struct smc_sock *smc)
smc_llc_link_active(link);
smcr_lgr_set_type(link->lgr, SMC_LGR_SINGLE);
+ down_write(&link->lgr->llc_conf_mutex);
/* initial contact - try to establish second link */
smc_llc_srv_add_link(link, NULL);
+ up_write(&link->lgr->llc_conf_mutex);
return 0;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH net-next v2 10/10] net/smc: fix application data exception
2022-08-26 9:51 [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections D. Wythe
` (8 preceding siblings ...)
2022-08-26 9:51 ` [PATCH net-next v2 09/10] net/smc: Fix potential panic dues to unprotected smc_llc_srv_add_link() D. Wythe
@ 2022-08-26 9:51 ` D. Wythe
2022-09-08 9:37 ` Wen Gu
2022-08-27 1:32 ` [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections Jakub Kicinski
2022-09-09 6:59 ` Jan Karcher
11 siblings, 1 reply; 22+ messages in thread
From: D. Wythe @ 2022-08-26 9:51 UTC (permalink / raw)
To: kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma, D. Wythe
From: "D. Wythe" <alibuda@linux.alibaba.com>
After we optimize the parallel capability of SMC-R connection
establishment, There is a certain probability that following
exceptions will occur in the wrk benchmark test:
Running 10s test @ http://11.213.45.6:80
8 threads and 64 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.72ms 13.94ms 245.33ms 94.17%
Req/Sec 1.96k 713.67 5.41k 75.16%
155262 requests in 10.10s, 23.10MB read
Non-2xx or 3xx responses: 3
We will find that the error is HTTP 400 error, which is a serious
exception in our test, which means the application data was
corrupted.
Consider the following scenarios:
CPU0 CPU1
buf_desc->used = 0;
cmpxchg(buf_desc->used, 0, 1)
deal_with(buf_desc)
memset(buf_desc->cpu_addr,0);
This will cause the data received by a victim connection to be cleared,
thus triggering an HTTP 400 error in the server.
This patch exchange the order between clear used and memset, add
barrier to ensure memory consistency.
Fixes: 1c5526968e27 ("net/smc: Clear memory when release and reuse buffer")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
net/smc/smc_core.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 84bf84c..fdad953 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1380,8 +1380,9 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
smc_buf_free(lgr, is_rmb, buf_desc);
} else {
- buf_desc->used = 0;
- memset(buf_desc->cpu_addr, 0, buf_desc->len);
+ /* memzero_explicit provides potential memory barrier semantics */
+ memzero_explicit(buf_desc->cpu_addr, buf_desc->len);
+ WRITE_ONCE(buf_desc->used, 0);
}
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections
2022-08-26 9:51 [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections D. Wythe
` (9 preceding siblings ...)
2022-08-26 9:51 ` [PATCH net-next v2 10/10] net/smc: fix application data exception D. Wythe
@ 2022-08-27 1:32 ` Jakub Kicinski
2022-08-29 3:25 ` Tony Lu
2022-08-29 3:28 ` D. Wythe
2022-09-09 6:59 ` Jan Karcher
11 siblings, 2 replies; 22+ messages in thread
From: Jakub Kicinski @ 2022-08-27 1:32 UTC (permalink / raw)
To: D. Wythe; +Cc: kgraul, wenjia, davem, netdev, linux-s390, linux-rdma
On Fri, 26 Aug 2022 17:51:27 +0800 D. Wythe wrote:
> This patch set attempts to optimize the parallelism of SMC-R connections,
> mainly to reduce unnecessary blocking on locks, and to fix exceptions that
> occur after thoses optimization.
>
> According to Off-CPU graph, SMC worker's off-CPU as that:
>
> smc_close_passive_work (1.09%)
> smcr_buf_unuse (1.08%)
> smc_llc_flow_initiate (1.02%)
>
> smc_listen_work (48.17%)
> __mutex_lock.isra.11 (47.96%)
The patches should be ordered so that the prerequisite changes are
first, then the removal of locks. Looks like there are 3 patches here
which carry a Fixes tag, for an old commit but in fact IIUC there is no
bug in those old commits, the problem only appears after the locking is
removed?
That said please wait for IBM folks to review first before reshuffling
the patches, I presume the code itself won't change.
Also I still haven't see anyone reply to Al Viro, IIRC he was
complaining about changes someone from your team has made.
I consider this a blocker for applying new patches from your team :(
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections
2022-08-27 1:32 ` [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections Jakub Kicinski
@ 2022-08-29 3:25 ` Tony Lu
2022-08-29 3:28 ` D. Wythe
1 sibling, 0 replies; 22+ messages in thread
From: Tony Lu @ 2022-08-29 3:25 UTC (permalink / raw)
To: Jakub Kicinski
Cc: D. Wythe, kgraul, wenjia, davem, netdev, linux-s390, linux-rdma
On Fri, Aug 26, 2022 at 06:32:13PM -0700, Jakub Kicinski wrote:
> On Fri, 26 Aug 2022 17:51:27 +0800 D. Wythe wrote:
> > This patch set attempts to optimize the parallelism of SMC-R connections,
> > mainly to reduce unnecessary blocking on locks, and to fix exceptions that
> > occur after thoses optimization.
> >
> > According to Off-CPU graph, SMC worker's off-CPU as that:
> >
> > smc_close_passive_work (1.09%)
> > smcr_buf_unuse (1.08%)
> > smc_llc_flow_initiate (1.02%)
> >
> > smc_listen_work (48.17%)
> > __mutex_lock.isra.11 (47.96%)
>
> The patches should be ordered so that the prerequisite changes are
> first, then the removal of locks. Looks like there are 3 patches here
> which carry a Fixes tag, for an old commit but in fact IIUC there is no
> bug in those old commits, the problem only appears after the locking is
> removed?
>
> That said please wait for IBM folks to review first before reshuffling
> the patches, I presume the code itself won't change.
>
> Also I still haven't see anyone reply to Al Viro, IIRC he was
> complaining about changes someone from your team has made.
> I consider this a blocker for applying new patches from your team :(
Yes, the approach of replacing socket needs to be refactored, and I have
been working on it for the fixes. Maybe I missed something, you can
check this reply here [1].
[1] https://lore.kernel.org/all/YvTL%2Fsf6lrhuGDuy@TonyMac-Alibaba/
Thanks.
Tony Lu
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections
2022-08-27 1:32 ` [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections Jakub Kicinski
2022-08-29 3:25 ` Tony Lu
@ 2022-08-29 3:28 ` D. Wythe
1 sibling, 0 replies; 22+ messages in thread
From: D. Wythe @ 2022-08-29 3:28 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: kgraul, wenjia, davem, netdev, linux-s390, linux-rdma
On 8/27/22 9:32 AM, Jakub Kicinski wrote:
> On Fri, 26 Aug 2022 17:51:27 +0800 D. Wythe wrote:
>> This patch set attempts to optimize the parallelism of SMC-R connections,
>> mainly to reduce unnecessary blocking on locks, and to fix exceptions that
>> occur after thoses optimization.
>>
>> According to Off-CPU graph, SMC worker's off-CPU as that:
>>
>> smc_close_passive_work (1.09%)
>> smcr_buf_unuse (1.08%)
>> smc_llc_flow_initiate (1.02%)
>>
>> smc_listen_work (48.17%)
>> __mutex_lock.isra.11 (47.96%)
>
> The patches should be ordered so that the prerequisite changes are
> first, then the removal of locks. Looks like there are 3 patches here
> which carry a Fixes tag, for an old commit but in fact IIUC there is no
> bug in those old commits, the problem only appears after the locking is
> removed?
>
Thank you for your suggestion, this is indeed my ill-consideration.
The first PATCH with the Fix tag is indeed a prerequisite for removing the lock,
and it do should be placed before. The other two with PATCH fixes theoretically
can also appear before, but after the lock is removed the probability of it will
be greatly increased. I see it can also be placed before.
> That said please wait for IBM folks to review first before reshuffling
> the patches, I presume the code itself won't change.
Thanks your suggestion again, I will reshuffling the order of it after you
have reviewed it all.
> Also I still haven't see anyone reply to Al Viro, IIRC he was
> complaining about changes someone from your team has made.
> I consider this a blocker for applying new patches from your team :(
Sorry to bother you and your team, my colleague will explain to you soon.
Thanks.
D. Wythe
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next v2 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending
2022-08-26 9:51 ` [PATCH net-next v2 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D. Wythe
@ 2022-08-29 14:48 ` Jan Karcher
2022-08-31 15:04 ` Jan Karcher
1 sibling, 0 replies; 22+ messages in thread
From: Jan Karcher @ 2022-08-29 14:48 UTC (permalink / raw)
To: D. Wythe, kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma
On 26.08.2022 11:51, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
>
> This patch attempts to remove locks named smc_client_lgr_pending and
> smc_server_lgr_pending, which aim to serialize the creation of link
> group. However, once link group existed already, those locks are
> meaningless, worse still, they make incoming connections have to be
> queued one after the other.
>
> Now, the creation of link group is no longer generated by competition,
> but allocated through following strategy.
>
> 1. Try to find a suitable link group, if successd, current connection
> is considered as NON first contact connection. ends.
>
> 2. Check the number of connections currently waiting for a suitable
> link group to be created, if it is not less that the number of link
> groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
> increase the number of link groups to be created, current connection
> is considered as the first contact connection. ends.
>
> 3. Increase the number of connections currently waiting, and wait
> for woken up.
>
> 4. Decrease the number of connections currently waiting, goto 1.
>
> We wake up the connection that was put to sleep in stage 3 through
> the SMC link state change event. Once the link moves out of the
> SMC_LNK_ACTIVATING state, decrease the number of link groups to
> be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
> connections.
>
> In the iplementation, we introduce the concept of lnk cluster, which is
> a collection of links with the same characteristics (see
> smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
> wake up efficiently in the scenario of N v.s 1.
>
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
> net/smc/af_smc.c | 13 +-
> net/smc/smc_core.c | 352 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
> net/smc/smc_core.h | 53 ++++++++
> net/smc/smc_llc.c | 9 +-
> 4 files changed, 411 insertions(+), 16 deletions(-)
Thank you for the v2.
I'm going to start testing and give you feedback ASAP.
- Jan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next v2 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending
2022-08-26 9:51 ` [PATCH net-next v2 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D. Wythe
2022-08-29 14:48 ` Jan Karcher
@ 2022-08-31 15:04 ` Jan Karcher
2022-09-02 11:25 ` D. Wythe
1 sibling, 1 reply; 22+ messages in thread
From: Jan Karcher @ 2022-08-31 15:04 UTC (permalink / raw)
To: D. Wythe, kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma
On 26.08.2022 11:51, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
>
> This patch attempts to remove locks named smc_client_lgr_pending and
> smc_server_lgr_pending, which aim to serialize the creation of link
> group. However, once link group existed already, those locks are
> meaningless, worse still, they make incoming connections have to be
> queued one after the other.
>
> Now, the creation of link group is no longer generated by competition,
> but allocated through following strategy.
>
> 1. Try to find a suitable link group, if successd, current connection
> is considered as NON first contact connection. ends.
>
> 2. Check the number of connections currently waiting for a suitable
> link group to be created, if it is not less that the number of link
> groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
> increase the number of link groups to be created, current connection
> is considered as the first contact connection. ends.
>
> 3. Increase the number of connections currently waiting, and wait
> for woken up.
>
> 4. Decrease the number of connections currently waiting, goto 1.
>
> We wake up the connection that was put to sleep in stage 3 through
> the SMC link state change event. Once the link moves out of the
> SMC_LNK_ACTIVATING state, decrease the number of link groups to
> be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
> connections.
>
> In the iplementation, we introduce the concept of lnk cluster, which is
> a collection of links with the same characteristics (see
> smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
> wake up efficiently in the scenario of N v.s 1.
>
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Hello D.,
thanks for the v2 and the patience.
I got to testing and as with v1 I want to share our findings with you.
If you need more information or want us to look deeper into the findings
please let us know.
Regarding SMC-R test-suite:
We see a refcount error during one of our stress tests. This lets us
believe that the smc_link_cluster_put() to smc_link_cluster_hold() ratio
is not right anymore.
The patch provided by yacan does fix this issue but we did not verify if
it is the right way to balance the hold and put calls.
[root@t8345011 ~]# journalctl --dmesg | tail -100
Aug 31 16:17:36 t8345011.lnxne.boe smc-tests: test_smcapp_50x_ifdown started
Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1
link removed: id 00000101, peerid 00000101, ibdev mlx5_0, ibport 1
Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1
state changed: SINGLE, pnetid NET25
Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1
link added: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1
state changed: ASYMMETRIC_PEER, pnetid NET25
Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1
link added: id 00000104, peerid 00000104, ibdev mlx5_0, ibport 1
Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1
state changed: SYMMETRIC, pnetid NET25
Aug 31 16:17:55 t8345011.lnxne.boe kernel: ------------[ cut here
]------------
Aug 31 16:17:55 t8345011.lnxne.boe kernel: refcount_t: underflow;
use-after-free.
Aug 31 16:17:55 t8345011.lnxne.boe kernel: WARNING: CPU: 1 PID: 150 at
lib/refcount.c:87 refcount_dec_not_one+0x88/0xa8
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Modules linked in: smc_diag
tcp_diag inet_diag nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
nf_tables nfnetlink mlx5_ib ism smc ib_uverbs ib_core vfio_ccw mdev
s390_trng vfio_iommu_type1 vfio sch_fq_codel configfs ip_tables x_tables
ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes
sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common
pkey zcrypt rng_core autofs4
Aug 31 16:17:55 t8345011.lnxne.boe kernel: CPU: 1 PID: 150 Comm:
kworker/1:2 Not tainted 6.0.0-rc2-00493-g91ecd751199f #8
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Hardware name: IBM 8561 T01
701 (z/VM 7.2.0)
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Workqueue: events
smc_llc_add_link_work [smc]
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl PSW : 0704c00180000000
000000005b31f32c (refcount_dec_not_one+0x8c/0xa8)
Aug 31 16:17:55 t8345011.lnxne.boe kernel: R:0 T:1 IO:1 EX:1
Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl GPRS: 00000000ffffffea
0000000000000027 0000000000000026 000000005c3151e0
Aug 31 16:17:55 t8345011.lnxne.boe kernel: 00000000fee80000
0000038000000001 000000008e0e9a00 000000008de79c24
Aug 31 16:17:55 t8345011.lnxne.boe kernel: 0000038000000000
000003ff803f05ac 0000000095038360 000000008de79c00
Aug 31 16:17:55 t8345011.lnxne.boe kernel: 00000000828ca100
0000000095038360 000000005b31f328 0000038000943b50
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl Code: 000000005b31f31c:
c02000466122 larl %r2,000000005bbeb560
000000005b31f322:
c0e500232e53 brasl %r14,000000005b784fc8
#000000005b31f328:
af000000 mc 0,0
>000000005b31f32c:
a7280001 lhi %r2,1
000000005b31f330:
ebeff0a00004 lmg %r14,%r15,160(%r15)
000000005b31f336:
ec223fbf0055 risbg %r2,%r2,63,191,0
000000005b31f33c:
07fe bcr 15,%r14
000000005b31f33e:
47000700 bc 0,1792
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Call Trace:
Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005b31f32c>]
refcount_dec_not_one+0x8c/0xa8
Aug 31 16:17:55 t8345011.lnxne.boe kernel: ([<000000005b31f328>]
refcount_dec_not_one+0x88/0xa8)
Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000003ff803ef16a>]
smcr_link_cluster_on_link_state.part.0+0x1ba/0x440 [smc]
Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000003ff803f05ac>]
smcr_link_clear+0x5c/0x1b0 [smc]
Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000003ff803fadf4>]
smc_llc_add_link_work+0x43c/0x470 [smc]
Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005ac1f0e2>]
process_one_work+0x1fa/0x478
Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005ac1f88c>]
worker_thread+0x64/0x468
Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005ac28580>]
kthread+0x108/0x110
Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005abaf2dc>]
__ret_from_fork+0x3c/0x58
Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005b7a4d6a>]
ret_from_fork+0xa/0x40
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Last Breaking-Event-Address:
Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005b785028>]
__warn_printk+0x60/0x68
Aug 31 16:17:55 t8345011.lnxne.boe kernel: ---[ end trace
0000000000000000 ]---
Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1
link removed: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
[root@t8345011 ~]#
Regarding SMC-D test-suite:
For SMC-D we also see errors during another stress test. While we expect
connections to fall back to TCP due to the limit of parallel connections
your patch introduces TCP fallbacks with a new reason.
[root@t8345011 ~]# journalctl --dmesg | tail -10
Aug 31 16:30:07 t8345011.lnxne.boe smc-tests:
test_oob7_send_multi_urg_at_start started
Aug 31 16:30:16 t8345011.lnxne.boe smc-tests:
test_oob8_ignore_some_urg_data started
Aug 31 16:30:30 t8345011.lnxne.boe smc-tests: test_smc_tool_second started
Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_tshark started
Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_smcapp_torture_test
started
Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1
link added: id 00000401, peerid 00000401, ibdev mlx5_0, ibport 1
Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1
state changed: SINGLE, pnetid NET25
Aug 31 16:30:49 t8345011.lnxne.boe kernel: TCP: request_sock_TCP:
Possible SYN flooding on port 51897. Sending cookies. Check SNMP counters.
Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1
link added: id 00000402, peerid 00000402, ibdev mlx5_1, ibport 1
Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1
state changed: SYMMETRIC, pnetid NET25
^
I am wondering why we see SMC-R dmesgs even if we communicate with
SMC-D. Gotta verify that. Can be an error on our side.
[root@t8345011 ~]#
[root@t8345011 ~]# smcss
ACTIVE 00000 0067005 10.25.45.10:48096 10.25.45.11:51897
0000 SMCD
ACTIVE 00000 0067001 10.25.45.10:48060 10.25.45.11:51897
0000 SMCD
ACTIVE 00000 0066999 10.25.45.10:48054 10.25.45.11:51897
0000 SMCD
ACTIVE 00000 0068762 10.25.45.10:48046 10.25.45.11:51897
0000 SMCD
ACTIVE 00000 0066997 10.25.45.10:48044 10.25.45.11:51897
0000 SMCD
ACTIVE 00000 0068760 10.25.45.10:48036 10.25.45.11:51897
0000 SMCD
ACTIVE 00000 0066995 10.25.45.10:48026 10.25.45.11:51897
0000 SMCD
ACTIVE 00000 0068758 10.25.45.10:48024 10.25.45.11:51897
0000 SMCD
ACTIVE 00000 0066993 10.25.45.10:48022 10.25.45.11:51897
0000 SMCD
ACTIVE 00000 0068756 10.25.45.10:48006 10.25.45.11:51897
0000 SMCD
ACTIVE 00000 0066991 10.25.45.10:47998 10.25.45.11:51897
0000 SMCD
ACTIVE 00000 0068754 10.25.45.10:47984 10.25.45.11:51897
0000 SMCD
ACTIVE 00000 0067124 10.25.45.11:51897 10.25.45.10:48314
0000 TCP 0x05000000/0x030d0000
ACTIVE 00000 0067121 10.25.45.11:51897 10.25.45.10:48302
0000 TCP 0x05000000/0x030d0000
ACTIVE 00000 0067120 10.25.45.11:51897 10.25.45.10:48284
0000 TCP 0x05000000/0x030d0000
ACTIVE 00000 0067114 10.25.45.11:51897 10.25.45.10:48282
0000 TCP 0x05000000/0x030d0000
ACTIVE 00000 0067115 10.25.45.11:51897 10.25.45.10:48254
0000 TCP 0x05000000/0x030d0000
ACTIVE 00000 0067111 10.25.45.11:51897 10.25.45.10:48250
0000 TCP 0x05000000/0x030d0000
ACTIVE 00000 0066415 10.25.45.11:51897 10.25.45.10:48242
0000 TCP 0x05000000/0x030d0000
ACTIVE 00000 0067113 10.25.45.11:51897 10.25.45.10:48230
0000 TCP 0x05000000/0x030d0000
ACTIVE 00000 0066409 10.25.45.11:51897 10.25.45.10:48202
0000 TCP 0x05000000/0x030d0000
ACTIVE 00000 0066413 10.25.45.11:51897 10.25.45.10:48214
0000 TCP 0x05000000/0x030d0000
ACTIVE 00000 0066414 10.25.45.11:51897 10.25.45.10:48204
0000 TCP 0x05000000/0x030d0000
ACTIVE 00000 0066397 10.25.45.11:51897 10.25.45.10:48120
0000 TCP 0x05000000/0x030d0000
ACTIVE 00000 0066399 10.25.45.11:51897 10.25.45.10:48084
0000 TCP 0x05000000/0x030d0000
ACTIVE 00000 0066396 10.25.45.11:51897 10.25.45.10:48078
0000 TCP 0x05000000/0x030d0000
ACTIVE 00000 0062632 10.25.45.11:51897 10.25.45.10:43120
0000 TCP 0x03010000
ACTIVE 00000 0062631 10.25.45.11:51897 10.25.45.10:43134
0000 TCP 0x03010000
ACTIVE 00000 0062626 10.25.45.11:51897 10.25.45.10:43106
0000 TCP 0x03010000
ACTIVE 00000 0062625 10.25.45.11:51897 10.25.45.10:43138
0000 TCP 0x03010000
ACTIVE 00000 0062621 10.25.45.11:51897 10.25.45.10:43160
0000 TCP 0x03010000
ACTIVE 00000 0061580 10.25.45.11:51897 10.25.45.10:42820
0000 TCP 0x03010000
ACTIVE 00000 0061558 10.25.45.11:51897 10.25.45.10:42792
0000 TCP 0x03010000
ACTIVE 00000 0061549 10.25.45.11:51897 10.25.45.10:42816
0000 TCP 0x03010000
ACTIVE 00000 0061548 10.25.45.11:51897 10.25.45.10:42764
0000 TCP 0x03010000
ACTIVE 00000 0061544 10.25.45.11:51897 10.25.45.10:42804
0000 TCP 0x03010000
ACTIVE 00000 0061543 10.25.45.11:51897 10.25.45.10:42856
0000 TCP 0x03010000
ACTIVE 00000 0061542 10.25.45.11:51897 10.25.45.10:42756
0000 TCP 0x03010000
ACTIVE 00000 0062554 10.25.45.11:51897 10.25.45.10:42852
0000 TCP 0x03010000
ACTIVE 00000 0062553 10.25.45.11:51897 10.25.45.10:42844
0000 TCP 0x03010000
ACTIVE 00000 0062549 10.25.45.11:51897 10.25.45.10:42836
0000 TCP 0x03010000
^
Here SMCD and 0x05000000/0x030d0000 are expected. But:
[353] smcss confirmed connection of type SMCD
[353] Error: Found TCP fallback due to unexpected reasons: 0x03010000
We also exeperience that the lsmod count stays above 2 even after the
testcase finished and takes quite a while before it goes down again (we
send a kill signal at the end of our testcase).
During test (which is fine)
[root@t8345011 ~]# lsmod | grep smc
smc_diag 16384 0
smc 225280 2981 ism,smc_diag
ib_core 413696 3 smc,ib_uverbs,mlx5_ib
Count > 2 even after tests finish!
[root@t8345011 ~]# lsmod | grep smc
smc_diag 16384 0
smc 225280 40 ism,smc_diag
ib_core 413696 3 smc,ib_uverbs,mlx5_ib
Let us know if you need any more information.
Thanks, Jan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next v2 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending
2022-08-31 15:04 ` Jan Karcher
@ 2022-09-02 11:25 ` D. Wythe
2022-09-07 8:10 ` Jan Karcher
0 siblings, 1 reply; 22+ messages in thread
From: D. Wythe @ 2022-09-02 11:25 UTC (permalink / raw)
To: Jan Karcher, kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma
On 8/31/22 11:04 PM, Jan Karcher wrote:
>
>
> On 26.08.2022 11:51, D. Wythe wrote:
>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>
>> This patch attempts to remove locks named smc_client_lgr_pending and
>> smc_server_lgr_pending, which aim to serialize the creation of link
>> group. However, once link group existed already, those locks are
>> meaningless, worse still, they make incoming connections have to be
>> queued one after the other.
>>
>> Now, the creation of link group is no longer generated by competition,
>> but allocated through following strategy.
>>
>> 1. Try to find a suitable link group, if successd, current connection
>> is considered as NON first contact connection. ends.
>>
>> 2. Check the number of connections currently waiting for a suitable
>> link group to be created, if it is not less that the number of link
>> groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
>> increase the number of link groups to be created, current connection
>> is considered as the first contact connection. ends.
>>
>> 3. Increase the number of connections currently waiting, and wait
>> for woken up.
>>
>> 4. Decrease the number of connections currently waiting, goto 1.
>>
>> We wake up the connection that was put to sleep in stage 3 through
>> the SMC link state change event. Once the link moves out of the
>> SMC_LNK_ACTIVATING state, decrease the number of link groups to
>> be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
>> connections.
>>
>> In the iplementation, we introduce the concept of lnk cluster, which is
>> a collection of links with the same characteristics (see
>> smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
>> wake up efficiently in the scenario of N v.s 1.
>>
>> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
>
> Hello D.,
>
> thanks for the v2 and the patience.
> I got to testing and as with v1 I want to share our findings with you. If you need more information or want us to look deeper into the findings please let us know.
>
> Regarding SMC-R test-suite:
> We see a refcount error during one of our stress tests. This lets us believe that the smc_link_cluster_put() to smc_link_cluster_hold() ratio is not right anymore.
> The patch provided by yacan does fix this issue but we did not verify if it is the right way to balance the hold and put calls.
>
> [root@t8345011 ~]# journalctl --dmesg | tail -100
> Aug 31 16:17:36 t8345011.lnxne.boe smc-tests: test_smcapp_50x_ifdown started
> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link removed: id 00000101, peerid 00000101, ibdev mlx5_0, ibport 1
> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 state changed: SINGLE, pnetid NET25
> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link added: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 state changed: ASYMMETRIC_PEER, pnetid NET25
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link added: id 00000104, peerid 00000104, ibdev mlx5_0, ibport 1
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 state changed: SYMMETRIC, pnetid NET25
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ------------[ cut here ]------------
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: refcount_t: underflow; use-after-free.
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: WARNING: CPU: 1 PID: 150 at lib/refcount.c:87 refcount_dec_not_one+0x88/0xa8
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Modules linked in: smc_diag tcp_diag inet_diag nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib ism smc ib_uverbs ib_core vfio_ccw mdev s390_trng vfio_iommu_type1 vfio sch_fq_codel configfs ip_tables x_tables ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: CPU: 1 PID: 150 Comm: kworker/1:2 Not tainted 6.0.0-rc2-00493-g91ecd751199f #8
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Workqueue: events smc_llc_add_link_work [smc]
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl PSW : 0704c00180000000 000000005b31f32c (refcount_dec_not_one+0x8c/0xa8)
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl GPRS: 00000000ffffffea 0000000000000027 0000000000000026 000000005c3151e0
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: 00000000fee80000 0000038000000001 000000008e0e9a00 000000008de79c24
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: 0000038000000000 000003ff803f05ac 0000000095038360 000000008de79c00
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: 00000000828ca100 0000000095038360 000000005b31f328 0000038000943b50
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl Code: 000000005b31f31c: c02000466122 larl %r2,000000005bbeb560
> 000000005b31f322: c0e500232e53 brasl %r14,000000005b784fc8
> #000000005b31f328: af000000 mc 0,0
> >000000005b31f32c: a7280001 lhi %r2,1
> 000000005b31f330: ebeff0a00004 lmg %r14,%r15,160(%r15)
> 000000005b31f336: ec223fbf0055 risbg %r2,%r2,63,191,0
> 000000005b31f33c: 07fe bcr 15,%r14
> 000000005b31f33e: 47000700 bc 0,1792
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Call Trace:
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005b31f32c>] refcount_dec_not_one+0x8c/0xa8
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ([<000000005b31f328>] refcount_dec_not_one+0x88/0xa8)
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000003ff803ef16a>] smcr_link_cluster_on_link_state.part.0+0x1ba/0x440 [smc]
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000003ff803f05ac>] smcr_link_clear+0x5c/0x1b0 [smc]
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000003ff803fadf4>] smc_llc_add_link_work+0x43c/0x470 [smc]
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005ac1f0e2>] process_one_work+0x1fa/0x478
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005ac1f88c>] worker_thread+0x64/0x468
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005ac28580>] kthread+0x108/0x110
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005abaf2dc>] __ret_from_fork+0x3c/0x58
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005b7a4d6a>] ret_from_fork+0xa/0x40
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Last Breaking-Event-Address:
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005b785028>] __warn_printk+0x60/0x68
Thank you for your test, I need to think about it, please give me some time.
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ---[ end trace 0000000000000000 ]---
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link removed: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
> [root@t8345011 ~]#
>
>
>
> Regarding SMC-D test-suite:
> For SMC-D we also see errors during another stress test. While we expect connections to fall back to TCP due to the limit of parallel connections your patch introduces TCP fallbacks with a new reason.
>
> [root@t8345011 ~]# journalctl --dmesg | tail -10
> Aug 31 16:30:07 t8345011.lnxne.boe smc-tests: test_oob7_send_multi_urg_at_start started
> Aug 31 16:30:16 t8345011.lnxne.boe smc-tests: test_oob8_ignore_some_urg_data started
> Aug 31 16:30:30 t8345011.lnxne.boe smc-tests: test_smc_tool_second started
> Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_tshark started
> Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_smcapp_torture_test started
> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 link added: id 00000401, peerid 00000401, ibdev mlx5_0, ibport 1
> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 state changed: SINGLE, pnetid NET25
> Aug 31 16:30:49 t8345011.lnxne.boe kernel: TCP: request_sock_TCP: Possible SYN flooding on port 51897. Sending cookies. Check SNMP counters.
> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 link added: id 00000402, peerid 00000402, ibdev mlx5_1, ibport 1
> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 state changed: SYMMETRIC, pnetid NET25
>
> ^
> I am wondering why we see SMC-R dmesgs even if we communicate with SMC-D. Gotta verify that. Can be an error on our side.
This is very weird, is there no such SMC-R dmesgs before apply my PATCH?
I am not sure if there is logic to downgrade SMC-D to SMC-R, maybe it's has related to 0x03010000.
I need to check the code, the reason will be sent out as soon as possible
> [root@t8345011 ~]#
> [root@t8345011 ~]# smcss
> ACTIVE 00000 0067005 10.25.45.10:48096 10.25.45.11:51897 0000 SMCD
> ACTIVE 00000 0067001 10.25.45.10:48060 10.25.45.11:51897 0000 SMCD
> ACTIVE 00000 0066999 10.25.45.10:48054 10.25.45.11:51897 0000 SMCD
> ACTIVE 00000 0068762 10.25.45.10:48046 10.25.45.11:51897 0000 SMCD
> ACTIVE 00000 0066997 10.25.45.10:48044 10.25.45.11:51897 0000 SMCD
> ACTIVE 00000 0068760 10.25.45.10:48036 10.25.45.11:51897 0000 SMCD
> ACTIVE 00000 0066995 10.25.45.10:48026 10.25.45.11:51897 0000 SMCD
> ACTIVE 00000 0068758 10.25.45.10:48024 10.25.45.11:51897 0000 SMCD
> ACTIVE 00000 0066993 10.25.45.10:48022 10.25.45.11:51897 0000 SMCD
> ACTIVE 00000 0068756 10.25.45.10:48006 10.25.45.11:51897 0000 SMCD
> ACTIVE 00000 0066991 10.25.45.10:47998 10.25.45.11:51897 0000 SMCD
> ACTIVE 00000 0068754 10.25.45.10:47984 10.25.45.11:51897 0000 SMCD
> ACTIVE 00000 0067124 10.25.45.11:51897 10.25.45.10:48314 0000 TCP 0x05000000/0x030d0000
> ACTIVE 00000 0067121 10.25.45.11:51897 10.25.45.10:48302 0000 TCP 0x05000000/0x030d0000
> ACTIVE 00000 0067120 10.25.45.11:51897 10.25.45.10:48284 0000 TCP 0x05000000/0x030d0000
> ACTIVE 00000 0067114 10.25.45.11:51897 10.25.45.10:48282 0000 TCP 0x05000000/0x030d0000
> ACTIVE 00000 0067115 10.25.45.11:51897 10.25.45.10:48254 0000 TCP 0x05000000/0x030d0000
> ACTIVE 00000 0067111 10.25.45.11:51897 10.25.45.10:48250 0000 TCP 0x05000000/0x030d0000
> ACTIVE 00000 0066415 10.25.45.11:51897 10.25.45.10:48242 0000 TCP 0x05000000/0x030d0000
> ACTIVE 00000 0067113 10.25.45.11:51897 10.25.45.10:48230 0000 TCP 0x05000000/0x030d0000
> ACTIVE 00000 0066409 10.25.45.11:51897 10.25.45.10:48202 0000 TCP 0x05000000/0x030d0000
> ACTIVE 00000 0066413 10.25.45.11:51897 10.25.45.10:48214 0000 TCP 0x05000000/0x030d0000
> ACTIVE 00000 0066414 10.25.45.11:51897 10.25.45.10:48204 0000 TCP 0x05000000/0x030d0000
> ACTIVE 00000 0066397 10.25.45.11:51897 10.25.45.10:48120 0000 TCP 0x05000000/0x030d0000
> ACTIVE 00000 0066399 10.25.45.11:51897 10.25.45.10:48084 0000 TCP 0x05000000/0x030d0000
> ACTIVE 00000 0066396 10.25.45.11:51897 10.25.45.10:48078 0000 TCP 0x05000000/0x030d0000
> ACTIVE 00000 0062632 10.25.45.11:51897 10.25.45.10:43120 0000 TCP 0x03010000
> ACTIVE 00000 0062631 10.25.45.11:51897 10.25.45.10:43134 0000 TCP 0x03010000
> ACTIVE 00000 0062626 10.25.45.11:51897 10.25.45.10:43106 0000 TCP 0x03010000
> ACTIVE 00000 0062625 10.25.45.11:51897 10.25.45.10:43138 0000 TCP 0x03010000
> ACTIVE 00000 0062621 10.25.45.11:51897 10.25.45.10:43160 0000 TCP 0x03010000
> ACTIVE 00000 0061580 10.25.45.11:51897 10.25.45.10:42820 0000 TCP 0x03010000
> ACTIVE 00000 0061558 10.25.45.11:51897 10.25.45.10:42792 0000 TCP 0x03010000
> ACTIVE 00000 0061549 10.25.45.11:51897 10.25.45.10:42816 0000 TCP 0x03010000
> ACTIVE 00000 0061548 10.25.45.11:51897 10.25.45.10:42764 0000 TCP 0x03010000
> ACTIVE 00000 0061544 10.25.45.11:51897 10.25.45.10:42804 0000 TCP 0x03010000
> ACTIVE 00000 0061543 10.25.45.11:51897 10.25.45.10:42856 0000 TCP 0x03010000
> ACTIVE 00000 0061542 10.25.45.11:51897 10.25.45.10:42756 0000 TCP 0x03010000
> ACTIVE 00000 0062554 10.25.45.11:51897 10.25.45.10:42852 0000 TCP 0x03010000
> ACTIVE 00000 0062553 10.25.45.11:51897 10.25.45.10:42844 0000 TCP 0x03010000
> ACTIVE 00000 0062549 10.25.45.11:51897 10.25.45.10:42836 0000 TCP 0x03010000
>
> ^
> Here SMCD and 0x05000000/0x030d0000 are expected. But:
> [353] smcss confirmed connection of type SMCD
> [353] Error: Found TCP fallback due to unexpected reasons: 0x03010000
sysctl -w net.ipv4.tcp_syncookies=0
Can you retry your test after set above configure? When TCP detects a potential flooding attack,
it will starts syn-cookies to verify traffic. In this case, SMC can't work, and then triggering a fallback with
error code 0x03010000.
This doesn't seem to be the problem that my PATCH can cause, but my PATCH removes the lock in
the handshake phase, which may speed up the frequency of your test initiating connections,
But I can't be sure ...
> We also exeperience that the lsmod count stays above 2 even after the testcase finished and takes quite a while before it goes down again (we send a kill signal at the end of our testcase).
>
> During test (which is fine)
>
> [root@t8345011 ~]# lsmod | grep smc
> smc_diag 16384 0
> smc 225280 2981 ism,smc_diag
> ib_core 413696 3 smc,ib_uverbs,mlx5_ib
>
> Count > 2 even after tests finish!
>
> [root@t8345011 ~]# lsmod | grep smc
> smc_diag 16384 0
> smc 225280 40 ism,smc_diag
> ib_core 413696 3 smc,ib_uverbs,mlx5_ib
> Let us know if you need any more information.
> Thanks, Jan
This usually means that there are still connections that are not really destroyed,
can you try this and to see if there are any remaining connections?
smcd linkgroup; #or smcr, it depends, if any, can you show us the connection state (smcss -r or -d)
ps aux | grep D; # check if there is work thread hungs, if any, please show us the /proc/$PID/stack.
D. Wythe
Thanks.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next v2 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending
2022-09-02 11:25 ` D. Wythe
@ 2022-09-07 8:10 ` Jan Karcher
2022-09-16 5:16 ` D. Wythe
0 siblings, 1 reply; 22+ messages in thread
From: Jan Karcher @ 2022-09-07 8:10 UTC (permalink / raw)
To: D. Wythe, kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma
On 02.09.2022 13:25, D. Wythe wrote:
>
>
> On 8/31/22 11:04 PM, Jan Karcher wrote:
>>
>>
>> On 26.08.2022 11:51, D. Wythe wrote:
>>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>>
>>> This patch attempts to remove locks named smc_client_lgr_pending and
>>> smc_server_lgr_pending, which aim to serialize the creation of link
>>> group. However, once link group existed already, those locks are
>>> meaningless, worse still, they make incoming connections have to be
>>> queued one after the other.
>>>
>>> Now, the creation of link group is no longer generated by competition,
>>> but allocated through following strategy.
>>>
>>> 1. Try to find a suitable link group, if successd, current connection
>>> is considered as NON first contact connection. ends.
>>>
>>> 2. Check the number of connections currently waiting for a suitable
>>> link group to be created, if it is not less that the number of link
>>> groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
>>> increase the number of link groups to be created, current connection
>>> is considered as the first contact connection. ends.
>>>
>>> 3. Increase the number of connections currently waiting, and wait
>>> for woken up.
>>>
>>> 4. Decrease the number of connections currently waiting, goto 1.
>>>
>>> We wake up the connection that was put to sleep in stage 3 through
>>> the SMC link state change event. Once the link moves out of the
>>> SMC_LNK_ACTIVATING state, decrease the number of link groups to
>>> be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
>>> connections.
>>>
>>> In the iplementation, we introduce the concept of lnk cluster, which is
>>> a collection of links with the same characteristics (see
>>> smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
>>> wake up efficiently in the scenario of N v.s 1.
>>>
>>> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
>>
>> Hello D.,
>>
>> thanks for the v2 and the patience.
>> I got to testing and as with v1 I want to share our findings with you.
>> If you need more information or want us to look deeper into the
>> findings please let us know.
>>
>> Regarding SMC-R test-suite:
>> We see a refcount error during one of our stress tests. This lets us
>> believe that the smc_link_cluster_put() to smc_link_cluster_hold()
>> ratio is not right anymore.
>> The patch provided by yacan does fix this issue but we did not verify
>> if it is the right way to balance the hold and put calls.
>>
>> [root@t8345011 ~]# journalctl --dmesg | tail -100
>> Aug 31 16:17:36 t8345011.lnxne.boe smc-tests: test_smcapp_50x_ifdown
>> started
>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net
>> 1 link removed: id 00000101, peerid 00000101, ibdev mlx5_0, ibport 1
>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net
>> 1 state changed: SINGLE, pnetid NET25
>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net
>> 1 link added: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net
>> 1 state changed: ASYMMETRIC_PEER, pnetid NET25
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net
>> 1 link added: id 00000104, peerid 00000104, ibdev mlx5_0, ibport 1
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net
>> 1 state changed: SYMMETRIC, pnetid NET25
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ------------[ cut here
>> ]------------
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: refcount_t: underflow;
>> use-after-free.
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: WARNING: CPU: 1 PID: 150 at
>> lib/refcount.c:87 refcount_dec_not_one+0x88/0xa8
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Modules linked in: smc_diag
>> tcp_diag inet_diag nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
>> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
>> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
>> nf_tables nfnetlink mlx5_ib ism smc ib_uverbs ib_core vfio_ccw mdev
>> s390_trng vfio_iommu_type1 vfio sch_fq_codel configfs ip_tables
>> x_tables ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core
>> des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390
>> sha1_s390 sha_common pkey zcrypt rng_core autofs4
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: CPU: 1 PID: 150 Comm:
>> kworker/1:2 Not tainted 6.0.0-rc2-00493-g91ecd751199f #8
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Hardware name: IBM 8561 T01
>> 701 (z/VM 7.2.0)
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Workqueue: events
>> smc_llc_add_link_work [smc]
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl PSW : 0704c00180000000
>> 000000005b31f32c (refcount_dec_not_one+0x8c/0xa8)
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: R:0 T:1 IO:1
>> EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl GPRS: 00000000ffffffea
>> 0000000000000027 0000000000000026 000000005c3151e0
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: 00000000fee80000
>> 0000038000000001 000000008e0e9a00 000000008de79c24
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: 0000038000000000
>> 000003ff803f05ac 0000000095038360 000000008de79c00
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: 00000000828ca100
>> 0000000095038360 000000005b31f328 0000038000943b50
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl Code:
>> 000000005b31f31c: c02000466122 larl %r2,000000005bbeb560
>>
>> 000000005b31f322: c0e500232e53 brasl %r14,000000005b784fc8
>>
>> #000000005b31f328: af000000 mc 0,0
>>
>> >000000005b31f32c: a7280001 lhi %r2,1
>>
>> 000000005b31f330: ebeff0a00004 lmg %r14,%r15,160(%r15)
>>
>> 000000005b31f336: ec223fbf0055 risbg %r2,%r2,63,191,0
>>
>> 000000005b31f33c: 07fe bcr 15,%r14
>>
>> 000000005b31f33e: 47000700 bc 0,1792
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Call Trace:
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005b31f32c>]
>> refcount_dec_not_one+0x8c/0xa8
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ([<000000005b31f328>]
>> refcount_dec_not_one+0x88/0xa8)
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000003ff803ef16a>]
>> smcr_link_cluster_on_link_state.part.0+0x1ba/0x440 [smc]
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000003ff803f05ac>]
>> smcr_link_clear+0x5c/0x1b0 [smc]
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000003ff803fadf4>]
>> smc_llc_add_link_work+0x43c/0x470 [smc]
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005ac1f0e2>]
>> process_one_work+0x1fa/0x478
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005ac1f88c>]
>> worker_thread+0x64/0x468
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005ac28580>]
>> kthread+0x108/0x110
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005abaf2dc>]
>> __ret_from_fork+0x3c/0x58
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005b7a4d6a>]
>> ret_from_fork+0xa/0x40
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Last Breaking-Event-Address:
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005b785028>]
>> __warn_printk+0x60/0x68
>
> Thank you for your test, I need to think about it, please give me some
> time.
>
>
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ---[ end trace
>> 0000000000000000 ]---
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net
>> 1 link removed: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
>> [root@t8345011 ~]#
>>
>>
>>
>> Regarding SMC-D test-suite:
>> For SMC-D we also see errors during another stress test. While we
>> expect connections to fall back to TCP due to the limit of parallel
>> connections your patch introduces TCP fallbacks with a new reason.
>>
>> [root@t8345011 ~]# journalctl --dmesg | tail -10
>> Aug 31 16:30:07 t8345011.lnxne.boe smc-tests:
>> test_oob7_send_multi_urg_at_start started
>> Aug 31 16:30:16 t8345011.lnxne.boe smc-tests:
>> test_oob8_ignore_some_urg_data started
>> Aug 31 16:30:30 t8345011.lnxne.boe smc-tests: test_smc_tool_second
>> started
>> Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_tshark started
>> Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_smcapp_torture_test
>> started
>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net
>> 1 link added: id 00000401, peerid 00000401, ibdev mlx5_0, ibport 1
>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net
>> 1 state changed: SINGLE, pnetid NET25
>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: TCP: request_sock_TCP:
>> Possible SYN flooding on port 51897. Sending cookies. Check SNMP
>> counters.
>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net
>> 1 link added: id 00000402, peerid 00000402, ibdev mlx5_1, ibport 1
>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net
>> 1 state changed: SYMMETRIC, pnetid NET25
>>
>> ^
>> I am wondering why we see SMC-R dmesgs even if we communicate with
>> SMC-D. Gotta verify that. Can be an error on our side.
>
> This is very weird, is there no such SMC-R dmesgs before apply my PATCH?
>
> I am not sure if there is logic to downgrade SMC-D to SMC-R, maybe it's
> has related to 0x03010000.
> I need to check the code, the reason will be sent out as soon as possible
>
>
>> [root@t8345011 ~]#
>> [root@t8345011 ~]# smcss
>> ACTIVE 00000 0067005 10.25.45.10:48096 10.25.45.11:51897
>> 0000 SMCD
>> ACTIVE 00000 0067001 10.25.45.10:48060 10.25.45.11:51897
>> 0000 SMCD
>> ACTIVE 00000 0066999 10.25.45.10:48054 10.25.45.11:51897
>> 0000 SMCD
>> ACTIVE 00000 0068762 10.25.45.10:48046 10.25.45.11:51897
>> 0000 SMCD
>> ACTIVE 00000 0066997 10.25.45.10:48044 10.25.45.11:51897
>> 0000 SMCD
>> ACTIVE 00000 0068760 10.25.45.10:48036 10.25.45.11:51897
>> 0000 SMCD
>> ACTIVE 00000 0066995 10.25.45.10:48026 10.25.45.11:51897
>> 0000 SMCD
>> ACTIVE 00000 0068758 10.25.45.10:48024 10.25.45.11:51897
>> 0000 SMCD
>> ACTIVE 00000 0066993 10.25.45.10:48022 10.25.45.11:51897
>> 0000 SMCD
>> ACTIVE 00000 0068756 10.25.45.10:48006 10.25.45.11:51897
>> 0000 SMCD
>> ACTIVE 00000 0066991 10.25.45.10:47998 10.25.45.11:51897
>> 0000 SMCD
>> ACTIVE 00000 0068754 10.25.45.10:47984 10.25.45.11:51897
>> 0000 SMCD
>> ACTIVE 00000 0067124 10.25.45.11:51897 10.25.45.10:48314
>> 0000 TCP 0x05000000/0x030d0000
>> ACTIVE 00000 0067121 10.25.45.11:51897 10.25.45.10:48302
>> 0000 TCP 0x05000000/0x030d0000
>> ACTIVE 00000 0067120 10.25.45.11:51897 10.25.45.10:48284
>> 0000 TCP 0x05000000/0x030d0000
>> ACTIVE 00000 0067114 10.25.45.11:51897 10.25.45.10:48282
>> 0000 TCP 0x05000000/0x030d0000
>> ACTIVE 00000 0067115 10.25.45.11:51897 10.25.45.10:48254
>> 0000 TCP 0x05000000/0x030d0000
>> ACTIVE 00000 0067111 10.25.45.11:51897 10.25.45.10:48250
>> 0000 TCP 0x05000000/0x030d0000
>> ACTIVE 00000 0066415 10.25.45.11:51897 10.25.45.10:48242
>> 0000 TCP 0x05000000/0x030d0000
>> ACTIVE 00000 0067113 10.25.45.11:51897 10.25.45.10:48230
>> 0000 TCP 0x05000000/0x030d0000
>> ACTIVE 00000 0066409 10.25.45.11:51897 10.25.45.10:48202
>> 0000 TCP 0x05000000/0x030d0000
>> ACTIVE 00000 0066413 10.25.45.11:51897 10.25.45.10:48214
>> 0000 TCP 0x05000000/0x030d0000
>> ACTIVE 00000 0066414 10.25.45.11:51897 10.25.45.10:48204
>> 0000 TCP 0x05000000/0x030d0000
>> ACTIVE 00000 0066397 10.25.45.11:51897 10.25.45.10:48120
>> 0000 TCP 0x05000000/0x030d0000
>> ACTIVE 00000 0066399 10.25.45.11:51897 10.25.45.10:48084
>> 0000 TCP 0x05000000/0x030d0000
>> ACTIVE 00000 0066396 10.25.45.11:51897 10.25.45.10:48078
>> 0000 TCP 0x05000000/0x030d0000
>> ACTIVE 00000 0062632 10.25.45.11:51897 10.25.45.10:43120
>> 0000 TCP 0x03010000
>> ACTIVE 00000 0062631 10.25.45.11:51897 10.25.45.10:43134
>> 0000 TCP 0x03010000
>> ACTIVE 00000 0062626 10.25.45.11:51897 10.25.45.10:43106
>> 0000 TCP 0x03010000
>> ACTIVE 00000 0062625 10.25.45.11:51897 10.25.45.10:43138
>> 0000 TCP 0x03010000
>> ACTIVE 00000 0062621 10.25.45.11:51897 10.25.45.10:43160
>> 0000 TCP 0x03010000
>> ACTIVE 00000 0061580 10.25.45.11:51897 10.25.45.10:42820
>> 0000 TCP 0x03010000
>> ACTIVE 00000 0061558 10.25.45.11:51897 10.25.45.10:42792
>> 0000 TCP 0x03010000
>> ACTIVE 00000 0061549 10.25.45.11:51897 10.25.45.10:42816
>> 0000 TCP 0x03010000
>> ACTIVE 00000 0061548 10.25.45.11:51897 10.25.45.10:42764
>> 0000 TCP 0x03010000
>> ACTIVE 00000 0061544 10.25.45.11:51897 10.25.45.10:42804
>> 0000 TCP 0x03010000
>> ACTIVE 00000 0061543 10.25.45.11:51897 10.25.45.10:42856
>> 0000 TCP 0x03010000
>> ACTIVE 00000 0061542 10.25.45.11:51897 10.25.45.10:42756
>> 0000 TCP 0x03010000
>> ACTIVE 00000 0062554 10.25.45.11:51897 10.25.45.10:42852
>> 0000 TCP 0x03010000
>> ACTIVE 00000 0062553 10.25.45.11:51897 10.25.45.10:42844
>> 0000 TCP 0x03010000
>> ACTIVE 00000 0062549 10.25.45.11:51897 10.25.45.10:42836
>> 0000 TCP 0x03010000
>>
>> ^
>> Here SMCD and 0x05000000/0x030d0000 are expected. But:
>> [353] smcss confirmed connection of type SMCD
>> [353] Error: Found TCP fallback due to unexpected reasons: 0x03010000
> sysctl -w net.ipv4.tcp_syncookies=0
>
> Can you retry your test after set above configure? When TCP detects a
> potential flooding attack,
> it will starts syn-cookies to verify traffic. In this case, SMC can't
> work, and then triggering a fallback with
> error code 0x03010000.
>
> This doesn't seem to be the problem that my PATCH can cause, but my
> PATCH removes the lock in
> the handshake phase, which may speed up the frequency of your test
> initiating connections,
> But I can't be sure ...
>
>
>> We also exeperience that the lsmod count stays above 2 even after the
>> testcase finished and takes quite a while before it goes down again
>> (we send a kill signal at the end of our testcase).
>
>>
>> During test (which is fine)
>>
>> [root@t8345011 ~]# lsmod | grep smc
>> smc_diag 16384 0
>> smc 225280 2981 ism,smc_diag
>> ib_core 413696 3 smc,ib_uverbs,mlx5_ib
>>
>> Count > 2 even after tests finish!
>>
>> [root@t8345011 ~]# lsmod | grep smc
>> smc_diag 16384 0
>> smc 225280 40 ism,smc_diag
>> ib_core 413696 3 smc,ib_uverbs,mlx5_ib
>
>> Let us know if you need any more information.
>> Thanks, Jan
>
>
> This usually means that there are still connections that are not really
> destroyed,
> can you try this and to see if there are any remaining connections?
>
> smcd linkgroup; #or smcr, it depends, if any, can you show us the
> connection state (smcss -r or -d)
>
> ps aux | grep D; # check if there is work thread hungs, if any, please
> show us the /proc/$PID/stack.
>
>
> D. Wythe
> Thanks.
>
Thank you for the tip with the syncookies. We disabled them on both
systems and here is the new output which should be not as confusing.
For your understanding of the output:
t8345010 is the system driving the tests and is acting as the client in
this testcase.
t8345011 is the pair system and the server of this test.
What we are doing is we spawn a lot of connection between the two
systems to see what is happening if there is stress (in terms of
connection handling) on the system.
We see the following:
- The driver falls back to SMCR in many occasions. This should not be.
Also note the missmatch of numbers of connections handled. There were no
other connections beside the test.
T8345010
> SMC-D Connections Summary
> Total connections handled 1012
> SMC-R Connections Summary
> Total connections handled 1512
T8345011
> SMC-D Connections Summary
> Total connections handled 1190
> SMC-R Connections Summary
> Total connections handled 1513
- Linkgroups for the SMCD & SMCR connections are being build up.
T8345011
> [root@t8345011 ~]# smcd linkgroup
> LG-ID VLAN #Conns PNET-ID
> 00000300 0 37 NET25
> [root@t8345011 ~]# smcr linkgroup
> LG-ID LG-Role LG-Type VLAN #Conns PNET-ID
> 00000400 SERV SYM 0 0 NET25
> [ 5 more LG 0500-0900]
- Linkgroups for the SMCD & SMCR connections are being build down once
the clients finish.
- ALL SMCR linkgoups are being cleared completely as expected. They
still reside empty for a while which is fine.
- The SMCD linkgroups are NOT cleared all the way. A few connections
stay in there (See output above).
- If we perform smcss on the server side those connections are listed
there as ACTIVE while the smcss list on the client side is empty.
T8345011
> [root@t8345011 ~]# smcss
> State UID Inode Local Address Peer Address
Intf Mode
> ACTIVE 00000 0100758 10.25.45.11:40237
10.25.45.10:55790 0000 SMCD
> [ 36 more ACTIVE connections ]
- The remaing ACTIVE connections on the server are reflected in the smcd
linkgroup #Conns aswell.
- On the client the lsmod count for the smc module is 39 also reflecting
the leftover connections.
T8345010
> [root@t8345010 tela-kernel]# lsmod |grep smc
> smc 225280 39 ism,smc_diag
- On the server the lsmod count for the smc module is 79.
T8345011
> [root@t8345011 ~]# lsmod | grep smc
> smc 225280 79 ism,smc_diag
- The most important smc_dbg outputs are provided and are showing that
the client is pretty clean and the server is still handling ghost
connections.
T8345011
> [root@t8345011 ~]# smc_dbg
> State UID Inode Local Address Peer Address
Intf Mode GID Token Peer-GID
Peer-Token Linkid
> ACTIVE 00000 0100758 10.25.45.11:40237
10.25.45.10:55790 0000 SMCD 120014a12e488561 0000890fd0000000
3e0014a32e488561 00008a0bd0000000 00000300
> State UID Inode Local Address Peer Address
Intf Mode Shutd Token Sndbuf Rcvbuf Peerbuf
rxprod-Cursor rxcons-Cursor rxFlags txprod-Cursor txcons-Cursor txFlags
txprep-Cursor txsent-Cursor txfin-Cursor
> ACTIVE 00000 0100758 10.25.45.11:40237
10.25.45.10:55790 0000 SMCD <-> 00001611 00004000 0000ffe0
0000ffe0 0000:00000000 0000:00000000 00:00 0000:00000000 0000:00000000
00:00 0000:00000000 0000:00000000 0000:00000000
- Via netstat we see that the server is in a CLOSE_WAIT state for the
connections and the client in a FIN_WAIT2
T8345010
> [root@t8345010 tela-kernel]# netstat -nta
> Proto Recv-Q Send-Q Local Address Foreign Address
State
> tcp 0 0 10.25.45.10:55790 10.25.45.11:40237
FIN_WAIT2
T8345011
> [root@t8345011 ~]# netstat -nta | grep "40237"
> tcp 1 0 10.25.45.11:40237 10.25.45.10:55790
CLOSE_WAIT
While I'm pretty new to the mailing list we had a discussion about how
to provide the log data in a reasonable way.
To prevent too much information we decided to go for the short output on
top. If that is not enough for you shot me a message and i can send you
the full output outside the mailing list.
If you have any ideas on how to provide larger output in a reasonable
way feel free to share your oppinion.
I hope the new output helps you locating the error.
Feel free to contact us in case you have questions.
- Jan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next v2 10/10] net/smc: fix application data exception
2022-08-26 9:51 ` [PATCH net-next v2 10/10] net/smc: fix application data exception D. Wythe
@ 2022-09-08 9:37 ` Wen Gu
2022-09-16 5:24 ` D. Wythe
0 siblings, 1 reply; 22+ messages in thread
From: Wen Gu @ 2022-09-08 9:37 UTC (permalink / raw)
To: D. Wythe, kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma
On 2022/8/26 17:51, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
>
> After we optimize the parallel capability of SMC-R connection
> establishment, There is a certain probability that following
> exceptions will occur in the wrk benchmark test:
>
> Running 10s test @ http://11.213.45.6:80
> 8 threads and 64 connections
> Thread Stats Avg Stdev Max +/- Stdev
> Latency 3.72ms 13.94ms 245.33ms 94.17%
> Req/Sec 1.96k 713.67 5.41k 75.16%
> 155262 requests in 10.10s, 23.10MB read
> Non-2xx or 3xx responses: 3
>
> We will find that the error is HTTP 400 error, which is a serious
> exception in our test, which means the application data was
> corrupted.
>
> Consider the following scenarios:
>
> CPU0 CPU1
>
> buf_desc->used = 0;
> cmpxchg(buf_desc->used, 0, 1)
> deal_with(buf_desc)
>
> memset(buf_desc->cpu_addr,0);
>
> This will cause the data received by a victim connection to be cleared,
> thus triggering an HTTP 400 error in the server.
>
> This patch exchange the order between clear used and memset, add
> barrier to ensure memory consistency.
>
> Fixes: 1c5526968e27 ("net/smc: Clear memory when release and reuse buffer")
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
> net/smc/smc_core.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index 84bf84c..fdad953 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -1380,8 +1380,9 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
>
> smc_buf_free(lgr, is_rmb, buf_desc);
> } else {
> - buf_desc->used = 0;
> - memset(buf_desc->cpu_addr, 0, buf_desc->len);
> + /* memzero_explicit provides potential memory barrier semantics */
> + memzero_explicit(buf_desc->cpu_addr, buf_desc->len);
> + WRITE_ONCE(buf_desc->used, 0);
> }
> }
>
It seems that the same issue exists in smc_buf_unuse(), Maybe it also needs to be fixed?
static void smc_buf_unuse(struct smc_connection *conn,
struct smc_link_group *lgr)
{
if (conn->sndbuf_desc) {
if (!lgr->is_smcd && conn->sndbuf_desc->is_vm) {
smcr_buf_unuse(conn->sndbuf_desc, false, lgr);
} else {
conn->sndbuf_desc->used = 0;
memset(conn->sndbuf_desc->cpu_addr, 0,
conn->sndbuf_desc->len);
^...................
}
}
if (conn->rmb_desc) {
if (!lgr->is_smcd) {
smcr_buf_unuse(conn->rmb_desc, true, lgr);
} else {
conn->rmb_desc->used = 0;
memset(conn->rmb_desc->cpu_addr, 0,
conn->rmb_desc->len +
sizeof(struct smcd_cdc_msg));
^...................
}
}
}
Thanks,
Wen Gu
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections
2022-08-26 9:51 [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections D. Wythe
` (10 preceding siblings ...)
2022-08-27 1:32 ` [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections Jakub Kicinski
@ 2022-09-09 6:59 ` Jan Karcher
11 siblings, 0 replies; 22+ messages in thread
From: Jan Karcher @ 2022-09-09 6:59 UTC (permalink / raw)
To: D. Wythe, kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma
On 26.08.2022 11:51, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
>
> This patch set attempts to optimize the parallelism of SMC-R connections,
> mainly to reduce unnecessary blocking on locks, and to fix exceptions that
> occur after thoses optimization.
>
> According to Off-CPU graph, SMC worker's off-CPU as that:
>
> smc_close_passive_work (1.09%)
> smcr_buf_unuse (1.08%)
> smc_llc_flow_initiate (1.02%)
>
> smc_listen_work (48.17%)
> __mutex_lock.isra.11 (47.96%)
>
>
> An ideal SMC-R connection process should only block on the IO events
> of the network, but it's quite clear that the SMC-R connection now is
> queued on the lock most of the time.
>
> The goal of this patchset is to achieve our ideal situation where
> network IO events are blocked for the majority of the connection lifetime.
>
> There are three big locks here:
>
> 1. smc_client_lgr_pending & smc_server_lgr_pending
>
> 2. llc_conf_mutex
>
> 3. rmbs_lock & sndbufs_lock
>
> And an implementation issue:
>
> 1. confirm/delete rkey msg can't be sent concurrently while
> protocol allows indeed.
>
> Unfortunately,The above problems together affect the parallelism of
> SMC-R connection. If any of them are not solved. our goal cannot
> be achieved.
>
> After this patch set, we can get a quite ideal off-CPU graph as
> following:
>
> smc_close_passive_work (41.58%)
> smcr_buf_unuse (41.57%)
> smc_llc_do_delete_rkey (41.57%)
>
> smc_listen_work (39.10%)
> smc_clc_wait_msg (13.18%)
> tcp_recvmsg_locked (13.18)
> smc_listen_find_device (25.87%)
> smcr_lgr_reg_rmbs (25.87%)
> smc_llc_do_confirm_rkey (25.87%)
>
> We can see that most of the waiting times are waiting for network IO
> events. This also has a certain performance improvement on our
> short-lived conenction wrk/nginx benchmark test:
>
> +--------------+------+------+-------+--------+------+--------+
> |conns/qps |c4 | c8 | c16 | c32 | c64 | c200 |
> +--------------+------+------+-------+--------+------+--------+
> |SMC-R before |9.7k | 10k | 10k | 9.9k | 9.1k | 8.9k |
> +--------------+------+------+-------+--------+------+--------+
> |SMC-R now |13k | 19k | 18k | 16k | 15k | 12k |
> +--------------+------+------+-------+--------+------+--------+
> |TCP |15k | 35k | 51k | 80k | 100k | 162k |
> +--------------+------+------+-------+--------+------+--------+
>
> The reason why the benefit is not obvious after the number of connections
> has increased dues to workqueue. If we try to change workqueue to UNBOUND,
> we can obtain at least 4-5 times performance improvement, reach up to half
> of TCP. However, this is not an elegant solution, the optimization of it
> will be much more complicated. But in any case, we will submit relevant
> optimization patches as soon as possible.
>
> Please note that the premise here is that the lock related problem
> must be solved first, otherwise, no matter how we optimize the workqueue,
> there won't be much improvement.
>
> Because there are a lot of related changes to the code, if you have
> any questions or suggestions, please let me know.
>
> Thanks
> D. Wythe
>
> v1 -> v2:
>
> 1. Fix panic in SMC-D scenario
> 2. Fix lnkc related hashfn calculation exception, caused by operator
> priority.
> 3. Remove -EBUSY processing of rhashtable_insert_fast, see more details
> in comments around smcr_link_get_or_create_cluster().
> 4. Only wake up one connection if the link has not been active.
> 5. Delete obsolete unlock logic in smc_listen_work().
> 6. PATCH format, do Reverse Christmas tree.
> 7. PATCH format, change all xxx_lnk_xxx function to xxx_link_xxx.
> 8. PATCH format, add correct fix tag for the patches for fixes.
> 9. PATCH format, fix some spelling error.
> 10.PATCH format, rename slow to do_slow in smcr_lgr_reg_rmbs().
>
>
> D. Wythe (10):
> net/smc: remove locks smc_client_lgr_pending and
> smc_server_lgr_pending
> net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending
> net/smc: allow confirm/delete rkey response deliver multiplex
> net/smc: make SMC_LLC_FLOW_RKEY run concurrently
> net/smc: llc_conf_mutex refactor, replace it with rw_semaphore
> net/smc: use read semaphores to reduce unnecessary blocking in
> smc_buf_create() & smcr_buf_unuse()
> net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs()
> net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore
> net/smc: Fix potential panic dues to unprotected
> smc_llc_srv_add_link()
> net/smc: fix application data exception
>
> net/smc/af_smc.c | 42 +++--
> net/smc/smc_core.c | 443 +++++++++++++++++++++++++++++++++++++++++++++++------
> net/smc/smc_core.h | 78 +++++++++-
> net/smc/smc_llc.c | 286 +++++++++++++++++++++++++---------
> net/smc/smc_llc.h | 6 +
> net/smc/smc_wr.c | 10 --
> net/smc/smc_wr.h | 10 ++
> 7 files changed, 725 insertions(+), 150 deletions(-)
>
D.,
I'm sorry.
I replied to the patch 01/10 with the test results and not the cover
letter. I have a filter on my inbox separating everything for "net/smc:"
and the keywords are missing on this cover letter.
Mea culpa.
https://lore.kernel.org/netdev/1767b6e4-0053-728b-9722-add68da13781@linux.ibm.com/
- Jan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next v2 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending
2022-09-07 8:10 ` Jan Karcher
@ 2022-09-16 5:16 ` D. Wythe
0 siblings, 0 replies; 22+ messages in thread
From: D. Wythe @ 2022-09-16 5:16 UTC (permalink / raw)
To: Jan Karcher, kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma
Hi, Jan
Thanks a lot for your test, your information is very important!
And feel sorry to reply late, this issues is quite hidden but stupid.
Here's the problem:
We use ini->is_smcd to determine whether the lock needs to be removed in our patch,
however, this value is only available after smc_listen_find_device be invoked,
Before that, the value is always false (kzalloc). Unfortunately, we had used it before.
In other words, it is equivalent that we removed the SMC-D lock but did not do SMC-D link cluster,
which will lead to a large number of critical area issues, including abnormal downgrade to SMC-R,
residual connection status, etc.
Considering that, it seems that we have to do the link cluster for SMC-D too, It won't take much time,
I expect to send a new version next week.
Thanks
D. Wythe
On 9/7/22 4:10 PM, Jan Karcher wrote:
>
>
> On 02.09.2022 13:25, D. Wythe wrote:
>>
>>
>> On 8/31/22 11:04 PM, Jan Karcher wrote:
>>>
>>>
>>> On 26.08.2022 11:51, D. Wythe wrote:
>>>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>>>
>>>> This patch attempts to remove locks named smc_client_lgr_pending and
>>>> smc_server_lgr_pending, which aim to serialize the creation of link
>>>> group. However, once link group existed already, those locks are
>>>> meaningless, worse still, they make incoming connections have to be
>>>> queued one after the other.
>>>>
>>>> Now, the creation of link group is no longer generated by competition,
>>>> but allocated through following strategy.
>>>>
>>>> 1. Try to find a suitable link group, if successd, current connection
>>>> is considered as NON first contact connection. ends.
>>>>
>>>> 2. Check the number of connections currently waiting for a suitable
>>>> link group to be created, if it is not less that the number of link
>>>> groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
>>>> increase the number of link groups to be created, current connection
>>>> is considered as the first contact connection. ends.
>>>>
>>>> 3. Increase the number of connections currently waiting, and wait
>>>> for woken up.
>>>>
>>>> 4. Decrease the number of connections currently waiting, goto 1.
>>>>
>>>> We wake up the connection that was put to sleep in stage 3 through
>>>> the SMC link state change event. Once the link moves out of the
>>>> SMC_LNK_ACTIVATING state, decrease the number of link groups to
>>>> be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
>>>> connections.
>>>>
>>>> In the iplementation, we introduce the concept of lnk cluster, which is
>>>> a collection of links with the same characteristics (see
>>>> smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
>>>> wake up efficiently in the scenario of N v.s 1.
>>>>
>>>> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
>>>
>>> Hello D.,
>>>
>>> thanks for the v2 and the patience.
>>> I got to testing and as with v1 I want to share our findings with you. If you need more information or want us to look deeper into the findings please let us know.
>>>
>>> Regarding SMC-R test-suite:
>>> We see a refcount error during one of our stress tests. This lets us believe that the smc_link_cluster_put() to smc_link_cluster_hold() ratio is not right anymore.
>>> The patch provided by yacan does fix this issue but we did not verify if it is the right way to balance the hold and put calls.
>>>
>>> [root@t8345011 ~]# journalctl --dmesg | tail -100
>>> Aug 31 16:17:36 t8345011.lnxne.boe smc-tests: test_smcapp_50x_ifdown started
>>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link removed: id 00000101, peerid 00000101, ibdev mlx5_0, ibport 1
>>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 state changed: SINGLE, pnetid NET25
>>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link added: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
>>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 state changed: ASYMMETRIC_PEER, pnetid NET25
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link added: id 00000104, peerid 00000104, ibdev mlx5_0, ibport 1
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 state changed: SYMMETRIC, pnetid NET25
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ------------[ cut here ]------------
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: refcount_t: underflow; use-after-free.
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: WARNING: CPU: 1 PID: 150 at lib/refcount.c:87 refcount_dec_not_one+0x88/0xa8
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Modules linked in: smc_diag tcp_diag inet_diag nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib ism smc ib_uverbs ib_core vfio_ccw mdev s390_trng vfio_iommu_type1 vfio sch_fq_codel configfs ip_tables x_tables ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: CPU: 1 PID: 150 Comm: kworker/1:2 Not tainted 6.0.0-rc2-00493-g91ecd751199f #8
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Workqueue: events smc_llc_add_link_work [smc]
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl PSW : 0704c00180000000 000000005b31f32c (refcount_dec_not_one+0x8c/0xa8)
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl GPRS: 00000000ffffffea 0000000000000027 0000000000000026 000000005c3151e0
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: 00000000fee80000 0000038000000001 000000008e0e9a00 000000008de79c24
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: 0000038000000000 000003ff803f05ac 0000000095038360 000000008de79c00
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: 00000000828ca100 0000000095038360 000000005b31f328 0000038000943b50
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl Code: 000000005b31f31c: c02000466122 larl %r2,000000005bbeb560
>>> 000000005b31f322: c0e500232e53 brasl %r14,000000005b784fc8
>>> #000000005b31f328: af000000 mc 0,0
>>> >000000005b31f32c: a7280001 lhi %r2,1
>>> 000000005b31f330: ebeff0a00004 lmg %r14,%r15,160(%r15)
>>> 000000005b31f336: ec223fbf0055 risbg %r2,%r2,63,191,0
>>> 000000005b31f33c: 07fe bcr 15,%r14
>>> 000000005b31f33e: 47000700 bc 0,1792
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Call Trace:
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005b31f32c>] refcount_dec_not_one+0x8c/0xa8
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ([<000000005b31f328>] refcount_dec_not_one+0x88/0xa8)
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000003ff803ef16a>] smcr_link_cluster_on_link_state.part.0+0x1ba/0x440 [smc]
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000003ff803f05ac>] smcr_link_clear+0x5c/0x1b0 [smc]
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000003ff803fadf4>] smc_llc_add_link_work+0x43c/0x470 [smc]
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005ac1f0e2>] process_one_work+0x1fa/0x478
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005ac1f88c>] worker_thread+0x64/0x468
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005ac28580>] kthread+0x108/0x110
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005abaf2dc>] __ret_from_fork+0x3c/0x58
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005b7a4d6a>] ret_from_fork+0xa/0x40
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Last Breaking-Event-Address:
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: [<000000005b785028>] __warn_printk+0x60/0x68
>>
>> Thank you for your test, I need to think about it, please give me some time.
>>
>>
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ---[ end trace 0000000000000000 ]---
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link removed: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
>>> [root@t8345011 ~]#
>>>
>>>
>>>
>>> Regarding SMC-D test-suite:
>>> For SMC-D we also see errors during another stress test. While we expect connections to fall back to TCP due to the limit of parallel connections your patch introduces TCP fallbacks with a new reason.
>>>
>>> [root@t8345011 ~]# journalctl --dmesg | tail -10
>>> Aug 31 16:30:07 t8345011.lnxne.boe smc-tests: test_oob7_send_multi_urg_at_start started
>>> Aug 31 16:30:16 t8345011.lnxne.boe smc-tests: test_oob8_ignore_some_urg_data started
>>> Aug 31 16:30:30 t8345011.lnxne.boe smc-tests: test_smc_tool_second started
>>> Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_tshark started
>>> Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_smcapp_torture_test started
>>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 link added: id 00000401, peerid 00000401, ibdev mlx5_0, ibport 1
>>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 state changed: SINGLE, pnetid NET25
>>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: TCP: request_sock_TCP: Possible SYN flooding on port 51897. Sending cookies. Check SNMP counters.
>>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 link added: id 00000402, peerid 00000402, ibdev mlx5_1, ibport 1
>>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 state changed: SYMMETRIC, pnetid NET25
>>>
>>> ^
>>> I am wondering why we see SMC-R dmesgs even if we communicate with SMC-D. Gotta verify that. Can be an error on our side.
>>
>> This is very weird, is there no such SMC-R dmesgs before apply my PATCH?
>>
>> I am not sure if there is logic to downgrade SMC-D to SMC-R, maybe it's has related to 0x03010000.
>> I need to check the code, the reason will be sent out as soon as possible
>>
>>
>>> [root@t8345011 ~]#
>>> [root@t8345011 ~]# smcss
>>> ACTIVE 00000 0067005 10.25.45.10:48096 10.25.45.11:51897 0000 SMCD
>>> ACTIVE 00000 0067001 10.25.45.10:48060 10.25.45.11:51897 0000 SMCD
>>> ACTIVE 00000 0066999 10.25.45.10:48054 10.25.45.11:51897 0000 SMCD
>>> ACTIVE 00000 0068762 10.25.45.10:48046 10.25.45.11:51897 0000 SMCD
>>> ACTIVE 00000 0066997 10.25.45.10:48044 10.25.45.11:51897 0000 SMCD
>>> ACTIVE 00000 0068760 10.25.45.10:48036 10.25.45.11:51897 0000 SMCD
>>> ACTIVE 00000 0066995 10.25.45.10:48026 10.25.45.11:51897 0000 SMCD
>>> ACTIVE 00000 0068758 10.25.45.10:48024 10.25.45.11:51897 0000 SMCD
>>> ACTIVE 00000 0066993 10.25.45.10:48022 10.25.45.11:51897 0000 SMCD
>>> ACTIVE 00000 0068756 10.25.45.10:48006 10.25.45.11:51897 0000 SMCD
>>> ACTIVE 00000 0066991 10.25.45.10:47998 10.25.45.11:51897 0000 SMCD
>>> ACTIVE 00000 0068754 10.25.45.10:47984 10.25.45.11:51897 0000 SMCD
>>> ACTIVE 00000 0067124 10.25.45.11:51897 10.25.45.10:48314 0000 TCP 0x05000000/0x030d0000
>>> ACTIVE 00000 0067121 10.25.45.11:51897 10.25.45.10:48302 0000 TCP 0x05000000/0x030d0000
>>> ACTIVE 00000 0067120 10.25.45.11:51897 10.25.45.10:48284 0000 TCP 0x05000000/0x030d0000
>>> ACTIVE 00000 0067114 10.25.45.11:51897 10.25.45.10:48282 0000 TCP 0x05000000/0x030d0000
>>> ACTIVE 00000 0067115 10.25.45.11:51897 10.25.45.10:48254 0000 TCP 0x05000000/0x030d0000
>>> ACTIVE 00000 0067111 10.25.45.11:51897 10.25.45.10:48250 0000 TCP 0x05000000/0x030d0000
>>> ACTIVE 00000 0066415 10.25.45.11:51897 10.25.45.10:48242 0000 TCP 0x05000000/0x030d0000
>>> ACTIVE 00000 0067113 10.25.45.11:51897 10.25.45.10:48230 0000 TCP 0x05000000/0x030d0000
>>> ACTIVE 00000 0066409 10.25.45.11:51897 10.25.45.10:48202 0000 TCP 0x05000000/0x030d0000
>>> ACTIVE 00000 0066413 10.25.45.11:51897 10.25.45.10:48214 0000 TCP 0x05000000/0x030d0000
>>> ACTIVE 00000 0066414 10.25.45.11:51897 10.25.45.10:48204 0000 TCP 0x05000000/0x030d0000
>>> ACTIVE 00000 0066397 10.25.45.11:51897 10.25.45.10:48120 0000 TCP 0x05000000/0x030d0000
>>> ACTIVE 00000 0066399 10.25.45.11:51897 10.25.45.10:48084 0000 TCP 0x05000000/0x030d0000
>>> ACTIVE 00000 0066396 10.25.45.11:51897 10.25.45.10:48078 0000 TCP 0x05000000/0x030d0000
>>> ACTIVE 00000 0062632 10.25.45.11:51897 10.25.45.10:43120 0000 TCP 0x03010000
>>> ACTIVE 00000 0062631 10.25.45.11:51897 10.25.45.10:43134 0000 TCP 0x03010000
>>> ACTIVE 00000 0062626 10.25.45.11:51897 10.25.45.10:43106 0000 TCP 0x03010000
>>> ACTIVE 00000 0062625 10.25.45.11:51897 10.25.45.10:43138 0000 TCP 0x03010000
>>> ACTIVE 00000 0062621 10.25.45.11:51897 10.25.45.10:43160 0000 TCP 0x03010000
>>> ACTIVE 00000 0061580 10.25.45.11:51897 10.25.45.10:42820 0000 TCP 0x03010000
>>> ACTIVE 00000 0061558 10.25.45.11:51897 10.25.45.10:42792 0000 TCP 0x03010000
>>> ACTIVE 00000 0061549 10.25.45.11:51897 10.25.45.10:42816 0000 TCP 0x03010000
>>> ACTIVE 00000 0061548 10.25.45.11:51897 10.25.45.10:42764 0000 TCP 0x03010000
>>> ACTIVE 00000 0061544 10.25.45.11:51897 10.25.45.10:42804 0000 TCP 0x03010000
>>> ACTIVE 00000 0061543 10.25.45.11:51897 10.25.45.10:42856 0000 TCP 0x03010000
>>> ACTIVE 00000 0061542 10.25.45.11:51897 10.25.45.10:42756 0000 TCP 0x03010000
>>> ACTIVE 00000 0062554 10.25.45.11:51897 10.25.45.10:42852 0000 TCP 0x03010000
>>> ACTIVE 00000 0062553 10.25.45.11:51897 10.25.45.10:42844 0000 TCP 0x03010000
>>> ACTIVE 00000 0062549 10.25.45.11:51897 10.25.45.10:42836 0000 TCP 0x03010000
>>>
>>> ^
>>> Here SMCD and 0x05000000/0x030d0000 are expected. But:
>>> [353] smcss confirmed connection of type SMCD
>>> [353] Error: Found TCP fallback due to unexpected reasons: 0x03010000
>> sysctl -w net.ipv4.tcp_syncookies=0
>>
>> Can you retry your test after set above configure? When TCP detects a potential flooding attack,
>> it will starts syn-cookies to verify traffic. In this case, SMC can't work, and then triggering a fallback with
>> error code 0x03010000.
>>
>> This doesn't seem to be the problem that my PATCH can cause, but my PATCH removes the lock in
>> the handshake phase, which may speed up the frequency of your test initiating connections,
>> But I can't be sure ...
>>
>>
>>> We also exeperience that the lsmod count stays above 2 even after the testcase finished and takes quite a while before it goes down again (we send a kill signal at the end of our testcase).
>>
>>>
>>> During test (which is fine)
>>>
>>> [root@t8345011 ~]# lsmod | grep smc
>>> smc_diag 16384 0
>>> smc 225280 2981 ism,smc_diag
>>> ib_core 413696 3 smc,ib_uverbs,mlx5_ib
>>>
>>> Count > 2 even after tests finish!
>>>
>>> [root@t8345011 ~]# lsmod | grep smc
>>> smc_diag 16384 0
>>> smc 225280 40 ism,smc_diag
>>> ib_core 413696 3 smc,ib_uverbs,mlx5_ib
>>
>>> Let us know if you need any more information.
>>> Thanks, Jan
>>
>>
>> This usually means that there are still connections that are not really destroyed,
>> can you try this and to see if there are any remaining connections?
>>
>> smcd linkgroup; #or smcr, it depends, if any, can you show us the connection state (smcss -r or -d)
>>
>> ps aux | grep D; # check if there is work thread hungs, if any, please show us the /proc/$PID/stack.
>>
>>
>> D. Wythe
>> Thanks.
>>
>
> Thank you for the tip with the syncookies. We disabled them on both systems and here is the new output which should be not as confusing.
>
> For your understanding of the output:
> t8345010 is the system driving the tests and is acting as the client in this testcase.
> t8345011 is the pair system and the server of this test.
> What we are doing is we spawn a lot of connection between the two systems to see what is happening if there is stress (in terms of connection handling) on the system.
>
> We see the following:
> - The driver falls back to SMCR in many occasions. This should not be. Also note the missmatch of numbers of connections handled. There were no other connections beside the test.
>
> T8345010
> > SMC-D Connections Summary
> > Total connections handled 1012
> > SMC-R Connections Summary
> > Total connections handled 1512
>
> T8345011
> > SMC-D Connections Summary
> > Total connections handled 1190
> > SMC-R Connections Summary
> > Total connections handled 1513
>
>
> - Linkgroups for the SMCD & SMCR connections are being build up.
>
> T8345011
> > [root@t8345011 ~]# smcd linkgroup
> > LG-ID VLAN #Conns PNET-ID
> > 00000300 0 37 NET25
> > [root@t8345011 ~]# smcr linkgroup
> > LG-ID LG-Role LG-Type VLAN #Conns PNET-ID
> > 00000400 SERV SYM 0 0 NET25
> > [ 5 more LG 0500-0900]
>
>
> - Linkgroups for the SMCD & SMCR connections are being build down once the clients finish.
> - ALL SMCR linkgoups are being cleared completely as expected. They still reside empty for a while which is fine.
> - The SMCD linkgroups are NOT cleared all the way. A few connections stay in there (See output above).
> - If we perform smcss on the server side those connections are listed there as ACTIVE while the smcss list on the client side is empty.
>
> T8345011
> > [root@t8345011 ~]# smcss
> > State UID Inode Local Address Peer Address Intf Mode
> > ACTIVE 00000 0100758 10.25.45.11:40237 10.25.45.10:55790 0000 SMCD
> > [ 36 more ACTIVE connections ]
>
>
> - The remaing ACTIVE connections on the server are reflected in the smcd linkgroup #Conns aswell.
> - On the client the lsmod count for the smc module is 39 also reflecting the leftover connections.
>
> T8345010
> > [root@t8345010 tela-kernel]# lsmod |grep smc
> > smc 225280 39 ism,smc_diag
>
>
> - On the server the lsmod count for the smc module is 79.
>
> T8345011
> > [root@t8345011 ~]# lsmod | grep smc
> > smc 225280 79 ism,smc_diag
>
>
> - The most important smc_dbg outputs are provided and are showing that the client is pretty clean and the server is still handling ghost connections.
>
> T8345011
> > [root@t8345011 ~]# smc_dbg
> > State UID Inode Local Address Peer Address Intf Mode GID Token Peer-GID Peer-Token Linkid
> > ACTIVE 00000 0100758 10.25.45.11:40237 10.25.45.10:55790 0000 SMCD 120014a12e488561 0000890fd0000000 3e0014a32e488561 00008a0bd0000000 00000300
> > State UID Inode Local Address Peer Address Intf Mode Shutd Token Sndbuf Rcvbuf Peerbuf rxprod-Cursor rxcons-Cursor rxFlags txprod-Cursor txcons-Cursor txFlags txprep-Cursor txsent-Cursor txfin-Cursor
> > ACTIVE 00000 0100758 10.25.45.11:40237 10.25.45.10:55790 0000 SMCD <-> 00001611 00004000 0000ffe0 0000ffe0 0000:00000000 0000:00000000 00:00 0000:00000000 0000:00000000 00:00 0000:00000000 0000:00000000 0000:00000000
>
>
> - Via netstat we see that the server is in a CLOSE_WAIT state for the connections and the client in a FIN_WAIT2
>
> T8345010
> > [root@t8345010 tela-kernel]# netstat -nta
> > Proto Recv-Q Send-Q Local Address Foreign Address State
> > tcp 0 0 10.25.45.10:55790 10.25.45.11:40237 FIN_WAIT2
> T8345011
> > [root@t8345011 ~]# netstat -nta | grep "40237"
> > tcp 1 0 10.25.45.11:40237 10.25.45.10:55790 CLOSE_WAIT
>
>
> While I'm pretty new to the mailing list we had a discussion about how to provide the log data in a reasonable way.
> To prevent too much information we decided to go for the short output on top. If that is not enough for you shot me a message and i can send you the full output outside the mailing list.
> If you have any ideas on how to provide larger output in a reasonable way feel free to share your oppinion.
>
> I hope the new output helps you locating the error.
> Feel free to contact us in case you have questions.
> - Jan
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH net-next v2 10/10] net/smc: fix application data exception
2022-09-08 9:37 ` Wen Gu
@ 2022-09-16 5:24 ` D. Wythe
0 siblings, 0 replies; 22+ messages in thread
From: D. Wythe @ 2022-09-16 5:24 UTC (permalink / raw)
To: Wen Gu, kgraul, wenjia; +Cc: kuba, davem, netdev, linux-s390, linux-rdma
Hi, Wen Gu
This is indeed same issues, I will fix it in the next version.
Thanks
D. Wythe
On 9/8/22 5:37 PM, Wen Gu wrote:
>
>
> On 2022/8/26 17:51, D. Wythe wrote:
>
>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>
>> After we optimize the parallel capability of SMC-R connection
>> establishment, There is a certain probability that following
>> exceptions will occur in the wrk benchmark test:
>>
>> Running 10s test @ http://11.213.45.6:80
>> 8 threads and 64 connections
>> Thread Stats Avg Stdev Max +/- Stdev
>> Latency 3.72ms 13.94ms 245.33ms 94.17%
>> Req/Sec 1.96k 713.67 5.41k 75.16%
>> 155262 requests in 10.10s, 23.10MB read
>> Non-2xx or 3xx responses: 3
>>
>> We will find that the error is HTTP 400 error, which is a serious
>> exception in our test, which means the application data was
>> corrupted.
>>
>> Consider the following scenarios:
>>
>> CPU0 CPU1
>>
>> buf_desc->used = 0;
>> cmpxchg(buf_desc->used, 0, 1)
>> deal_with(buf_desc)
>>
>> memset(buf_desc->cpu_addr,0);
>>
>> This will cause the data received by a victim connection to be cleared,
>> thus triggering an HTTP 400 error in the server.
>>
>> This patch exchange the order between clear used and memset, add
>> barrier to ensure memory consistency.
>>
>> Fixes: 1c5526968e27 ("net/smc: Clear memory when release and reuse buffer")
>> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
>> ---
>> net/smc/smc_core.c | 5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
>> index 84bf84c..fdad953 100644
>> --- a/net/smc/smc_core.c
>> +++ b/net/smc/smc_core.c
>> @@ -1380,8 +1380,9 @@ static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
>> smc_buf_free(lgr, is_rmb, buf_desc);
>> } else {
>> - buf_desc->used = 0;
>> - memset(buf_desc->cpu_addr, 0, buf_desc->len);
>> + /* memzero_explicit provides potential memory barrier semantics */
>> + memzero_explicit(buf_desc->cpu_addr, buf_desc->len);
>> + WRITE_ONCE(buf_desc->used, 0);
>> }
>> }
>
> It seems that the same issue exists in smc_buf_unuse(), Maybe it also needs to be fixed?
>
>
> static void smc_buf_unuse(struct smc_connection *conn,
> struct smc_link_group *lgr)
> {
> if (conn->sndbuf_desc) {
> if (!lgr->is_smcd && conn->sndbuf_desc->is_vm) {
> smcr_buf_unuse(conn->sndbuf_desc, false, lgr);
> } else {
> conn->sndbuf_desc->used = 0;
> memset(conn->sndbuf_desc->cpu_addr, 0,
> conn->sndbuf_desc->len);
> ^...................
> }
> }
> if (conn->rmb_desc) {
> if (!lgr->is_smcd) {
> smcr_buf_unuse(conn->rmb_desc, true, lgr);
> } else {
> conn->rmb_desc->used = 0;
> memset(conn->rmb_desc->cpu_addr, 0,
> conn->rmb_desc->len +
> sizeof(struct smcd_cdc_msg));
> ^...................
> }
> }
> }
>
> Thanks,
> Wen Gu
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2022-09-16 5:24 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-26 9:51 [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending D. Wythe
2022-08-29 14:48 ` Jan Karcher
2022-08-31 15:04 ` Jan Karcher
2022-09-02 11:25 ` D. Wythe
2022-09-07 8:10 ` Jan Karcher
2022-09-16 5:16 ` D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 02/10] net/smc: fix SMC_CLC_DECL_ERR_REGRMB without smc_server_lgr_pending D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 03/10] net/smc: allow confirm/delete rkey response deliver multiplex D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 04/10] net/smc: make SMC_LLC_FLOW_RKEY run concurrently D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 05/10] net/smc: llc_conf_mutex refactor, replace it with rw_semaphore D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 06/10] net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse() D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 07/10] net/smc: reduce unnecessary blocking in smcr_lgr_reg_rmbs() D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 08/10] net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 09/10] net/smc: Fix potential panic dues to unprotected smc_llc_srv_add_link() D. Wythe
2022-08-26 9:51 ` [PATCH net-next v2 10/10] net/smc: fix application data exception D. Wythe
2022-09-08 9:37 ` Wen Gu
2022-09-16 5:24 ` D. Wythe
2022-08-27 1:32 ` [PATCH net-next v2 00/10] optimize the parallelism of SMC-R connections Jakub Kicinski
2022-08-29 3:25 ` Tony Lu
2022-08-29 3:28 ` D. Wythe
2022-09-09 6:59 ` Jan Karcher
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).