All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net 0/2] net/smc: fix kernel panic caused by race of smc_sock
@ 2021-12-28  9:03 Dust Li
  2021-12-28  9:03 ` [PATCH net 1/2] net/smc: don't send CDC/LLC message if link not ready Dust Li
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Dust Li @ 2021-12-28  9:03 UTC (permalink / raw)
  To: Karsten Graul, David S . Miller, Jakub Kicinski
  Cc: linux-s390, netdev, Wen Gu, Tony Lu

This patchset fixes the race between smc_release triggered by
close(2) and cdc_handle triggered by underlaying RDMA device.

The race is caused because the smc_connection may been released
before the pending tx CDC messages got its CQEs. In order to fix
this, I add a counter to track how many pending WRs we have posted
through the smc_connection, and only release the smc_connection
after there is no pending WRs on the connection.

The first patch prevents posting WR on a QP that is not in RTS
state. This patch is needed because if we post WR on a QP that
is not in RTS state, ib_post_send() may success but no CQE will
return, and that will confuse the counter tracking the pending
WRs.

The second patch add a counter to track how many WRs were posted
through the smc_connection, and don't reset the QP on link destroying
to prevent leak of the counter.

Dust Li (2):
  net/smc: don't send CDC/LLC message if link not ready
  net/smc: fix kernel panic caused by race of smc_sock

 net/smc/smc.h      |  5 +++++
 net/smc/smc_cdc.c  | 52 +++++++++++++++++++++-------------------------
 net/smc/smc_cdc.h  |  2 +-
 net/smc/smc_core.c | 27 ++++++++++++++++++------
 net/smc/smc_core.h |  6 ++++++
 net/smc/smc_ib.c   |  4 ++--
 net/smc/smc_ib.h   |  1 +
 net/smc/smc_llc.c  |  2 +-
 net/smc/smc_wr.c   | 45 +++++----------------------------------
 net/smc/smc_wr.h   |  5 ++---
 10 files changed, 68 insertions(+), 81 deletions(-)

-- 
2.19.1.3.ge56e4f7


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH net 1/2] net/smc: don't send CDC/LLC message if link not ready
  2021-12-28  9:03 [PATCH net 0/2] net/smc: fix kernel panic caused by race of smc_sock Dust Li
@ 2021-12-28  9:03 ` Dust Li
  2021-12-29 12:36   ` Karsten Graul
  2021-12-31  6:08   ` [PATCH net] net/smc: add comments for smc_link_{usable|sendable} Dust Li
  2021-12-28  9:03 ` [PATCH net 2/2] net/smc: fix kernel panic caused by race of smc_sock Dust Li
  2021-12-28 12:50 ` [PATCH net 0/2] " patchwork-bot+netdevbpf
  2 siblings, 2 replies; 13+ messages in thread
From: Dust Li @ 2021-12-28  9:03 UTC (permalink / raw)
  To: Karsten Graul, David S . Miller, Jakub Kicinski
  Cc: linux-s390, netdev, Wen Gu, Tony Lu

We found smc_llc_send_link_delete_all() sometimes wait
for 2s timeout when testing with RDMA link up/down.
It is possible when a smc_link is in ACTIVATING state,
the underlaying QP is still in RESET or RTR state, which
cannot send any messages out.

smc_llc_send_link_delete_all() use smc_link_usable() to
checks whether the link is usable, if the QP is still in
RESET or RTR state, but the smc_link is in ACTIVATING, this
LLC message will always fail without any CQE entering the
CQ, and we will always wait 2s before timeout.

Since we cannot send any messages through the QP before
the QP enter RTS. I add a wrapper smc_link_sendable()
which checks the state of QP along with the link state.
And replace smc_link_usable() with smc_link_sendable()
in all LLC & CDC message sending routine.

Fixes: 5f08318f617b ("smc: connection data control (CDC)")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
---
 net/smc/smc_core.c | 2 +-
 net/smc/smc_core.h | 6 ++++++
 net/smc/smc_llc.c  | 2 +-
 net/smc/smc_wr.c   | 4 ++--
 net/smc/smc_wr.h   | 2 +-
 5 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 387d28b2f8dd..55ca175e8d57 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -647,7 +647,7 @@ static void smcr_lgr_link_deactivate_all(struct smc_link_group *lgr)
 	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
 		struct smc_link *lnk = &lgr->lnk[i];
 
-		if (smc_link_usable(lnk))
+		if (smc_link_sendable(lnk))
 			lnk->state = SMC_LNK_INACTIVE;
 	}
 	wake_up_all(&lgr->llc_msg_waiter);
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 59cef3b830d8..d63b08274197 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -415,6 +415,12 @@ static inline bool smc_link_usable(struct smc_link *lnk)
 	return true;
 }
 
+static inline bool smc_link_sendable(struct smc_link *lnk)
+{
+	return smc_link_usable(lnk) &&
+		lnk->qp_attr.cur_qp_state == IB_QPS_RTS;
+}
+
 static inline bool smc_link_active(struct smc_link *lnk)
 {
 	return lnk->state == SMC_LNK_ACTIVE;
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index b102680296b8..3e9fd8a3124c 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -1630,7 +1630,7 @@ void smc_llc_send_link_delete_all(struct smc_link_group *lgr, bool ord, u32 rsn)
 	delllc.reason = htonl(rsn);
 
 	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
-		if (!smc_link_usable(&lgr->lnk[i]))
+		if (!smc_link_sendable(&lgr->lnk[i]))
 			continue;
 		if (!smc_llc_send_message_wait(&lgr->lnk[i], &delllc))
 			break;
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index 600ab5889227..aaf3028a6fe9 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -188,7 +188,7 @@ void smc_wr_tx_cq_handler(struct ib_cq *ib_cq, void *cq_context)
 static inline int smc_wr_tx_get_free_slot_index(struct smc_link *link, u32 *idx)
 {
 	*idx = link->wr_tx_cnt;
-	if (!smc_link_usable(link))
+	if (!smc_link_sendable(link))
 		return -ENOLINK;
 	for_each_clear_bit(*idx, link->wr_tx_mask, link->wr_tx_cnt) {
 		if (!test_and_set_bit(*idx, link->wr_tx_mask))
@@ -231,7 +231,7 @@ int smc_wr_tx_get_free_slot(struct smc_link *link,
 	} else {
 		rc = wait_event_interruptible_timeout(
 			link->wr_tx_wait,
-			!smc_link_usable(link) ||
+			!smc_link_sendable(link) ||
 			lgr->terminating ||
 			(smc_wr_tx_get_free_slot_index(link, &idx) != -EBUSY),
 			SMC_WR_TX_WAIT_FREE_SLOT_TIME);
diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
index f353311e6f84..48ed9b08ac7a 100644
--- a/net/smc/smc_wr.h
+++ b/net/smc/smc_wr.h
@@ -62,7 +62,7 @@ static inline void smc_wr_tx_set_wr_id(atomic_long_t *wr_tx_id, long val)
 
 static inline bool smc_wr_tx_link_hold(struct smc_link *link)
 {
-	if (!smc_link_usable(link))
+	if (!smc_link_sendable(link))
 		return false;
 	atomic_inc(&link->wr_tx_refcnt);
 	return true;
-- 
2.19.1.3.ge56e4f7


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net 2/2] net/smc: fix kernel panic caused by race of smc_sock
  2021-12-28  9:03 [PATCH net 0/2] net/smc: fix kernel panic caused by race of smc_sock Dust Li
  2021-12-28  9:03 ` [PATCH net 1/2] net/smc: don't send CDC/LLC message if link not ready Dust Li
@ 2021-12-28  9:03 ` Dust Li
  2021-12-29 12:33   ` Karsten Graul
  2021-12-28 12:50 ` [PATCH net 0/2] " patchwork-bot+netdevbpf
  2 siblings, 1 reply; 13+ messages in thread
From: Dust Li @ 2021-12-28  9:03 UTC (permalink / raw)
  To: Karsten Graul, David S . Miller, Jakub Kicinski
  Cc: linux-s390, netdev, Wen Gu, Tony Lu

A crash occurs when smc_cdc_tx_handler() tries to access smc_sock
but smc_release() has already freed it.

[ 4570.695099] BUG: unable to handle page fault for address: 000000002eae9e88
[ 4570.696048] #PF: supervisor write access in kernel mode
[ 4570.696728] #PF: error_code(0x0002) - not-present page
[ 4570.697401] PGD 0 P4D 0
[ 4570.697716] Oops: 0002 [#1] PREEMPT SMP NOPTI
[ 4570.698228] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.16.0-rc4+ #111
[ 4570.699013] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/0
[ 4570.699933] RIP: 0010:_raw_spin_lock+0x1a/0x30
<...>
[ 4570.711446] Call Trace:
[ 4570.711746]  <IRQ>
[ 4570.711992]  smc_cdc_tx_handler+0x41/0xc0
[ 4570.712470]  smc_wr_tx_tasklet_fn+0x213/0x560
[ 4570.712981]  ? smc_cdc_tx_dismisser+0x10/0x10
[ 4570.713489]  tasklet_action_common.isra.17+0x66/0x140
[ 4570.714083]  __do_softirq+0x123/0x2f4
[ 4570.714521]  irq_exit_rcu+0xc4/0xf0
[ 4570.714934]  common_interrupt+0xba/0xe0

Though smc_cdc_tx_handler() checked the existence of smc connection,
smc_release() may have already dismissed and released the smc socket
before smc_cdc_tx_handler() further visits it.

smc_cdc_tx_handler()           |smc_release()
if (!conn)                     |
                               |
                               |smc_cdc_tx_dismiss_slots()
                               |      smc_cdc_tx_dismisser()
                               |
                               |sock_put(&smc->sk) <- last sock_put,
                               |                      smc_sock freed
bh_lock_sock(&smc->sk) (panic) |

To make sure we won't receive any CDC messages after we free the
smc_sock, add a refcount on the smc_connection for inflight CDC
message(posted to the QP but haven't received related CQE), and
don't release the smc_connection until all the inflight CDC messages
haven been done, for both success or failed ones.

Using refcount on CDC messages brings another problem: when the link
is going to be destroyed, smcr_link_clear() will reset the QP, which
then remove all the pending CQEs related to the QP in the CQ. To make
sure all the CQEs will always come back so the refcount on the
smc_connection can always reach 0, smc_ib_modify_qp_reset() was replaced
by smc_ib_modify_qp_error().
And remove the timeout in smc_wr_tx_wait_no_pending_sends() since we
need to wait for all pending WQEs done, or we may encounter use-after-
free when handling CQEs.

For IB device removal routine, we need to wait for all the QPs on that
device been destroyed before we can destroy CQs on the device, or
the refcount on smc_connection won't reach 0 and smc_sock cannot be
released.

Fixes: 5f08318f617b ("smc: connection data control (CDC)")
Reported-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
---
 net/smc/smc.h      |  5 +++++
 net/smc/smc_cdc.c  | 52 +++++++++++++++++++++-------------------------
 net/smc/smc_cdc.h  |  2 +-
 net/smc/smc_core.c | 25 +++++++++++++++++-----
 net/smc/smc_ib.c   |  4 ++--
 net/smc/smc_ib.h   |  1 +
 net/smc/smc_wr.c   | 41 +++---------------------------------
 net/smc/smc_wr.h   |  3 +--
 8 files changed, 57 insertions(+), 76 deletions(-)

diff --git a/net/smc/smc.h b/net/smc/smc.h
index 2b663819fd53..463e40eb6c2f 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -185,6 +185,11 @@ struct smc_connection {
 	u16			tx_cdc_seq;	/* sequence # for CDC send */
 	u16			tx_cdc_seq_fin;	/* sequence # - tx completed */
 	spinlock_t		send_lock;	/* protect wr_sends */
+	atomic_t		cdc_pend_tx_wr; /* number of pending tx CDC wqe
+						 * - inc when post wqe,
+						 * - dec on polled tx cqe
+						 */
+	wait_queue_head_t	cdc_pend_tx_wq; /* wakeup on no cdc_pend_tx_wr*/
 	struct delayed_work	tx_work;	/* retry of smc_cdc_msg_send */
 	u32			tx_off;		/* base offset in peer rmb */
 
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 99acd337ba90..84c8a4374fdd 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -31,10 +31,6 @@ static void smc_cdc_tx_handler(struct smc_wr_tx_pend_priv *pnd_snd,
 	struct smc_sock *smc;
 	int diff;
 
-	if (!conn)
-		/* already dismissed */
-		return;
-
 	smc = container_of(conn, struct smc_sock, conn);
 	bh_lock_sock(&smc->sk);
 	if (!wc_status) {
@@ -51,6 +47,12 @@ static void smc_cdc_tx_handler(struct smc_wr_tx_pend_priv *pnd_snd,
 			      conn);
 		conn->tx_cdc_seq_fin = cdcpend->ctrl_seq;
 	}
+
+	if (atomic_dec_and_test(&conn->cdc_pend_tx_wr) &&
+	    unlikely(wq_has_sleeper(&conn->cdc_pend_tx_wq)))
+		wake_up(&conn->cdc_pend_tx_wq);
+	WARN_ON(atomic_read(&conn->cdc_pend_tx_wr) < 0);
+
 	smc_tx_sndbuf_nonfull(smc);
 	bh_unlock_sock(&smc->sk);
 }
@@ -107,6 +109,10 @@ int smc_cdc_msg_send(struct smc_connection *conn,
 	conn->tx_cdc_seq++;
 	conn->local_tx_ctrl.seqno = conn->tx_cdc_seq;
 	smc_host_msg_to_cdc((struct smc_cdc_msg *)wr_buf, conn, &cfed);
+
+	atomic_inc(&conn->cdc_pend_tx_wr);
+	smp_mb__after_atomic(); /* Make sure cdc_pend_tx_wr added before post */
+
 	rc = smc_wr_tx_send(link, (struct smc_wr_tx_pend_priv *)pend);
 	if (!rc) {
 		smc_curs_copy(&conn->rx_curs_confirmed, &cfed, conn);
@@ -114,6 +120,7 @@ int smc_cdc_msg_send(struct smc_connection *conn,
 	} else {
 		conn->tx_cdc_seq--;
 		conn->local_tx_ctrl.seqno = conn->tx_cdc_seq;
+		atomic_dec(&conn->cdc_pend_tx_wr);
 	}
 
 	return rc;
@@ -136,7 +143,18 @@ int smcr_cdc_msg_send_validation(struct smc_connection *conn,
 	peer->token = htonl(local->token);
 	peer->prod_flags.failover_validation = 1;
 
+	/* We need to set pend->conn here to make sure smc_cdc_tx_handler()
+	 * can handle properly
+	 */
+	smc_cdc_add_pending_send(conn, pend);
+
+	atomic_inc(&conn->cdc_pend_tx_wr);
+	smp_mb__after_atomic(); /* Make sure cdc_pend_tx_wr added before post */
+
 	rc = smc_wr_tx_send(link, (struct smc_wr_tx_pend_priv *)pend);
+	if (unlikely(rc))
+		atomic_dec(&conn->cdc_pend_tx_wr);
+
 	return rc;
 }
 
@@ -193,31 +211,9 @@ int smc_cdc_get_slot_and_msg_send(struct smc_connection *conn)
 	return rc;
 }
 
-static bool smc_cdc_tx_filter(struct smc_wr_tx_pend_priv *tx_pend,
-			      unsigned long data)
+void smc_cdc_wait_pend_tx_wr(struct smc_connection *conn)
 {
-	struct smc_connection *conn = (struct smc_connection *)data;
-	struct smc_cdc_tx_pend *cdc_pend =
-		(struct smc_cdc_tx_pend *)tx_pend;
-
-	return cdc_pend->conn == conn;
-}
-
-static void smc_cdc_tx_dismisser(struct smc_wr_tx_pend_priv *tx_pend)
-{
-	struct smc_cdc_tx_pend *cdc_pend =
-		(struct smc_cdc_tx_pend *)tx_pend;
-
-	cdc_pend->conn = NULL;
-}
-
-void smc_cdc_tx_dismiss_slots(struct smc_connection *conn)
-{
-	struct smc_link *link = conn->lnk;
-
-	smc_wr_tx_dismiss_slots(link, SMC_CDC_MSG_TYPE,
-				smc_cdc_tx_filter, smc_cdc_tx_dismisser,
-				(unsigned long)conn);
+	wait_event(conn->cdc_pend_tx_wq, !atomic_read(&conn->cdc_pend_tx_wr));
 }
 
 /* Send a SMC-D CDC header.
diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
index 0a0a89abd38b..696cc11f2303 100644
--- a/net/smc/smc_cdc.h
+++ b/net/smc/smc_cdc.h
@@ -291,7 +291,7 @@ int smc_cdc_get_free_slot(struct smc_connection *conn,
 			  struct smc_wr_buf **wr_buf,
 			  struct smc_rdma_wr **wr_rdma_buf,
 			  struct smc_cdc_tx_pend **pend);
-void smc_cdc_tx_dismiss_slots(struct smc_connection *conn);
+void smc_cdc_wait_pend_tx_wr(struct smc_connection *conn);
 int smc_cdc_msg_send(struct smc_connection *conn, struct smc_wr_buf *wr_buf,
 		     struct smc_cdc_tx_pend *pend);
 int smc_cdc_get_slot_and_msg_send(struct smc_connection *conn);
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 55ca175e8d57..a6849362f4dd 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1127,7 +1127,7 @@ void smc_conn_free(struct smc_connection *conn)
 			smc_ism_unset_conn(conn);
 		tasklet_kill(&conn->rx_tsklet);
 	} else {
-		smc_cdc_tx_dismiss_slots(conn);
+		smc_cdc_wait_pend_tx_wr(conn);
 		if (current_work() != &conn->abort_work)
 			cancel_work_sync(&conn->abort_work);
 	}
@@ -1204,7 +1204,7 @@ void smcr_link_clear(struct smc_link *lnk, bool log)
 	smc_llc_link_clear(lnk, log);
 	smcr_buf_unmap_lgr(lnk);
 	smcr_rtoken_clear_link(lnk);
-	smc_ib_modify_qp_reset(lnk);
+	smc_ib_modify_qp_error(lnk);
 	smc_wr_free_link(lnk);
 	smc_ib_destroy_queue_pair(lnk);
 	smc_ib_dealloc_protection_domain(lnk);
@@ -1336,7 +1336,7 @@ static void smc_conn_kill(struct smc_connection *conn, bool soft)
 		else
 			tasklet_unlock_wait(&conn->rx_tsklet);
 	} else {
-		smc_cdc_tx_dismiss_slots(conn);
+		smc_cdc_wait_pend_tx_wr(conn);
 	}
 	smc_lgr_unregister_conn(conn);
 	smc_close_active_abort(smc);
@@ -1459,11 +1459,16 @@ void smc_smcd_terminate_all(struct smcd_dev *smcd)
 /* Called when an SMCR device is removed or the smc module is unloaded.
  * If smcibdev is given, all SMCR link groups using this device are terminated.
  * If smcibdev is NULL, all SMCR link groups are terminated.
+ *
+ * We must wait here for QPs been destroyed before we destroy the CQs,
+ * or we won't received any CQEs and cdc_pend_tx_wr cannot reach 0 thus
+ * smc_sock cannot be released.
  */
 void smc_smcr_terminate_all(struct smc_ib_device *smcibdev)
 {
 	struct smc_link_group *lgr, *lg;
 	LIST_HEAD(lgr_free_list);
+	LIST_HEAD(lgr_linkdown_list);
 	int i;
 
 	spin_lock_bh(&smc_lgr_list.lock);
@@ -1475,7 +1480,7 @@ void smc_smcr_terminate_all(struct smc_ib_device *smcibdev)
 		list_for_each_entry_safe(lgr, lg, &smc_lgr_list.list, list) {
 			for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
 				if (lgr->lnk[i].smcibdev == smcibdev)
-					smcr_link_down_cond_sched(&lgr->lnk[i]);
+					list_move_tail(&lgr->list, &lgr_linkdown_list);
 			}
 		}
 	}
@@ -1487,6 +1492,16 @@ void smc_smcr_terminate_all(struct smc_ib_device *smcibdev)
 		__smc_lgr_terminate(lgr, false);
 	}
 
+	list_for_each_entry_safe(lgr, lg, &lgr_linkdown_list, list) {
+		for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
+			if (lgr->lnk[i].smcibdev == smcibdev) {
+				mutex_lock(&lgr->llc_conf_mutex);
+				smcr_link_down_cond(&lgr->lnk[i]);
+				mutex_unlock(&lgr->llc_conf_mutex);
+			}
+		}
+	}
+
 	if (smcibdev) {
 		if (atomic_read(&smcibdev->lnk_cnt))
 			wait_event(smcibdev->lnks_deleted,
@@ -1586,7 +1601,6 @@ static void smcr_link_down(struct smc_link *lnk)
 	if (!lgr || lnk->state == SMC_LNK_UNUSED || list_empty(&lgr->list))
 		return;
 
-	smc_ib_modify_qp_reset(lnk);
 	to_lnk = smc_switch_conns(lgr, lnk, true);
 	if (!to_lnk) { /* no backup link available */
 		smcr_link_clear(lnk, true);
@@ -1824,6 +1838,7 @@ int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 	conn->local_tx_ctrl.common.type = SMC_CDC_MSG_TYPE;
 	conn->local_tx_ctrl.len = SMC_WR_TX_SIZE;
 	conn->urg_state = SMC_URG_READ;
+	init_waitqueue_head(&conn->cdc_pend_tx_wq);
 	INIT_WORK(&smc->conn.abort_work, smc_conn_abort_work);
 	if (ini->is_smcd) {
 		conn->rx_off = sizeof(struct smcd_cdc_msg);
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index d93055ec17ae..fe5d5399c4e8 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -109,12 +109,12 @@ int smc_ib_modify_qp_rts(struct smc_link *lnk)
 			    IB_QP_MAX_QP_RD_ATOMIC);
 }
 
-int smc_ib_modify_qp_reset(struct smc_link *lnk)
+int smc_ib_modify_qp_error(struct smc_link *lnk)
 {
 	struct ib_qp_attr qp_attr;
 
 	memset(&qp_attr, 0, sizeof(qp_attr));
-	qp_attr.qp_state = IB_QPS_RESET;
+	qp_attr.qp_state = IB_QPS_ERR;
 	return ib_modify_qp(lnk->roce_qp, &qp_attr, IB_QP_STATE);
 }
 
diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h
index 07585937370e..bfa1c6bf6313 100644
--- a/net/smc/smc_ib.h
+++ b/net/smc/smc_ib.h
@@ -90,6 +90,7 @@ int smc_ib_create_queue_pair(struct smc_link *lnk);
 int smc_ib_ready_link(struct smc_link *lnk);
 int smc_ib_modify_qp_rts(struct smc_link *lnk);
 int smc_ib_modify_qp_reset(struct smc_link *lnk);
+int smc_ib_modify_qp_error(struct smc_link *lnk);
 long smc_ib_setup_per_ibdev(struct smc_ib_device *smcibdev);
 int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags,
 			     struct smc_buf_desc *buf_slot, u8 link_idx);
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index aaf3028a6fe9..bb0946b32479 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -62,13 +62,9 @@ static inline bool smc_wr_is_tx_pend(struct smc_link *link)
 }
 
 /* wait till all pending tx work requests on the given link are completed */
-int smc_wr_tx_wait_no_pending_sends(struct smc_link *link)
+void smc_wr_tx_wait_no_pending_sends(struct smc_link *link)
 {
-	if (wait_event_timeout(link->wr_tx_wait, !smc_wr_is_tx_pend(link),
-			       SMC_WR_TX_WAIT_PENDING_TIME))
-		return 0;
-	else /* timeout */
-		return -EPIPE;
+	wait_event(link->wr_tx_wait, !smc_wr_is_tx_pend(link));
 }
 
 static inline int smc_wr_tx_find_pending_index(struct smc_link *link, u64 wr_id)
@@ -87,7 +83,6 @@ static inline void smc_wr_tx_process_cqe(struct ib_wc *wc)
 	struct smc_wr_tx_pend pnd_snd;
 	struct smc_link *link;
 	u32 pnd_snd_idx;
-	int i;
 
 	link = wc->qp->qp_context;
 
@@ -128,14 +123,6 @@ static inline void smc_wr_tx_process_cqe(struct ib_wc *wc)
 	}
 
 	if (wc->status) {
-		for_each_set_bit(i, link->wr_tx_mask, link->wr_tx_cnt) {
-			/* clear full struct smc_wr_tx_pend including .priv */
-			memset(&link->wr_tx_pends[i], 0,
-			       sizeof(link->wr_tx_pends[i]));
-			memset(&link->wr_tx_bufs[i], 0,
-			       sizeof(link->wr_tx_bufs[i]));
-			clear_bit(i, link->wr_tx_mask);
-		}
 		if (link->lgr->smc_version == SMC_V2) {
 			memset(link->wr_tx_v2_pend, 0,
 			       sizeof(*link->wr_tx_v2_pend));
@@ -419,25 +406,6 @@ int smc_wr_reg_send(struct smc_link *link, struct ib_mr *mr)
 	return rc;
 }
 
-void smc_wr_tx_dismiss_slots(struct smc_link *link, u8 wr_tx_hdr_type,
-			     smc_wr_tx_filter filter,
-			     smc_wr_tx_dismisser dismisser,
-			     unsigned long data)
-{
-	struct smc_wr_tx_pend_priv *tx_pend;
-	struct smc_wr_rx_hdr *wr_tx;
-	int i;
-
-	for_each_set_bit(i, link->wr_tx_mask, link->wr_tx_cnt) {
-		wr_tx = (struct smc_wr_rx_hdr *)&link->wr_tx_bufs[i];
-		if (wr_tx->type != wr_tx_hdr_type)
-			continue;
-		tx_pend = &link->wr_tx_pends[i].priv;
-		if (filter(tx_pend, data))
-			dismisser(tx_pend);
-	}
-}
-
 /****************************** receive queue ********************************/
 
 int smc_wr_rx_register_handler(struct smc_wr_rx_handler *handler)
@@ -673,10 +641,7 @@ void smc_wr_free_link(struct smc_link *lnk)
 	smc_wr_wakeup_reg_wait(lnk);
 	smc_wr_wakeup_tx_wait(lnk);
 
-	if (smc_wr_tx_wait_no_pending_sends(lnk))
-		memset(lnk->wr_tx_mask, 0,
-		       BITS_TO_LONGS(SMC_WR_BUF_CNT) *
-						sizeof(*lnk->wr_tx_mask));
+	smc_wr_tx_wait_no_pending_sends(lnk);
 	wait_event(lnk->wr_reg_wait, (!atomic_read(&lnk->wr_reg_refcnt)));
 	wait_event(lnk->wr_tx_wait, (!atomic_read(&lnk->wr_tx_refcnt)));
 
diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
index 48ed9b08ac7a..47512ccce5ef 100644
--- a/net/smc/smc_wr.h
+++ b/net/smc/smc_wr.h
@@ -22,7 +22,6 @@
 #define SMC_WR_BUF_CNT 16	/* # of ctrl buffers per link */
 
 #define SMC_WR_TX_WAIT_FREE_SLOT_TIME	(10 * HZ)
-#define SMC_WR_TX_WAIT_PENDING_TIME	(5 * HZ)
 
 #define SMC_WR_TX_SIZE 44 /* actual size of wr_send data (<=SMC_WR_BUF_SIZE) */
 
@@ -130,7 +129,7 @@ void smc_wr_tx_dismiss_slots(struct smc_link *lnk, u8 wr_rx_hdr_type,
 			     smc_wr_tx_filter filter,
 			     smc_wr_tx_dismisser dismisser,
 			     unsigned long data);
-int smc_wr_tx_wait_no_pending_sends(struct smc_link *link);
+void smc_wr_tx_wait_no_pending_sends(struct smc_link *link);
 
 int smc_wr_rx_register_handler(struct smc_wr_rx_handler *handler);
 int smc_wr_rx_post_init(struct smc_link *link);
-- 
2.19.1.3.ge56e4f7


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net 0/2] net/smc: fix kernel panic caused by race of smc_sock
  2021-12-28  9:03 [PATCH net 0/2] net/smc: fix kernel panic caused by race of smc_sock Dust Li
  2021-12-28  9:03 ` [PATCH net 1/2] net/smc: don't send CDC/LLC message if link not ready Dust Li
  2021-12-28  9:03 ` [PATCH net 2/2] net/smc: fix kernel panic caused by race of smc_sock Dust Li
@ 2021-12-28 12:50 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 13+ messages in thread
From: patchwork-bot+netdevbpf @ 2021-12-28 12:50 UTC (permalink / raw)
  To: Dust Li; +Cc: kgraul, davem, kuba, linux-s390, netdev, guwen, tonylu

Hello:

This series was applied to netdev/net.git (master)
by David S. Miller <davem@davemloft.net>:

On Tue, 28 Dec 2021 17:03:23 +0800 you wrote:
> This patchset fixes the race between smc_release triggered by
> close(2) and cdc_handle triggered by underlaying RDMA device.
> 
> The race is caused because the smc_connection may been released
> before the pending tx CDC messages got its CQEs. In order to fix
> this, I add a counter to track how many pending WRs we have posted
> through the smc_connection, and only release the smc_connection
> after there is no pending WRs on the connection.
> 
> [...]

Here is the summary with links:
  - [net,1/2] net/smc: don't send CDC/LLC message if link not ready
    https://git.kernel.org/netdev/net/c/90cee52f2e78
  - [net,2/2] net/smc: fix kernel panic caused by race of smc_sock
    https://git.kernel.org/netdev/net/c/349d43127dac

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 2/2] net/smc: fix kernel panic caused by race of smc_sock
  2021-12-28  9:03 ` [PATCH net 2/2] net/smc: fix kernel panic caused by race of smc_sock Dust Li
@ 2021-12-29 12:33   ` Karsten Graul
  2021-12-30  3:46     ` dust.li
  0 siblings, 1 reply; 13+ messages in thread
From: Karsten Graul @ 2021-12-29 12:33 UTC (permalink / raw)
  To: Dust Li, David S . Miller, Jakub Kicinski
  Cc: linux-s390, netdev, Wen Gu, Tony Lu

On 28/12/2021 10:03, Dust Li wrote:
> A crash occurs when smc_cdc_tx_handler() tries to access smc_sock
> but smc_release() has already freed it.

I am not sure about what happened here. 
Your patch removes the whole dismisser concept that was introduced to
solve exactly the problem you describe. And you implemented a different approach.

In theory, when smc_cdc_tx_handler() is called but the connection is already
freed than the connection should have gone through smc_cdc_tx_dismiss_slots(),
called by smc_conn_kill() or smc_conn_free(). If that happened there would be no
access to an already freed address in smc_cdc_tx_handler().

Can you explain why the code reached smc_cdc_tx_handler() with cdcpend->conn
pointing to a connection that is already freed? I think if there is a bug it should
be fixed instead of replacing the code by a new construct.

Thoughts?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] net/smc: don't send CDC/LLC message if link not ready
  2021-12-28  9:03 ` [PATCH net 1/2] net/smc: don't send CDC/LLC message if link not ready Dust Li
@ 2021-12-29 12:36   ` Karsten Graul
  2021-12-30  3:02     ` dust.li
  2021-12-31  6:08   ` [PATCH net] net/smc: add comments for smc_link_{usable|sendable} Dust Li
  1 sibling, 1 reply; 13+ messages in thread
From: Karsten Graul @ 2021-12-29 12:36 UTC (permalink / raw)
  To: Dust Li, David S . Miller, Jakub Kicinski
  Cc: linux-s390, netdev, Wen Gu, Tony Lu

On 28/12/2021 10:03, Dust Li wrote:
> We found smc_llc_send_link_delete_all() sometimes wait
> for 2s timeout when testing with RDMA link up/down.
> It is possible when a smc_link is in ACTIVATING state,
> the underlaying QP is still in RESET or RTR state, which
> cannot send any messages out.

I see your point, but why do you needed to introduce a new wrapper instead of
extending the existing smc_link_usable() wrapper?
With that and without any comments the reader of the code does not know why there are
2 different functions and what is the reason for having two of them.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] net/smc: don't send CDC/LLC message if link not ready
  2021-12-29 12:36   ` Karsten Graul
@ 2021-12-30  3:02     ` dust.li
  2021-12-30 18:55       ` Karsten Graul
  0 siblings, 1 reply; 13+ messages in thread
From: dust.li @ 2021-12-30  3:02 UTC (permalink / raw)
  To: Karsten Graul, David S . Miller, Jakub Kicinski
  Cc: linux-s390, netdev, Wen Gu, Tony Lu

On Wed, Dec 29, 2021 at 01:36:06PM +0100, Karsten Graul wrote:
>On 28/12/2021 10:03, Dust Li wrote:
>> We found smc_llc_send_link_delete_all() sometimes wait
>> for 2s timeout when testing with RDMA link up/down.
>> It is possible when a smc_link is in ACTIVATING state,
>> the underlaying QP is still in RESET or RTR state, which
>> cannot send any messages out.
>
>I see your point, but why do you needed to introduce a new wrapper instead of
>extending the existing smc_link_usable() wrapper?
>With that and without any comments the reader of the code does not know why there are
>2 different functions and what is the reason for having two of them.

Sorry for that, I should add some comments for those wrappers to
better explain that.

The reason we need two wrappers here is because the QP state for the
client side is not turned into RTS directly, but seperated into two
stages. In the middle on RTR to RTS, we still need smc_link_usable().

For example:
For the client side in first contact, the QP is still in RTR before it
receives the LLC_CONFIRM message. So for functions like smc_llc_wait()
or smc_llc_event_handler(), we can't use smc_link_sendable(), or we
can't even receive LLC_CONFIRM message anymore.

So the meaning for those 2 wrappers should be:
smc_link_usable():   the link is ready to receive RDMA messages
smc_link_sendable(): the link is ready to send RDMA messages

For those places that need to send any CDC or LLC message,
should go for smc_link_sendable(), otherwise, use smc_link_usable()
instead.

I saw David has already applied this to net, should I send another
patch to add some comments ?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 2/2] net/smc: fix kernel panic caused by race of smc_sock
  2021-12-29 12:33   ` Karsten Graul
@ 2021-12-30  3:46     ` dust.li
  0 siblings, 0 replies; 13+ messages in thread
From: dust.li @ 2021-12-30  3:46 UTC (permalink / raw)
  To: Karsten Graul, David S . Miller, Jakub Kicinski
  Cc: linux-s390, netdev, Wen Gu, Tony Lu

On Wed, Dec 29, 2021 at 01:33:20PM +0100, Karsten Graul wrote:
>On 28/12/2021 10:03, Dust Li wrote:
>> A crash occurs when smc_cdc_tx_handler() tries to access smc_sock
>> but smc_release() has already freed it.
>
>I am not sure about what happened here. 
>Your patch removes the whole dismisser concept that was introduced to
>solve exactly the problem you describe. And you implemented a different approach.
>
>In theory, when smc_cdc_tx_handler() is called but the connection is already
>freed than the connection should have gone through smc_cdc_tx_dismiss_slots(),
>called by smc_conn_kill() or smc_conn_free(). If that happened there would be no
>access to an already freed address in smc_cdc_tx_handler().
>
>Can you explain why the code reached smc_cdc_tx_handler() with cdcpend->conn
>pointing to a connection that is already freed? I think if there is a bug it should
>be fixed instead of replacing the code by a new construct.
>
>Thoughts?

Yes, at first we do try to fix this on the original path, but finally failed.
that's why we turned into this way.

This bug can be reproduced in our environment pretty fast running the
following test:

server:
  smc_run nginx
client:
  while true; do smc_run wrk -c 1000 -t 4 -d 20 http://smc-server; done

The reason is smc_cdc_tx_handler() checks whether cdcpend->conn == NULL
or not, and will access to the connection if it's not NULL.
But for short TCP flows(transfered to SMC flow), it is likely to close()
the connection very soon. Since smc_cdc_tx_handler() is running in the
soft IRQ context, and close(2) is running in the process context.
There is a chance of race as describe below:

    smc_cdc_tx_handler()           |smc_release()
    if (!conn)                     |
                                   |
                                   |smc_cdc_tx_dismiss_slots()
                                   |      smc_cdc_tx_dismisser()
                                   |
                                   |sock_put(&smc->sk) <- last sock_put,
                                   |                      smc_sock freed
    bh_lock_sock(&smc->sk) (panic) |

If the check at the left passed, and then the application closed the
smc_socket and dismissed the cdcpend with smc_connection.
smc_cdc_tx_handler() still think the smc_sock is alive but actually
it's already released, and further bh_lock_sock(&smc->sk) will cause
the kernel panic.

As long as the check in smc_cdc_tx_handler() is not synchronized with
the smc_release(), this is inevitable. And unfortunately, I haven't
found a good way to synchronized them with the dismisser.

So my final solution is to remove the dissmiser and introduce the
refcount for tx cdc messages to protect the visit from
smc_cdc_tx_handler() from smc_release().
With the refcount, we can make sure the smc_conn_free() will wait for
all pending tx cdc messages done by the underlying RDMA device and
smc_cdc_tx_handler() is always safe to visit the smc_sock.


With this patchset, I run the wrk/nginx test cases for 2 days without
any panic(No link down/up is performed) any more.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] net/smc: don't send CDC/LLC message if link not ready
  2021-12-30  3:02     ` dust.li
@ 2021-12-30 18:55       ` Karsten Graul
  2021-12-31  3:15         ` dust.li
  0 siblings, 1 reply; 13+ messages in thread
From: Karsten Graul @ 2021-12-30 18:55 UTC (permalink / raw)
  To: dust.li, David S . Miller, Jakub Kicinski
  Cc: linux-s390, netdev, Wen Gu, Tony Lu

On 30/12/2021 04:02, dust.li wrote:
> On Wed, Dec 29, 2021 at 01:36:06PM +0100, Karsten Graul wrote:
>> On 28/12/2021 10:03, Dust Li wrote:
> I saw David has already applied this to net, should I send another
> patch to add some comments ?

You could send a follow-on patch with your additional information, which
I find is very helpful! Thanks.

-- 
Karsten

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] net/smc: don't send CDC/LLC message if link not ready
  2021-12-30 18:55       ` Karsten Graul
@ 2021-12-31  3:15         ` dust.li
  2022-01-03 10:40           ` Karsten Graul
  0 siblings, 1 reply; 13+ messages in thread
From: dust.li @ 2021-12-31  3:15 UTC (permalink / raw)
  To: Karsten Graul, David S . Miller, Jakub Kicinski
  Cc: linux-s390, netdev, Wen Gu, Tony Lu

On Thu, Dec 30, 2021 at 07:55:01PM +0100, Karsten Graul wrote:
>On 30/12/2021 04:02, dust.li wrote:
>> On Wed, Dec 29, 2021 at 01:36:06PM +0100, Karsten Graul wrote:
>>> On 28/12/2021 10:03, Dust Li wrote:
>> I saw David has already applied this to net, should I send another
>> patch to add some comments ?
>
>You could send a follow-on patch with your additional information, which
>I find is very helpful! Thanks.

Sure, will do

>
>-- 
>Karsten

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH net] net/smc: add comments for smc_link_{usable|sendable}
  2021-12-28  9:03 ` [PATCH net 1/2] net/smc: don't send CDC/LLC message if link not ready Dust Li
  2021-12-29 12:36   ` Karsten Graul
@ 2021-12-31  6:08   ` Dust Li
  2022-01-02 16:20     ` patchwork-bot+netdevbpf
  1 sibling, 1 reply; 13+ messages in thread
From: Dust Li @ 2021-12-31  6:08 UTC (permalink / raw)
  To: Karsten Graul, David S . Miller, Jakub Kicinski
  Cc: linux-s390, netdev, Wen Gu, Tony Lu

Add comments for both smc_link_sendable() and smc_link_usable()
to help better distinguish and use them.

No function changes.

Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
---
 net/smc/smc_core.h | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index d63b08274197..1e2926760eb0 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -407,7 +407,13 @@ static inline struct smc_connection *smc_lgr_find_conn(
 	return res;
 }
 
-/* returns true if the specified link is usable */
+/*
+ * Returns true if the specified link is usable.
+ *
+ * usable means the link is ready to receive RDMA messages, map memory
+ * on the link, etc. This doesn't ensure we are able to send RDMA messages
+ * on this link, if sending RDMA messages is needed, use smc_link_sendable()
+ */
 static inline bool smc_link_usable(struct smc_link *lnk)
 {
 	if (lnk->state == SMC_LNK_UNUSED || lnk->state == SMC_LNK_INACTIVE)
@@ -415,6 +421,15 @@ static inline bool smc_link_usable(struct smc_link *lnk)
 	return true;
 }
 
+/*
+ * Returns true if the specified link is ready to receive AND send RDMA
+ * messages.
+ *
+ * For the client side in first contact, the underlying QP may still in
+ * RESET or RTR when the link state is ACTIVATING, checks in smc_link_usable()
+ * is not strong enough. For those places that need to send any CDC or LLC
+ * messages, use smc_link_sendable(), otherwise, use smc_link_usable() instead
+ */
 static inline bool smc_link_sendable(struct smc_link *lnk)
 {
 	return smc_link_usable(lnk) &&
-- 
2.19.1.3.ge56e4f7


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net] net/smc: add comments for smc_link_{usable|sendable}
  2021-12-31  6:08   ` [PATCH net] net/smc: add comments for smc_link_{usable|sendable} Dust Li
@ 2022-01-02 16:20     ` patchwork-bot+netdevbpf
  0 siblings, 0 replies; 13+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-01-02 16:20 UTC (permalink / raw)
  To: Dust Li; +Cc: kgraul, davem, kuba, linux-s390, netdev, guwen, tonylu

Hello:

This patch was applied to netdev/net-next.git (master)
by David S. Miller <davem@davemloft.net>:

On Fri, 31 Dec 2021 14:08:53 +0800 you wrote:
> Add comments for both smc_link_sendable() and smc_link_usable()
> to help better distinguish and use them.
> 
> No function changes.
> 
> Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
> 
> [...]

Here is the summary with links:
  - [net] net/smc: add comments for smc_link_{usable|sendable}
    https://git.kernel.org/netdev/net-next/c/1f52a9380ff1

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net 1/2] net/smc: don't send CDC/LLC message if link not ready
  2021-12-31  3:15         ` dust.li
@ 2022-01-03 10:40           ` Karsten Graul
  0 siblings, 0 replies; 13+ messages in thread
From: Karsten Graul @ 2022-01-03 10:40 UTC (permalink / raw)
  To: dust.li, David S . Miller, Jakub Kicinski
  Cc: linux-s390, netdev, Wen Gu, Tony Lu

On 31/12/2021 04:15, dust.li wrote:
> On Thu, Dec 30, 2021 at 07:55:01PM +0100, Karsten Graul wrote:
>> On 30/12/2021 04:02, dust.li wrote:
>>> On Wed, Dec 29, 2021 at 01:36:06PM +0100, Karsten Graul wrote:
>>>> On 28/12/2021 10:03, Dust Li wrote:
>>> I saw David has already applied this to net, should I send another
>>> patch to add some comments ?
>>
>> You could send a follow-on patch with your additional information, which
>> I find is very helpful! Thanks.
> 
> Sure, will do

Thank you!

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-01-03 10:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-28  9:03 [PATCH net 0/2] net/smc: fix kernel panic caused by race of smc_sock Dust Li
2021-12-28  9:03 ` [PATCH net 1/2] net/smc: don't send CDC/LLC message if link not ready Dust Li
2021-12-29 12:36   ` Karsten Graul
2021-12-30  3:02     ` dust.li
2021-12-30 18:55       ` Karsten Graul
2021-12-31  3:15         ` dust.li
2022-01-03 10:40           ` Karsten Graul
2021-12-31  6:08   ` [PATCH net] net/smc: add comments for smc_link_{usable|sendable} Dust Li
2022-01-02 16:20     ` patchwork-bot+netdevbpf
2021-12-28  9:03 ` [PATCH net 2/2] net/smc: fix kernel panic caused by race of smc_sock Dust Li
2021-12-29 12:33   ` Karsten Graul
2021-12-30  3:46     ` dust.li
2021-12-28 12:50 ` [PATCH net 0/2] " patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.