From: Manjunath Patil <manjunath.b.patil@oracle.com>
To: dledford@redhat.com, jgg@ziepe.ca
Cc: manjunath.b.patil@oracle.com, linux-rdma@vger.kernel.org,
linux-kernel@vger.kernel.org, rama.nichanamatlu@oracle.com
Subject: [PATCH v3] RDMA/cm: add timeout to cm_destroy_id wait
Date: Fri, 8 Mar 2024 22:33:23 -0800 [thread overview]
Message-ID: <20240309063323.458102-1-manjunath.b.patil@oracle.com> (raw)
Add timeout to cm_destroy_id, so that userspace can trigger any data
collection that would help in analyzing the cause of delay in destroying
the cm_id.
New noinline function helps dtrace/ebpf programs to hook on to it.
Existing functionality isn't changed except triggering a probe-able new
function at every timeout interval.
We have seen cases where CM messages stuck with MAD layer (either due to
software bug or faulty HCA), leading to cm_id getting stuck in the
following call stack. This patch helps in resolving such issues faster.
kernel: ... INFO: task XXXX:56778 blocked for more than 120 seconds.
...
Call Trace:
__schedule+0x2bc/0x895
schedule+0x36/0x7c
schedule_timeout+0x1f6/0x31f
? __slab_free+0x19c/0x2ba
wait_for_completion+0x12b/0x18a
? wake_up_q+0x80/0x73
cm_destroy_id+0x345/0x610 [ib_cm]
ib_destroy_cm_id+0x10/0x20 [ib_cm]
rdma_destroy_id+0xa8/0x300 [rdma_cm]
ucma_destroy_id+0x13e/0x190 [rdma_ucm]
ucma_write+0xe0/0x160 [rdma_ucm]
__vfs_write+0x3a/0x16d
vfs_write+0xb2/0x1a1
? syscall_trace_enter+0x1ce/0x2b8
SyS_write+0x5c/0xd3
do_syscall_64+0x79/0x1b9
entry_SYSCALL_64_after_hwframe+0x16d/0x0
Orabug: 36280065
Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
---
v3:
- added #define
v2:
- removed sysctl related code
drivers/infiniband/core/cm.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index ff58058aeadc..793103cf8152 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -34,6 +34,7 @@ MODULE_AUTHOR("Sean Hefty");
MODULE_DESCRIPTION("InfiniBand CM");
MODULE_LICENSE("Dual BSD/GPL");
+#define CM_DESTORY_ID_WAIT_TIMEOUT 10000 /* msecs */
static const char * const ibcm_rej_reason_strs[] = {
[IB_CM_REJ_NO_QP] = "no QP",
[IB_CM_REJ_NO_EEC] = "no EEC",
@@ -1025,10 +1026,20 @@ static void cm_reset_to_idle(struct cm_id_private *cm_id_priv)
}
}
+static noinline void cm_destroy_id_wait_timeout(struct ib_cm_id *cm_id)
+{
+ struct cm_id_private *cm_id_priv;
+
+ cm_id_priv = container_of(cm_id, struct cm_id_private, id);
+ pr_err("%s: cm_id=%p timed out. state=%d refcnt=%d\n", __func__,
+ cm_id, cm_id->state, refcount_read(&cm_id_priv->refcount));
+}
+
static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
{
struct cm_id_private *cm_id_priv;
struct cm_work *work;
+ int ret;
cm_id_priv = container_of(cm_id, struct cm_id_private, id);
spin_lock_irq(&cm_id_priv->lock);
@@ -1135,7 +1146,14 @@ static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
xa_erase(&cm.local_id_table, cm_local_id(cm_id->local_id));
cm_deref_id(cm_id_priv);
- wait_for_completion(&cm_id_priv->comp);
+ do {
+ ret = wait_for_completion_timeout(&cm_id_priv->comp,
+ msecs_to_jiffies(
+ CM_DESTORY_ID_WAIT_TIMEOUT));
+ if (!ret) /* timeout happened */
+ cm_destroy_id_wait_timeout(cm_id);
+ } while (!ret);
+
while ((work = cm_dequeue_work(cm_id_priv)) != NULL)
cm_free_work(work);
--
2.31.1
next reply other threads:[~2024-03-09 6:33 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-09 6:33 Manjunath Patil [this message]
2024-03-10 11:17 ` [PATCH v3] RDMA/cm: add timeout to cm_destroy_id wait Leon Romanovsky
2024-03-11 4:13 ` Manjunath Patil
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240309063323.458102-1-manjunath.b.patil@oracle.com \
--to=manjunath.b.patil@oracle.com \
--cc=dledford@redhat.com \
--cc=jgg@ziepe.ca \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=rama.nichanamatlu@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.