linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: "Håkon Bugge" <haakon.bugge@oracle.com>,
	"Jason Gunthorpe" <jgg@nvidia.com>,
	"Sasha Levin" <sashal@kernel.org>,
	linux-rdma@vger.kernel.org
Subject: [PATCH AUTOSEL 4.19 18/25] RDMA/core/sa_query: Retry SA queries
Date: Thu,  9 Sep 2021 20:22:26 -0400	[thread overview]
Message-ID: <20210910002234.176125-18-sashal@kernel.org> (raw)
In-Reply-To: <20210910002234.176125-1-sashal@kernel.org>

From: Håkon Bugge <haakon.bugge@oracle.com>

[ Upstream commit 5f5a650999d5718af766fc70a120230b04235a6f ]

A MAD packet is sent as an unreliable datagram (UD). SA requests are sent
as MAD packets. As such, SA requests or responses may be silently dropped.

IB Core's MAD layer has a timeout and retry mechanism, which amongst
other, is used by RDMA CM. But it is not used by SA queries. The lack of
retries of SA queries leads to long specified timeout, and error being
returned in case of packet loss. The ULP or user-land process has to
perform the retry.

Fix this by taking advantage of the MAD layer's retry mechanism.

First, a check against a zero timeout is added in rdma_resolve_route(). In
send_mad(), we set the MAD layer timeout to one tenth of the specified
timeout and the number of retries to 10. The special case when timeout is
less than 10 is handled.

With this fix:

 # ucmatose -c 1000 -S 1024 -C 1

runs stable on an Infiniband fabric. Without this fix, we see an
intermittent behavior and it errors out with:

cmatose: event: RDMA_CM_EVENT_ROUTE_ERROR, error: -110

(110 is ETIMEDOUT)

Link: https://lore.kernel.org/r/1628784755-28316-1-git-send-email-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/infiniband/core/cma.c      | 3 +++
 drivers/infiniband/core/sa_query.c | 9 ++++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 842a30947bdc..f3a0745c1b06 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2776,6 +2776,9 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
 	struct rdma_id_private *id_priv;
 	int ret;
 
+	if (!timeout_ms)
+		return -EINVAL;
+
 	id_priv = container_of(id, struct rdma_id_private, id);
 	if (!cma_comp_exch(id_priv, RDMA_CM_ADDR_RESOLVED, RDMA_CM_ROUTE_QUERY))
 		return -EINVAL;
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 9881e6fa9fe4..251772737764 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -1413,6 +1413,7 @@ static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
 	bool preload = gfpflags_allow_blocking(gfp_mask);
 	unsigned long flags;
 	int ret, id;
+	const int nmbr_sa_query_retries = 10;
 
 	if (preload)
 		idr_preload(gfp_mask);
@@ -1426,7 +1427,13 @@ static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
 	if (id < 0)
 		return id;
 
-	query->mad_buf->timeout_ms  = timeout_ms;
+	query->mad_buf->timeout_ms  = timeout_ms / nmbr_sa_query_retries;
+	query->mad_buf->retries = nmbr_sa_query_retries;
+	if (!query->mad_buf->timeout_ms) {
+		/* Special case, very small timeout_ms */
+		query->mad_buf->timeout_ms = 1;
+		query->mad_buf->retries = timeout_ms;
+	}
 	query->mad_buf->context[0] = query;
 	query->id = id;
 
-- 
2.30.2


  parent reply	other threads:[~2021-09-10  1:01 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-10  0:22 [PATCH AUTOSEL 4.19 01/25] clk: rockchip: rk3036: fix up the sclk_sfc parent error Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 02/25] scsi: smartpqi: Fix ISR accessing uninitialized data Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 03/25] scsi: lpfc: Fix cq_id truncation in rq create Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 04/25] clk: mediatek: Fix asymmetrical PLL enable and disable control Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 05/25] HID: usbhid: free raw_report buffers in usbhid_stop Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 06/25] f2fs: fix to force keeping write barrier for strict fsync mode Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 07/25] f2fs: fix min_seq_blocks can not make sense in some scenes Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 08/25] powerpc: make the install target not depend on any build artifact Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 09/25] jbd2: fix portability problems caused by unaligned accesses Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 10/25] scsi: qla2xxx: Fix NPIV create erroneous error Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 11/25] scsi: target: pscsi: Fix possible null-pointer dereference in pscsi_complete_cmd() Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 12/25] fs: dlm: fix return -EINTR on recovery stopped Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 13/25] scsi: core: Fix missing FORCE for scsi_devinfo_tbl.c build rule Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 14/25] powerpc/32: indirect function call use bctrl rather than blrl in ret_from_kernel_thread Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 15/25] powerpc/booke: Avoid link stack corruption in several places Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 16/25] KVM: PPC: Book3S HV: Initialise vcpu MSR with MSR_ME Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 17/25] KVM: PPC: Book3S HV P9: Fixes for TM softpatch interrupt NIP Sasha Levin
2021-09-10  0:22 ` Sasha Levin [this message]
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 19/25] platform/x86: dell-smbios-wmi: Avoid false-positive memcpy() warning Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 20/25] ext4: if zeroout fails fall back to splitting the extent node Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 21/25] ext4: Make sure quota files are not grabbed accidentally Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 22/25] xen: remove stray preempt_disable() from PV AP startup code Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 23/25] checkkconfigsymbols.py: Fix the '--ignore' option Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 24/25] ocfs2: quota_local: fix possible uninitialized-variable access in ocfs2_local_read_info() Sasha Levin
2021-09-10  0:22 ` [PATCH AUTOSEL 4.19 25/25] ocfs2: ocfs2_downconvert_lock failure results in deadlock Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210910002234.176125-18-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=haakon.bugge@oracle.com \
    --cc=jgg@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).