All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
To: dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Mike Marciniszyn
	<mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
	Jianxin Xiong
	<jianxin.xiong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Subject: [PATCH 19/28] IB/rdmavt, hfi1: Fix NFSoRDMA failure with FRMR enabled
Date: Mon, 25 Jul 2016 13:39:45 -0700	[thread overview]
Message-ID: <20160725203944.4800.44029.stgit@scvm10.sc.intel.com> (raw)
In-Reply-To: <20160725203554.4800.37248.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>

From: Jianxin Xiong <jianxin.xiong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Hanging has been observed while writing a file over NFSoRDMA. Dmesg on
the server contains messages like these:

[  931.992501] svcrdma: Error -22 posting RDMA_READ
[  952.076879] svcrdma: Error -22 posting RDMA_READ
[  982.154127] svcrdma: Error -22 posting RDMA_READ
[ 1012.235884] svcrdma: Error -22 posting RDMA_READ
[ 1042.319194] svcrdma: Error -22 posting RDMA_READ

Here is why:

With the base memory management extension enabled, FRMR is used instead
of FMR. The xprtrdma server issues each RDMA read request as the following
bundle:

(1)IB_WR_REG_MR, signaled;
(2)IB_WR_RDMA_READ, signaled;
(3)IB_WR_LOCAL_INV, signaled & fencing.

These requests are signaled. In order to generate completion, the fast
register work request is processed by the hfi1 send engine after being
posted to the work queue, and the corresponding lkey is not valid until
the request is processed. However, the rdmavt driver validates lkey when
the RDMA read request is posted and thus it fails immediately with error
-EINVAL (-22).

This patch changes the work flow of local operations (fast register and
local invalidate) so that fast register work requests are always
processed immediately to ensure that the corresponding lkey is valid
when subsequent work requests are posted. Local invalidate requests are
processed immediately if fencing is not required and no previous local
invalidate request is pending.

To allow completion generation for signaled local operations that have
been processed before posting to the work queue, an internal send flag
RVT_SEND_COMPLETION_ONLY is added. The hfi1 send engine checks this flag
and only generates completion for such requests.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jianxin Xiong <jianxin.xiong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/rc.c   |   17 +++++++------
 drivers/infiniband/hw/hfi1/ruc.c  |   13 +++++-----
 drivers/infiniband/hw/hfi1/uc.c   |   15 ++++++------
 drivers/infiniband/sw/rdmavt/qp.c |   48 +++++++++++++++++++++++++------------
 include/rdma/rdmavt_qp.h          |    1 +
 5 files changed, 56 insertions(+), 38 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/rc.c b/drivers/infiniband/hw/hfi1/rc.c
index 0bc43b6..5da190e 100644
--- a/drivers/infiniband/hw/hfi1/rc.c
+++ b/drivers/infiniband/hw/hfi1/rc.c
@@ -402,7 +402,6 @@ int hfi1_make_rc_req(struct rvt_qp *qp, struct hfi1_pkt_state *ps)
 	char newreq;
 	int middle = 0;
 	int delta;
-	int err;
 
 	ps->s_txreq = get_txreq(ps->dev, qp);
 	if (IS_ERR(ps->s_txreq))
@@ -484,25 +483,27 @@ int hfi1_make_rc_req(struct rvt_qp *qp, struct hfi1_pkt_state *ps)
 			 */
 			if (wqe->wr.opcode == IB_WR_REG_MR ||
 			    wqe->wr.opcode == IB_WR_LOCAL_INV) {
+				int local_ops = 0;
+				int err = 0;
+
 				if (qp->s_last != qp->s_cur)
 					goto bail;
 				if (++qp->s_cur == qp->s_size)
 					qp->s_cur = 0;
 				if (++qp->s_tail == qp->s_size)
 					qp->s_tail = 0;
-				if (wqe->wr.opcode == IB_WR_REG_MR)
-					err = rvt_fast_reg_mr(
-						qp, wqe->reg_wr.mr,
-						wqe->reg_wr.key,
-						wqe->reg_wr.access);
-				else
+				if (!(wqe->wr.send_flags &
+				      RVT_SEND_COMPLETION_ONLY)) {
 					err = rvt_invalidate_rkey(
 						qp,
 						wqe->wr.ex.invalidate_rkey);
+					local_ops = 1;
+				}
 				hfi1_send_complete(qp, wqe,
 						   err ? IB_WC_LOC_PROT_ERR
 						       : IB_WC_SUCCESS);
-				atomic_dec(&qp->local_ops_pending);
+				if (local_ops)
+					atomic_dec(&qp->local_ops_pending);
 				qp->s_hdrwords = 0;
 				goto done_free_tx;
 			}
diff --git a/drivers/infiniband/hw/hfi1/ruc.c b/drivers/infiniband/hw/hfi1/ruc.c
index 76b9c9e..7e76d33 100644
--- a/drivers/infiniband/hw/hfi1/ruc.c
+++ b/drivers/infiniband/hw/hfi1/ruc.c
@@ -442,16 +442,15 @@ again:
 	sqp->s_len = wqe->length;
 	switch (wqe->wr.opcode) {
 	case IB_WR_REG_MR:
-		if (rvt_fast_reg_mr(sqp, wqe->reg_wr.mr, wqe->reg_wr.key,
-				    wqe->reg_wr.access))
-			send_status = IB_WC_LOC_PROT_ERR;
-		local_ops = 1;
 		goto send_comp;
 
 	case IB_WR_LOCAL_INV:
-		if (rvt_invalidate_rkey(sqp, wqe->wr.ex.invalidate_rkey))
-			send_status = IB_WC_LOC_PROT_ERR;
-		local_ops = 1;
+		if (!(wqe->wr.send_flags & RVT_SEND_COMPLETION_ONLY)) {
+			if (rvt_invalidate_rkey(sqp,
+						wqe->wr.ex.invalidate_rkey))
+				send_status = IB_WC_LOC_PROT_ERR;
+			local_ops = 1;
+		}
 		goto send_comp;
 
 	case IB_WR_SEND_WITH_INV:
diff --git a/drivers/infiniband/hw/hfi1/uc.c b/drivers/infiniband/hw/hfi1/uc.c
index ef6c96c..a726d96 100644
--- a/drivers/infiniband/hw/hfi1/uc.c
+++ b/drivers/infiniband/hw/hfi1/uc.c
@@ -77,7 +77,6 @@ int hfi1_make_uc_req(struct rvt_qp *qp, struct hfi1_pkt_state *ps)
 	u32 len;
 	u32 pmtu = qp->pmtu;
 	int middle = 0;
-	int err;
 
 	ps->s_txreq = get_txreq(ps->dev, qp);
 	if (IS_ERR(ps->s_txreq))
@@ -125,20 +124,22 @@ int hfi1_make_uc_req(struct rvt_qp *qp, struct hfi1_pkt_state *ps)
 		 */
 		if (wqe->wr.opcode == IB_WR_REG_MR ||
 		    wqe->wr.opcode == IB_WR_LOCAL_INV) {
+			int local_ops = 0;
+			int err = 0;
+
 			if (qp->s_last != qp->s_cur)
 				goto bail;
 			if (++qp->s_cur == qp->s_size)
 				qp->s_cur = 0;
-			if (wqe->wr.opcode == IB_WR_REG_MR)
-				err = rvt_fast_reg_mr(qp, wqe->reg_wr.mr,
-						      wqe->reg_wr.key,
-						      wqe->reg_wr.access);
-			else
+			if (!(wqe->wr.send_flags & RVT_SEND_COMPLETION_ONLY)) {
 				err = rvt_invalidate_rkey(
 					qp, wqe->wr.ex.invalidate_rkey);
+				local_ops = 1;
+			}
 			hfi1_send_complete(qp, wqe, err ? IB_WC_LOC_PROT_ERR
 							: IB_WC_SUCCESS);
-			atomic_dec(&qp->local_ops_pending);
+			if (local_ops)
+				atomic_dec(&qp->local_ops_pending);
 			qp->s_hdrwords = 0;
 			goto done_free_tx;
 		}
diff --git a/drivers/infiniband/sw/rdmavt/qp.c b/drivers/infiniband/sw/rdmavt/qp.c
index 218494c..8ccf1b9 100644
--- a/drivers/infiniband/sw/rdmavt/qp.c
+++ b/drivers/infiniband/sw/rdmavt/qp.c
@@ -1579,6 +1579,7 @@ static int rvt_post_one_wr(struct rvt_qp *qp,
 	int ret;
 	size_t cplen;
 	bool reserved_op;
+	int local_ops_delayed = 0;
 
 	BUILD_BUG_ON(IB_QPT_MAX >= (sizeof(u32) * BITS_PER_BYTE));
 
@@ -1592,25 +1593,37 @@ static int rvt_post_one_wr(struct rvt_qp *qp,
 	cplen = ret;
 
 	/*
-	 * Local operations including fast register and local invalidate
-	 * can be processed immediately w/o being posted to the send queue
-	 * if neither fencing nor completion generation is needed. However,
-	 * once fencing or completion is requested, direct processing of
-	 * following local operations must be disabled until all the local
-	 * operations posted to the send queue have completed. This is
-	 * necessary to ensure the correct ordering.
+	 * Local operations include fast register and local invalidate.
+	 * Fast register needs to be processed immediately because the
+	 * registered lkey may be used by following work requests and the
+	 * lkey needs to be valid at the time those requests are posted.
+	 * Local invalidate can be processed immediately if fencing is
+	 * not required and no previous local invalidate ops are pending.
+	 * Signaled local operations that have been processed immediately
+	 * need to have requests with "completion only" flags set posted
+	 * to the send queue in order to generate completions.
 	 */
-	if ((rdi->post_parms[wr->opcode].flags & RVT_OPERATION_LOCAL) &&
-	    !(wr->send_flags & (IB_SEND_FENCE | IB_SEND_SIGNALED)) &&
-	    !atomic_read(&qp->local_ops_pending)) {
-		struct ib_reg_wr *reg = reg_wr(wr);
-
+	if ((rdi->post_parms[wr->opcode].flags & RVT_OPERATION_LOCAL)) {
 		switch (wr->opcode) {
 		case IB_WR_REG_MR:
-			return rvt_fast_reg_mr(qp, reg->mr, reg->key,
-					       reg->access);
+			ret = rvt_fast_reg_mr(qp,
+					      reg_wr(wr)->mr,
+					      reg_wr(wr)->key,
+					      reg_wr(wr)->access);
+			if (ret || !(wr->send_flags & IB_SEND_SIGNALED))
+				return ret;
+			break;
 		case IB_WR_LOCAL_INV:
-			return rvt_invalidate_rkey(qp, wr->ex.invalidate_rkey);
+			if ((wr->send_flags & IB_SEND_FENCE) ||
+			    atomic_read(&qp->local_ops_pending)) {
+				local_ops_delayed = 1;
+			} else {
+				ret = rvt_invalidate_rkey(
+					qp, wr->ex.invalidate_rkey);
+				if (ret || !(wr->send_flags & IB_SEND_SIGNALED))
+					return ret;
+			}
+			break;
 		default:
 			return -EINVAL;
 		}
@@ -1675,7 +1688,10 @@ static int rvt_post_one_wr(struct rvt_qp *qp,
 	}
 
 	if (rdi->post_parms[wr->opcode].flags & RVT_OPERATION_LOCAL) {
-		atomic_inc(&qp->local_ops_pending);
+		if (local_ops_delayed)
+			atomic_inc(&qp->local_ops_pending);
+		else
+			wqe->wr.send_flags |= RVT_SEND_COMPLETION_ONLY;
 		wqe->ssn = 0;
 		wqe->psn = 0;
 		wqe->lpsn = 0;
diff --git a/include/rdma/rdmavt_qp.h b/include/rdma/rdmavt_qp.h
index 56adcfc..13902dd 100644
--- a/include/rdma/rdmavt_qp.h
+++ b/include/rdma/rdmavt_qp.h
@@ -148,6 +148,7 @@
  * Internal send flags
  */
 #define RVT_SEND_RESERVE_USED           IB_SEND_RESERVED_START
+#define RVT_SEND_COMPLETION_ONLY	(IB_SEND_RESERVED_START << 1)
 
 /*
  * Send work request queue entry.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2016-07-25 20:39 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-25 20:37 [PATCH 00/28] RDMA/hfi1,qib,rdmavt: Second round of fixes for 4.8 Dennis Dalessandro
2016-07-25 20:37 ` Dennis Dalessandro
2016-07-25 20:39 ` [PATCH 13/28] IB/rdmavt: Add missing spin_lock_init call for rdi->n_cqs_lock Dennis Dalessandro
     [not found] ` <20160725203554.4800.37248.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
2016-07-25 20:37   ` [PATCH 01/28] IB/hfi1: Fix integrity errors counter value calculation Dennis Dalessandro
2016-07-25 20:38   ` [PATCH 02/28] IB/hfi1: Fix to fully initialize send context area Dennis Dalessandro
     [not found]     ` <20160725203759.4800.2358.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
2016-07-26  5:26       ` Leon Romanovsky
     [not found]         ` <20160726052657.GD20674-2ukJVAZIZ/Y@public.gmane.org>
2016-07-26 14:18           ` Dalessandro, Dennis
2016-07-28 16:32           ` ira.weiny
     [not found]             ` <20160728163209.GA28030-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
2016-07-31  6:53               ` Leon Romanovsky
2016-07-25 20:38   ` [PATCH 03/28] IB/hfi1: Pull FECN/BECN processing to a common place Dennis Dalessandro
2016-07-25 20:38   ` [PATCH 04/28] IB/rdmavt: Add support for ib_map_mr_sg Dennis Dalessandro
2016-07-25 20:38   ` [PATCH 05/28] IB/rdmavt: Add mechanism to invalidate MR keys Dennis Dalessandro
2016-07-25 20:38   ` [PATCH 06/28] IB/rdmavt: Handle local operations in post send Dennis Dalessandro
2016-07-25 20:38   ` [PATCH 07/28] IB/hfi1: Handle send with invalidate opcode in the RC recv path Dennis Dalessandro
2016-07-25 20:38   ` [PATCH 08/28] IB/hfi1: Work request processing for fast register mr and invalidate Dennis Dalessandro
2016-07-25 20:38   ` [PATCH 09/28] IB/hfi1: Add support for extended memory management Dennis Dalessandro
     [not found]     ` <20160725203842.4800.60710.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
2016-07-25 21:24       ` Jason Gunthorpe
     [not found]         ` <20160725212457.GA21162-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2016-07-25 21:39           ` Dalessandro, Dennis
2016-07-25 20:38   ` [PATCH 10/28] IB/hfi1: Modify the default number of kernel receive conexts Dennis Dalessandro
2016-07-25 20:38   ` [PATCH 11/28] IB/hfi1: Explain state complete frame details Dennis Dalessandro
2016-07-25 20:39   ` [PATCH 12/28] IB/hfi1: Read all firmware versions Dennis Dalessandro
2016-07-25 20:39   ` [PATCH 14/28] IB/hfi1: Fix "suspicious rcu_dereference_check() usage" warnings Dennis Dalessandro
2016-07-25 20:39   ` [PATCH 15/28] IB/hfi1: Add static PCIe Gen3 CTLE tuning Dennis Dalessandro
2016-07-25 20:39   ` [PATCH 16/28] IB/hfi1: Add sysfs entry to override SDMA interrupt affinity Dennis Dalessandro
2016-07-25 20:39   ` [PATCH 17/28] IB/hfi1: Fix trace message units Dennis Dalessandro
2016-07-25 20:39   ` [PATCH 18/28] IB/hfi1: Add the capability for reserved operations Dennis Dalessandro
2016-07-25 20:39   ` Dennis Dalessandro [this message]
2016-07-25 20:39   ` [PATCH 20/28] IB/hfi1: Disable external device configuration requests Dennis Dalessandro
2016-07-25 20:39   ` [PATCH 21/28] IB/hfi1: Ignore QSFP interrupts until power stabilizes Dennis Dalessandro
2016-07-25 20:40   ` [PATCH 22/28] IB/hfi1: Reset QSFP on every run through channel tuning Dennis Dalessandro
2016-07-25 20:40   ` [PATCH 23/28] IB/hfi1: Remove unused elements from struct ahg_ib_header Dennis Dalessandro
2016-07-25 20:40   ` [PATCH 24/28] IB/hfi1: Rename struct ahg_ib_header to struct hfi1_ahg_info Dennis Dalessandro
2016-07-25 20:40   ` [PATCH 25/28] IB/hfi1: Rename hfi1_pio_header to hfi1_sdma_header Dennis Dalessandro
2016-07-25 20:40   ` [PATCH 26/28] IB/hfi1: Cleanup UD packet handler Dennis Dalessandro
2016-07-25 20:40   ` [PATCH 27/28] IB/hfi1: Use hdr2sc function to calculate 5-bit SC Dennis Dalessandro
2016-07-25 20:40   ` [PATCH 28/28] IB/qib, IB/hfi1: Fix grh creation in ud loopback Dennis Dalessandro
2016-08-03  2:40   ` [PATCH 00/28] RDMA/hfi1,qib,rdmavt: Second round of fixes for 4.8 Doug Ledford

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160725203944.4800.44029.stgit@scvm10.sc.intel.com \
    --to=dennis.dalessandro-ral2jqcrhueavxtiumwx3w@public.gmane.org \
    --cc=dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=jianxin.xiong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.