Re: bug report for rdma_rxe

From: Jason Gunthorpe <jgg@nvidia.com>
To: Bob Pearson <rpearsonhpe@gmail.com>
Cc: Yanjun Zhu <yanjun.zhu@linux.dev>,
	Zhu Yanjun <zyjzyj2000@gmail.com>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>
Subject: Re: bug report for rdma_rxe
Date: Tue, 26 Apr 2022 08:42:31 -0300	[thread overview]
Message-ID: <20220426114231.GI2125828@nvidia.com> (raw)
In-Reply-To: <2f84097e-b31c-52b4-80b3-9e275a3b83bc@gmail.com>

On Mon, Apr 25, 2022 at 08:40:30PM -0500, Bob Pearson wrote:
> On 4/25/22 17:58, Jason Gunthorpe wrote:

> Imagine a very long RDMA read operation that times out several times before finally
> getting all the data returned to the requester. Now imagine it is followed by some
> small RDMA ops to a different node that use fast reg MRs and are executed by the
> other node after receiving a small control message. E.g.
> 
> 	node1					node2					node3
> 
> 1:	Send: RDMA_READ(mr1 to node2)
> 						RDMA_READ_REPLY(mr1@node1, 1of2)
> 	ib_map_mr_sg(mr2a local)
> 	Send: IB_WR_REG_MR(mr2a local)
> 	Send: Control msg (mr2a to node3)
> 											Send: RDMA_WRITE(mr2a@node1)
> 	Send: IB_WR_LOCAL_INV(mr2a local)
> 	ib_update_fast_reg_key(mr2a->mr2b)
> 	ib_map_mr_sg(mr2b local)
> 	Send: Control msg (mr2b to node3)
> 											Send: RDMA_WRITE(mr2b@node1)
> 	Timeout: replay from 1 (w/o local ops)
> 	Send: RDMA_READ(mr1 to node2)
> 						RDMA_READ_REPLY(mr1@node1, 2of2)
> 	Send: Control msg (mr2a to node3)
> 											Send: RDMA_WRITE(mr2a@node1)
> 											FAILS because mr2a has been
> 											replaced by mr2b.
> On the other hand if we replay the REG_MR local command that won't work either
> because we didn't know to rerun the ib_map_mr_sg() call.

How did you get two destination nodes into an RC send queue? We have
SRQ not SSQ.

In any event, the above is a buggy ULP. The IB_WR_LOCAL_INV cannot be
posted until the CQ for Send with mr2a is received. (or possibly a
strong fence is used)

It follows the general rule that the ULP cannot alter the data memory
under a WQE until it sees the CQE for that WQE to know the NIC has
completed finished with the memory.

Jason