bug report for rdma_rxe

* bug report for rdma_rxe
@ 2022-04-22 21:04 Bob Pearson
  2022-04-23  1:54 ` Bob Pearson
  2022-04-25  0:04 ` Yanjun Zhu
  0 siblings, 2 replies; 10+ messages in thread
From: Bob Pearson @ 2022-04-22 21:04 UTC (permalink / raw)
  To: Jason Gunthorpe, Zhu Yanjun, linux-rdma

Local operations in the rdma_rxe driver are not obviously idempotent. But, the
RC retry mechanism backs up the send queue to the point of the wqe that is
currently being acknowledged and re-walks the sq. Each send or write operation is
retried with the exception that the first one is truncated by the packets already
having been acknowledged. Each read and atomic operation is resent except that
read data already received in the first wqe is not requested. But all the
local operations are replayed. The problem is local invalidate which is destructive.
For example

sq:	some operation that times out
	bind mw to mr
	some other operation
	invalidate mw
	invalidate mr

can't be replayed because invalidating the mr makes the second bind fail.
There are lots of other examples where things go wrong.

To make things worse the send queue timer is never cleared and for typical
timeout values goes off every few msec whether anything actually failed.

Bob

^ permalink raw reply	[flat|nested] 10+ messages in thread