All of lore.kernel.org
 help / color / mirror / Atom feed
* bug report for rdma_rxe
@ 2022-04-22 21:04 Bob Pearson
  2022-04-23  1:54 ` Bob Pearson
  2022-04-25  0:04 ` Yanjun Zhu
  0 siblings, 2 replies; 10+ messages in thread
From: Bob Pearson @ 2022-04-22 21:04 UTC (permalink / raw)
  To: Jason Gunthorpe, Zhu Yanjun, linux-rdma

Local operations in the rdma_rxe driver are not obviously idempotent. But, the
RC retry mechanism backs up the send queue to the point of the wqe that is
currently being acknowledged and re-walks the sq. Each send or write operation is
retried with the exception that the first one is truncated by the packets already
having been acknowledged. Each read and atomic operation is resent except that
read data already received in the first wqe is not requested. But all the
local operations are replayed. The problem is local invalidate which is destructive.
For example

sq:	some operation that times out
	bind mw to mr
	some other operation
	invalidate mw
	invalidate mr

can't be replayed because invalidating the mr makes the second bind fail.
There are lots of other examples where things go wrong.

To make things worse the send queue timer is never cleared and for typical
timeout values goes off every few msec whether anything actually failed.

Bob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug report for rdma_rxe
  2022-04-22 21:04 bug report for rdma_rxe Bob Pearson
@ 2022-04-23  1:54 ` Bob Pearson
  2022-04-25  0:04 ` Yanjun Zhu
  1 sibling, 0 replies; 10+ messages in thread
From: Bob Pearson @ 2022-04-23  1:54 UTC (permalink / raw)
  To: Jason Gunthorpe, Zhu Yanjun, linux-rdma

On 4/22/22 16:04, Bob Pearson wrote:
> Local operations in the rdma_rxe driver are not obviously idempotent. But, the
> RC retry mechanism backs up the send queue to the point of the wqe that is
> currently being acknowledged and re-walks the sq. Each send or write operation is
> retried with the exception that the first one is truncated by the packets already
> having been acknowledged. Each read and atomic operation is resent except that
> read data already received in the first wqe is not requested. But all the
> local operations are replayed. The problem is local invalidate which is destructive.
> For example
> 
> sq:	some operation that times out
> 	bind mw to mr
> 	some other operation
> 	invalidate mw
> 	invalidate mr
> 
> can't be replayed because invalidating the mr makes the second bind fail.
> There are lots of other examples where things go wrong.
> 
> To make things worse the send queue timer is never cleared and for typical
> timeout values goes off every few msec whether anything actually failed.
> 
> Bob

This looks like an unholy mess. The reason I was looking at it is because Lustre
on rxe doesn't work at the moment and the problems were traced to retry flows (on a very
reliable network) caused by stray timeouts. We see local_invalidate_mr operations
getting retried multiple times and not all of them succeed because the caller
is remapping the fast MR in the mean time and changing the rkey.

Bob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug report for rdma_rxe
  2022-04-22 21:04 bug report for rdma_rxe Bob Pearson
  2022-04-23  1:54 ` Bob Pearson
@ 2022-04-25  0:04 ` Yanjun Zhu
  2022-04-25 16:58   ` Bob Pearson
  1 sibling, 1 reply; 10+ messages in thread
From: Yanjun Zhu @ 2022-04-25  0:04 UTC (permalink / raw)
  To: Bob Pearson, Jason Gunthorpe, Zhu Yanjun, linux-rdma

在 2022/4/23 5:04, Bob Pearson 写道:
> Local operations in the rdma_rxe driver are not obviously idempotent. But, the
> RC retry mechanism backs up the send queue to the point of the wqe that is
> currently being acknowledged and re-walks the sq. Each send or write operation is
> retried with the exception that the first one is truncated by the packets already
> having been acknowledged. Each read and atomic operation is resent except that
> read data already received in the first wqe is not requested. But all the
> local operations are replayed. The problem is local invalidate which is destructive.
> For example

Is there any example or just your analysis?
You know, sometimes your analysis is not always correct.
To prove your analysis, please show us some solid example.

Zhu Yanjun

> 
> sq:	some operation that times out
> 	bind mw to mr
> 	some other operation
> 	invalidate mw
> 	invalidate mr
> 
> can't be replayed because invalidating the mr makes the second bind fail.
> There are lots of other examples where things go wrong.
> 
> To make things worse the send queue timer is never cleared and for typical
> timeout values goes off every few msec whether anything actually failed.
> 
> Bob


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug report for rdma_rxe
  2022-04-25  0:04 ` Yanjun Zhu
@ 2022-04-25 16:58   ` Bob Pearson
  2022-04-25 22:58     ` Jason Gunthorpe
  2022-04-26  3:10     ` Zhu Yanjun
  0 siblings, 2 replies; 10+ messages in thread
From: Bob Pearson @ 2022-04-25 16:58 UTC (permalink / raw)
  To: Yanjun Zhu, Jason Gunthorpe, Zhu Yanjun, linux-rdma

On 4/24/22 19:04, Yanjun Zhu wrote:
> 在 2022/4/23 5:04, Bob Pearson 写道:
>> Local operations in the rdma_rxe driver are not obviously idempotent. But, the
>> RC retry mechanism backs up the send queue to the point of the wqe that is
>> currently being acknowledged and re-walks the sq. Each send or write operation is
>> retried with the exception that the first one is truncated by the packets already
>> having been acknowledged. Each read and atomic operation is resent except that
>> read data already received in the first wqe is not requested. But all the
>> local operations are replayed. The problem is local invalidate which is destructive.
>> For example
> 
> Is there any example or just your analysis?

I have a colleague at HPE who is testing Lustre/o2iblnd/rxe. They are testing over a
highly reliable network so do not expect to see dropped or out of order packets. But they
see multiple timeout flows. When working on rping a week ago I also saw lots of timeouts
and verified that the timeout code in rxe has the behavior that when a new RC operation is
sent the retry timer is modified to go off at jiffies + qp->timeout_jiffies but only if
there is not a currently pending timer. Once set it is never cleared so it will fire
typically a few msec later initiating a retry flow. If IO operations are frequent then
there will be a timeout every few msec (about 20 times a second for typical timeout values.)
o2iblnd uses fast reg MRs to write data to the target system and then local invalidate
operations to invalidate the MR and then increments the key portion of the rkey and resets
the map and then does a reg mr operation. Retry flows cause the local invalidate and reg MR
operations to be re-executed over and over again. A single retry can cause a half a dozen
invalidate operations to be run with various rkeys and they mostly fail because they don't
match the current MR. This results in Lustre crashing.

Currently I am actually happy that the unneeded retries are happening because it makes
testing the retry code a lot easier. But eventually it would be good to clear or reset the timer
after the operation is completed which would greatly reduce the number of retries. Also
it will be important to figure out how the IBA intended for local invalidates and reg MRs to
work. The way they are now they cannot be successfully retried. Also marking them as done
and skipping them in the retry sequence does not work. (It breaks some of the blktests test
cases.)

> You know, sometimes your analysis is not always correct.
> To prove your analysis, please show us some solid example.
> 
> Zhu Yanjun
> 
>>
>> sq:    some operation that times out
>>     bind mw to mr
>>     some other operation
>>     invalidate mw
>>     invalidate mr
>>
>> can't be replayed because invalidating the mr makes the second bind fail.
>> There are lots of other examples where things go wrong.
>>
>> To make things worse the send queue timer is never cleared and for typical
>> timeout values goes off every few msec whether anything actually failed.
>>
>> Bob
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug report for rdma_rxe
  2022-04-25 16:58   ` Bob Pearson
@ 2022-04-25 22:58     ` Jason Gunthorpe
  2022-04-26  1:40       ` Bob Pearson
  2022-04-28 13:31       ` Bob Pearson
  2022-04-26  3:10     ` Zhu Yanjun
  1 sibling, 2 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2022-04-25 22:58 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Yanjun Zhu, Zhu Yanjun, linux-rdma

On Mon, Apr 25, 2022 at 11:58:55AM -0500, Bob Pearson wrote:
> On 4/24/22 19:04, Yanjun Zhu wrote:
> > 在 2022/4/23 5:04, Bob Pearson 写道:
> >> Local operations in the rdma_rxe driver are not obviously idempotent. But, the
> >> RC retry mechanism backs up the send queue to the point of the wqe that is
> >> currently being acknowledged and re-walks the sq. Each send or write operation is
> >> retried with the exception that the first one is truncated by the packets already
> >> having been acknowledged. Each read and atomic operation is resent except that
> >> read data already received in the first wqe is not requested. But all the
> >> local operations are replayed. The problem is local invalidate which is destructive.
> >> For example
> > 
> > Is there any example or just your analysis?
> 
> I have a colleague at HPE who is testing Lustre/o2iblnd/rxe. They are testing over a
> highly reliable network so do not expect to see dropped or out of order packets. But they
> see multiple timeout flows. When working on rping a week ago I also saw lots of timeouts
> and verified that the timeout code in rxe has the behavior that when a new RC operation is
> sent the retry timer is modified to go off at jiffies + qp->timeout_jiffies but only if
> there is not a currently pending timer. Once set it is never cleared so it will fire
> typically a few msec later initiating a retry flow. If IO operations are frequent then
> there will be a timeout every few msec (about 20 times a second for typical timeout values.)
> o2iblnd uses fast reg MRs to write data to the target system and then local invalidate
> operations to invalidate the MR and then increments the key portion of the rkey and resets
> the map and then does a reg mr operation. Retry flows cause the local invalidate and reg MR
> operations to be re-executed over and over again. A single retry can cause a half a dozen
> invalidate operations to be run with various rkeys and they mostly fail because they don't
> match the current MR. This results in Lustre crashing.
> 
> Currently I am actually happy that the unneeded retries are happening because it makes
> testing the retry code a lot easier. But eventually it would be good to clear or reset the timer
> after the operation is completed which would greatly reduce the number of retries. Also
> it will be important to figure out how the IBA intended for local invalidates and reg MRs to
> work. The way they are now they cannot be successfully retried. Also marking them as done
> and skipping them in the retry sequence does not work. (It breaks some of the blktests test
> cases.)

local operations on a QP are not supposed to need retry because they
are not supposed to go on the network, so backing up the sq past its
current position should not re-execute any local operations until the
sq passes its actual head.

Or, stated differently, you have a head/tail pointer for local work
and a head/tail pointer for network work and the two track
independently within the defined ordering constraints.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug report for rdma_rxe
  2022-04-25 22:58     ` Jason Gunthorpe
@ 2022-04-26  1:40       ` Bob Pearson
  2022-04-26 11:42         ` Jason Gunthorpe
  2022-04-28 13:31       ` Bob Pearson
  1 sibling, 1 reply; 10+ messages in thread
From: Bob Pearson @ 2022-04-26  1:40 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Yanjun Zhu, Zhu Yanjun, linux-rdma

On 4/25/22 17:58, Jason Gunthorpe wrote:
> On Mon, Apr 25, 2022 at 11:58:55AM -0500, Bob Pearson wrote:
>> On 4/24/22 19:04, Yanjun Zhu wrote:
>>> 在 2022/4/23 5:04, Bob Pearson 写道:
>>>> Local operations in the rdma_rxe driver are not obviously idempotent. But, the
>>>> RC retry mechanism backs up the send queue to the point of the wqe that is
>>>> currently being acknowledged and re-walks the sq. Each send or write operation is
>>>> retried with the exception that the first one is truncated by the packets already
>>>> having been acknowledged. Each read and atomic operation is resent except that
>>>> read data already received in the first wqe is not requested. But all the
>>>> local operations are replayed. The problem is local invalidate which is destructive.
>>>> For example
>>>
>>> Is there any example or just your analysis?
>>
>> I have a colleague at HPE who is testing Lustre/o2iblnd/rxe. They are testing over a
>> highly reliable network so do not expect to see dropped or out of order packets. But they
>> see multiple timeout flows. When working on rping a week ago I also saw lots of timeouts
>> and verified that the timeout code in rxe has the behavior that when a new RC operation is
>> sent the retry timer is modified to go off at jiffies + qp->timeout_jiffies but only if
>> there is not a currently pending timer. Once set it is never cleared so it will fire
>> typically a few msec later initiating a retry flow. If IO operations are frequent then
>> there will be a timeout every few msec (about 20 times a second for typical timeout values.)
>> o2iblnd uses fast reg MRs to write data to the target system and then local invalidate
>> operations to invalidate the MR and then increments the key portion of the rkey and resets
>> the map and then does a reg mr operation. Retry flows cause the local invalidate and reg MR
>> operations to be re-executed over and over again. A single retry can cause a half a dozen
>> invalidate operations to be run with various rkeys and they mostly fail because they don't
>> match the current MR. This results in Lustre crashing.
>>
>> Currently I am actually happy that the unneeded retries are happening because it makes
>> testing the retry code a lot easier. But eventually it would be good to clear or reset the timer
>> after the operation is completed which would greatly reduce the number of retries. Also
>> it will be important to figure out how the IBA intended for local invalidates and reg MRs to
>> work. The way they are now they cannot be successfully retried. Also marking them as done
>> and skipping them in the retry sequence does not work. (It breaks some of the blktests test
>> cases.)
> 
> local operations on a QP are not supposed to need retry because they
> are not supposed to go on the network, so backing up the sq past its
> current position should not re-execute any local operations until the
> sq passes its actual head.
> 
> Or, stated differently, you have a head/tail pointer for local work
> and a head/tail pointer for network work and the two track
> independently within the defined ordering constraints.
> 
> Jason

Imagine a very long RDMA read operation that times out several times before finally
getting all the data returned to the requester. Now imagine it is followed by some
small RDMA ops to a different node that use fast reg MRs and are executed by the
other node after receiving a small control message. E.g.

	node1					node2					node3

1:	Send: RDMA_READ(mr1 to node2)
						RDMA_READ_REPLY(mr1@node1, 1of2)
	ib_map_mr_sg(mr2a local)
	Send: IB_WR_REG_MR(mr2a local)
	Send: Control msg (mr2a to node3)
											Send: RDMA_WRITE(mr2a@node1)
	Send: IB_WR_LOCAL_INV(mr2a local)
	ib_update_fast_reg_key(mr2a->mr2b)
	ib_map_mr_sg(mr2b local)
	Send: Control msg (mr2b to node3)
											Send: RDMA_WRITE(mr2b@node1)
	Timeout: replay from 1 (w/o local ops)
	Send: RDMA_READ(mr1 to node2)
						RDMA_READ_REPLY(mr1@node1, 2of2)
	Send: Control msg (mr2a to node3)
											Send: RDMA_WRITE(mr2a@node1)
											FAILS because mr2a has been
											replaced by mr2b.
On the other hand if we replay the REG_MR local command that won't work either
because we didn't know to rerun the ib_map_mr_sg() call.

I don't see an easy way to fix this.

What does help is to always set the fence bit on all local operations.
Then will hold off until the prior op completes. But this is a lot
slower than just pipe-lining. This is cheating a little. Node 1 has to
know that the previous RDMA_WRITE to mr2a is done before reusing the mr.
And that implies that either they look at the data (not really OK) or
waited for the completion which implies the send queue is done up to
that point or the fence bit is set in the local invalidate op.

Bob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug report for rdma_rxe
  2022-04-25 16:58   ` Bob Pearson
  2022-04-25 22:58     ` Jason Gunthorpe
@ 2022-04-26  3:10     ` Zhu Yanjun
  1 sibling, 0 replies; 10+ messages in thread
From: Zhu Yanjun @ 2022-04-26  3:10 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Yanjun Zhu, Jason Gunthorpe, linux-rdma

On Tue, Apr 26, 2022 at 12:58 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> On 4/24/22 19:04, Yanjun Zhu wrote:
> > 在 2022/4/23 5:04, Bob Pearson 写道:
> >> Local operations in the rdma_rxe driver are not obviously idempotent. But, the
> >> RC retry mechanism backs up the send queue to the point of the wqe that is
> >> currently being acknowledged and re-walks the sq. Each send or write operation is
> >> retried with the exception that the first one is truncated by the packets already
> >> having been acknowledged. Each read and atomic operation is resent except that
> >> read data already received in the first wqe is not requested. But all the
> >> local operations are replayed. The problem is local invalidate which is destructive.
> >> For example
> >
> > Is there any example or just your analysis?
>
> I have a colleague at HPE who is testing Lustre/o2iblnd/rxe. They are testing over a
> highly reliable network so do not expect to see dropped or out of order packets. But they
> see multiple timeout flows. When working on rping a week ago I also saw lots of timeouts
> and verified that the timeout code in rxe has the behavior that when a new RC operation is
> sent the retry timer is modified to go off at jiffies + qp->timeout_jiffies but only if
> there is not a currently pending timer. Once set it is never cleared so it will fire
> typically a few msec later initiating a retry flow. If IO operations are frequent then
> there will be a timeout every few msec (about 20 times a second for typical timeout values.)
> o2iblnd uses fast reg MRs to write data to the target system and then local invalidate
> operations to invalidate the MR and then increments the key portion of the rkey and resets
> the map and then does a reg mr operation. Retry flows cause the local invalidate and reg MR
> operations to be re-executed over and over again. A single retry can cause a half a dozen
> invalidate operations to be run with various rkeys and they mostly fail because they don't
> match the current MR. This results in Lustre crashing.
>
> Currently I am actually happy that the unneeded retries are happening because it makes
> testing the retry code a lot easier. But eventually it would be good to clear or reset the timer
> after the operation is completed which would greatly reduce the number of retries. Also

This retry is triggered by RDMA/RXE or by some applications? And the
mr is used the original one or allocate a new one?

If this retry is triggered by RDMA/RXE, and the original mr is freed
and a new one is allocated, it is very similar to ODP.
Currently ODP is not supported in RXE, but the behavior is very similar.

Zhu Yanjun

> it will be important to figure out how the IBA intended for local invalidates and reg MRs to
> work. The way they are now they cannot be successfully retried. Also marking them as done
> and skipping them in the retry sequence does not work. (It breaks some of the blktests test
> cases.)
>
> > You know, sometimes your analysis is not always correct.
> > To prove your analysis, please show us some solid example.
> >
> > Zhu Yanjun
> >
> >>
> >> sq:    some operation that times out
> >>     bind mw to mr
> >>     some other operation
> >>     invalidate mw
> >>     invalidate mr
> >>
> >> can't be replayed because invalidating the mr makes the second bind fail.
> >> There are lots of other examples where things go wrong.
> >>
> >> To make things worse the send queue timer is never cleared and for typical
> >> timeout values goes off every few msec whether anything actually failed.
> >>
> >> Bob
> >
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug report for rdma_rxe
  2022-04-26  1:40       ` Bob Pearson
@ 2022-04-26 11:42         ` Jason Gunthorpe
  0 siblings, 0 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2022-04-26 11:42 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Yanjun Zhu, Zhu Yanjun, linux-rdma

On Mon, Apr 25, 2022 at 08:40:30PM -0500, Bob Pearson wrote:
> On 4/25/22 17:58, Jason Gunthorpe wrote:

> Imagine a very long RDMA read operation that times out several times before finally
> getting all the data returned to the requester. Now imagine it is followed by some
> small RDMA ops to a different node that use fast reg MRs and are executed by the
> other node after receiving a small control message. E.g.
> 
> 	node1					node2					node3
> 
> 1:	Send: RDMA_READ(mr1 to node2)
> 						RDMA_READ_REPLY(mr1@node1, 1of2)
> 	ib_map_mr_sg(mr2a local)
> 	Send: IB_WR_REG_MR(mr2a local)
> 	Send: Control msg (mr2a to node3)
> 											Send: RDMA_WRITE(mr2a@node1)
> 	Send: IB_WR_LOCAL_INV(mr2a local)
> 	ib_update_fast_reg_key(mr2a->mr2b)
> 	ib_map_mr_sg(mr2b local)
> 	Send: Control msg (mr2b to node3)
> 											Send: RDMA_WRITE(mr2b@node1)
> 	Timeout: replay from 1 (w/o local ops)
> 	Send: RDMA_READ(mr1 to node2)
> 						RDMA_READ_REPLY(mr1@node1, 2of2)
> 	Send: Control msg (mr2a to node3)
> 											Send: RDMA_WRITE(mr2a@node1)
> 											FAILS because mr2a has been
> 											replaced by mr2b.
> On the other hand if we replay the REG_MR local command that won't work either
> because we didn't know to rerun the ib_map_mr_sg() call.

How did you get two destination nodes into an RC send queue? We have
SRQ not SSQ.

In any event, the above is a buggy ULP. The IB_WR_LOCAL_INV cannot be
posted until the CQ for Send with mr2a is received. (or possibly a
strong fence is used)

It follows the general rule that the ULP cannot alter the data memory
under a WQE until it sees the CQE for that WQE to know the NIC has
completed finished with the memory.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug report for rdma_rxe
  2022-04-25 22:58     ` Jason Gunthorpe
  2022-04-26  1:40       ` Bob Pearson
@ 2022-04-28 13:31       ` Bob Pearson
  2022-04-28 14:29         ` Jason Gunthorpe
  1 sibling, 1 reply; 10+ messages in thread
From: Bob Pearson @ 2022-04-28 13:31 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Yanjun Zhu, Zhu Yanjun, linux-rdma, jhack

On 4/25/22 17:58, Jason Gunthorpe wrote:
> On Mon, Apr 25, 2022 at 11:58:55AM -0500, Bob Pearson wrote:
>> On 4/24/22 19:04, Yanjun Zhu wrote:
>>> 在 2022/4/23 5:04, Bob Pearson 写道:
>>>> Local operations in the rdma_rxe driver are not obviously idempotent. But, the
>>>> RC retry mechanism backs up the send queue to the point of the wqe that is
>>>> currently being acknowledged and re-walks the sq. Each send or write operation is
>>>> retried with the exception that the first one is truncated by the packets already
>>>> having been acknowledged. Each read and atomic operation is resent except that
>>>> read data already received in the first wqe is not requested. But all the
>>>> local operations are replayed. The problem is local invalidate which is destructive.
>>>> For example
>>>
>>> Is there any example or just your analysis?
>>
>> I have a colleague at HPE who is testing Lustre/o2iblnd/rxe. They are testing over a
>> highly reliable network so do not expect to see dropped or out of order packets. But they
>> see multiple timeout flows. When working on rping a week ago I also saw lots of timeouts
>> and verified that the timeout code in rxe has the behavior that when a new RC operation is
>> sent the retry timer is modified to go off at jiffies + qp->timeout_jiffies but only if
>> there is not a currently pending timer. Once set it is never cleared so it will fire
>> typically a few msec later initiating a retry flow. If IO operations are frequent then
>> there will be a timeout every few msec (about 20 times a second for typical timeout values.)
>> o2iblnd uses fast reg MRs to write data to the target system and then local invalidate
>> operations to invalidate the MR and then increments the key portion of the rkey and resets
>> the map and then does a reg mr operation. Retry flows cause the local invalidate and reg MR
>> operations to be re-executed over and over again. A single retry can cause a half a dozen
>> invalidate operations to be run with various rkeys and they mostly fail because they don't
>> match the current MR. This results in Lustre crashing.
>>
>> Currently I am actually happy that the unneeded retries are happening because it makes
>> testing the retry code a lot easier. But eventually it would be good to clear or reset the timer
>> after the operation is completed which would greatly reduce the number of retries. Also
>> it will be important to figure out how the IBA intended for local invalidates and reg MRs to
>> work. The way they are now they cannot be successfully retried. Also marking them as done
>> and skipping them in the retry sequence does not work. (It breaks some of the blktests test
>> cases.)
> 
> local operations on a QP are not supposed to need retry because they
> are not supposed to go on the network, so backing up the sq past its
> current position should not re-execute any local operations until the
> sq passes its actual head.
> 
> Or, stated differently, you have a head/tail pointer for local work
> and a head/tail pointer for network work and the two track
> independently within the defined ordering constraints.
> 
> Jason

This is a strong constraint on the send queue but is the only sane solution I suspect.
It implies that not attempting to redo local operations implies that the verbs consumer
must guarantee that they can safely change the MR/MW state as soon as the operation is
executed for the first time. This means that either there is a fence or they have seen
the completion of all IO operations that depend on the memory. It is not clear that all
test cases obey these rules or that they don't. We should WARN on those situations where
we can see a violation.

There is another source of errors in the driver that we now suspect. The send queue is
variously owned by three or more threads: verbs API calls which post operations,
the requester tasklet which turns wqe's into RoCE request packets and the completer tasklet
that responds to RoCE ack/read-reply packets (for RC) or marking the wqe as done (for UD/UC).
Both tasklets read and write wqe fields but do not use any locking to enforce consistency.
For normal flows this is mostly OK because the wqe is either only accessed by the requester
tasklet or later the completer tasklet. But in retry flows they can overlap. There needs to
be a clear ownership of the wqes by one tasklet or the other and memory barriers at the
hand-offs.

The wqe->state variable should indicate which tasklet owns the wqe and a lock should be held
when the state is loaded or changed. The retry prep routine req_retry() should hold the lock for the
duration of re-marking the wqes.

Bob

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: bug report for rdma_rxe
  2022-04-28 13:31       ` Bob Pearson
@ 2022-04-28 14:29         ` Jason Gunthorpe
  0 siblings, 0 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2022-04-28 14:29 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Yanjun Zhu, Zhu Yanjun, linux-rdma, jhack

On Thu, Apr 28, 2022 at 08:31:24AM -0500, Bob Pearson wrote:

> This is a strong constraint on the send queue but is the only sane solution I suspect.
> It implies that not attempting to redo local operations implies that the verbs consumer
> must guarantee that they can safely change the MR/MW state as soon as the operation is
> executed for the first time. This means that either there is a fence or they have seen
> the completion of all IO operations that depend on the memory. It is not clear that all
> test cases obey these rules or that they don't. We should WARN on those situations where
> we can see a violation.

The spec defines the fencing requirements for this already, see for
instance "9.4.1.1.1 Invalidate Operation Ordering":

 3) a SEND with Invalidate operation may impact a previous RDMA
 READ operation. Thus, a requester should not perform a SEND with
 Invalidate while previous RDMA READ operations are still out-
 standing. The requester can set the Fence attribute on a given work
 request such as a SEND with Invalidate in order to ensure that pre-
 vious outstanding RDMA READ operations have completed before
 initiating a subsequent SEND with Invalidate operation.
 
I have no doubt we have subtle ULP bugs here, we've historically had
bugs with ULPs doing invalidation wrong - the usual mistake is
assuming that the recv completion is sufficient to trigger
invalidation - it is not true, the ULP must also see the send
completion consuming the rkey before it triggers invalidation.

It is not guarenteed that the SQ completion will arrive before the RQ
completion, even though it seems like causally that has to be true,
lost packets and other abnormal cases cause problems.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-04-28 14:29 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-22 21:04 bug report for rdma_rxe Bob Pearson
2022-04-23  1:54 ` Bob Pearson
2022-04-25  0:04 ` Yanjun Zhu
2022-04-25 16:58   ` Bob Pearson
2022-04-25 22:58     ` Jason Gunthorpe
2022-04-26  1:40       ` Bob Pearson
2022-04-26 11:42         ` Jason Gunthorpe
2022-04-28 13:31       ` Bob Pearson
2022-04-28 14:29         ` Jason Gunthorpe
2022-04-26  3:10     ` Zhu Yanjun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.