All of lore.kernel.org
 help / color / mirror / Atom feed
* ibv_req_notify_cq clarification
@ 2021-02-18  9:13 Gal Pressman
  2021-02-18 12:38 ` Bernard Metzler
  2021-02-18 12:53 ` Jason Gunthorpe
  0 siblings, 2 replies; 27+ messages in thread
From: Gal Pressman @ 2021-02-18  9:13 UTC (permalink / raw)
  To: RDMA mailing list

I'm a bit confused about the meaning of the ibv_req_notify_cq() verb:
"Upon the addition of a new CQ entry (CQE) to cq, a completion event will be
added to the completion channel associated with the CQ."

What is considered a new CQE in this case?
The next CQE from the user's perspective, i.e. any new CQE that wasn't consumed
by the user's poll cq?
Or any new CQE from the device's perspective?

For example, if at the time of ibv_req_notify_cq() call the CQ has received 100
completions, but the user hasn't polled his CQ yet, when should he be notified?
On the 101 completion or immediately (since there are completions waiting on the
CQ)?

Thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-18  9:13 ibv_req_notify_cq clarification Gal Pressman
@ 2021-02-18 12:38 ` Bernard Metzler
  2021-02-18 12:47   ` Gal Pressman
  2021-02-18 13:59   ` Bernard Metzler
  2021-02-18 12:53 ` Jason Gunthorpe
  1 sibling, 2 replies; 27+ messages in thread
From: Bernard Metzler @ 2021-02-18 12:38 UTC (permalink / raw)
  To: Gal Pressman; +Cc: RDMA mailing list

-----"Gal Pressman" <galpress@amazon.com> wrote: -----

>To: "RDMA mailing list" <linux-rdma@vger.kernel.org>
>From: "Gal Pressman" <galpress@amazon.com>
>Date: 02/18/2021 11:26AM
>Subject: [EXTERNAL] ibv_req_notify_cq clarification
>
>I'm a bit confused about the meaning of the ibv_req_notify_cq() verb:
>"Upon the addition of a new CQ entry (CQE) to cq, a completion event
>will be
>added to the completion channel associated with the CQ."
>
>What is considered a new CQE in this case?
>The next CQE from the user's perspective, i.e. any new CQE that
>wasn't consumed
>by the user's poll cq?
>Or any new CQE from the device's perspective?
>
>For example, if at the time of ibv_req_notify_cq() call the CQ has
>received 100
>completions, but the user hasn't polled his CQ yet, when should he be
>notified?
>On the 101 completion or immediately (since there are completions
>waiting on the
>CQ)?

It is from drivers perspective. So notification when 101
became available and CQ is marked for notification.
Applications tend to poll the CQ after requesting
notification, since there might be a race
from appl perspective...




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-18 12:38 ` Bernard Metzler
@ 2021-02-18 12:47   ` Gal Pressman
  2021-02-18 13:59   ` Bernard Metzler
  1 sibling, 0 replies; 27+ messages in thread
From: Gal Pressman @ 2021-02-18 12:47 UTC (permalink / raw)
  To: Bernard Metzler; +Cc: RDMA mailing list

On 18/02/2021 14:38, Bernard Metzler wrote:
> -----"Gal Pressman" <galpress@amazon.com> wrote: -----
> 
>> To: "RDMA mailing list" <linux-rdma@vger.kernel.org>
>> From: "Gal Pressman" <galpress@amazon.com>
>> Date: 02/18/2021 11:26AM
>> Subject: [EXTERNAL] ibv_req_notify_cq clarification
>>
>> I'm a bit confused about the meaning of the ibv_req_notify_cq() verb:
>> "Upon the addition of a new CQ entry (CQE) to cq, a completion event
>> will be
>> added to the completion channel associated with the CQ."
>>
>> What is considered a new CQE in this case?
>> The next CQE from the user's perspective, i.e. any new CQE that
>> wasn't consumed
>> by the user's poll cq?
>> Or any new CQE from the device's perspective?
>>
>> For example, if at the time of ibv_req_notify_cq() call the CQ has
>> received 100
>> completions, but the user hasn't polled his CQ yet, when should he be
>> notified?
>> On the 101 completion or immediately (since there are completions
>> waiting on the
>> CQ)?
> 
> It is from drivers perspective. So notification when 101
> became available and CQ is marked for notification.
> Applications tend to poll the CQ after requesting
> notification, since there might be a race
> from appl perspective...

Thanks Bernard,

Looking at the implementation of rdma-core providers, most of them pass the CQ
consumer index as part of the arm doorbell which made me think it arms the CQ to
notify on a completion newer than what the app polled, not on a completion newer
than the producer index.
Any idea why the consumer index is needed if that's not the case?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-18  9:13 ibv_req_notify_cq clarification Gal Pressman
  2021-02-18 12:38 ` Bernard Metzler
@ 2021-02-18 12:53 ` Jason Gunthorpe
  2021-02-18 15:52   ` Gal Pressman
  1 sibling, 1 reply; 27+ messages in thread
From: Jason Gunthorpe @ 2021-02-18 12:53 UTC (permalink / raw)
  To: Gal Pressman; +Cc: RDMA mailing list

On Thu, Feb 18, 2021 at 11:13:43AM +0200, Gal Pressman wrote:
> I'm a bit confused about the meaning of the ibv_req_notify_cq() verb:
> "Upon the addition of a new CQ entry (CQE) to cq, a completion event will be
> added to the completion channel associated with the CQ."
> 
> What is considered a new CQE in this case?
> The next CQE from the user's perspective, i.e. any new CQE that wasn't consumed
> by the user's poll cq?
> Or any new CQE from the device's perspective?

new CQE from the device perspective.

> For example, if at the time of ibv_req_notify_cq() call the CQ has received 100
> completions, but the user hasn't polled his CQ yet, when should he be notified?
> On the 101 completion or immediately (since there are completions waiting on the
> CQ)?

101 completion

It is only meaningful to call it when the CQ is empty.

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Re: ibv_req_notify_cq clarification
  2021-02-18 12:38 ` Bernard Metzler
  2021-02-18 12:47   ` Gal Pressman
@ 2021-02-18 13:59   ` Bernard Metzler
  1 sibling, 0 replies; 27+ messages in thread
From: Bernard Metzler @ 2021-02-18 13:59 UTC (permalink / raw)
  To: Gal Pressman; +Cc: RDMA mailing list

-----"Gal Pressman" <galpress@amazon.com> wrote: -----

>To: "Bernard Metzler" <BMT@zurich.ibm.com>
>From: "Gal Pressman" <galpress@amazon.com>
>Date: 02/18/2021 01:47PM
>Cc: "RDMA mailing list" <linux-rdma@vger.kernel.org>
>Subject: [EXTERNAL] Re: ibv_req_notify_cq clarification
>
>On 18/02/2021 14:38, Bernard Metzler wrote:
>> -----"Gal Pressman" <galpress@amazon.com> wrote: -----
>> 
>>> To: "RDMA mailing list" <linux-rdma@vger.kernel.org>
>>> From: "Gal Pressman" <galpress@amazon.com>
>>> Date: 02/18/2021 11:26AM
>>> Subject: [EXTERNAL] ibv_req_notify_cq clarification
>>>
>>> I'm a bit confused about the meaning of the ibv_req_notify_cq()
>verb:
>>> "Upon the addition of a new CQ entry (CQE) to cq, a completion
>event
>>> will be
>>> added to the completion channel associated with the CQ."
>>>
>>> What is considered a new CQE in this case?
>>> The next CQE from the user's perspective, i.e. any new CQE that
>>> wasn't consumed
>>> by the user's poll cq?
>>> Or any new CQE from the device's perspective?
>>>
>>> For example, if at the time of ibv_req_notify_cq() call the CQ has
>>> received 100
>>> completions, but the user hasn't polled his CQ yet, when should he
>be
>>> notified?
>>> On the 101 completion or immediately (since there are completions
>>> waiting on the
>>> CQ)?
>> 
>> It is from drivers perspective. So notification when 101
>> became available and CQ is marked for notification.
>> Applications tend to poll the CQ after requesting
>> notification, since there might be a race
>> from appl perspective...
>
>Thanks Bernard,
>
>Looking at the implementation of rdma-core providers, most of them
>pass the CQ
>consumer index as part of the arm doorbell which made me think it
>arms the CQ to
>notify on a completion newer than what the app polled, not on a
>completion newer
>than the producer index.
>Any idea why the consumer index is needed if that's not the case?
>
...maybe to do what you suggest - just kicking the completion
handler if it appears the application has missed CQE's when the
CQ was not armed? That would be in the gray zone of having
a provider which wants to be smarter than the application,
since this feature is not part of the verbs specification.
I'd suggest applications better to rely on the
IB_CQ_REPORT_MISSED_EVENTS flag if it doesn't want to
pull unconditionally.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-18 12:53 ` Jason Gunthorpe
@ 2021-02-18 15:52   ` Gal Pressman
  2021-02-18 16:23     ` Jason Gunthorpe
  2021-02-18 18:47     ` Bernard Metzler
  0 siblings, 2 replies; 27+ messages in thread
From: Gal Pressman @ 2021-02-18 15:52 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: RDMA mailing list

On 18/02/2021 14:53, Jason Gunthorpe wrote:
> On Thu, Feb 18, 2021 at 11:13:43AM +0200, Gal Pressman wrote:
>> I'm a bit confused about the meaning of the ibv_req_notify_cq() verb:
>> "Upon the addition of a new CQ entry (CQE) to cq, a completion event will be
>> added to the completion channel associated with the CQ."
>>
>> What is considered a new CQE in this case?
>> The next CQE from the user's perspective, i.e. any new CQE that wasn't consumed
>> by the user's poll cq?
>> Or any new CQE from the device's perspective?
> 
> new CQE from the device perspective.
> 
>> For example, if at the time of ibv_req_notify_cq() call the CQ has received 100
>> completions, but the user hasn't polled his CQ yet, when should he be notified?
>> On the 101 completion or immediately (since there are completions waiting on the
>> CQ)?
> 
> 101 completion
> 
> It is only meaningful to call it when the CQ is empty.

Thanks, so there's an inherent race between the user's CQ poll and the next arm?

Do you know what's the purpose of the consumer index in the arm doorbell that's
implemented by many providers?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-18 15:52   ` Gal Pressman
@ 2021-02-18 16:23     ` Jason Gunthorpe
  2021-02-18 22:22       ` Tom Talpey
  2021-02-21  9:25       ` Gal Pressman
  2021-02-18 18:47     ` Bernard Metzler
  1 sibling, 2 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2021-02-18 16:23 UTC (permalink / raw)
  To: Gal Pressman; +Cc: RDMA mailing list

On Thu, Feb 18, 2021 at 05:52:16PM +0200, Gal Pressman wrote:
> On 18/02/2021 14:53, Jason Gunthorpe wrote:
> > On Thu, Feb 18, 2021 at 11:13:43AM +0200, Gal Pressman wrote:
> >> I'm a bit confused about the meaning of the ibv_req_notify_cq() verb:
> >> "Upon the addition of a new CQ entry (CQE) to cq, a completion event will be
> >> added to the completion channel associated with the CQ."
> >>
> >> What is considered a new CQE in this case?
> >> The next CQE from the user's perspective, i.e. any new CQE that wasn't consumed
> >> by the user's poll cq?
> >> Or any new CQE from the device's perspective?
> > 
> > new CQE from the device perspective.
> > 
> >> For example, if at the time of ibv_req_notify_cq() call the CQ has received 100
> >> completions, but the user hasn't polled his CQ yet, when should he be notified?
> >> On the 101 completion or immediately (since there are completions waiting on the
> >> CQ)?
> > 
> > 101 completion
> > 
> > It is only meaningful to call it when the CQ is empty.
> 
> Thanks, so there's an inherent race between the user's CQ poll and the next arm?

I think the specs or man pages talk about this, the application has to
observe empty, do arm, then poll again then sleep on the cq if empty.

> Do you know what's the purpose of the consumer index in the arm doorbell that's
> implemented by many providers?

The consumer index is needed by HW to prevent CQ overflow, presumably
the drivers push to reduce the cases where the HW has to read it from
PCI

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Re: ibv_req_notify_cq clarification
  2021-02-18 15:52   ` Gal Pressman
  2021-02-18 16:23     ` Jason Gunthorpe
@ 2021-02-18 18:47     ` Bernard Metzler
  1 sibling, 0 replies; 27+ messages in thread
From: Bernard Metzler @ 2021-02-18 18:47 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Gal Pressman, RDMA mailing list

-----"Jason Gunthorpe" <jgg@ziepe.ca> wrote: -----

>To: "Gal Pressman" <galpress@amazon.com>
>From: "Jason Gunthorpe" <jgg@ziepe.ca>
>Date: 02/18/2021 07:35PM
>Cc: "RDMA mailing list" <linux-rdma@vger.kernel.org>
>Subject: [EXTERNAL] Re: ibv_req_notify_cq clarification
>
>On Thu, Feb 18, 2021 at 05:52:16PM +0200, Gal Pressman wrote:
>> On 18/02/2021 14:53, Jason Gunthorpe wrote:
>> > On Thu, Feb 18, 2021 at 11:13:43AM +0200, Gal Pressman wrote:
>> >> I'm a bit confused about the meaning of the ibv_req_notify_cq()
>verb:
>> >> "Upon the addition of a new CQ entry (CQE) to cq, a completion
>event will be
>> >> added to the completion channel associated with the CQ."
>> >>
>> >> What is considered a new CQE in this case?
>> >> The next CQE from the user's perspective, i.e. any new CQE that
>wasn't consumed
>> >> by the user's poll cq?
>> >> Or any new CQE from the device's perspective?
>> > 
>> > new CQE from the device perspective.
>> > 
>> >> For example, if at the time of ibv_req_notify_cq() call the CQ
>has received 100
>> >> completions, but the user hasn't polled his CQ yet, when should
>he be notified?
>> >> On the 101 completion or immediately (since there are
>completions waiting on the
>> >> CQ)?
>> > 
>> > 101 completion
>> > 
>> > It is only meaningful to call it when the CQ is empty.
>> 
>> Thanks, so there's an inherent race between the user's CQ poll and
>the next arm?
>
>I think the specs or man pages talk about this, the application has
>to
>observe empty, do arm, then poll again then sleep on the cq if empty.
>

I'd love to see the IB_CQ_REPORT_MISSED_EVENTS flag mechanics
available for user land verbs. I learned about this potential
race the hard way when a well known distributed FS sometimes
hung ;) 



>> Do you know what's the purpose of the consumer index in the arm
>doorbell that's
>> implemented by many providers?
>
>The consumer index is needed by HW to prevent CQ overflow, presumably
>the drivers push to reduce the cases where the HW has to read it from
>PCI
>
>Jason
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-18 16:23     ` Jason Gunthorpe
@ 2021-02-18 22:22       ` Tom Talpey
  2021-02-18 22:51         ` Jason Gunthorpe
  2021-02-21  9:25       ` Gal Pressman
  1 sibling, 1 reply; 27+ messages in thread
From: Tom Talpey @ 2021-02-18 22:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Gal Pressman; +Cc: RDMA mailing list

On 2/18/2021 11:23 AM, Jason Gunthorpe wrote:
> On Thu, Feb 18, 2021 at 05:52:16PM +0200, Gal Pressman wrote:
>> On 18/02/2021 14:53, Jason Gunthorpe wrote:
>>> On Thu, Feb 18, 2021 at 11:13:43AM +0200, Gal Pressman wrote:
>>>> I'm a bit confused about the meaning of the ibv_req_notify_cq() verb:
>>>> "Upon the addition of a new CQ entry (CQE) to cq, a completion event will be
>>>> added to the completion channel associated with the CQ."
>>>>
>>>> What is considered a new CQE in this case?
>>>> The next CQE from the user's perspective, i.e. any new CQE that wasn't consumed
>>>> by the user's poll cq?
>>>> Or any new CQE from the device's perspective?
>>>
>>> new CQE from the device perspective.
>>>
>>>> For example, if at the time of ibv_req_notify_cq() call the CQ has received 100
>>>> completions, but the user hasn't polled his CQ yet, when should he be notified?
>>>> On the 101 completion or immediately (since there are completions waiting on the
>>>> CQ)?
>>>
>>> 101 completion
>>>
>>> It is only meaningful to call it when the CQ is empty.
>>
>> Thanks, so there's an inherent race between the user's CQ poll and the next arm?
> 
> I think the specs or man pages talk about this, the application has to
> observe empty, do arm, then poll again then sleep on the cq if empty.
> 
>> Do you know what's the purpose of the consumer index in the arm doorbell that's
>> implemented by many providers?
> 
> The consumer index is needed by HW to prevent CQ overflow, presumably
> the drivers push to reduce the cases where the HW has to read it from
> PCI

Prevent CQ overflow? There's no such requirement that I'm aware of.
If the consumer doesn't provide a large-enough CQ, then it reaps the
consequences. Same thing for WQ depth, although I am aware that some
verbs implementations attempt to return a kind of EAGAIN when posting
to a send WQ.

What can the provider do if the CQ is "full" anyway? Buffer the CQE
and go into some type of polling loop attempting to redeliver? Ouch!

Tom.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-18 22:22       ` Tom Talpey
@ 2021-02-18 22:51         ` Jason Gunthorpe
  2021-02-18 23:07           ` Tom Talpey
  0 siblings, 1 reply; 27+ messages in thread
From: Jason Gunthorpe @ 2021-02-18 22:51 UTC (permalink / raw)
  To: Tom Talpey; +Cc: Gal Pressman, RDMA mailing list

On Thu, Feb 18, 2021 at 05:22:31PM -0500, Tom Talpey wrote:
> On 2/18/2021 11:23 AM, Jason Gunthorpe wrote:
> > On Thu, Feb 18, 2021 at 05:52:16PM +0200, Gal Pressman wrote:
> > > On 18/02/2021 14:53, Jason Gunthorpe wrote:
> > > > On Thu, Feb 18, 2021 at 11:13:43AM +0200, Gal Pressman wrote:
> > > > > I'm a bit confused about the meaning of the ibv_req_notify_cq() verb:
> > > > > "Upon the addition of a new CQ entry (CQE) to cq, a completion event will be
> > > > > added to the completion channel associated with the CQ."
> > > > > 
> > > > > What is considered a new CQE in this case?
> > > > > The next CQE from the user's perspective, i.e. any new CQE that wasn't consumed
> > > > > by the user's poll cq?
> > > > > Or any new CQE from the device's perspective?
> > > > 
> > > > new CQE from the device perspective.
> > > > 
> > > > > For example, if at the time of ibv_req_notify_cq() call the CQ has received 100
> > > > > completions, but the user hasn't polled his CQ yet, when should he be notified?
> > > > > On the 101 completion or immediately (since there are completions waiting on the
> > > > > CQ)?
> > > > 
> > > > 101 completion
> > > > 
> > > > It is only meaningful to call it when the CQ is empty.
> > > 
> > > Thanks, so there's an inherent race between the user's CQ poll and the next arm?
> > 
> > I think the specs or man pages talk about this, the application has to
> > observe empty, do arm, then poll again then sleep on the cq if empty.
> > 
> > > Do you know what's the purpose of the consumer index in the arm doorbell that's
> > > implemented by many providers?
> > 
> > The consumer index is needed by HW to prevent CQ overflow, presumably
> > the drivers push to reduce the cases where the HW has to read it from
> > PCI
> 
> Prevent CQ overflow? There's no such requirement that I'm aware of.
> If the consumer doesn't provide a large-enough CQ, then it reaps the
> consequences. Same thing for WQ depth, although I am aware that some
> verbs implementations attempt to return a kind of EAGAIN when posting
> to a send WQ.
> 
> What can the provider do if the CQ is "full" anyway? Buffer the CQE
> and go into some type of polling loop attempting to redeliver? Ouch!

QP goes to error, CQE is discarded, IIRC.

Wrapping and overflowing the CQ is not acceptable, it would mean
reading CQEs could never be done reliably.

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-18 22:51         ` Jason Gunthorpe
@ 2021-02-18 23:07           ` Tom Talpey
  2021-02-19  0:45             ` Jason Gunthorpe
  0 siblings, 1 reply; 27+ messages in thread
From: Tom Talpey @ 2021-02-18 23:07 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Gal Pressman, RDMA mailing list

On 2/18/2021 5:51 PM, Jason Gunthorpe wrote:
> On Thu, Feb 18, 2021 at 05:22:31PM -0500, Tom Talpey wrote:
>> On 2/18/2021 11:23 AM, Jason Gunthorpe wrote:
>>> On Thu, Feb 18, 2021 at 05:52:16PM +0200, Gal Pressman wrote:
>>>> On 18/02/2021 14:53, Jason Gunthorpe wrote:
>>>>> On Thu, Feb 18, 2021 at 11:13:43AM +0200, Gal Pressman wrote:
>>>>>> I'm a bit confused about the meaning of the ibv_req_notify_cq() verb:
>>>>>> "Upon the addition of a new CQ entry (CQE) to cq, a completion event will be
>>>>>> added to the completion channel associated with the CQ."
>>>>>>
>>>>>> What is considered a new CQE in this case?
>>>>>> The next CQE from the user's perspective, i.e. any new CQE that wasn't consumed
>>>>>> by the user's poll cq?
>>>>>> Or any new CQE from the device's perspective?
>>>>>
>>>>> new CQE from the device perspective.
>>>>>
>>>>>> For example, if at the time of ibv_req_notify_cq() call the CQ has received 100
>>>>>> completions, but the user hasn't polled his CQ yet, when should he be notified?
>>>>>> On the 101 completion or immediately (since there are completions waiting on the
>>>>>> CQ)?
>>>>>
>>>>> 101 completion
>>>>>
>>>>> It is only meaningful to call it when the CQ is empty.
>>>>
>>>> Thanks, so there's an inherent race between the user's CQ poll and the next arm?
>>>
>>> I think the specs or man pages talk about this, the application has to
>>> observe empty, do arm, then poll again then sleep on the cq if empty.
>>>
>>>> Do you know what's the purpose of the consumer index in the arm doorbell that's
>>>> implemented by many providers?
>>>
>>> The consumer index is needed by HW to prevent CQ overflow, presumably
>>> the drivers push to reduce the cases where the HW has to read it from
>>> PCI
>>
>> Prevent CQ overflow? There's no such requirement that I'm aware of.
>> If the consumer doesn't provide a large-enough CQ, then it reaps the
>> consequences. Same thing for WQ depth, although I am aware that some
>> verbs implementations attempt to return a kind of EAGAIN when posting
>> to a send WQ.
>>
>> What can the provider do if the CQ is "full" anyway? Buffer the CQE
>> and go into some type of polling loop attempting to redeliver? Ouch!
> 
> QP goes to error, CQE is discarded, IIRC.

What!? There might be many QP's all sharing the same CQ. Put them
*all* into error? And for what, because the CQ is trash anyway. This
sounds like optimizing the error case. Uselessly.

I don't think any of this is in the verbs API. The architecture
was designed such that posting by the consumer (WQ) and provider (CQ)
do not need to perform any reads at all. It's all about lock-free
writing.

> Wrapping and overflowing the CQ is not acceptable, it would mean
> reading CQEs could never be done reliably.

But the provider never reads the CQ, only the consumer can read.
The provider writes to head, ignoring tail. Consumer reads from
tail, and it goes empty when tail == head. And if head overruns
tail, that was the consumer's fault for posting too many WQEs.

Gal, what providers do you see that have this check?

Tom.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-18 23:07           ` Tom Talpey
@ 2021-02-19  0:45             ` Jason Gunthorpe
  2021-02-19 14:31               ` Tom Talpey
  0 siblings, 1 reply; 27+ messages in thread
From: Jason Gunthorpe @ 2021-02-19  0:45 UTC (permalink / raw)
  To: Tom Talpey; +Cc: Gal Pressman, RDMA mailing list

On Thu, Feb 18, 2021 at 06:07:13PM -0500, Tom Talpey wrote:
> > > If the consumer doesn't provide a large-enough CQ, then it reaps the
> > > consequences. Same thing for WQ depth, although I am aware that some
> > > verbs implementations attempt to return a kind of EAGAIN when posting
> > > to a send WQ.
> > > 
> > > What can the provider do if the CQ is "full" anyway? Buffer the CQE
> > > and go into some type of polling loop attempting to redeliver? Ouch!
> > 
> > QP goes to error, CQE is discarded, IIRC.
> 
> What!? There might be many QP's all sharing the same CQ. Put them
> *all* into error? And for what, because the CQ is trash anyway. This
> sounds like optimizing the error case. Uselessly.

No, only the QPs that need to push a CQE and can't.

> > Wrapping and overflowing the CQ is not acceptable, it would mean
> > reading CQEs could never be done reliably.
> 
> But the provider never reads the CQ, only the consumer can read.
> The provider writes to head, ignoring tail. Consumer reads from
> tail, and it goes empty when tail == head. And if head overruns
> tail, that was the consumer's fault for posting too many WQEs.

Yes, but if the app makes a mistake you don't want to trash the whole
system. Resiliency says you contain the failure as much as possible
and the app at least has some chance to pick up the pieces.

If the HW corrupts the CQEs while the CPU is reading them then the
whole machine is toast, high chance the kernel will corrupt memory.

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-19  0:45             ` Jason Gunthorpe
@ 2021-02-19 14:31               ` Tom Talpey
  2021-02-19 14:42                 ` Jason Gunthorpe
  0 siblings, 1 reply; 27+ messages in thread
From: Tom Talpey @ 2021-02-19 14:31 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Gal Pressman, RDMA mailing list

On 2/18/2021 7:45 PM, Jason Gunthorpe wrote:
> On Thu, Feb 18, 2021 at 06:07:13PM -0500, Tom Talpey wrote:
>>>> If the consumer doesn't provide a large-enough CQ, then it reaps the
>>>> consequences. Same thing for WQ depth, although I am aware that some
>>>> verbs implementations attempt to return a kind of EAGAIN when posting
>>>> to a send WQ.
>>>>
>>>> What can the provider do if the CQ is "full" anyway? Buffer the CQE
>>>> and go into some type of polling loop attempting to redeliver? Ouch!
>>>
>>> QP goes to error, CQE is discarded, IIRC.
>>
>> What!? There might be many QP's all sharing the same CQ. Put them
>> *all* into error? And for what, because the CQ is trash anyway. This
>> sounds like optimizing the error case. Uselessly.
> 
> No, only the QPs that need to push a CQE and can't.

Hm. Ok, so QP's will drop unpredictably, and their outstanding WQEs
will probably be lost as well, but I can see cases where a CQ slot
might open up while the failed QP is flushing, and CQE's get delivered
out of order. That might be even worse. It would seem safer to stop
writing to the CQ altogether - all QPs.

>>> Wrapping and overflowing the CQ is not acceptable, it would mean
>>> reading CQEs could never be done reliably.
>>
>> But the provider never reads the CQ, only the consumer can read.
>> The provider writes to head, ignoring tail. Consumer reads from
>> tail, and it goes empty when tail == head. And if head overruns
>> tail, that was the consumer's fault for posting too many WQEs.
> 
> Yes, but if the app makes a mistake you don't want to trash the whole
> system. Resiliency says you contain the failure as much as possible
> and the app at least has some chance to pick up the pieces.
> 
> If the HW corrupts the CQEs while the CPU is reading them then the
> whole machine is toast, high chance the kernel will corrupt memory.

That would be a problem, but it's only true if the provider implements
the CQ as a circular buffer. That isn't imposed by the Verbs. The CQ
itself is opaque to the consumer, it's merely a queue with arm and
dequeue operations - no enqueue, no head/tail or other pointers, etc.

So yeah, a provider that made such a choice will need to be careful.
But there are other, possibly better, ways.

Tom.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-19 14:31               ` Tom Talpey
@ 2021-02-19 14:42                 ` Jason Gunthorpe
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2021-02-19 14:42 UTC (permalink / raw)
  To: Tom Talpey; +Cc: Gal Pressman, RDMA mailing list

On Fri, Feb 19, 2021 at 09:31:22AM -0500, Tom Talpey wrote:
> On 2/18/2021 7:45 PM, Jason Gunthorpe wrote:
> > On Thu, Feb 18, 2021 at 06:07:13PM -0500, Tom Talpey wrote:
> > > > > If the consumer doesn't provide a large-enough CQ, then it reaps the
> > > > > consequences. Same thing for WQ depth, although I am aware that some
> > > > > verbs implementations attempt to return a kind of EAGAIN when posting
> > > > > to a send WQ.
> > > > > 
> > > > > What can the provider do if the CQ is "full" anyway? Buffer the CQE
> > > > > and go into some type of polling loop attempting to redeliver? Ouch!
> > > > 
> > > > QP goes to error, CQE is discarded, IIRC.
> > > 
> > > What!? There might be many QP's all sharing the same CQ. Put them
> > > *all* into error? And for what, because the CQ is trash anyway. This
> > > sounds like optimizing the error case. Uselessly.
> > 
> > No, only the QPs that need to push a CQE and can't.
> 
> Hm. Ok, so QP's will drop unpredictably, and their outstanding WQEs
> will probably be lost as well, but I can see cases where a CQ slot
> might open up while the failed QP is flushing, and CQE's get delivered
> out of order. That might be even worse. It would seem safer to stop
> writing to the CQ altogether - all QPs.

I think the app gets an IBV_EVENT_CQ_ERR and IBV_EVENT_QP_FATAL and
has to clean it up.

> That would be a problem, but it's only true if the provider implements
> the CQ as a circular buffer. 

AFAIK there is no datastructure that allows unbounded writing from the
producer side, the only choices are to halt on overflow or corrupt on
overflow - corrupt breaks the machine, so good HW does halt.

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-18 16:23     ` Jason Gunthorpe
  2021-02-18 22:22       ` Tom Talpey
@ 2021-02-21  9:25       ` Gal Pressman
  2021-02-22 13:46         ` Jason Gunthorpe
  1 sibling, 1 reply; 27+ messages in thread
From: Gal Pressman @ 2021-02-21  9:25 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: RDMA mailing list

On 18/02/2021 18:23, Jason Gunthorpe wrote:
> On Thu, Feb 18, 2021 at 05:52:16PM +0200, Gal Pressman wrote:
>> On 18/02/2021 14:53, Jason Gunthorpe wrote:
>>> On Thu, Feb 18, 2021 at 11:13:43AM +0200, Gal Pressman wrote:
>>>> I'm a bit confused about the meaning of the ibv_req_notify_cq() verb:
>>>> "Upon the addition of a new CQ entry (CQE) to cq, a completion event will be
>>>> added to the completion channel associated with the CQ."
>>>>
>>>> What is considered a new CQE in this case?
>>>> The next CQE from the user's perspective, i.e. any new CQE that wasn't consumed
>>>> by the user's poll cq?
>>>> Or any new CQE from the device's perspective?
>>>
>>> new CQE from the device perspective.
>>>
>>>> For example, if at the time of ibv_req_notify_cq() call the CQ has received 100
>>>> completions, but the user hasn't polled his CQ yet, when should he be notified?
>>>> On the 101 completion or immediately (since there are completions waiting on the
>>>> CQ)?
>>>
>>> 101 completion
>>>
>>> It is only meaningful to call it when the CQ is empty.
>>
>> Thanks, so there's an inherent race between the user's CQ poll and the next arm?
> 
> I think the specs or man pages talk about this, the application has to
> observe empty, do arm, then poll again then sleep on the cq if empty.
> 
>> Do you know what's the purpose of the consumer index in the arm doorbell that's
>> implemented by many providers?
> 
> The consumer index is needed by HW to prevent CQ overflow, presumably
> the drivers push to reduce the cases where the HW has to read it from
> PCI

Thanks, that makes sense.

I found the following sentence in CX PRM:
"If new CQEs are posted to the CQ after the reporting of a completion event and
these CQEs are not yet consumed, then an event will be generated immediately
after the request for notification is executed."

Doesn't that contradict the expected behavior?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-21  9:25       ` Gal Pressman
@ 2021-02-22 13:46         ` Jason Gunthorpe
  2021-02-22 15:36           ` Gal Pressman
  0 siblings, 1 reply; 27+ messages in thread
From: Jason Gunthorpe @ 2021-02-22 13:46 UTC (permalink / raw)
  To: Gal Pressman; +Cc: RDMA mailing list

On Sun, Feb 21, 2021 at 11:25:02AM +0200, Gal Pressman wrote:
> On 18/02/2021 18:23, Jason Gunthorpe wrote:
> > On Thu, Feb 18, 2021 at 05:52:16PM +0200, Gal Pressman wrote:
> >> On 18/02/2021 14:53, Jason Gunthorpe wrote:
> >>> On Thu, Feb 18, 2021 at 11:13:43AM +0200, Gal Pressman wrote:
> >>>> I'm a bit confused about the meaning of the ibv_req_notify_cq() verb:
> >>>> "Upon the addition of a new CQ entry (CQE) to cq, a completion event will be
> >>>> added to the completion channel associated with the CQ."
> >>>>
> >>>> What is considered a new CQE in this case?
> >>>> The next CQE from the user's perspective, i.e. any new CQE that wasn't consumed
> >>>> by the user's poll cq?
> >>>> Or any new CQE from the device's perspective?
> >>>
> >>> new CQE from the device perspective.
> >>>
> >>>> For example, if at the time of ibv_req_notify_cq() call the CQ has received 100
> >>>> completions, but the user hasn't polled his CQ yet, when should he be notified?
> >>>> On the 101 completion or immediately (since there are completions waiting on the
> >>>> CQ)?
> >>>
> >>> 101 completion
> >>>
> >>> It is only meaningful to call it when the CQ is empty.
> >>
> >> Thanks, so there's an inherent race between the user's CQ poll and the next arm?
> > 
> > I think the specs or man pages talk about this, the application has to
> > observe empty, do arm, then poll again then sleep on the cq if empty.
> > 
> >> Do you know what's the purpose of the consumer index in the arm doorbell that's
> >> implemented by many providers?
> > 
> > The consumer index is needed by HW to prevent CQ overflow, presumably
> > the drivers push to reduce the cases where the HW has to read it from
> > PCI
> 
> Thanks, that makes sense.
> 
> I found the following sentence in CX PRM:
> "If new CQEs are posted to the CQ after the reporting of a completion event and
> these CQEs are not yet consumed, then an event will be generated immediately
> after the request for notification is executed."
> 
> Doesn't that contradict the expected behavior?

I read it as confirming it?

Only *new* CQEs trigger an event, and new CQE's always trigger an
event regardless of the full/empty state of the queue.

This paragraph is an obtuse way of warning of the race I described.

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-22 13:46         ` Jason Gunthorpe
@ 2021-02-22 15:36           ` Gal Pressman
  2021-02-22 15:55             ` Jason Gunthorpe
  2021-02-22 18:38             ` Hefty, Sean
  0 siblings, 2 replies; 27+ messages in thread
From: Gal Pressman @ 2021-02-22 15:36 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: RDMA mailing list

On 22/02/2021 15:46, Jason Gunthorpe wrote:
> On Sun, Feb 21, 2021 at 11:25:02AM +0200, Gal Pressman wrote:
>> On 18/02/2021 18:23, Jason Gunthorpe wrote:
>>> On Thu, Feb 18, 2021 at 05:52:16PM +0200, Gal Pressman wrote:
>>>> On 18/02/2021 14:53, Jason Gunthorpe wrote:
>>>>> On Thu, Feb 18, 2021 at 11:13:43AM +0200, Gal Pressman wrote:
>>>>>> I'm a bit confused about the meaning of the ibv_req_notify_cq() verb:
>>>>>> "Upon the addition of a new CQ entry (CQE) to cq, a completion event will be
>>>>>> added to the completion channel associated with the CQ."
>>>>>>
>>>>>> What is considered a new CQE in this case?
>>>>>> The next CQE from the user's perspective, i.e. any new CQE that wasn't consumed
>>>>>> by the user's poll cq?
>>>>>> Or any new CQE from the device's perspective?
>>>>>
>>>>> new CQE from the device perspective.
>>>>>
>>>>>> For example, if at the time of ibv_req_notify_cq() call the CQ has received 100
>>>>>> completions, but the user hasn't polled his CQ yet, when should he be notified?
>>>>>> On the 101 completion or immediately (since there are completions waiting on the
>>>>>> CQ)?
>>>>>
>>>>> 101 completion
>>>>>
>>>>> It is only meaningful to call it when the CQ is empty.
>>>>
>>>> Thanks, so there's an inherent race between the user's CQ poll and the next arm?
>>>
>>> I think the specs or man pages talk about this, the application has to
>>> observe empty, do arm, then poll again then sleep on the cq if empty.
>>>
>>>> Do you know what's the purpose of the consumer index in the arm doorbell that's
>>>> implemented by many providers?
>>>
>>> The consumer index is needed by HW to prevent CQ overflow, presumably
>>> the drivers push to reduce the cases where the HW has to read it from
>>> PCI
>>
>> Thanks, that makes sense.
>>
>> I found the following sentence in CX PRM:
>> "If new CQEs are posted to the CQ after the reporting of a completion event and
>> these CQEs are not yet consumed, then an event will be generated immediately
>> after the request for notification is executed."
>>
>> Doesn't that contradict the expected behavior?
> 
> I read it as confirming it?
> 
> Only *new* CQEs trigger an event, and new CQE's always trigger an
> event regardless of the full/empty state of the queue.
> 
> This paragraph is an obtuse way of warning of the race I described.

Hmm, yea this sentence is a bit confusing :)..

"Mellanox HCAs keep track of the last index for which the user received an
event. Using this index, it is guaranteed that an event is generated immediately
when a request completion notification is performed and a CQE has already been
reported."

This also sounds weird, why is an event generated for a completion that has
already been reported?

So from my understanding of how this should work, the following code in perftest
(ib_send_bw test) is buggy?:
https://github.com/linux-rdma/perftest/blob/master/src/perftest_resources.c#L2955

Running this with 32 iterations, the client does something like:
- arm cq
- post send x 32
- wait for cq event
- arm cq
- poll cq (once, with batch size of 16)
- no more post send (reached tot_iters)
- wait for cq event (but an event has already been generated?)

And gets stuck?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-22 15:36           ` Gal Pressman
@ 2021-02-22 15:55             ` Jason Gunthorpe
  2021-02-22 19:24               ` Gal Pressman
  2021-02-22 18:38             ` Hefty, Sean
  1 sibling, 1 reply; 27+ messages in thread
From: Jason Gunthorpe @ 2021-02-22 15:55 UTC (permalink / raw)
  To: Gal Pressman; +Cc: RDMA mailing list

On Mon, Feb 22, 2021 at 05:36:17PM +0200, Gal Pressman wrote:

> "Mellanox HCAs keep track of the last index for which the user received an
> event. Using this index, it is guaranteed that an event is generated immediately
> when a request completion notification is performed and a CQE has already been
> reported."

I don't think verbs exposes this behavior.
 
> This also sounds weird, why is an event generated for a completion that has
> already been reported?

It eleminates races, if the consumer says 'I read up to X send me an
interrupt if X+1 exists' when X+1 already exists if there is a race
producer has already written it. So send an interrupt.

> So from my understanding of how this should work, the following code in perftest
> (ib_send_bw test) is buggy?:
> https://github.com/linux-rdma/perftest/blob/master/src/perftest_resources.c#L2955
> 
> Running this with 32 iterations, the client does something like:
> - arm cq
> - post send x 32
> - wait for cq event
> - arm cq
> - poll cq (once, with batch size of 16)
> - no more post send (reached tot_iters)
> - wait for cq event (but an event has already been generated?)

I don't know much about perf-test, but in verbs arming a non-empty CQ
is asking for trouble

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: ibv_req_notify_cq clarification
  2021-02-22 15:36           ` Gal Pressman
  2021-02-22 15:55             ` Jason Gunthorpe
@ 2021-02-22 18:38             ` Hefty, Sean
  2021-02-22 19:26               ` Gal Pressman
  1 sibling, 1 reply; 27+ messages in thread
From: Hefty, Sean @ 2021-02-22 18:38 UTC (permalink / raw)
  To: Gal Pressman, Jason Gunthorpe; +Cc: RDMA mailing list

> Running this with 32 iterations, the client does something like:
> - arm cq
> - post send x 32
> - wait for cq event
> - arm cq
> - poll cq (once, with batch size of 16)

This is a race.  The code should continue to read completions until the CQ is drained.

At this point, all completions may have been written to the CQ, and no new events will be generated.

> - no more post send (reached tot_iters)
> - wait for cq event (but an event has already been generated?)
> 
> And gets stuck?

- Sean

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-22 15:55             ` Jason Gunthorpe
@ 2021-02-22 19:24               ` Gal Pressman
  2021-02-22 19:37                 ` Jason Gunthorpe
  2021-02-22 22:41                 ` Hefty, Sean
  0 siblings, 2 replies; 27+ messages in thread
From: Gal Pressman @ 2021-02-22 19:24 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: RDMA mailing list

On 22/02/2021 17:55, Jason Gunthorpe wrote:
> On Mon, Feb 22, 2021 at 05:36:17PM +0200, Gal Pressman wrote:
> 
>> "Mellanox HCAs keep track of the last index for which the user received an
>> event. Using this index, it is guaranteed that an event is generated immediately
>> when a request completion notification is performed and a CQE has already been
>> reported."
> 
> I don't think verbs exposes this behavior.

So in theory this hardware could generate events that the user app doesn't expect?

>> This also sounds weird, why is an event generated for a completion that has
>> already been reported?
> 
> It eleminates races, if the consumer says 'I read up to X send me an
> interrupt if X+1 exists' when X+1 already exists if there is a race
> producer has already written it. So send an interrupt.

Right, that's what I was getting at in my first question, this isn't the next
completion from the device's perspective.
So in such case the consumer index in the arm doorbell is used to indicate what
should be considered a "new" completion? This is new from the app perspective.

But looking at ibv_ud_pingpong for example, I don't understand how that could
even work.
The test arms the CQ on creation (consumder index 0), calls ibv_get_cq_event(),
wakes up and immediately arms the CQ again (before polling, consumer index is
still 0).
This means that the next ibv_get_cq_event() will wake up immediately, as the CQ
was armed twice with the same consumer index and the first completion already
exists. Surely that's not what's meant to happen?

>> So from my understanding of how this should work, the following code in perftest
>> (ib_send_bw test) is buggy?:
>> https://github.com/linux-rdma/perftest/blob/master/src/perftest_resources.c#L2955
>>
>> Running this with 32 iterations, the client does something like:
>> - arm cq
>> - post send x 32
>> - wait for cq event
>> - arm cq
>> - poll cq (once, with batch size of 16)
>> - no more post send (reached tot_iters)
>> - wait for cq event (but an event has already been generated?)
> 
> I don't know much about perf-test, but in verbs arming a non-empty CQ
> is asking for trouble

Do you have a way to verify whether this test gets stuck? Maybe I am missing
something?

What do you mean by arming a non-empty CQ?
The man pages suggest a scheme where the app should call ibv_get_cq_event()
followed by an ibv_req_notify_cq(), the CQ polling/emptying comes after these,
so at the time of arm the CQ isn't empty.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-22 18:38             ` Hefty, Sean
@ 2021-02-22 19:26               ` Gal Pressman
  0 siblings, 0 replies; 27+ messages in thread
From: Gal Pressman @ 2021-02-22 19:26 UTC (permalink / raw)
  To: Hefty, Sean, Jason Gunthorpe; +Cc: RDMA mailing list

On 22/02/2021 20:38, Hefty, Sean wrote:
>> Running this with 32 iterations, the client does something like:
>> - arm cq
>> - post send x 32
>> - wait for cq event
>> - arm cq
>> - poll cq (once, with batch size of 16)
> 
> This is a race.  The code should continue to read completions until the CQ is drained.
> 
> At this point, all completions may have been written to the CQ, and no new events will be generated.
> 
>> - no more post send (reached tot_iters)
>> - wait for cq event (but an event has already been generated?)
>>
>> And gets stuck?
> 
> - Sean
> 

That's what I suspected, but I assume people run this and it works fine on
different hardware?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-22 19:24               ` Gal Pressman
@ 2021-02-22 19:37                 ` Jason Gunthorpe
  2021-02-23 12:18                   ` Gal Pressman
  2021-02-22 22:41                 ` Hefty, Sean
  1 sibling, 1 reply; 27+ messages in thread
From: Jason Gunthorpe @ 2021-02-22 19:37 UTC (permalink / raw)
  To: Gal Pressman; +Cc: RDMA mailing list

On Mon, Feb 22, 2021 at 09:24:18PM +0200, Gal Pressman wrote:
> On 22/02/2021 17:55, Jason Gunthorpe wrote:
> > On Mon, Feb 22, 2021 at 05:36:17PM +0200, Gal Pressman wrote:
> > 
> >> "Mellanox HCAs keep track of the last index for which the user received an
> >> event. Using this index, it is guaranteed that an event is generated immediately
> >> when a request completion notification is performed and a CQE has already been
> >> reported."
> > 
> > I don't think verbs exposes this behavior.
> 
> So in theory this hardware could generate events that the user app
> doesn't expect?

Not really, it shuffles around how race conditions are resolved.

> But looking at ibv_ud_pingpong for example, I don't understand how
> that could even work.  The test arms the CQ on creation (consumder
> index 0), calls ibv_get_cq_event(), wakes up and immediately arms
> the CQ again (before polling, consumer index is still 0).

By IBTA spec this is wrong and racey.

> This means that the next ibv_get_cq_event() will wake up immediately, as the CQ
> was armed twice with the same consumer index and the first completion already
> exists. Surely that's not what's meant to happen?

If the driver for this HW stuffs the consumer index then yes, you'd
get a wakeup as though the new entries were delivered instantly after
the arm. If it stuffs the current producer index then you get behavior
like IBTA describes.

I'd say they are both compatible ways to approach this, the app can't
tell the state of the CQ, so it can't know if the new CQEs were
delievered before or after it ARM'd it.

> Do you have a way to verify whether this test gets stuck? Maybe I am
> missing something?

If the mlx5 implementation is doing like you say then it will not get
stuck on that HW, but it is not to spec

> What do you mean by arming a non-empty CQ?

The arm only has meaning if you know the CQ is empty because then you
can reliably catch the empty->!empty transition, which is the whole
point.

If the CQ is non-empty then, by spec, if no new events arrive it will
never be signaled - the app must re-poll it on its own so why arm it?

> The man pages suggest a scheme where the app should call ibv_get_cq_event()
> followed by an ibv_req_notify_cq(), the CQ polling/emptying comes after these,
> so at the time of arm the CQ isn't empty.

It doesn't show how to re-arm it seems

I'm repeating from memory here, I haven't checked the specs, but that
Sean agrees seems like I'm remembering it right :)

The question you seem to be asking is what happens if you re-arm a
non-empty CQ, do you immediately get an event or not? It should be
easy enough to test on siw, rxe and mlx5 and see

I expect by spec arming a non-empty CQ will not generate an event. The
spec was written to support very simple hardware that would simply
generate an event the next time it writes a CQE then atomically clear
the arm.

Does look like a documentation update is in-order though!

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: ibv_req_notify_cq clarification
  2021-02-22 19:24               ` Gal Pressman
  2021-02-22 19:37                 ` Jason Gunthorpe
@ 2021-02-22 22:41                 ` Hefty, Sean
  2021-02-23 12:23                   ` Gal Pressman
  1 sibling, 1 reply; 27+ messages in thread
From: Hefty, Sean @ 2021-02-22 22:41 UTC (permalink / raw)
  To: Gal Pressman, Jason Gunthorpe; +Cc: RDMA mailing list

> > It eliminates races, if the consumer says 'I read up to X send me an
> > interrupt if X+1 exists' when X+1 already exists if there is a race
> > producer has already written it. So send an interrupt.
> 
> Right, that's what I was getting at in my first question, this isn't the next
> completion from the device's perspective.

From the spec:

Requests the CQ event handler be called when the next completion
entry of the specified type is added to the specified CQ. The handler
is called at most once per Request Completion Notification call for a
particular CQ. Any CQ entries that existed before the notify is enabled
will not result in a call to the handler.

It may help to think of it in terms of edge versus level triggered.  The IB spec defines the CQ as behaving similar to an edge triggered object.  I would guess that 99% of developers would successfully write an edge-triggered based application that hangs.  A level triggered wait is much easier to get correct.

> So in such case the consumer index in the arm doorbell is used to indicate what
> should be considered a "new" completion? This is new from the app perspective.

The verbs API only guarantees that the CQ will be signaled when a new completion relative to the HW has been written.  If there's a driver that converts the behavior into a level triggered operation, props to them!  That implementation is far more likely to make incorrectly written applications work than ever break a correctly written one.

> But looking at ibv_ud_pingpong for example, I don't understand how that could
> even work.
> The test arms the CQ on creation (consumder index 0), calls ibv_get_cq_event(),
> wakes up and immediately arms the CQ again (before polling, consumer index is
> still 0).

In the worst case, arming the CQ just results in select/poll/epoll returning immediately the next time it is called.  The thread finds the CQ empty, rearms the CQ again, and select/poll/epoll finally block.

> >> Running this with 32 iterations, the client does something like:
> >> - arm cq
> >> - post send x 32
> >> - wait for cq event
> >> - arm cq
> >> - poll cq (once, with batch size of 16)
> >> - no more post send (reached tot_iters)
> >> - wait for cq event (but an event has already been generated?)
> >
> > I don't know much about perf-test, but in verbs arming a non-empty CQ
> > is asking for trouble
> 
> Do you have a way to verify whether this test gets stuck? Maybe I am missing
> something?

Looking at the code, I agree that it looks racy in a way that could cause it to hang.  My guess is most people running perftests are looking for the best performance numbers, so blocking completions are rarely enabled.  And even if they were, it's still a race condition as to whether the test would hang.

> What do you mean by arming a non-empty CQ?
> The man pages suggest a scheme where the app should call ibv_get_cq_event()
> followed by an ibv_req_notify_cq(), the CQ polling/emptying comes after these,
> so at the time of arm the CQ isn't empty.

The app can do:

Wakeup
While read cqe succeeds
	Process cqe
Read cq event
Arm cq
/* cqe's may have been written between the last read and arming */
While read cqe succeeds
	Process cqe
Sleep

Or shorten this to:

Wakeup
Read cq event
Arm cq
While read cqe succeeds
	Process cqe
Sleep

In both cases, all cqe's must be processed after calling arm, and it's possible to read a cq event only to find the cq empty.  One could argue which is more efficient, but we're talking about a sleeping thread in either case.

- Sean

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-22 19:37                 ` Jason Gunthorpe
@ 2021-02-23 12:18                   ` Gal Pressman
  2021-02-23 12:38                     ` Jason Gunthorpe
  0 siblings, 1 reply; 27+ messages in thread
From: Gal Pressman @ 2021-02-23 12:18 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: RDMA mailing list

On 22/02/2021 21:37, Jason Gunthorpe wrote:
> On Mon, Feb 22, 2021 at 09:24:18PM +0200, Gal Pressman wrote:
>> On 22/02/2021 17:55, Jason Gunthorpe wrote:
>>> On Mon, Feb 22, 2021 at 05:36:17PM +0200, Gal Pressman wrote:
>>>
>>>> "Mellanox HCAs keep track of the last index for which the user received an
>>>> event. Using this index, it is guaranteed that an event is generated immediately
>>>> when a request completion notification is performed and a CQE has already been
>>>> reported."
>>>
>>> I don't think verbs exposes this behavior.
>>
>> So in theory this hardware could generate events that the user app
>> doesn't expect?
> 
> Not really, it shuffles around how race conditions are resolved.
> 
>> But looking at ibv_ud_pingpong for example, I don't understand how
>> that could even work.  The test arms the CQ on creation (consumder
>> index 0), calls ibv_get_cq_event(), wakes up and immediately arms
>> the CQ again (before polling, consumer index is still 0).
> 
> By IBTA spec this is wrong and racey.

Where does it say that?

>> This means that the next ibv_get_cq_event() will wake up immediately, as the CQ
>> was armed twice with the same consumer index and the first completion already
>> exists. Surely that's not what's meant to happen?
> 
> If the driver for this HW stuffs the consumer index then yes, you'd
> get a wakeup as though the new entries were delivered instantly after
> the arm. If it stuffs the current producer index then you get behavior
> like IBTA describes.
> 
> I'd say they are both compatible ways to approach this, the app can't
> tell the state of the CQ, so it can't know if the new CQEs were
> delievered before or after it ARM'd it.

That's not accurate though.
A simple ping app that sends one packet at a time knows the CQ state when it
arms it. The CQ event is for that single sent packet, and the arm should notify
about the next packet it's about to send.

In this case, notifying twice for the same completion is not compatible with the
spec - there are no races in this case.

>> Do you have a way to verify whether this test gets stuck? Maybe I am
>> missing something?
> 
> If the mlx5 implementation is doing like you say then it will not get
> stuck on that HW, but it is not to spec
> 
>> What do you mean by arming a non-empty CQ?
> 
> The arm only has meaning if you know the CQ is empty because then you
> can reliably catch the empty->!empty transition, which is the whole
> point.

Well, you never really know the CQ is empty though. A completion can always
arrive just after you finish polling.

> If the CQ is non-empty then, by spec, if no new events arrive it will
> never be signaled - the app must re-poll it on its own so why arm it?
> 
>> The man pages suggest a scheme where the app should call ibv_get_cq_event()
>> followed by an ibv_req_notify_cq(), the CQ polling/emptying comes after these,
>> so at the time of arm the CQ isn't empty.
> 
> It doesn't show how to re-arm it seems
> 
> I'm repeating from memory here, I haven't checked the specs, but that
> Sean agrees seems like I'm remembering it right :)
> 
> The question you seem to be asking is what happens if you re-arm a
> non-empty CQ, do you immediately get an event or not? It should be
> easy enough to test on siw, rxe and mlx5 and see

rxe seems to get stuck, had some issues running siw and don't have an mlx5 nic
at hand.

> I expect by spec arming a non-empty CQ will not generate an event. The
> spec was written to support very simple hardware that would simply
> generate an event the next time it writes a CQE then atomically clear
> the arm.
> 
> Does look like a documentation update is in-order though!

Looking at libibverbs examples, pyverbs tests and perftest, they all act roughly
the same:

1. arm CQ
2. post send/recv
3. get event
4. arm CQ
5. poll
6. goto step 2

All of these apps always use the same construct of get event followed by an
immediate arm. Did all of these apps get it wrong?

Do you have an example of where it's done right :)?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-22 22:41                 ` Hefty, Sean
@ 2021-02-23 12:23                   ` Gal Pressman
  2021-02-23 20:48                     ` Hefty, Sean
  0 siblings, 1 reply; 27+ messages in thread
From: Gal Pressman @ 2021-02-23 12:23 UTC (permalink / raw)
  To: Hefty, Sean, Jason Gunthorpe; +Cc: RDMA mailing list

On 23/02/2021 0:41, Hefty, Sean wrote:
>>> It eliminates races, if the consumer says 'I read up to X send me an
>>> interrupt if X+1 exists' when X+1 already exists if there is a race
>>> producer has already written it. So send an interrupt.
>>
>> Right, that's what I was getting at in my first question, this isn't the next
>> completion from the device's perspective.
> 
> From the spec:
> 
> Requests the CQ event handler be called when the next completion
> entry of the specified type is added to the specified CQ. The handler
> is called at most once per Request Completion Notification call for a
> particular CQ. Any CQ entries that existed before the notify is enabled
> will not result in a call to the handler.
> 
> It may help to think of it in terms of edge versus level triggered.  The IB spec defines the CQ as behaving similar to an edge triggered object.  I would guess that 99% of developers would successfully write an edge-triggered based application that hangs.  A level triggered wait is much easier to get correct.

Agree, but a level triggered wait contradicts the
"Any CQ entries that existed before the notify is enabled will not result in a
call to the handler."

>> So in such case the consumer index in the arm doorbell is used to indicate what
>> should be considered a "new" completion? This is new from the app perspective.
> 
> The verbs API only guarantees that the CQ will be signaled when a new completion relative to the HW has been written.  If there's a driver that converts the behavior into a level triggered operation, props to them!  That implementation is far more likely to make incorrectly written applications work than ever break a correctly written one.

I agree that this behavior is most likely better, but it contradicts the api
agreement.
It seems like you should write your app according to the provider you're using
in order to get it right.

>> But looking at ibv_ud_pingpong for example, I don't understand how that could
>> even work.
>> The test arms the CQ on creation (consumder index 0), calls ibv_get_cq_event(),
>> wakes up and immediately arms the CQ again (before polling, consumer index is
>> still 0).
> 
> In the worst case, arming the CQ just results in select/poll/epoll returning immediately the next time it is called.  The thread finds the CQ empty, rearms the CQ again, and select/poll/epoll finally block.
> 
>>>> Running this with 32 iterations, the client does something like:
>>>> - arm cq
>>>> - post send x 32
>>>> - wait for cq event
>>>> - arm cq
>>>> - poll cq (once, with batch size of 16)
>>>> - no more post send (reached tot_iters)
>>>> - wait for cq event (but an event has already been generated?)
>>>
>>> I don't know much about perf-test, but in verbs arming a non-empty CQ
>>> is asking for trouble
>>
>> Do you have a way to verify whether this test gets stuck? Maybe I am missing
>> something?
> 
> Looking at the code, I agree that it looks racy in a way that could cause it to hang.  My guess is most people running perftests are looking for the best performance numbers, so blocking completions are rarely enabled.  And even if they were, it's still a race condition as to whether the test would hang.
> 
>> What do you mean by arming a non-empty CQ?
>> The man pages suggest a scheme where the app should call ibv_get_cq_event()
>> followed by an ibv_req_notify_cq(), the CQ polling/emptying comes after these,
>> so at the time of arm the CQ isn't empty.
> 
> The app can do:
> 
> Wakeup
> While read cqe succeeds
> 	Process cqe
> Read cq event
> Arm cq
> /* cqe's may have been written between the last read and arming */
> While read cqe succeeds
> 	Process cqe
> Sleep
> 
> Or shorten this to:
> 
> Wakeup
> Read cq event
> Arm cq
> While read cqe succeeds
> 	Process cqe
> Sleep
> 
> In both cases, all cqe's must be processed after calling arm, and it's possible to read a cq event only to find the cq empty.  One could argue which is more efficient, but we're talking about a sleeping thread in either case.

Yea, this seems correct.
But as I said in my reply to Jason, libibverbs examples, pyverbs tests and
perftest all go from "Read cq event" to "Arm cq" immediately.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: ibv_req_notify_cq clarification
  2021-02-23 12:18                   ` Gal Pressman
@ 2021-02-23 12:38                     ` Jason Gunthorpe
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2021-02-23 12:38 UTC (permalink / raw)
  To: Gal Pressman; +Cc: RDMA mailing list

On Tue, Feb 23, 2021 at 02:18:31PM +0200, Gal Pressman wrote:
> > Does look like a documentation update is in-order though!
> 
> Looking at libibverbs examples, pyverbs tests and perftest, they all act roughly
> the same:
> 
> 1. arm CQ
> 2. post send/recv
> 3. get event
> 4. arm CQ
> 5. poll
> 6. goto step 2
> 
> All of these apps always use the same construct of get event followed by an
> immediate arm. Did all of these apps get it wrong?

Sean is right, this is OK, so long as step 5 polls to empty, it is
just unwound differently and will take suprious wakeups more often

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: ibv_req_notify_cq clarification
  2021-02-23 12:23                   ` Gal Pressman
@ 2021-02-23 20:48                     ` Hefty, Sean
  0 siblings, 0 replies; 27+ messages in thread
From: Hefty, Sean @ 2021-02-23 20:48 UTC (permalink / raw)
  To: Gal Pressman, Jason Gunthorpe; +Cc: RDMA mailing list

> I agree that this behavior is most likely better, but it contradicts the api
> agreement.
> It seems like you should write your app according to the provider you're using
> in order to get it right.

The app should code for edge triggered behavior for correctness.  I'm saying if the NIC actually uses level triggering (I don't know if they do or not), the app can't tell and will just work.

> > The app can do:
> >
> > Wakeup
> > While read cqe succeeds
> > 	Process cqe
> > Read cq event
> > Arm cq
> > /* cqe's may have been written between the last read and arming */
> > While read cqe succeeds
> > 	Process cqe
> > Sleep
> >
> > Or shorten this to:
> >
> > Wakeup
> > Read cq event
> > Arm cq
> > While read cqe succeeds
> > 	Process cqe
> > Sleep
> >
> > In both cases, all cqe's must be processed after calling arm, and it's possible to
> read a cq event only to find the cq empty.  One could argue which is more efficient,
> but we're talking about a sleeping thread in either case.
> 
> Yea, this seems correct.
> But as I said in my reply to Jason, libibverbs examples, pyverbs tests and
> perftest all go from "Read cq event" to "Arm cq" immediately.

Both options provide correct behavior, regardless of how signaling works.  The issue is that perftest does:

For I = 1 to 16
	If read cqe succeeds
		Process cqe

Instead of

While read cqe succeeds
	Process cqe

Perftest will only work if it's lucky or with level signaling.  I would not assume perftest is correct just because no one has reported the problem before.  :)

- Sean

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-02-23 20:49 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-18  9:13 ibv_req_notify_cq clarification Gal Pressman
2021-02-18 12:38 ` Bernard Metzler
2021-02-18 12:47   ` Gal Pressman
2021-02-18 13:59   ` Bernard Metzler
2021-02-18 12:53 ` Jason Gunthorpe
2021-02-18 15:52   ` Gal Pressman
2021-02-18 16:23     ` Jason Gunthorpe
2021-02-18 22:22       ` Tom Talpey
2021-02-18 22:51         ` Jason Gunthorpe
2021-02-18 23:07           ` Tom Talpey
2021-02-19  0:45             ` Jason Gunthorpe
2021-02-19 14:31               ` Tom Talpey
2021-02-19 14:42                 ` Jason Gunthorpe
2021-02-21  9:25       ` Gal Pressman
2021-02-22 13:46         ` Jason Gunthorpe
2021-02-22 15:36           ` Gal Pressman
2021-02-22 15:55             ` Jason Gunthorpe
2021-02-22 19:24               ` Gal Pressman
2021-02-22 19:37                 ` Jason Gunthorpe
2021-02-23 12:18                   ` Gal Pressman
2021-02-23 12:38                     ` Jason Gunthorpe
2021-02-22 22:41                 ` Hefty, Sean
2021-02-23 12:23                   ` Gal Pressman
2021-02-23 20:48                     ` Hefty, Sean
2021-02-22 18:38             ` Hefty, Sean
2021-02-22 19:26               ` Gal Pressman
2021-02-18 18:47     ` Bernard Metzler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.