Re: [PATCH 2/9] IB: add a proper completion queue abstraction

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
       [not found] <20151124100839.48b52fb35c6f209c51bccbb9807b6df0.f113bf890f.wbe@email24.secureserver.net>
@ 2015-11-24 17:52     ` Jason Gunthorpe
  0 siblings, 0 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-24 17:52 UTC (permalink / raw)
  To: cait-DpaxOq6QOWMAvxtiuMwx3w
  Cc: Bart Van Assche, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Nov 24, 2015 at 10:08:39AM -0700, cait-DpaxOq6QOWMAvxtiuMwx3w@public.gmane.org wrote:
>    >> My recollection of the IB verbs is that they were unlikely to have
>    >> overlooked something like that. If it did slip through then there
>    >> should be an errata.
>    >verbs reflects the wire protocol and the wire protocol has no way to
>    >create a linkage between the send and recv sides of a RC QP. It is not
>    >a spec bug, there is no errata.
> 
>    No, verbs do not reflect the link layer. Verbs fulfill the contract
>    with the end user - especially when the hardware cannot.

Are you serious? You think verbs should specify things that cannot
actually be implemented? IB verbs doesn't do that.

>    Reporting a completion of a receive without having guaranteed
>    that the acknowledgement will be sent before anything
>    subsequently submitted is failing to implement a reliable
>    connection.

As I've said repeatedly, the side receiving the request has no impact
on this issue. Requiring it to generate an ack before delivering the
recv completion *does nothing* to guarantee ordering on the other
side in IB. The network can always drop the ack, and the receiver
cannot tell.

I disagree that RC has anything to do with this causal ordering issue,
this is not a property described in the IB spec for RC.

>    One of the hardware designs I worked with over a decade ago
>    actually had *two* completion queues, one from the left brain (receive)
>    and the other from the right brain (transmit). Nevertheless we
>    presented

All verbs hardware has to have these two flows.

>    a unified completion queue that followed ordering rules. We were not
>    "reflecting" the hardware, we were implementing an interface in a
>    combination of hardware and software.

And how did you detect lost acks at the reciever to make this an
actual guarentee? Maybe TCP can do this but IB cannot.

I honestly don't know why you think verbs has this requirement. It is
not in the IB spec, and can't be implemented by IB hardware. If
one of the other flavours of verbs has this, then fine, but it is not
part of the *Linux* verbs flavour.

> I cannot think of a reason why that would not work here. Simply do
> not report a completion to the user before guaranteeing that the
> acks will be transmitted.

I've already explained three times why this is not enough for IB.

>    What would you do if the next user action had been to close the
>    connection?  Would the acks have been lost?

No acks will be lost. The IB termination process requires the closing
side to drain it's sendq and stop injecting new traffic before sending
the CM close message. The other must also drain the sendq before
replying. This handshake ensures the QP is quiescent and all in flight
traffic completed before the QP is destroyed.

IB basically shifts the lingering close into the CM arena, where it
costs less, and allows the QP and associated HW resources to be
destroyed quickly.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-24 17:52     ` Jason Gunthorpe
  0 siblings, 0 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-24 17:52 UTC (permalink / raw)
  To: cait
  Cc: Bart Van Assche, Christoph Hellwig, linux-rdma, sagig, axboe,
	linux-scsi, linux-kernel

On Tue, Nov 24, 2015 at 10:08:39AM -0700, cait@asomi.com wrote:
>    >> My recollection of the IB verbs is that they were unlikely to have
>    >> overlooked something like that. If it did slip through then there
>    >> should be an errata.
>    >verbs reflects the wire protocol and the wire protocol has no way to
>    >create a linkage between the send and recv sides of a RC QP. It is not
>    >a spec bug, there is no errata.
> 
>    No, verbs do not reflect the link layer. Verbs fulfill the contract
>    with the end user - especially when the hardware cannot.

Are you serious? You think verbs should specify things that cannot
actually be implemented? IB verbs doesn't do that.

>    Reporting a completion of a receive without having guaranteed
>    that the acknowledgement will be sent before anything
>    subsequently submitted is failing to implement a reliable
>    connection.

As I've said repeatedly, the side receiving the request has no impact
on this issue. Requiring it to generate an ack before delivering the
recv completion *does nothing* to guarantee ordering on the other
side in IB. The network can always drop the ack, and the receiver
cannot tell.

I disagree that RC has anything to do with this causal ordering issue,
this is not a property described in the IB spec for RC.

>    One of the hardware designs I worked with over a decade ago
>    actually had *two* completion queues, one from the left brain (receive)
>    and the other from the right brain (transmit). Nevertheless we
>    presented

All verbs hardware has to have these two flows.

>    a unified completion queue that followed ordering rules. We were not
>    "reflecting" the hardware, we were implementing an interface in a
>    combination of hardware and software.

And how did you detect lost acks at the reciever to make this an
actual guarentee? Maybe TCP can do this but IB cannot.

I honestly don't know why you think verbs has this requirement. It is
not in the IB spec, and can't be implemented by IB hardware. If
one of the other flavours of verbs has this, then fine, but it is not
part of the *Linux* verbs flavour.

> I cannot think of a reason why that would not work here. Simply do
> not report a completion to the user before guaranteeing that the
> acks will be transmitted.

I've already explained three times why this is not enough for IB.

>    What would you do if the next user action had been to close the
>    connection?  Would the acks have been lost?

No acks will be lost. The IB termination process requires the closing
side to drain it's sendq and stop injecting new traffic before sending
the CM close message. The other must also drain the sendq before
replying. This handshake ensures the QP is quiescent and all in flight
traffic completed before the QP is destroyed.

IB basically shifts the lingering close into the CM arena, where it
costs less, and allows the QP and associated HW resources to be
destroyed quickly.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
       [not found]     ` <56552132.7090701@asomi.com>
@ 2015-11-25  6:21           ` Jason Gunthorpe
  0 siblings, 0 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-25  6:21 UTC (permalink / raw)
  To: Caitlin Bestler
  Cc: Bart Van Assche, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Nov 24, 2015 at 06:47:14PM -0800, Caitlin Bestler wrote:
>         Acknowledge packet for the last packet of the request has been
>         committed to the wire (including the appropriate fields for RDMA
>         READ response).

> The target side cannot generate a response before it receives the request.
> The target side verbs cannot generate the completion before the acknowledge
> packet has been
> committed to the wire.

Sure, but, I keep saying this, the responder behavior is largely
irrelevant to what the target is able/required to do.

> Therefore the initiating side cannot receive a response before the write
> operation has completed.

Wrong. The ladder diagram would be

Requestor          Responder           Responder Verbs

SEND1 ----------->   Process
         X---------  ACK (lost)
	                     --------> recv1 CQ
			     <-------   send2 WQE
recv2 CQE <--------  SEND2 Packet
[..]
send1 CQE <--------  ACK (resent)

The Ack may be lost, causing the send CQE to arrive after the recv
CQE, even though the responder did everything in a specific order.

The fundamental issue is that the responder cannot detect the lost
ACK. The PSN of the ACK packet is part of the Requestor's PSN space,
not part of the Responders:

 9.7.5.1.1 GENERATING PSNs FOR ACKNOWLEDGE MESSAGES

 C9-95: For responses to SEND requests or RDMA WRITE requests the
  responder shall insert in the PSN field of the response the PSN of the
  most recent request packet being acknowledged.

Or stated another way, the value of the AckReq bit in SEND1 has no
impact on the contents of the SEND2 packet - thus there is no way for
the requestor to detect the loss of the ACK and hold off delivering
recv2 CQE.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-25  6:21           ` Jason Gunthorpe
  0 siblings, 0 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-25  6:21 UTC (permalink / raw)
  To: Caitlin Bestler
  Cc: Bart Van Assche, Christoph Hellwig, linux-rdma, sagig, axboe,
	linux-scsi, linux-kernel

On Tue, Nov 24, 2015 at 06:47:14PM -0800, Caitlin Bestler wrote:
>         Acknowledge packet for the last packet of the request has been
>         committed to the wire (including the appropriate fields for RDMA
>         READ response).

> The target side cannot generate a response before it receives the request.
> The target side verbs cannot generate the completion before the acknowledge
> packet has been
> committed to the wire.

Sure, but, I keep saying this, the responder behavior is largely
irrelevant to what the target is able/required to do.

> Therefore the initiating side cannot receive a response before the write
> operation has completed.

Wrong. The ladder diagram would be

Requestor          Responder           Responder Verbs

SEND1 ----------->   Process
         X---------  ACK (lost)
	                     --------> recv1 CQ
			     <-------   send2 WQE
recv2 CQE <--------  SEND2 Packet
[..]
send1 CQE <--------  ACK (resent)

The Ack may be lost, causing the send CQE to arrive after the recv
CQE, even though the responder did everything in a specific order.

The fundamental issue is that the responder cannot detect the lost
ACK. The PSN of the ACK packet is part of the Requestor's PSN space,
not part of the Responders:

 9.7.5.1.1 GENERATING PSNs FOR ACKNOWLEDGE MESSAGES

 C9-95: For responses to SEND requests or RDMA WRITE requests the
  responder shall insert in the PSN field of the response the PSN of the
  most recent request packet being acknowledged.

Or stated another way, the value of the AckReq bit in SEND1 has no
impact on the contents of the SEND2 packet - thus there is no way for
the requestor to detect the loss of the ACK and hold off delivering
recv2 CQE.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-24  7:03                                             ` Jason Gunthorpe
@ 2015-11-24 12:52                                                 ` Tom Talpey
  -1 siblings, 0 replies; 61+ messages in thread
From: Tom Talpey @ 2015-11-24 12:52 UTC (permalink / raw)
  To: Jason Gunthorpe, Caitlin Bestler
  Cc: Bart Van Assche, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 11/24/2015 2:03 AM, Jason Gunthorpe wrote:
> On Mon, Nov 23, 2015 at 06:35:28PM -0800, Caitlin Bestler wrote:
>> Are there actual HCAs that make this mistake?
>
> All IB HCAs have this behavior and require apps to see a send CQ
> completion before making any statements about the state of the send Q
> or buffers handed over to the HCA. Tom and I have seen this in real
> systems under proper stress conditions. [Which is why I am so certain
> about this, because when I first hit it years ago I dug into the spec
> and figured out it was not a HW bug I was looking at]

To be clear, I saw the reply-completion-before-request-completion on
Windows, not Linux, but the principle is identical. It's simply a
fact of life on a multiprocessor, unless you want to throw in locks
and synchronization rules that drivers have to follow to enforce
ordered completions across queues. Which trust me, you don't.

In Windows SMB Direct, we added reference counts around pretty much
every verb interaction associated with each upper layer operation,
and did not retire them until all refcounts went to zero. It is
excruciatingly correct yet performs incredibly well.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-24 12:52                                                 ` Tom Talpey
  0 siblings, 0 replies; 61+ messages in thread
From: Tom Talpey @ 2015-11-24 12:52 UTC (permalink / raw)
  To: Jason Gunthorpe, Caitlin Bestler
  Cc: Bart Van Assche, Christoph Hellwig, linux-rdma, sagig, axboe,
	linux-scsi, linux-kernel

On 11/24/2015 2:03 AM, Jason Gunthorpe wrote:
> On Mon, Nov 23, 2015 at 06:35:28PM -0800, Caitlin Bestler wrote:
>> Are there actual HCAs that make this mistake?
>
> All IB HCAs have this behavior and require apps to see a send CQ
> completion before making any statements about the state of the send Q
> or buffers handed over to the HCA. Tom and I have seen this in real
> systems under proper stress conditions. [Which is why I am so certain
> about this, because when I first hit it years ago I dug into the spec
> and figured out it was not a HW bug I was looking at]

To be clear, I saw the reply-completion-before-request-completion on
Windows, not Linux, but the principle is identical. It's simply a
fact of life on a multiprocessor, unless you want to throw in locks
and synchronization rules that drivers have to follow to enforce
ordered completions across queues. Which trust me, you don't.

In Windows SMB Direct, we added reference counts around pretty much
every verb interaction associated with each upper layer operation,
and did not retire them until all refcounts went to zero. It is
excruciatingly correct yet performs incredibly well.

Tom.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-24  2:35                                         ` Caitlin Bestler
@ 2015-11-24  7:03                                             ` Jason Gunthorpe
  -1 siblings, 0 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-24  7:03 UTC (permalink / raw)
  To: Caitlin Bestler
  Cc: Bart Van Assche, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Nov 23, 2015 at 06:35:28PM -0800, Caitlin Bestler wrote:

> Is it possible for an IB HCA to transmit a response on a QP and not
> in that packet or a previous packet acknowledge something that it
> has delivered to the user?

AFAIK, the rules of ack coalescing do not interact with the send
side. Even if they did, that is the wrong place to look at.

We must look at the receiver. Ordered ack,data on the wire may suffer
a packet loss and the ack may not reach the reciever. In this case can
the reciever detect the lost ack and not progress the data? For IB, it
cannot. The ack sequencing is part of the transmitters recv FSM, and
does not interact with the send FSM.

I feel this a deliberate IB design choice to be simple and efficient
in hardware.

> My recollection of the IB verbs is that they were unlikely to have
> overlooked something like that. If it did slip through then there
> should be an errata.

verbs reflects the wire protocol and the wire protocol has no way to
create a linkage between the send and recv sides of a RC QP. It is not
a spec bug, there is no errata.

> But regardless of specification lawyering, is this an implementation
> issue.

All IB implementations have no choice but to act this way - the wire
protocol provides no way to guarentee ack vs data sequencing at the
receiver, so there is simply no way to guarentee the sendq advances
strictly causally with the recvq.

> Are there actual HCAs that make this mistake?

All IB HCAs have this behavior and require apps to see a send CQ
completion before making any statements about the state of the send Q
or buffers handed over to the HCA. Tom and I have seen this in real
systems under proper stress conditions. [Which is why I am so certain
about this, because when I first hit it years ago I dug into the spec
and figured out it was not a HW bug I was looking at]

This is a direct consequence of how IB runs the ACK protocol.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-24  7:03                                             ` Jason Gunthorpe
  0 siblings, 0 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-24  7:03 UTC (permalink / raw)
  To: Caitlin Bestler
  Cc: Bart Van Assche, Christoph Hellwig, linux-rdma, sagig, axboe,
	linux-scsi, linux-kernel

On Mon, Nov 23, 2015 at 06:35:28PM -0800, Caitlin Bestler wrote:

> Is it possible for an IB HCA to transmit a response on a QP and not
> in that packet or a previous packet acknowledge something that it
> has delivered to the user?

AFAIK, the rules of ack coalescing do not interact with the send
side. Even if they did, that is the wrong place to look at.

We must look at the receiver. Ordered ack,data on the wire may suffer
a packet loss and the ack may not reach the reciever. In this case can
the reciever detect the lost ack and not progress the data? For IB, it
cannot. The ack sequencing is part of the transmitters recv FSM, and
does not interact with the send FSM.

I feel this a deliberate IB design choice to be simple and efficient
in hardware.

> My recollection of the IB verbs is that they were unlikely to have
> overlooked something like that. If it did slip through then there
> should be an errata.

verbs reflects the wire protocol and the wire protocol has no way to
create a linkage between the send and recv sides of a RC QP. It is not
a spec bug, there is no errata.

> But regardless of specification lawyering, is this an implementation
> issue.

All IB implementations have no choice but to act this way - the wire
protocol provides no way to guarentee ack vs data sequencing at the
receiver, so there is simply no way to guarentee the sendq advances
strictly causally with the recvq.

> Are there actual HCAs that make this mistake?

All IB HCAs have this behavior and require apps to see a send CQ
completion before making any statements about the state of the send Q
or buffers handed over to the HCA. Tom and I have seen this in real
systems under proper stress conditions. [Which is why I am so certain
about this, because when I first hit it years ago I dug into the spec
and figured out it was not a HW bug I was looking at]

This is a direct consequence of how IB runs the ACK protocol.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-24  0:00                                   ` Jason Gunthorpe
@ 2015-11-24  2:35                                         ` Caitlin Bestler
  0 siblings, 0 replies; 61+ messages in thread
From: Caitlin Bestler @ 2015-11-24  2:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA



On 11/23/2015 4:00 PM, Jason Gunthorpe wrote:
> On Mon, Nov 23, 2015 at 03:30:42PM -0800, Caitlin Bestler wrote:
>>     The receive completion can be safely assumed to indicate transmit
>>     completion over a reliable connection unless your peer has gone
>>     completely bonkers and is replying to a command that it did not
>>     receive.
> Perhaps iWarp is different and does specify this ordering but IB does
> not.
>
> The issue with IB is how the ACK protocol is designed. There is not
> strong ordering between ACKs and data transfers. A HCA can send
> ACK,DATA and the network could drop the ACK. The recevier side does
> not know the ACK was lost and goes ahead to process DATA.
>
> Since only ACK advances the sendq and DATA advances the recvq it is
> trivial to get a case where the recvq is advanced with a reply while
> the sendq continues to wait for the ACK to be resent.
>
> Further IB allows ACK coalescing and has no rules for how an ACK is
> placed. It is entirely valid for a HCA to RECV,REPLY,ACK - for
> instance.
>
>
Is it possible for an IB HCA to transmit a response on a QP and not in 
that packet
or a previous packet acknowledge something that it has delivered to the 
user?

My recollection of the IB verbs is that they were unlikely to have 
overlooked something
like that. If it did slip through then there should be an errata.

But regardless of specification lawyering, is this an implementation 
issue. Are there
actual HCAs that make this mistake?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-24  2:35                                         ` Caitlin Bestler
  0 siblings, 0 replies; 61+ messages in thread
From: Caitlin Bestler @ 2015-11-24  2:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, Christoph Hellwig, linux-rdma, sagig, axboe,
	linux-scsi, linux-kernel



On 11/23/2015 4:00 PM, Jason Gunthorpe wrote:
> On Mon, Nov 23, 2015 at 03:30:42PM -0800, Caitlin Bestler wrote:
>>     The receive completion can be safely assumed to indicate transmit
>>     completion over a reliable connection unless your peer has gone
>>     completely bonkers and is replying to a command that it did not
>>     receive.
> Perhaps iWarp is different and does specify this ordering but IB does
> not.
>
> The issue with IB is how the ACK protocol is designed. There is not
> strong ordering between ACKs and data transfers. A HCA can send
> ACK,DATA and the network could drop the ACK. The recevier side does
> not know the ACK was lost and goes ahead to process DATA.
>
> Since only ACK advances the sendq and DATA advances the recvq it is
> trivial to get a case where the recvq is advanced with a reply while
> the sendq continues to wait for the ACK to be resent.
>
> Further IB allows ACK coalescing and has no rules for how an ACK is
> placed. It is entirely valid for a HCA to RECV,REPLY,ACK - for
> instance.
>
>
Is it possible for an IB HCA to transmit a response on a QP and not in 
that packet
or a previous packet acknowledge something that it has delivered to the 
user?

My recollection of the IB verbs is that they were unlikely to have 
overlooked something
like that. If it did slip through then there should be an errata.

But regardless of specification lawyering, is this an implementation 
issue. Are there
actual HCAs that make this mistake?


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-24  0:34                                         ` Tom Talpey
@ 2015-11-24  0:40                                             ` Jason Gunthorpe
  -1 siblings, 0 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-24  0:40 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Caitlin Bestler, Bart Van Assche, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Nov 23, 2015 at 07:34:53PM -0500, Tom Talpey wrote:

> Been there, seen that. Bluescreened on it, mysteriously.

Yes, me too :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-24  0:40                                             ` Jason Gunthorpe
  0 siblings, 0 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-24  0:40 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Caitlin Bestler, Bart Van Assche, Christoph Hellwig, linux-rdma,
	sagig, axboe, linux-scsi, linux-kernel

On Mon, Nov 23, 2015 at 07:34:53PM -0500, Tom Talpey wrote:

> Been there, seen that. Bluescreened on it, mysteriously.

Yes, me too :(

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-24  0:00                                   ` Jason Gunthorpe
@ 2015-11-24  0:34                                         ` Tom Talpey
  0 siblings, 0 replies; 61+ messages in thread
From: Tom Talpey @ 2015-11-24  0:34 UTC (permalink / raw)
  To: Jason Gunthorpe, Caitlin Bestler
  Cc: Bart Van Assche, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 11/23/2015 7:00 PM, Jason Gunthorpe wrote:
> On Mon, Nov 23, 2015 at 03:30:42PM -0800, Caitlin Bestler wrote:
>>     The receive completion can be safely assumed to indicate transmit
>>     completion over a reliable connection unless your peer has gone
>>     completely bonkers and is replying to a command that it did not
>>     receive.
>
> Perhaps iWarp is different and does specify this ordering but IB does
> not.

iWARP is not different. The situation you (Jason) describe has
nothing to do with the transport. It has everything to do with
as you point out the lack of causality between send and receive
completions.

It is entirely possible for the reply to be received before the
send is fully processed. For example, the send might be issued
on one core, and that core scheduled away before the completion
for the send is ready. In the meantime, the request goes on
the wire, the target processes it and replies, and the reply
is processed. Boom, the send queue completion is still pending.

Been there, seen that. Bluescreened on it, mysteriously.

A really good way to see this is with software providers, btw.
Try it with soft{roce,iwarp}, under heavy load.

Tom.

>
> The issue with IB is how the ACK protocol is designed. There is not
> strong ordering between ACKs and data transfers. A HCA can send
> ACK,DATA and the network could drop the ACK. The recevier side does
> not know the ACK was lost and goes ahead to process DATA.
>
> Since only ACK advances the sendq and DATA advances the recvq it is
> trivial to get a case where the recvq is advanced with a reply while
> the sendq continues to wait for the ACK to be resent.
>
> Further IB allows ACK coalescing and has no rules for how an ACK is
> placed. It is entirely valid for a HCA to RECV,REPLY,ACK - for
> instance.
>
>>     I actually had a bug in an early iWARP emulation where the simulated
>>     peer, because it was simulated, responded
>>     instantly. The result was a TCP segment that both acked the
>>     transmission *and* contained the reply. The bug was
>>     that the code processed the reception before the transmission ack,
>>     causing the receive completion to be placed
>>     on the completion queue before transmit completion.
>
> I don't know if iWARP has the same lax ordering as IB, but certainly,
> what you describe is legal for IB verbs to do, and our kernel ULPs
> have to cope with it.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-24  0:34                                         ` Tom Talpey
  0 siblings, 0 replies; 61+ messages in thread
From: Tom Talpey @ 2015-11-24  0:34 UTC (permalink / raw)
  To: Jason Gunthorpe, Caitlin Bestler
  Cc: Bart Van Assche, Christoph Hellwig, linux-rdma, sagig, axboe,
	linux-scsi, linux-kernel

On 11/23/2015 7:00 PM, Jason Gunthorpe wrote:
> On Mon, Nov 23, 2015 at 03:30:42PM -0800, Caitlin Bestler wrote:
>>     The receive completion can be safely assumed to indicate transmit
>>     completion over a reliable connection unless your peer has gone
>>     completely bonkers and is replying to a command that it did not
>>     receive.
>
> Perhaps iWarp is different and does specify this ordering but IB does
> not.

iWARP is not different. The situation you (Jason) describe has
nothing to do with the transport. It has everything to do with
as you point out the lack of causality between send and receive
completions.

It is entirely possible for the reply to be received before the
send is fully processed. For example, the send might be issued
on one core, and that core scheduled away before the completion
for the send is ready. In the meantime, the request goes on
the wire, the target processes it and replies, and the reply
is processed. Boom, the send queue completion is still pending.

Been there, seen that. Bluescreened on it, mysteriously.

A really good way to see this is with software providers, btw.
Try it with soft{roce,iwarp}, under heavy load.

Tom.

>
> The issue with IB is how the ACK protocol is designed. There is not
> strong ordering between ACKs and data transfers. A HCA can send
> ACK,DATA and the network could drop the ACK. The recevier side does
> not know the ACK was lost and goes ahead to process DATA.
>
> Since only ACK advances the sendq and DATA advances the recvq it is
> trivial to get a case where the recvq is advanced with a reply while
> the sendq continues to wait for the ACK to be resent.
>
> Further IB allows ACK coalescing and has no rules for how an ACK is
> placed. It is entirely valid for a HCA to RECV,REPLY,ACK - for
> instance.
>
>>     I actually had a bug in an early iWARP emulation where the simulated
>>     peer, because it was simulated, responded
>>     instantly. The result was a TCP segment that both acked the
>>     transmission *and* contained the reply. The bug was
>>     that the code processed the reception before the transmission ack,
>>     causing the receive completion to be placed
>>     on the completion queue before transmit completion.
>
> I don't know if iWARP has the same lax ordering as IB, but certainly,
> what you describe is legal for IB verbs to do, and our kernel ULPs
> have to cope with it.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
       [not found]                                 ` <B24F4DDE-709A-4D2D-8B26-4E83325DBB1A@asomi.com>
@ 2015-11-24  0:00                                   ` Jason Gunthorpe
       [not found]                                     ` <20151124000011.GA9301-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-24  0:00 UTC (permalink / raw)
  To: Caitlin Bestler
  Cc: Bart Van Assche, Christoph Hellwig, linux-rdma, sagig, axboe,
	linux-scsi, linux-kernel

On Mon, Nov 23, 2015 at 03:30:42PM -0800, Caitlin Bestler wrote:
>    The receive completion can be safely assumed to indicate transmit
>    completion over a reliable connection unless your peer has gone
>    completely bonkers and is replying to a command that it did not
>    receive.

Perhaps iWarp is different and does specify this ordering but IB does
not.

The issue with IB is how the ACK protocol is designed. There is not
strong ordering between ACKs and data transfers. A HCA can send
ACK,DATA and the network could drop the ACK. The recevier side does
not know the ACK was lost and goes ahead to process DATA.

Since only ACK advances the sendq and DATA advances the recvq it is
trivial to get a case where the recvq is advanced with a reply while
the sendq continues to wait for the ACK to be resent.

Further IB allows ACK coalescing and has no rules for how an ACK is
placed. It is entirely valid for a HCA to RECV,REPLY,ACK - for
instance.

>    I actually had a bug in an early iWARP emulation where the simulated
>    peer, because it was simulated, responded
>    instantly. The result was a TCP segment that both acked the
>    transmission *and* contained the reply. The bug was
>    that the code processed the reception before the transmission ack,
>    causing the receive completion to be placed
>    on the completion queue before transmit completion.

I don't know if iWARP has the same lax ordering as IB, but certainly,
what you describe is legal for IB verbs to do, and our kernel ULPs
have to cope with it.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-23 22:33                             ` Bart Van Assche
@ 2015-11-23 23:06                               ` Jason Gunthorpe
       [not found]                                 ` <B24F4DDE-709A-4D2D-8B26-4E83325DBB1A@asomi.com>
  0 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-23 23:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-rdma, sagig, axboe, linux-scsi, linux-kernel

On Mon, Nov 23, 2015 at 02:33:05PM -0800, Bart Van Assche wrote:
> On 11/23/2015 02:18 PM, Jason Gunthorpe wrote:
> >On Mon, Nov 23, 2015 at 01:54:05PM -0800, Bart Van Assche wrote:
> >What I don't see is how SRP handles things when the
> >sendq fills up, ie the case where __srp_get_tx_iu() == NULL. It looks
> >like the driver starts to panic and generates printks. I can't tell if
> >it can survive that, but it doesn't look very good..
> 
> Hello Jason,
> 
> From srp_cm_rep_handler():
> 
> 		target->scsi_host->can_queue
> 			= min(ch->req_lim - SRP_TSK_MGMT_SQ_SIZE,
> 			      target->scsi_host->can_queue);
> 
> In other words, the SCSI core is told to ensure that the number of
> outstanding SCSI commands is one less than the number of elements in the
> ch->free_tx list. And since the SRP initiator serializes task management
> requests it is guaranteed that __srp_get_tx_iu() won't fail due to
> ch->free_tx being empty.

I realize that, and as I already explained, SRP cannot drive the sendq
flow control from the recv side.

The SCSI core considers the command complete and will issue a new
command as soon as the recv completion associated with the command is
returned. (ie when the remote responds)

This *DOES NOT* say anything about the state of the sendq, it does not
guarantee there is send CQ entry available for the associated send, it
does not guarantee there is available space in the sendq. Verbs DOES
NOT make ordering guarentees between queues, even if the queues are
causally related.

This is an important point in verbs and it is commonly done wrong.

So, yes, __srp_get_tx_iu absolutely can fail due to ch->free_tx being
empty, even though by observing the recv side SRP has inferred that
the sendq should have space.

Every ULP has to cope with this, and a direct poll API that doesn't
account for the need to block on a predicate is broken by design.
This is why I object.

'ib_poll_cq_until' would correct all of this.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-23 22:18                           ` Jason Gunthorpe
@ 2015-11-23 22:33                             ` Bart Van Assche
  2015-11-23 23:06                               ` Jason Gunthorpe
  0 siblings, 1 reply; 61+ messages in thread
From: Bart Van Assche @ 2015-11-23 22:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-rdma, sagig, axboe, linux-scsi, linux-kernel

On 11/23/2015 02:18 PM, Jason Gunthorpe wrote:
> On Mon, Nov 23, 2015 at 01:54:05PM -0800, Bart Van Assche wrote:
> What I don't see is how SRP handles things when the
> sendq fills up, ie the case where __srp_get_tx_iu() == NULL. It looks
> like the driver starts to panic and generates printks. I can't tell if
> it can survive that, but it doesn't look very good..

Hello Jason,

 From srp_cm_rep_handler():

		target->scsi_host->can_queue
			= min(ch->req_lim - SRP_TSK_MGMT_SQ_SIZE,
			      target->scsi_host->can_queue);

In other words, the SCSI core is told to ensure that the number of 
outstanding SCSI commands is one less than the number of elements in the 
ch->free_tx list. And since the SRP initiator serializes task management 
requests it is guaranteed that __srp_get_tx_iu() won't fail due to 
ch->free_tx being empty.

Bart.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-23 21:54                         ` Bart Van Assche
@ 2015-11-23 22:18                           ` Jason Gunthorpe
  2015-11-23 22:33                             ` Bart Van Assche
  0 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-23 22:18 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-rdma, sagig, axboe, linux-scsi, linux-kernel

On Mon, Nov 23, 2015 at 01:54:05PM -0800, Bart Van Assche wrote:

> Not really ... Please have a look at the SRP initiator source code. What the
> SRP initiator does is to poll the send queue before sending a new
> SCSI

I see that. What I don't see is how SRP handles things when the
sendq fills up, ie the case where __srp_get_tx_iu() == NULL. It looks
like the driver starts to panic and generates printks. I can't tell if
it can survive that, but it doesn't look very good..

It would be a lot better if this wasn't allowed to happen, the polling
loop can run until a sendq becomes available, and never return null
from __srp_get_tx_iu().

Ie, __srp_get_tx_iu should look more like

   ib_poll_cq_until(..., !list_empty(&ch->free_tx));

Which would be a fairly sane core API for this direct usage.. Ideally
the core code would sleep if possible and not just spin. Maybe also
protect it with a timer.

> command to the target system starts. I think this approach could also be
> used in other ULP drivers if the send queue poll frequency is such that no
> send queue overflow occurs.

Yes, I agree, but it has to be done properly :)

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-23 21:28                       ` Jason Gunthorpe
@ 2015-11-23 21:54                         ` Bart Van Assche
  2015-11-23 22:18                           ` Jason Gunthorpe
  0 siblings, 1 reply; 61+ messages in thread
From: Bart Van Assche @ 2015-11-23 21:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-rdma, sagig, axboe, linux-scsi, linux-kernel

On 11/23/2015 01:28 PM, Jason Gunthorpe wrote:
> On Mon, Nov 23, 2015 at 01:04:25PM -0800, Bart Van Assche wrote:
>
>> Considerable time ago the send queue in the SRP initiator driver was
>> modified from signaled to non-signaled to reduce the number of interrupts
>> triggered by the SRP initiator driver. The SRP initiator driver polls the
>> send queue every time before a SCSI command is sent to the target. I think
>> this is a pattern that is also useful for other ULP's so I'm not convinced
>> that ib_process_cq_direct() should be deprecated :-)
>
> As I explained, that is a fine idea, but I can't see how SRP is able
> to correctly do sendq flow control without spinning on the poll, which
> it does not do.
>
> I'm guessing SRP is trying to drive sendq flow control from the recv
> side, like NFS was. This is wrong and should not be part of the common
> API.
>
> Does that make sense?

Not really ... Please have a look at the SRP initiator source code. What 
the SRP initiator does is to poll the send queue before sending a new 
SCSI command to the target system starts. I think this approach could 
also be used in other ULP drivers if the send queue poll frequency is 
such that no send queue overflow occurs.

Bart.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-23 21:04                       ` Bart Van Assche
  (?)
@ 2015-11-23 21:28                       ` Jason Gunthorpe
  2015-11-23 21:54                         ` Bart Van Assche
  -1 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-23 21:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-rdma, sagig, axboe, linux-scsi, linux-kernel

On Mon, Nov 23, 2015 at 01:04:25PM -0800, Bart Van Assche wrote:

> Considerable time ago the send queue in the SRP initiator driver was
> modified from signaled to non-signaled to reduce the number of interrupts
> triggered by the SRP initiator driver. The SRP initiator driver polls the
> send queue every time before a SCSI command is sent to the target. I think
> this is a pattern that is also useful for other ULP's so I'm not convinced
> that ib_process_cq_direct() should be deprecated :-)

As I explained, that is a fine idea, but I can't see how SRP is able
to correctly do sendq flow control without spinning on the poll, which
it does not do.

I'm guessing SRP is trying to drive sendq flow control from the recv
side, like NFS was. This is wrong and should not be part of the common
API.

Does that make sense?

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-23 20:37                     ` Jason Gunthorpe
@ 2015-11-23 21:04                       ` Bart Van Assche
  -1 siblings, 0 replies; 61+ messages in thread
From: Bart Van Assche @ 2015-11-23 21:04 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: Bart Van Assche, linux-rdma, sagig, axboe, linux-scsi, linux-kernel

On 11/23/2015 12:37 PM, Jason Gunthorpe wrote:
> On Sat, Nov 14, 2015 at 08:13:44AM +0100, Christoph Hellwig wrote:
>> On Fri, Nov 13, 2015 at 03:06:36PM -0700, Jason Gunthorpe wrote:
>>> Looking at that thread and then at the patch a bit more..
>>>
>>> +void ib_process_cq_direct(struct ib_cq *cq)
>>> [..]
>>> +	__ib_process_cq(cq, INT_MAX);
>>>
>>> INT_MAX is not enough, it needs to loop.
>>> This is missing a ib_req_notify also.
>>
>> No.  Direct cases _never_ calls ib_req_notify.  Its for the case where
>> the SRP case polls the send CQ only from the same context it sends for
>> without any interrupt notification at al.
>
> Hurm, okay, that is not at all what I was thinking this was for..
>
> So the only use of this function is to drain a send cq, in a state
> where it is guarenteed no new entries can be added, and only if the cq
> is not already event driven. I'd stick those notes in the comment..
>
> Hum. I wonder if this is even a reasonable way to run a ULP. It is
> important that rx completions are not used to drive reaping of
> resources that are still committed to the send queue. ie do not
> trigger send buffer reuse based on a rx completion.
>
> So, if a ULP uses this API, how does it handle the sendq becoming
> full? As above, a ULP cannot use recvs to infer available sendq
> space. It must directly reap the sendq. So a correct ULP would have to
> spin calling ib_process_direct_cq until it makes enough progress to
> add more things to the sendq. I don't obviously see that in SRP - so
> I'm guessing it has buggered up sendq flow control?
>
> NFS had similar problems lately too, I wrote a long explanation to
> Chuck on this subject.
>
> That said, the demand poll almost seems like a reasonable way for a
> ULP to run the sendq, do the polls on send occasionally or when more
> space is needed to better amortize the reaping overhead at the cost of
> send latency. But API wise it needs to be able to switch over to a
> sleep if enough progress hasn't been made.
>
> So.. maybe also add to the comment that ib_process_cq_direct is
> deprecated and should not be used in new code until SRP gets sorted?

Hello Jason,

Considerable time ago the send queue in the SRP initiator driver was 
modified from signaled to non-signaled to reduce the number of 
interrupts triggered by the SRP initiator driver. The SRP initiator 
driver polls the send queue every time before a SCSI command is sent to 
the target. I think this is a pattern that is also useful for other 
ULP's so I'm not convinced that ib_process_cq_direct() should be 
deprecated :-)

Bart.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-23 21:04                       ` Bart Van Assche
  0 siblings, 0 replies; 61+ messages in thread
From: Bart Van Assche @ 2015-11-23 21:04 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: Bart Van Assche, linux-rdma, sagig, axboe, linux-scsi, linux-kernel

On 11/23/2015 12:37 PM, Jason Gunthorpe wrote:
> On Sat, Nov 14, 2015 at 08:13:44AM +0100, Christoph Hellwig wrote:
>> On Fri, Nov 13, 2015 at 03:06:36PM -0700, Jason Gunthorpe wrote:
>>> Looking at that thread and then at the patch a bit more..
>>>
>>> +void ib_process_cq_direct(struct ib_cq *cq)
>>> [..]
>>> +	__ib_process_cq(cq, INT_MAX);
>>>
>>> INT_MAX is not enough, it needs to loop.
>>> This is missing a ib_req_notify also.
>>
>> No.  Direct cases _never_ calls ib_req_notify.  Its for the case where
>> the SRP case polls the send CQ only from the same context it sends for
>> without any interrupt notification at al.
>
> Hurm, okay, that is not at all what I was thinking this was for..
>
> So the only use of this function is to drain a send cq, in a state
> where it is guarenteed no new entries can be added, and only if the cq
> is not already event driven. I'd stick those notes in the comment..
>
> Hum. I wonder if this is even a reasonable way to run a ULP. It is
> important that rx completions are not used to drive reaping of
> resources that are still committed to the send queue. ie do not
> trigger send buffer reuse based on a rx completion.
>
> So, if a ULP uses this API, how does it handle the sendq becoming
> full? As above, a ULP cannot use recvs to infer available sendq
> space. It must directly reap the sendq. So a correct ULP would have to
> spin calling ib_process_direct_cq until it makes enough progress to
> add more things to the sendq. I don't obviously see that in SRP - so
> I'm guessing it has buggered up sendq flow control?
>
> NFS had similar problems lately too, I wrote a long explanation to
> Chuck on this subject.
>
> That said, the demand poll almost seems like a reasonable way for a
> ULP to run the sendq, do the polls on send occasionally or when more
> space is needed to better amortize the reaping overhead at the cost of
> send latency. But API wise it needs to be able to switch over to a
> sleep if enough progress hasn't been made.
>
> So.. maybe also add to the comment that ib_process_cq_direct is
> deprecated and should not be used in new code until SRP gets sorted?

Hello Jason,

Considerable time ago the send queue in the SRP initiator driver was 
modified from signaled to non-signaled to reduce the number of 
interrupts triggered by the SRP initiator driver. The SRP initiator 
driver polls the send queue every time before a SCSI command is sent to 
the target. I think this is a pattern that is also useful for other 
ULP's so I'm not convinced that ib_process_cq_direct() should be 
deprecated :-)

Bart.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-23 20:01           ` Jason Gunthorpe
@ 2015-11-23 20:57                 ` Christoph Hellwig
  0 siblings, 0 replies; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-23 20:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb,
	bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon, Nov 23, 2015 at 01:01:36PM -0700, Jason Gunthorpe wrote:
> Okay, having now read the whole thing, I think I see the flow now. I don't
> see any holes in the above, other than it is doing a bit more work
> than it needs in some edges cases because it doesn't know if the CQ is
> actually empty or not.
> 
> > > > +	completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
> > > > +	if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
> > > > +	    ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
> > > > +		queue_work(ib_comp_wq, &cq->work);
> > > 
> > > Same comment here..
> > 
> > Same here - we only requeue the work item if either we processed all of
> > our budget, or ib_req_notify_cq with IB_CQ_REPORT_MISSED_EVENTS told
> > us that we need to poll again.
> 
> I find the if construction hard to read, but yes, it looks OK.

If you're got suggestions to improve the flow please send them my way.
I'm not entirely happy with it, but couldn't come up with a better idea.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-23 20:57                 ` Christoph Hellwig
  0 siblings, 0 replies; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-23 20:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-rdma, sagig, bart.vanassche, axboe,
	linux-scsi, linux-kernel

On Mon, Nov 23, 2015 at 01:01:36PM -0700, Jason Gunthorpe wrote:
> Okay, having now read the whole thing, I think I see the flow now. I don't
> see any holes in the above, other than it is doing a bit more work
> than it needs in some edges cases because it doesn't know if the CQ is
> actually empty or not.
> 
> > > > +	completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
> > > > +	if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
> > > > +	    ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
> > > > +		queue_work(ib_comp_wq, &cq->work);
> > > 
> > > Same comment here..
> > 
> > Same here - we only requeue the work item if either we processed all of
> > our budget, or ib_req_notify_cq with IB_CQ_REPORT_MISSED_EVENTS told
> > us that we need to poll again.
> 
> I find the if construction hard to read, but yes, it looks OK.

If you're got suggestions to improve the flow please send them my way.
I'm not entirely happy with it, but couldn't come up with a better idea.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-14  7:13                 ` Christoph Hellwig
@ 2015-11-23 20:37                     ` Jason Gunthorpe
  -1 siblings, 0 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-23 20:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Sat, Nov 14, 2015 at 08:13:44AM +0100, Christoph Hellwig wrote:
> On Fri, Nov 13, 2015 at 03:06:36PM -0700, Jason Gunthorpe wrote:
> > Looking at that thread and then at the patch a bit more..
> > 
> > +void ib_process_cq_direct(struct ib_cq *cq)
> > [..]
> > +	__ib_process_cq(cq, INT_MAX);
> > 
> > INT_MAX is not enough, it needs to loop.
> > This is missing a ib_req_notify also.
> 
> No.  Direct cases _never_ calls ib_req_notify.  Its for the case where
> the SRP case polls the send CQ only from the same context it sends for
> without any interrupt notification at al.

Hurm, okay, that is not at all what I was thinking this was for..

So the only use of this function is to drain a send cq, in a state
where it is guarenteed no new entries can be added, and only if the cq
is not already event driven. I'd stick those notes in the comment..

Hum. I wonder if this is even a reasonable way to run a ULP. It is
important that rx completions are not used to drive reaping of
resources that are still committed to the send queue. ie do not
trigger send buffer reuse based on a rx completion.

So, if a ULP uses this API, how does it handle the sendq becoming
full? As above, a ULP cannot use recvs to infer available sendq
space. It must directly reap the sendq. So a correct ULP would have to
spin calling ib_process_direct_cq until it makes enough progress to
add more things to the sendq. I don't obviously see that in SRP - so
I'm guessing it has buggered up sendq flow control?

NFS had similar problems lately too, I wrote a long explanation to
Chuck on this subject.

That said, the demand poll almost seems like a reasonable way for a
ULP to run the sendq, do the polls on send occasionally or when more
space is needed to better amortize the reaping overhead at the cost of
send latency. But API wise it needs to be able to switch over to a
sleep if enough progress hasn't been made.

So.. maybe also add to the comment that ib_process_cq_direct is
deprecated and should not be used in new code until SRP gets sorted?

> > Perhaps ib_req_notify_cq should be folded into __ib_process_cq, then
> > it can trivially honour the budget on additional loops from
> > IB_CQ_REPORT_MISSED_EVENTS.
> 
> Which also defeats this proposal.

Just ignore my remarks here, someone should do a benchmark to see if
we are hitting the edge cases of extra spins around the various loops
before reworking this. Can't trivially hoist ib_req_notify_cq into
__ib_process_cq because of how it needs to be ordered with
irq_poll_complete

Reviewed-by: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-23 20:37                     ` Jason Gunthorpe
  0 siblings, 0 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-23 20:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, linux-rdma, sagig, axboe, linux-scsi, linux-kernel

On Sat, Nov 14, 2015 at 08:13:44AM +0100, Christoph Hellwig wrote:
> On Fri, Nov 13, 2015 at 03:06:36PM -0700, Jason Gunthorpe wrote:
> > Looking at that thread and then at the patch a bit more..
> > 
> > +void ib_process_cq_direct(struct ib_cq *cq)
> > [..]
> > +	__ib_process_cq(cq, INT_MAX);
> > 
> > INT_MAX is not enough, it needs to loop.
> > This is missing a ib_req_notify also.
> 
> No.  Direct cases _never_ calls ib_req_notify.  Its for the case where
> the SRP case polls the send CQ only from the same context it sends for
> without any interrupt notification at al.

Hurm, okay, that is not at all what I was thinking this was for..

So the only use of this function is to drain a send cq, in a state
where it is guarenteed no new entries can be added, and only if the cq
is not already event driven. I'd stick those notes in the comment..

Hum. I wonder if this is even a reasonable way to run a ULP. It is
important that rx completions are not used to drive reaping of
resources that are still committed to the send queue. ie do not
trigger send buffer reuse based on a rx completion.

So, if a ULP uses this API, how does it handle the sendq becoming
full? As above, a ULP cannot use recvs to infer available sendq
space. It must directly reap the sendq. So a correct ULP would have to
spin calling ib_process_direct_cq until it makes enough progress to
add more things to the sendq. I don't obviously see that in SRP - so
I'm guessing it has buggered up sendq flow control?

NFS had similar problems lately too, I wrote a long explanation to
Chuck on this subject.

That said, the demand poll almost seems like a reasonable way for a
ULP to run the sendq, do the polls on send occasionally or when more
space is needed to better amortize the reaping overhead at the cost of
send latency. But API wise it needs to be able to switch over to a
sleep if enough progress hasn't been made.

So.. maybe also add to the comment that ib_process_cq_direct is
deprecated and should not be used in new code until SRP gets sorted?

> > Perhaps ib_req_notify_cq should be folded into __ib_process_cq, then
> > it can trivially honour the budget on additional loops from
> > IB_CQ_REPORT_MISSED_EVENTS.
> 
> Which also defeats this proposal.

Just ignore my remarks here, someone should do a benchmark to see if
we are hitting the edge cases of extra spins around the various loops
before reworking this. Can't trivially hoist ib_req_notify_cq into
__ib_process_cq because of how it needs to be ordered with
irq_poll_complete

Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-14  7:08           ` Christoph Hellwig
  (?)
@ 2015-11-23 20:01           ` Jason Gunthorpe
       [not found]             ` <20151123200136.GA5640-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  -1 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-23 20:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma, sagig, bart.vanassche, axboe, linux-scsi, linux-kernel

On Sat, Nov 14, 2015 at 08:08:49AM +0100, Christoph Hellwig wrote:
> On Fri, Nov 13, 2015 at 11:25:13AM -0700, Jason Gunthorpe wrote:
> > For instance, like this, not fulling draining the cq and then doing:
> > 
> > > +	completed = __ib_process_cq(cq, budget);
> > > +	if (completed < budget) {
> > > +		irq_poll_complete(&cq->iop);
> > > +		if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) {
> > 
> > Doesn't seem entirely right? There is no point in calling
> > ib_req_notify_cq if the code knows there is still stuff in the CQ and
> > has already, independently, arranged for ib_poll_hander to be
> > guarenteed called.
> 
> The code only calls ib_req_notify_cq if it knowns we finished earlier than
> our budget.

Okay, having now read the whole thing, I think I see the flow now. I don't
see any holes in the above, other than it is doing a bit more work
than it needs in some edges cases because it doesn't know if the CQ is
actually empty or not.

> > > +	completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
> > > +	if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
> > > +	    ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
> > > +		queue_work(ib_comp_wq, &cq->work);
> > 
> > Same comment here..
> 
> Same here - we only requeue the work item if either we processed all of
> our budget, or ib_req_notify_cq with IB_CQ_REPORT_MISSED_EVENTS told
> us that we need to poll again.

I find the if construction hard to read, but yes, it looks OK.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-22 14:57                                   ` Sagi Grimberg
  (?)
@ 2015-11-22 16:55                                   ` Bart Van Assche
  -1 siblings, 0 replies; 61+ messages in thread
From: Bart Van Assche @ 2015-11-22 16:55 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: linux-rdma, axboe, linux-scsi, linux-kernel

On 11/22/15 06:57, Sagi Grimberg wrote:
>>> I think that bart wants to allow the caller to select cpu affinity
>>> per CQ. In this case ib_alloc_cq in workqueue mode would need to
>>> accept a affinity_hint from the caller (default to wild-card
>>> WORK_CPU_UNBOUND).
>>
>> Hmm, true.  How would be set that hint from userspace?  I'd really prefer
>> to see a practical justification for it first.
>
> In order to assign CPUs from user-space we'd need an ethtool like
> interface for isert/srpt/<xxxt>. Given that this is something we don't
> want to get into right now, I assumed that Bart meant that srpt
> would take a "least used" approach from srpt driver (which isn't better
> taking the wild-card option I'd say), So I'll let Bart answer...

Hello Christoph and Sagi,

My intention is indeed to allow to control CPU affinity per CQ. One use 
case is to implement a least-used policy in RDMA drivers that use 
multiple completion queues. Another use case is to make CPU affinity 
configurable from user space through something similar to ethtool or via 
sysfs.

Bart.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-22 13:23                               ` Christoph Hellwig
@ 2015-11-22 14:57                                   ` Sagi Grimberg
  -1 siblings, 0 replies; 61+ messages in thread
From: Sagi Grimberg @ 2015-11-22 14:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	axboe-b10kYP2dOMg, linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


>>
>> I think that bart wants to allow the caller to select cpu affinity
>> per CQ. In this case ib_alloc_cq in workqueue mode would need to
>> accept a affinity_hint from the caller (default to wild-card
>> WORK_CPU_UNBOUND).
>
> Hmm, true.  How would be set that hint from userspace?  I'd really prefer
> to see a practical justification for it first.

In order to assign CPUs from user-space we'd need an ethtool like
interface for isert/srpt/<xxxt>. Given that this is something we don't
want to get into right now, I assumed that Bart meant that srpt
would take a "least used" approach from srpt driver (which isn't better
taking the wild-card option I'd say), So I'll let Bart answer...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-22 14:57                                   ` Sagi Grimberg
  0 siblings, 0 replies; 61+ messages in thread
From: Sagi Grimberg @ 2015-11-22 14:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, linux-rdma, axboe, linux-scsi, linux-kernel

>>
>> I think that bart wants to allow the caller to select cpu affinity
>> per CQ. In this case ib_alloc_cq in workqueue mode would need to
>> accept a affinity_hint from the caller (default to wild-card
>> WORK_CPU_UNBOUND).
>
> Hmm, true.  How would be set that hint from userspace?  I'd really prefer
> to see a practical justification for it first.

In order to assign CPUs from user-space we'd need an ethtool like
interface for isert/srpt/<xxxt>. Given that this is something we don't
want to get into right now, I assumed that Bart meant that srpt
would take a "least used" approach from srpt driver (which isn't better
taking the wild-card option I'd say), So I'll let Bart answer...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-22 10:36                           ` Sagi Grimberg
@ 2015-11-22 13:23                               ` Christoph Hellwig
  -1 siblings, 0 replies; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-22 13:23 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Bart Van Assche,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Sun, Nov 22, 2015 at 12:36:00PM +0200, Sagi Grimberg wrote:
>
>> Wouldn't it be a better idea to set the WQ_SYSFS interface and use
>> the standard sysfs interface for specifying cpumasks or node affinity?
>
> I think that bart wants to allow the caller to select cpu affinity
> per CQ. In this case ib_alloc_cq in workqueue mode would need to
> accept a affinity_hint from the caller (default to wild-card 
> WORK_CPU_UNBOUND).

Hmm, true.  How would be set that hint from userspace?  I'd really prefer
to see a practical justification for it first.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-22 13:23                               ` Christoph Hellwig
  0 siblings, 0 replies; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-22 13:23 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Bart Van Assche, linux-rdma, axboe,
	linux-scsi, linux-kernel

On Sun, Nov 22, 2015 at 12:36:00PM +0200, Sagi Grimberg wrote:
>
>> Wouldn't it be a better idea to set the WQ_SYSFS interface and use
>> the standard sysfs interface for specifying cpumasks or node affinity?
>
> I think that bart wants to allow the caller to select cpu affinity
> per CQ. In this case ib_alloc_cq in workqueue mode would need to
> accept a affinity_hint from the caller (default to wild-card 
> WORK_CPU_UNBOUND).

Hmm, true.  How would be set that hint from userspace?  I'd really prefer
to see a practical justification for it first.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-22 10:13                     ` Christoph Hellwig
@ 2015-11-22 10:36                           ` Sagi Grimberg
  0 siblings, 0 replies; 61+ messages in thread
From: Sagi Grimberg @ 2015-11-22 10:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	axboe-b10kYP2dOMg, linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA


> Wouldn't it be a better idea to set the WQ_SYSFS interface and use
> the standard sysfs interface for specifying cpumasks or node affinity?

I think that bart wants to allow the caller to select cpu affinity
per CQ. In this case ib_alloc_cq in workqueue mode would need to
accept a affinity_hint from the caller (default to wild-card 
WORK_CPU_UNBOUND).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-22 10:36                           ` Sagi Grimberg
  0 siblings, 0 replies; 61+ messages in thread
From: Sagi Grimberg @ 2015-11-22 10:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, linux-rdma, axboe, linux-scsi, linux-kernel


> Wouldn't it be a better idea to set the WQ_SYSFS interface and use
> the standard sysfs interface for specifying cpumasks or node affinity?

I think that bart wants to allow the caller to select cpu affinity
per CQ. In this case ib_alloc_cq in workqueue mode would need to
accept a affinity_hint from the caller (default to wild-card 
WORK_CPU_UNBOUND).

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-22  9:51                   ` Sagi Grimberg
@ 2015-11-22 10:13                     ` Christoph Hellwig
       [not found]                       ` <20151122101308.GA12189-jcswGhMUV9g@public.gmane.org>
  0 siblings, 1 reply; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-22 10:13 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Bart Van Assche, Christoph Hellwig, linux-rdma, axboe,
	linux-scsi, linux-kernel

On Sun, Nov 22, 2015 at 11:51:13AM +0200, Sagi Grimberg wrote:
>
>> Hello Christoph,
>>
>> The comment about locality in the above quote is interesting. How about
>> modifying patch 2/9 as indicated below ? The modification below does not
>> change the behavior of this patch if ib_cq.w.cpu is not modified. And it
>> allows users who care about locality and who want to skip the scheduler
>> overhead by setting ib_cq.w.cpu to the index of the CPU they want the
>> work to be processed on.
>
> That sounds acceptable...

Wouldn't it be a better idea to set the WQ_SYSFS interface and use
the standard sysfs interface for specifying cpumasks or node affinity?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-20 16:50                 ` Bart Van Assche
@ 2015-11-22  9:51                   ` Sagi Grimberg
  2015-11-22 10:13                     ` Christoph Hellwig
  0 siblings, 1 reply; 61+ messages in thread
From: Sagi Grimberg @ 2015-11-22  9:51 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: linux-rdma, axboe, linux-scsi, linux-kernel


> Hello Christoph,
>
> The comment about locality in the above quote is interesting. How about
> modifying patch 2/9 as indicated below ? The modification below does not
> change the behavior of this patch if ib_cq.w.cpu is not modified. And it
> allows users who care about locality and who want to skip the scheduler
> overhead by setting ib_cq.w.cpu to the index of the CPU they want the
> work to be processed on.

That sounds acceptable...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-20 10:16               ` Christoph Hellwig
@ 2015-11-20 16:50                 ` Bart Van Assche
  2015-11-22  9:51                   ` Sagi Grimberg
  0 siblings, 1 reply; 61+ messages in thread
From: Bart Van Assche @ 2015-11-20 16:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, linux-rdma, axboe, linux-scsi, linux-kernel

On 11/20/2015 02:16 AM, Christoph Hellwig wrote:
> On Wed, Nov 18, 2015 at 10:20:14AM -0800, Bart Van Assche wrote:
>> Are you perhaps referring to the sysfs CPU mask that allows to control
>> workqueue affinity ?
> 
> I think he is referring to the defintion of WQ_UNBOUND:
> 
>    WQ_UNBOUND
> 
> 	Work items queued to an unbound wq are served by the special
> 	woker-pools which host workers which are not bound to any
> 	specific CPU.  This makes the wq behave as a simple execution
> 	context provider without concurrency management.  The unbound
> 	worker-pools try to start execution of work items as soon as
> 	possible.  Unbound wq sacrifices locality but is useful for
> 	the following cases.
> 
> 	* Wide fluctuation in the concurrency level requirement is
> 	  expected and using bound wq may end up creating large number
> 	  of mostly unused workers across different CPUs as the issuer
> 	  hops through different CPUs.
> 
> 	* Long running CPU intensive workloads which can be better
> 	  managed by the system scheduler.
 
Hello Christoph,

The comment about locality in the above quote is interesting. How about
modifying patch 2/9 as indicated below ? The modification below does not
change the behavior of this patch if ib_cq.w.cpu is not modified. And it
allows users who care about locality and who want to skip the scheduler
overhead by setting ib_cq.w.cpu to the index of the CPU they want the
work to be processed on.

Thanks,

Bart.

---
 drivers/infiniband/core/cq.c | 11 ++++++-----
 include/rdma/ib_verbs.h      |  5 ++++-
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index bf2a079..4d80d8c 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -94,18 +94,18 @@ static void ib_cq_completion_softirq(struct ib_cq *cq, void *private)
 
 static void ib_cq_poll_work(struct work_struct *work)
 {
-	struct ib_cq *cq = container_of(work, struct ib_cq, work);
+	struct ib_cq *cq = container_of(work, struct ib_cq, w.work);
 	int completed;
 
 	completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
 	if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
 	    ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
-		queue_work(ib_comp_wq, &cq->work);
+		queue_work_on(cq->w.cpu, ib_comp_wq, &cq->w.work);
 }
 
 static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
 {
-	queue_work(ib_comp_wq, &cq->work);
+	queue_work_on(cq->w.cpu, ib_comp_wq, &cq->w.work);
 }
 
 /**
@@ -159,7 +159,8 @@ struct ib_cq *ib_alloc_cq(struct ib_device *dev, void *private,
 		break;
 	case IB_POLL_WORKQUEUE:
 		cq->comp_handler = ib_cq_completion_workqueue;
-		INIT_WORK(&cq->work, ib_cq_poll_work);
+		INIT_WORK(&cq->w.work, ib_cq_poll_work);
+		cq->w.cpu = WORK_CPU_UNBOUND;
 		ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
 		break;
 	default:
@@ -195,7 +196,7 @@ void ib_free_cq(struct ib_cq *cq)
 		irq_poll_disable(&cq->iop);
 		break;
 	case IB_POLL_WORKQUEUE:
-		flush_work(&cq->work);
+		flush_work(&cq->w.work);
 		break;
 	default:
 		WARN_ON_ONCE(1);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index f59a8d3..b1344f8 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1291,7 +1291,10 @@ struct ib_cq {
 	struct ib_wc		*wc;
 	union {
 		struct irq_poll		iop;
-		struct work_struct	work;
+		struct {
+			struct work_struct	work;
+			int			cpu;
+		} w;
 	};
 };
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-18 18:20               ` Bart Van Assche
  (?)
@ 2015-11-20 10:16               ` Christoph Hellwig
  2015-11-20 16:50                 ` Bart Van Assche
  -1 siblings, 1 reply; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-20 10:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Sagi Grimberg, Christoph Hellwig, linux-rdma, axboe, linux-scsi,
	linux-kernel

On Wed, Nov 18, 2015 at 10:20:14AM -0800, Bart Van Assche wrote:
> Are you perhaps referring to the sysfs CPU mask that allows to control 
> workqueue affinity ?

I think he is referring to the defintion of WQ_UNBOUND:

  WQ_UNBOUND

	Work items queued to an unbound wq are served by the special
	woker-pools which host workers which are not bound to any
	specific CPU.  This makes the wq behave as a simple execution
	context provider without concurrency management.  The unbound
	worker-pools try to start execution of work items as soon as
	possible.  Unbound wq sacrifices locality but is useful for
	the following cases.

	* Wide fluctuation in the concurrency level requirement is
	  expected and using bound wq may end up creating large number
	  of mostly unused workers across different CPUs as the issuer
	  hops through different CPUs.

	* Long running CPU intensive workloads which can be better
	  managed by the system scheduler.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-18  7:55           ` Sagi Grimberg
@ 2015-11-18 18:20               ` Bart Van Assche
  -1 siblings, 0 replies; 61+ messages in thread
From: Bart Van Assche @ 2015-11-18 18:20 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: axboe-b10kYP2dOMg, linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 11/17/2015 11:55 PM, Sagi Grimberg wrote:
>>> +static void ib_cq_poll_work(struct work_struct *work)
>>> +{
>>> +    struct ib_cq *cq = container_of(work, struct ib_cq, work);
>>> +    int completed;
>>> +
>>> +    completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
>>> +    if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
>>> +        ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
>>> +        queue_work(ib_comp_wq, &cq->work);
>>> +}
>>> +
>>> +static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
>>> +{
>>> +    queue_work(ib_comp_wq, &cq->work);
>>> +}
>>
>> The above code will cause all polling to occur on the context of the CPU
>> that received the completion interrupt. This approach is not powerful
>> enough. For certain workloads throughput is higher if work completions
>> are processed by another CPU core on the same CPU socket. Has it been
>> considered to make the CPU core on which work completions are processed
>> configurable ?
>
> The workqueue is unbound. This means that the functionality you are
> you are asking for exists.

Hello Sagi,

Are you perhaps referring to the sysfs CPU mask that allows to control 
workqueue affinity ? I expect that setting the CPU mask for an entire 
pool through sysfs will lead to suboptimal results. What I have learned 
by tuning target systems is that there is a significant performance 
difference (> 30% IOPS) between a configuration where each completion 
thread is pinned to exactly one CPU compared to allowing the scheduler 
to choose a CPU.

Controlling the CPU affinity of worker threads with the taskset command 
is not possible since the function create_worker() in kernel/workqueue.c 
calls kthread_bind_mask(). That function sets PF_NO_SETAFFINITY. From 
sched.h:

#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to 
meddle with cpus_allowed */

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-18 18:20               ` Bart Van Assche
  0 siblings, 0 replies; 61+ messages in thread
From: Bart Van Assche @ 2015-11-18 18:20 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, linux-rdma
  Cc: axboe, linux-scsi, linux-kernel

On 11/17/2015 11:55 PM, Sagi Grimberg wrote:
>>> +static void ib_cq_poll_work(struct work_struct *work)
>>> +{
>>> +    struct ib_cq *cq = container_of(work, struct ib_cq, work);
>>> +    int completed;
>>> +
>>> +    completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
>>> +    if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
>>> +        ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
>>> +        queue_work(ib_comp_wq, &cq->work);
>>> +}
>>> +
>>> +static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
>>> +{
>>> +    queue_work(ib_comp_wq, &cq->work);
>>> +}
>>
>> The above code will cause all polling to occur on the context of the CPU
>> that received the completion interrupt. This approach is not powerful
>> enough. For certain workloads throughput is higher if work completions
>> are processed by another CPU core on the same CPU socket. Has it been
>> considered to make the CPU core on which work completions are processed
>> configurable ?
>
> The workqueue is unbound. This means that the functionality you are
> you are asking for exists.

Hello Sagi,

Are you perhaps referring to the sysfs CPU mask that allows to control 
workqueue affinity ? I expect that setting the CPU mask for an entire 
pool through sysfs will lead to suboptimal results. What I have learned 
by tuning target systems is that there is a significant performance 
difference (> 30% IOPS) between a configuration where each completion 
thread is pinned to exactly one CPU compared to allowing the scheduler 
to choose a CPU.

Controlling the CPU affinity of worker threads with the taskset command 
is not possible since the function create_worker() in kernel/workqueue.c 
calls kthread_bind_mask(). That function sets PF_NO_SETAFFINITY. From 
sched.h:

#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to 
meddle with cpus_allowed */

Bart.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-17 17:52       ` Bart Van Assche
@ 2015-11-18 14:00           ` Christoph Hellwig
  -1 siblings, 0 replies; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-18 14:00 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, Nov 17, 2015 at 09:52:58AM -0800, Bart Van Assche wrote:
> On 11/13/2015 05:46 AM, Christoph Hellwig wrote:
>> + * context and does not ask from completion interrupts from the HCA.
>                                ^^^^
> Should this perhaps be changed into "for" ?

Yes.

>
>> + */
>> +void ib_process_cq_direct(struct ib_cq *cq)
>> +{
>> +	WARN_ON_ONCE(cq->poll_ctx != IB_POLL_DIRECT);
>> +
>> +	__ib_process_cq(cq, INT_MAX);
>> +}
>> +EXPORT_SYMBOL(ib_process_cq_direct);
>
> My proposal is to drop this function and to export __ib_process_cq() 
> instead (with or without renaming). That will allow callers of this 
> function to compare the poll budget with the number of completions that 
> have been processed and use that information to decide whether or not to 
> call this function again.

I'd like to keep the WARN_ON, but we can export the same signature.

Then again my preference would be to remove the direct mode entirely.

>> +static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
>> +{
>> +	queue_work(ib_comp_wq, &cq->work);
>> +}
>
> The above code will cause all polling to occur on the context of the CPU 
> that received the completion interrupt. This approach is not powerful 
> enough. For certain workloads throughput is higher if work completions are 
> processed by another CPU core on the same CPU socket. Has it been 
> considered to make the CPU core on which work completions are processed 
> configurable ?

It's an unbound workqueue, so it's not tied to the specific CPU.  However
we'll only run the work_struct once so it's still tied to a single CPU
at a time, but that's not different from the kthread use previously.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-18 14:00           ` Christoph Hellwig
  0 siblings, 0 replies; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-18 14:00 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-rdma, sagig, axboe, linux-scsi, linux-kernel

On Tue, Nov 17, 2015 at 09:52:58AM -0800, Bart Van Assche wrote:
> On 11/13/2015 05:46 AM, Christoph Hellwig wrote:
>> + * context and does not ask from completion interrupts from the HCA.
>                                ^^^^
> Should this perhaps be changed into "for" ?

Yes.

>
>> + */
>> +void ib_process_cq_direct(struct ib_cq *cq)
>> +{
>> +	WARN_ON_ONCE(cq->poll_ctx != IB_POLL_DIRECT);
>> +
>> +	__ib_process_cq(cq, INT_MAX);
>> +}
>> +EXPORT_SYMBOL(ib_process_cq_direct);
>
> My proposal is to drop this function and to export __ib_process_cq() 
> instead (with or without renaming). That will allow callers of this 
> function to compare the poll budget with the number of completions that 
> have been processed and use that information to decide whether or not to 
> call this function again.

I'd like to keep the WARN_ON, but we can export the same signature.

Then again my preference would be to remove the direct mode entirely.

>> +static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
>> +{
>> +	queue_work(ib_comp_wq, &cq->work);
>> +}
>
> The above code will cause all polling to occur on the context of the CPU 
> that received the completion interrupt. This approach is not powerful 
> enough. For certain workloads throughput is higher if work completions are 
> processed by another CPU core on the same CPU socket. Has it been 
> considered to make the CPU core on which work completions are processed 
> configurable ?

It's an unbound workqueue, so it's not tied to the specific CPU.  However
we'll only run the work_struct once so it's still tied to a single CPU
at a time, but that's not different from the kthread use previously.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-17 17:52       ` Bart Van Assche
@ 2015-11-18  7:55           ` Sagi Grimberg
  -1 siblings, 0 replies; 61+ messages in thread
From: Sagi Grimberg @ 2015-11-18  7:55 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: axboe-b10kYP2dOMg, linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi Bart,

>> + */
>> +void ib_process_cq_direct(struct ib_cq *cq)
>> +{
>> +    WARN_ON_ONCE(cq->poll_ctx != IB_POLL_DIRECT);
>> +
>> +    __ib_process_cq(cq, INT_MAX);
>> +}
>> +EXPORT_SYMBOL(ib_process_cq_direct);
>
> My proposal is to drop this function and to export __ib_process_cq()
> instead (with or without renaming). That will allow callers of this
> function to compare the poll budget with the number of completions that
> have been processed and use that information to decide whether or not to
> call this function again.

I agree with that.

>
>> +static void ib_cq_poll_work(struct work_struct *work)
>> +{
>> +    struct ib_cq *cq = container_of(work, struct ib_cq, work);
>> +    int completed;
>> +
>> +    completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
>> +    if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
>> +        ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
>> +        queue_work(ib_comp_wq, &cq->work);
>> +}
>> +
>> +static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
>> +{
>> +    queue_work(ib_comp_wq, &cq->work);
>> +}
>
> The above code will cause all polling to occur on the context of the CPU
> that received the completion interrupt. This approach is not powerful
> enough. For certain workloads throughput is higher if work completions
> are processed by another CPU core on the same CPU socket. Has it been
> considered to make the CPU core on which work completions are processed
> configurable ?

The workqueue is unbound. This means that the functionality you are
you are asking for exists.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-18  7:55           ` Sagi Grimberg
  0 siblings, 0 replies; 61+ messages in thread
From: Sagi Grimberg @ 2015-11-18  7:55 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig, linux-rdma
  Cc: axboe, linux-scsi, linux-kernel

Hi Bart,

>> + */
>> +void ib_process_cq_direct(struct ib_cq *cq)
>> +{
>> +    WARN_ON_ONCE(cq->poll_ctx != IB_POLL_DIRECT);
>> +
>> +    __ib_process_cq(cq, INT_MAX);
>> +}
>> +EXPORT_SYMBOL(ib_process_cq_direct);
>
> My proposal is to drop this function and to export __ib_process_cq()
> instead (with or without renaming). That will allow callers of this
> function to compare the poll budget with the number of completions that
> have been processed and use that information to decide whether or not to
> call this function again.

I agree with that.

>
>> +static void ib_cq_poll_work(struct work_struct *work)
>> +{
>> +    struct ib_cq *cq = container_of(work, struct ib_cq, work);
>> +    int completed;
>> +
>> +    completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
>> +    if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
>> +        ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
>> +        queue_work(ib_comp_wq, &cq->work);
>> +}
>> +
>> +static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
>> +{
>> +    queue_work(ib_comp_wq, &cq->work);
>> +}
>
> The above code will cause all polling to occur on the context of the CPU
> that received the completion interrupt. This approach is not powerful
> enough. For certain workloads throughput is higher if work completions
> are processed by another CPU core on the same CPU socket. Has it been
> considered to make the CPU core on which work completions are processed
> configurable ?

The workqueue is unbound. This means that the functionality you are
you are asking for exists.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-13 13:46 ` [PATCH 2/9] IB: " Christoph Hellwig
@ 2015-11-17 17:52       ` Bart Van Assche
       [not found]   ` <1447422410-20891-3-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
  1 sibling, 0 replies; 61+ messages in thread
From: Bart Van Assche @ 2015-11-17 17:52 UTC (permalink / raw)
  To: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb,
	bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 11/13/2015 05:46 AM, Christoph Hellwig wrote:
> + * context and does not ask from completion interrupts from the HCA.
                                ^^^^
Should this perhaps be changed into "for" ?

> + */
> +void ib_process_cq_direct(struct ib_cq *cq)
> +{
> +	WARN_ON_ONCE(cq->poll_ctx != IB_POLL_DIRECT);
> +
> +	__ib_process_cq(cq, INT_MAX);
> +}
> +EXPORT_SYMBOL(ib_process_cq_direct);

My proposal is to drop this function and to export __ib_process_cq() 
instead (with or without renaming). That will allow callers of this 
function to compare the poll budget with the number of completions that 
have been processed and use that information to decide whether or not to 
call this function again.

> +static void ib_cq_poll_work(struct work_struct *work)
> +{
> +	struct ib_cq *cq = container_of(work, struct ib_cq, work);
> +	int completed;
> +
> +	completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
> +	if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
> +	    ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
> +		queue_work(ib_comp_wq, &cq->work);
> +}
> +
> +static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
> +{
> +	queue_work(ib_comp_wq, &cq->work);
> +}

The above code will cause all polling to occur on the context of the CPU 
that received the completion interrupt. This approach is not powerful 
enough. For certain workloads throughput is higher if work completions 
are processed by another CPU core on the same CPU socket. Has it been 
considered to make the CPU core on which work completions are processed 
configurable ?

> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> index 62b6cba..3027824 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -457,10 +457,11 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct srp_target_port *target)
>   static void srp_destroy_qp(struct srp_rdma_ch *ch)
>   {
>   	static struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR };
> -	static struct ib_recv_wr wr = { .wr_id = SRP_LAST_WR_ID };
> +	static struct ib_recv_wr wr = { 0 };
>   	struct ib_recv_wr *bad_wr;
>   	int ret;

Since the 'wr' structure is static I don't think it needs to be 
zero-initialized explicitly.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-17 17:52       ` Bart Van Assche
  0 siblings, 0 replies; 61+ messages in thread
From: Bart Van Assche @ 2015-11-17 17:52 UTC (permalink / raw)
  To: Christoph Hellwig, linux-rdma
  Cc: sagig, bart.vanassche, axboe, linux-scsi, linux-kernel

On 11/13/2015 05:46 AM, Christoph Hellwig wrote:
> + * context and does not ask from completion interrupts from the HCA.
                                ^^^^
Should this perhaps be changed into "for" ?

> + */
> +void ib_process_cq_direct(struct ib_cq *cq)
> +{
> +	WARN_ON_ONCE(cq->poll_ctx != IB_POLL_DIRECT);
> +
> +	__ib_process_cq(cq, INT_MAX);
> +}
> +EXPORT_SYMBOL(ib_process_cq_direct);

My proposal is to drop this function and to export __ib_process_cq() 
instead (with or without renaming). That will allow callers of this 
function to compare the poll budget with the number of completions that 
have been processed and use that information to decide whether or not to 
call this function again.

> +static void ib_cq_poll_work(struct work_struct *work)
> +{
> +	struct ib_cq *cq = container_of(work, struct ib_cq, work);
> +	int completed;
> +
> +	completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
> +	if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
> +	    ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
> +		queue_work(ib_comp_wq, &cq->work);
> +}
> +
> +static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
> +{
> +	queue_work(ib_comp_wq, &cq->work);
> +}

The above code will cause all polling to occur on the context of the CPU 
that received the completion interrupt. This approach is not powerful 
enough. For certain workloads throughput is higher if work completions 
are processed by another CPU core on the same CPU socket. Has it been 
considered to make the CPU core on which work completions are processed 
configurable ?

> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
> index 62b6cba..3027824 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -457,10 +457,11 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct srp_target_port *target)
>   static void srp_destroy_qp(struct srp_rdma_ch *ch)
>   {
>   	static struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR };
> -	static struct ib_recv_wr wr = { .wr_id = SRP_LAST_WR_ID };
> +	static struct ib_recv_wr wr = { 0 };
>   	struct ib_recv_wr *bad_wr;
>   	int ret;

Since the 'wr' structure is static I don't think it needs to be 
zero-initialized explicitly.

Bart.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-15 12:55         ` Christoph Hellwig
@ 2015-11-15 13:21             ` Sagi Grimberg
  -1 siblings, 0 replies; 61+ messages in thread
From: Sagi Grimberg @ 2015-11-15 13:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA



On 15/11/2015 14:55, Christoph Hellwig wrote:
> On Sun, Nov 15, 2015 at 11:40:02AM +0200, Sagi Grimberg wrote:
>> I doubt INT_MAX is useful as a budget in any use-case. it can easily
>> hog the CPU. If the consumer is given access to poll a CQ, it must be
>> able to provide some way to budget it. Why not expose a budget argument
>> to the consumer?
>
> Because in theory we could have a lot of sends completing before
> we finally need to reap them.  I think that's more of a theoretical
> than real issue.

Still, processing a CQ possibly forever is not something we'd want to
enable in an API, if a caller wants to do that anyway, it should loop
this call...

>
> My preference would be to simply kill this mode though.  Allocate a IU
> to each block request in SRP and only use the free_tx list for task
> management and AEN/req_limit calls.  Then we can use a single CQ
> and mark the regular I/O requests as unsignalled.

It might be better. I'd say that we keep this API and let Bart decide
if he wants to do that in srp. If he wants to convert srp, we can
always drop it.

> AFAICS no other driver wants a similar polling mode as the SRP initiator
> does for it's send queue.

iser worked in this mode in the past. But we changed that.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-15 13:21             ` Sagi Grimberg
  0 siblings, 0 replies; 61+ messages in thread
From: Sagi Grimberg @ 2015-11-15 13:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma, bart.vanassche, axboe, linux-scsi, linux-kernel



On 15/11/2015 14:55, Christoph Hellwig wrote:
> On Sun, Nov 15, 2015 at 11:40:02AM +0200, Sagi Grimberg wrote:
>> I doubt INT_MAX is useful as a budget in any use-case. it can easily
>> hog the CPU. If the consumer is given access to poll a CQ, it must be
>> able to provide some way to budget it. Why not expose a budget argument
>> to the consumer?
>
> Because in theory we could have a lot of sends completing before
> we finally need to reap them.  I think that's more of a theoretical
> than real issue.

Still, processing a CQ possibly forever is not something we'd want to
enable in an API, if a caller wants to do that anyway, it should loop
this call...

>
> My preference would be to simply kill this mode though.  Allocate a IU
> to each block request in SRP and only use the free_tx list for task
> management and AEN/req_limit calls.  Then we can use a single CQ
> and mark the regular I/O requests as unsignalled.

It might be better. I'd say that we keep this API and let Bart decide
if he wants to do that in srp. If he wants to convert srp, we can
always drop it.

> AFAICS no other driver wants a similar polling mode as the SRP initiator
> does for it's send queue.

iser worked in this mode in the past. But we changed that.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-15  9:40   ` Sagi Grimberg
@ 2015-11-15 12:55         ` Christoph Hellwig
  0 siblings, 0 replies; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-15 12:55 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Sun, Nov 15, 2015 at 11:40:02AM +0200, Sagi Grimberg wrote:
> I doubt INT_MAX is useful as a budget in any use-case. it can easily
> hog the CPU. If the consumer is given access to poll a CQ, it must be
> able to provide some way to budget it. Why not expose a budget argument
> to the consumer?

Because in theory we could have a lot of sends completing before
we finally need to reap them.  I think that's more of a theoretical
than real issue.

My preference would be to simply kill this mode though.  Allocate a IU
to each block request in SRP and only use the free_tx list for task
management and AEN/req_limit calls.  Then we can use a single CQ
and mark the regular I/O requests as unsignalled.

AFAICS no other driver wants a similar polling mode as the SRP initiator
does for it's send queue.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-15 12:55         ` Christoph Hellwig
  0 siblings, 0 replies; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-15 12:55 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, linux-rdma, bart.vanassche, axboe, linux-scsi,
	linux-kernel

On Sun, Nov 15, 2015 at 11:40:02AM +0200, Sagi Grimberg wrote:
> I doubt INT_MAX is useful as a budget in any use-case. it can easily
> hog the CPU. If the consumer is given access to poll a CQ, it must be
> able to provide some way to budget it. Why not expose a budget argument
> to the consumer?

Because in theory we could have a lot of sends completing before
we finally need to reap them.  I think that's more of a theoretical
than real issue.

My preference would be to simply kill this mode though.  Allocate a IU
to each block request in SRP and only use the free_tx list for task
management and AEN/req_limit calls.  Then we can use a single CQ
and mark the regular I/O requests as unsignalled.

AFAICS no other driver wants a similar polling mode as the SRP initiator
does for it's send queue.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-13 13:46 ` [PATCH 2/9] IB: " Christoph Hellwig
@ 2015-11-15  9:40   ` Sagi Grimberg
       [not found]     ` <564852F2.5080602-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
       [not found]   ` <1447422410-20891-3-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
  1 sibling, 1 reply; 61+ messages in thread
From: Sagi Grimberg @ 2015-11-15  9:40 UTC (permalink / raw)
  To: Christoph Hellwig, linux-rdma
  Cc: bart.vanassche, axboe, linux-scsi, linux-kernel



> +/**
> + * ib_process_direct_cq - process a CQ in caller context
> + * @cq:		CQ to process
> + *
> + * This function is used to process all outstanding CQ entries on a
> + * %IB_POLL_DIRECT CQ.  It does not offload CQ processing to a different
> + * context and does not ask from completion interrupts from the HCA.
> + */
> +void ib_process_cq_direct(struct ib_cq *cq)
> +{
> +	WARN_ON_ONCE(cq->poll_ctx != IB_POLL_DIRECT);
> +
> +	__ib_process_cq(cq, INT_MAX);
> +}

I doubt INT_MAX is useful as a budget in any use-case. it can easily
hog the CPU. If the consumer is given access to poll a CQ, it must be
able to provide some way to budget it. Why not expose a budget argument
to the consumer?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-13 22:06           ` Jason Gunthorpe
@ 2015-11-14  7:13                 ` Christoph Hellwig
  0 siblings, 0 replies; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-14  7:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Nov 13, 2015 at 03:06:36PM -0700, Jason Gunthorpe wrote:
> Looking at that thread and then at the patch a bit more..
> 
> +void ib_process_cq_direct(struct ib_cq *cq)
> [..]
> +	__ib_process_cq(cq, INT_MAX);
> 
> INT_MAX is not enough, it needs to loop.
> This is missing a ib_req_notify also.

No.  Direct cases _never_ calls ib_req_notify.  Its for the case where
the SRP case polls the send CQ only from the same context it sends for
without any interrupt notification at al.

> +static int __ib_process_cq(struct ib_cq *cq, int budget)
> +	while ((n = ib_poll_cq(cq, IB_POLL_BATCH, cq->wc)) > 0) {
> 
> Does an unnecessary ib_poll_cq call in common cases. I'd suggest
> change the result to bool and do:
> 
> // true return means the caller should attempt ib_req_notify_cq
> while ((n = ib_poll_cq(cq, IB_POLL_BATCH, cq->wc)) > 0) {
>  for (...)
>  if (n != IB_POLL_BATCH)
>    return true;
>  completed += n;
>  if (completed > budget)
>     return false;
> }
> return true;
> 
> And then change call site like:
> 
> static void ib_cq_poll_work(struct work_struct *work)
> {
>     if (__ib_process_cq(...))
>         if (ib_req_notify_cq(cq, IB_POLL_FLAGS) == 0)
> 	    return;
>     // Else we need to loop again.
>     queue_work(ib_comp_wq, &cq->work);
> }
> 
> Which avoids the rearm.
> 
> void ib_process_cq_direct(struct ib_cq *cq)
> {
>    while (1) {
>        if (__ib_process_cq(..) &&
>            ib_req_notify_cq(cq, IB_POLL_FLAGS) == 0)
>            return;
>    }
> }
> 
> Which adds the inf loop and rearm.
> 
> etc for softirq

For the workqueue and softirq cases this looks reasonable.  For the
direct case there is no rearming, though.

> Perhaps ib_req_notify_cq should be folded into __ib_process_cq, then
> it can trivially honour the budget on additional loops from
> IB_CQ_REPORT_MISSED_EVENTS.

Which also defeats this proposal.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-14  7:13                 ` Christoph Hellwig
  0 siblings, 0 replies; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-14  7:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Bart Van Assche, Christoph Hellwig, linux-rdma, sagig, axboe,
	linux-scsi, linux-kernel

On Fri, Nov 13, 2015 at 03:06:36PM -0700, Jason Gunthorpe wrote:
> Looking at that thread and then at the patch a bit more..
> 
> +void ib_process_cq_direct(struct ib_cq *cq)
> [..]
> +	__ib_process_cq(cq, INT_MAX);
> 
> INT_MAX is not enough, it needs to loop.
> This is missing a ib_req_notify also.

No.  Direct cases _never_ calls ib_req_notify.  Its for the case where
the SRP case polls the send CQ only from the same context it sends for
without any interrupt notification at al.

> +static int __ib_process_cq(struct ib_cq *cq, int budget)
> +	while ((n = ib_poll_cq(cq, IB_POLL_BATCH, cq->wc)) > 0) {
> 
> Does an unnecessary ib_poll_cq call in common cases. I'd suggest
> change the result to bool and do:
> 
> // true return means the caller should attempt ib_req_notify_cq
> while ((n = ib_poll_cq(cq, IB_POLL_BATCH, cq->wc)) > 0) {
>  for (...)
>  if (n != IB_POLL_BATCH)
>    return true;
>  completed += n;
>  if (completed > budget)
>     return false;
> }
> return true;
> 
> And then change call site like:
> 
> static void ib_cq_poll_work(struct work_struct *work)
> {
>     if (__ib_process_cq(...))
>         if (ib_req_notify_cq(cq, IB_POLL_FLAGS) == 0)
> 	    return;
>     // Else we need to loop again.
>     queue_work(ib_comp_wq, &cq->work);
> }
> 
> Which avoids the rearm.
> 
> void ib_process_cq_direct(struct ib_cq *cq)
> {
>    while (1) {
>        if (__ib_process_cq(..) &&
>            ib_req_notify_cq(cq, IB_POLL_FLAGS) == 0)
>            return;
>    }
> }
> 
> Which adds the inf loop and rearm.
> 
> etc for softirq

For the workqueue and softirq cases this looks reasonable.  For the
direct case there is no rearming, though.

> Perhaps ib_req_notify_cq should be folded into __ib_process_cq, then
> it can trivially honour the budget on additional loops from
> IB_CQ_REPORT_MISSED_EVENTS.

Which also defeats this proposal.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-13 18:25       ` Jason Gunthorpe
@ 2015-11-14  7:08           ` Christoph Hellwig
  -1 siblings, 0 replies; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-14  7:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb,
	bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Nov 13, 2015 at 11:25:13AM -0700, Jason Gunthorpe wrote:
> For instance, like this, not fulling draining the cq and then doing:
> 
> > +	completed = __ib_process_cq(cq, budget);
> > +	if (completed < budget) {
> > +		irq_poll_complete(&cq->iop);
> > +		if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) {
> 
> Doesn't seem entirely right? There is no point in calling
> ib_req_notify_cq if the code knows there is still stuff in the CQ and
> has already, independently, arranged for ib_poll_hander to be
> guarenteed called.

The code only calls ib_req_notify_cq if it knowns we finished earlier than
our budget.

> > +	completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
> > +	if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
> > +	    ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
> > +		queue_work(ib_comp_wq, &cq->work);
> 
> Same comment here..


Same here - we only requeue the work item if either we processed all of
our budget, or ib_req_notify_cq with IB_CQ_REPORT_MISSED_EVENTS told
us that we need to poll again.

> I understand several drivers are not using a hard irq context for the
> comp_handler call back. Is there any way to exploit that in this new
> API so we don't have to do so many context switches? Ie if the driver
> already is using a softirq when calling comp_handler can we somehow
> just rig ib_poll_handler directly and avoid the overhead? (Future)

Let's say this API makes it possible.  I still don't think moving the
whole budget and rearm logic into the LLD is necessarily a good idea
if we can avoid it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-14  7:08           ` Christoph Hellwig
  0 siblings, 0 replies; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-14  7:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, linux-rdma, sagig, bart.vanassche, axboe,
	linux-scsi, linux-kernel

On Fri, Nov 13, 2015 at 11:25:13AM -0700, Jason Gunthorpe wrote:
> For instance, like this, not fulling draining the cq and then doing:
> 
> > +	completed = __ib_process_cq(cq, budget);
> > +	if (completed < budget) {
> > +		irq_poll_complete(&cq->iop);
> > +		if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) {
> 
> Doesn't seem entirely right? There is no point in calling
> ib_req_notify_cq if the code knows there is still stuff in the CQ and
> has already, independently, arranged for ib_poll_hander to be
> guarenteed called.

The code only calls ib_req_notify_cq if it knowns we finished earlier than
our budget.

> > +	completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
> > +	if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
> > +	    ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
> > +		queue_work(ib_comp_wq, &cq->work);
> 
> Same comment here..


Same here - we only requeue the work item if either we processed all of
our budget, or ib_req_notify_cq with IB_CQ_REPORT_MISSED_EVENTS told
us that we need to poll again.

> I understand several drivers are not using a hard irq context for the
> comp_handler call back. Is there any way to exploit that in this new
> API so we don't have to do so many context switches? Ie if the driver
> already is using a softirq when calling comp_handler can we somehow
> just rig ib_poll_handler directly and avoid the overhead? (Future)

Let's say this API makes it possible.  I still don't think moving the
whole budget and rearm logic into the LLD is necessarily a good idea
if we can avoid it.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-13 19:57           ` Bart Van Assche
  (?)
@ 2015-11-13 22:06           ` Jason Gunthorpe
       [not found]             ` <20151113220636.GA32133-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  -1 siblings, 1 reply; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-13 22:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-rdma, sagig, axboe, linux-scsi, linux-kernel

On Fri, Nov 13, 2015 at 11:57:56AM -0800, Bart Van Assche wrote:

> I think this is the conversation you are referring to: "About a shortcoming
> of the verbs API" (http://thread.gmane.org/gmane.linux.drivers.rdma/5028).
> That conversation occurred five years ago, which means that you have an
> excellent memory :-)

Heh, the whole thread is interesting, but this message is what I was
thinking of

http://thread.gmane.org/gmane.linux.drivers.rdma/5028

And it looks like this patch is OK relative to that discussion.

> I doesn't seem to me like Christoph wanted to support dynamic switching
> between the IB_POLL_DIRECT, IB_POLL_SOFTIRQ and IB_POLL_WORKQUEUE polling
> modes. I think this should have been mentioned in the patch description.

Indeed. Which is probably OK.

> The implementation of this patch makes it clear that it is essential that
> all polling is serialized. The WC array that is used for polling is embedded
> in the CQ and is not protected against concurrent access. This means that it
> is essential that _ib_process_cq() calls are serialized. I need some more
> time to verify whether such serialization is always guaranteed by this
> patch.

Yes, the two big design/review checks 
 - ib_process_cq is fully serialized/etc
 - All re-arm cases are done properly - rearm is only called when the
   CQ is empty and all cases where it is not empty guarentee that the
   polling loop happens again.

Looking at that thread and then at the patch a bit more..

+void ib_process_cq_direct(struct ib_cq *cq)
[..]
+	__ib_process_cq(cq, INT_MAX);

INT_MAX is not enough, it needs to loop.
This is missing a ib_req_notify also.

And this structure:

+static int __ib_process_cq(struct ib_cq *cq, int budget)
+	while ((n = ib_poll_cq(cq, IB_POLL_BATCH, cq->wc)) > 0) {

Does an unnecessary ib_poll_cq call in common cases. I'd suggest
change the result to bool and do:

// true return means the caller should attempt ib_req_notify_cq
while ((n = ib_poll_cq(cq, IB_POLL_BATCH, cq->wc)) > 0) {
 for (...)
 if (n != IB_POLL_BATCH)
   return true;
 completed += n;
 if (completed > budget)
    return false;
}
return true;

And then change call site like:

static void ib_cq_poll_work(struct work_struct *work)
{
    if (__ib_process_cq(...))
        if (ib_req_notify_cq(cq, IB_POLL_FLAGS) == 0)
	    return;
    // Else we need to loop again.
    queue_work(ib_comp_wq, &cq->work);
}

Which avoids the rearm.

void ib_process_cq_direct(struct ib_cq *cq)
{
   while (1) {
       if (__ib_process_cq(..) &&
           ib_req_notify_cq(cq, IB_POLL_FLAGS) == 0)
           return;
   }
}

Which adds the inf loop and rearm.

etc for softirq

Perhaps ib_req_notify_cq should be folded into __ib_process_cq, then
it can trivially honour the budget on additional loops from
IB_CQ_REPORT_MISSED_EVENTS.

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-13 18:25       ` Jason Gunthorpe
@ 2015-11-13 19:57           ` Bart Van Assche
  -1 siblings, 0 replies; 61+ messages in thread
From: Bart Van Assche @ 2015-11-13 19:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb,
	bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On 11/13/2015 10:25 AM, Jason Gunthorpe wrote:
> On Fri, Nov 13, 2015 at 02:46:43PM +0100, Christoph Hellwig wrote:
>> This adds an abstraction that allows ULP to simply pass a completion
>> object and completion callback with each submitted WR and let the RDMA
>> core handle the nitty gritty details of how to handle completion
>> interrupts and poll the CQ.
>
> This looks pretty nice, I'd really like to look it over carefully
> after SC|15..
>
> I know Bart and others have attempted to have switching between event
> and polling driven operation, but there were problems resolving the
> races. Would be nice to review that conversation.. Do you remember the
> details Bart?

Hello Jason,

I think this is the conversation you are referring to: "About a 
shortcoming of the verbs API" 
(http://thread.gmane.org/gmane.linux.drivers.rdma/5028). That 
conversation occurred five years ago, which means that you have an 
excellent memory :-)

I doesn't seem to me like Christoph wanted to support dynamic switching 
between the IB_POLL_DIRECT, IB_POLL_SOFTIRQ and IB_POLL_WORKQUEUE 
polling modes. I think this should have been mentioned in the patch 
description.

The implementation of this patch makes it clear that it is essential 
that all polling is serialized. The WC array that is used for polling is 
embedded in the CQ and is not protected against concurrent access. This 
means that it is essential that _ib_process_cq() calls are serialized. I 
need some more time to verify whether such serialization is always 
guaranteed by this patch.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-13 19:57           ` Bart Van Assche
  0 siblings, 0 replies; 61+ messages in thread
From: Bart Van Assche @ 2015-11-13 19:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Christoph Hellwig
  Cc: linux-rdma, sagig, bart.vanassche, axboe, linux-scsi, linux-kernel

On 11/13/2015 10:25 AM, Jason Gunthorpe wrote:
> On Fri, Nov 13, 2015 at 02:46:43PM +0100, Christoph Hellwig wrote:
>> This adds an abstraction that allows ULP to simply pass a completion
>> object and completion callback with each submitted WR and let the RDMA
>> core handle the nitty gritty details of how to handle completion
>> interrupts and poll the CQ.
>
> This looks pretty nice, I'd really like to look it over carefully
> after SC|15..
>
> I know Bart and others have attempted to have switching between event
> and polling driven operation, but there were problems resolving the
> races. Would be nice to review that conversation.. Do you remember the
> details Bart?

Hello Jason,

I think this is the conversation you are referring to: "About a 
shortcoming of the verbs API" 
(http://thread.gmane.org/gmane.linux.drivers.rdma/5028). That 
conversation occurred five years ago, which means that you have an 
excellent memory :-)

I doesn't seem to me like Christoph wanted to support dynamic switching 
between the IB_POLL_DIRECT, IB_POLL_SOFTIRQ and IB_POLL_WORKQUEUE 
polling modes. I think this should have been mentioned in the patch 
description.

The implementation of this patch makes it clear that it is essential 
that all polling is serialized. The WC array that is used for polling is 
embedded in the CQ and is not protected against concurrent access. This 
means that it is essential that _ib_process_cq() calls are serialized. I 
need some more time to verify whether such serialization is always 
guaranteed by this patch.

Bart.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-13 13:46 ` [PATCH 2/9] IB: " Christoph Hellwig
@ 2015-11-13 18:25       ` Jason Gunthorpe
       [not found]   ` <1447422410-20891-3-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
  1 sibling, 0 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-13 18:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb,
	bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ, axboe-b10kYP2dOMg,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Fri, Nov 13, 2015 at 02:46:43PM +0100, Christoph Hellwig wrote:
> This adds an abstraction that allows ULP to simply pass a completion
> object and completion callback with each submitted WR and let the RDMA
> core handle the nitty gritty details of how to handle completion
> interrupts and poll the CQ.

This looks pretty nice, I'd really like to look it over carefully
after SC|15..

I know Bart and others have attempted to have switching between event
and polling driven operation, but there were problems resolving the
races. Would be nice to review that conversation.. Do you remember the
details Bart?

> +static int __ib_process_cq(struct ib_cq *cq, int budget)
> +{
> +	int i, n, completed = 0;
> +
> +	while ((n = ib_poll_cq(cq, IB_POLL_BATCH, cq->wc)) > 0) {
> +		completed += n;
> +		if (completed >= budget)
> +			break;

For instance, like this, not fulling draining the cq and then doing:

> +	completed = __ib_process_cq(cq, budget);
> +	if (completed < budget) {
> +		irq_poll_complete(&cq->iop);
> +		if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) {

Doesn't seem entirely right? There is no point in calling
ib_req_notify_cq if the code knows there is still stuff in the CQ and
has already, independently, arranged for ib_poll_hander to be
guarenteed called.

> +			if (!irq_poll_sched_prep(&cq->iop))
> +				irq_poll_sched(&cq->iop);

Which, it seems, is what this is doing.

Assuming irq_poll_sched is safe to call from a hard irq context, this
looks sane, at first glance.

> +	completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
> +	if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
> +	    ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
> +		queue_work(ib_comp_wq, &cq->work);

Same comment here..

> +static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
> +{
> +	queue_work(ib_comp_wq, &cq->work);

> +	switch (cq->poll_ctx) {
> +	case IB_POLL_DIRECT:
> +		cq->comp_handler = ib_cq_completion_direct;
> +		break;
> +	case IB_POLL_SOFTIRQ:
> +		cq->comp_handler = ib_cq_completion_softirq;
> +
> +		irq_poll_init(&cq->iop, IB_POLL_BUDGET_IRQ, ib_poll_handler);
> +		irq_poll_enable(&cq->iop);
> +		ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
> +		break;

I understand several drivers are not using a hard irq context for the
comp_handler call back. Is there any way to exploit that in this new
API so we don't have to do so many context switches? Ie if the driver
already is using a softirq when calling comp_handler can we somehow
just rig ib_poll_handler directly and avoid the overhead? (Future)

At first glance this seems so much saner than what we have..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 2/9] IB: add a proper completion queue abstraction
@ 2015-11-13 18:25       ` Jason Gunthorpe
  0 siblings, 0 replies; 61+ messages in thread
From: Jason Gunthorpe @ 2015-11-13 18:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-rdma, sagig, bart.vanassche, axboe, linux-scsi, linux-kernel

On Fri, Nov 13, 2015 at 02:46:43PM +0100, Christoph Hellwig wrote:
> This adds an abstraction that allows ULP to simply pass a completion
> object and completion callback with each submitted WR and let the RDMA
> core handle the nitty gritty details of how to handle completion
> interrupts and poll the CQ.

This looks pretty nice, I'd really like to look it over carefully
after SC|15..

I know Bart and others have attempted to have switching between event
and polling driven operation, but there were problems resolving the
races. Would be nice to review that conversation.. Do you remember the
details Bart?

> +static int __ib_process_cq(struct ib_cq *cq, int budget)
> +{
> +	int i, n, completed = 0;
> +
> +	while ((n = ib_poll_cq(cq, IB_POLL_BATCH, cq->wc)) > 0) {
> +		completed += n;
> +		if (completed >= budget)
> +			break;

For instance, like this, not fulling draining the cq and then doing:

> +	completed = __ib_process_cq(cq, budget);
> +	if (completed < budget) {
> +		irq_poll_complete(&cq->iop);
> +		if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) {

Doesn't seem entirely right? There is no point in calling
ib_req_notify_cq if the code knows there is still stuff in the CQ and
has already, independently, arranged for ib_poll_hander to be
guarenteed called.

> +			if (!irq_poll_sched_prep(&cq->iop))
> +				irq_poll_sched(&cq->iop);

Which, it seems, is what this is doing.

Assuming irq_poll_sched is safe to call from a hard irq context, this
looks sane, at first glance.

> +	completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
> +	if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
> +	    ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
> +		queue_work(ib_comp_wq, &cq->work);

Same comment here..

> +static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
> +{
> +	queue_work(ib_comp_wq, &cq->work);

> +	switch (cq->poll_ctx) {
> +	case IB_POLL_DIRECT:
> +		cq->comp_handler = ib_cq_completion_direct;
> +		break;
> +	case IB_POLL_SOFTIRQ:
> +		cq->comp_handler = ib_cq_completion_softirq;
> +
> +		irq_poll_init(&cq->iop, IB_POLL_BUDGET_IRQ, ib_poll_handler);
> +		irq_poll_enable(&cq->iop);
> +		ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
> +		break;

I understand several drivers are not using a hard irq context for the
comp_handler call back. Is there any way to exploit that in this new
API so we don't have to do so many context switches? Ie if the driver
already is using a softirq when calling comp_handler can we somehow
just rig ib_poll_handler directly and avoid the overhead? (Future)

At first glance this seems so much saner than what we have..

Jason

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 2/9] IB: add a proper completion queue abstraction
  2015-11-13 13:46 Christoph Hellwig
@ 2015-11-13 13:46 ` Christoph Hellwig
  2015-11-15  9:40   ` Sagi Grimberg
       [not found]   ` <1447422410-20891-3-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
  0 siblings, 2 replies; 61+ messages in thread
From: Christoph Hellwig @ 2015-11-13 13:46 UTC (permalink / raw)
  To: linux-rdma; +Cc: sagig, bart.vanassche, axboe, linux-scsi, linux-kernel

This adds an abstraction that allows ULP to simply pass a completion
object and completion callback with each submitted WR and let the RDMA
core handle the nitty gritty details of how to handle completion
interrupts and poll the CQ.

In detail there is a new ib_cqe structure which just contains the
completion callback, and which can be used to get at the containing
object using container_of.  It is pointed to by the WR and WC as an
alternative to the wr_id field, similar to how many ULPs already use
the field to store a pointer using casts.

A driver using the new completion callbacks allocates it's CQs using
the new ib_create_cq API, which in addition to the number of CQEs and
the completion vectors also takes a mode on how we poll for CQEs.
Three modes are available: direct for drivers that never take CQ
interrupts and just poll for them, softirq to poll from softirq context
using the to be renamed blk-iopoll infrastructure which takes care of
rearming and budgeting, or a workqueue for consumer who want to be
called from user context.

Thanks a lot to Sagi Grimberg who helped reviewing the API, wrote
the current version of the workqueue code because my two previous
attempts sucked too much and converted the iSER initiator to the new
API.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/infiniband/Kconfig              |   1 +
 drivers/infiniband/core/Makefile        |   2 +-
 drivers/infiniband/core/cq.c            | 208 ++++++++++++++++++++++++++++++++
 drivers/infiniband/core/device.c        |  15 ++-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |   2 +-
 drivers/infiniband/ulp/srp/ib_srp.c     |   6 +-
 include/rdma/ib_verbs.h                 |  38 +++++-
 7 files changed, 263 insertions(+), 9 deletions(-)
 create mode 100644 drivers/infiniband/core/cq.c

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index aa26f3c..282ec0b 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -5,6 +5,7 @@ menuconfig INFINIBAND
 	depends on NET
 	depends on INET
 	depends on m || IPV6 != m
+	select IRQ_POLL
 	---help---
 	  Core support for InfiniBand (IB).  Make sure to also select
 	  any protocols you wish to use as well as drivers for your
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index d43a899..ae48d8740 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -8,7 +8,7 @@ obj-$(CONFIG_INFINIBAND_USER_MAD) +=	ib_umad.o
 obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=	ib_uverbs.o ib_ucm.o \
 					$(user_access-y)
 
-ib_core-y :=			packer.o ud_header.o verbs.o sysfs.o \
+ib_core-y :=			packer.o ud_header.o verbs.o cq.o sysfs.o \
 				device.o fmr_pool.o cache.o netlink.o \
 				roce_gid_mgmt.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
new file mode 100644
index 0000000..d9eb796
--- /dev/null
+++ b/drivers/infiniband/core/cq.c
@@ -0,0 +1,208 @@
+/*
+ * Copyright (c) 2015 HGST, a Western Digital Company.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+#include <linux/module.h>
+#include <linux/err.h>
+#include <linux/slab.h>
+#include <rdma/ib_verbs.h>
+
+/* # of WCs to poll for with a single call to ib_poll_cq */
+#define IB_POLL_BATCH			16
+
+/* # of WCs to iterate over before yielding */
+#define IB_POLL_BUDGET_IRQ		256
+#define IB_POLL_BUDGET_WORKQUEUE	65536
+
+#define IB_POLL_FLAGS \
+	(IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS)
+
+static int __ib_process_cq(struct ib_cq *cq, int budget)
+{
+	int i, n, completed = 0;
+
+	while ((n = ib_poll_cq(cq, IB_POLL_BATCH, cq->wc)) > 0) {
+		for (i = 0; i < n; i++) {
+			struct ib_wc *wc = &cq->wc[i];
+
+			if (wc->wr_cqe)
+				wc->wr_cqe->done(cq, wc);
+			else
+				WARN_ON_ONCE(wc->status == IB_WC_SUCCESS);
+		}
+
+		completed += n;
+		if (completed >= budget)
+			break;
+	}
+
+	return completed;
+}
+
+/**
+ * ib_process_direct_cq - process a CQ in caller context
+ * @cq:		CQ to process
+ *
+ * This function is used to process all outstanding CQ entries on a
+ * %IB_POLL_DIRECT CQ.  It does not offload CQ processing to a different
+ * context and does not ask from completion interrupts from the HCA.
+ */
+void ib_process_cq_direct(struct ib_cq *cq)
+{
+	WARN_ON_ONCE(cq->poll_ctx != IB_POLL_DIRECT);
+
+	__ib_process_cq(cq, INT_MAX);
+}
+EXPORT_SYMBOL(ib_process_cq_direct);
+
+static void ib_cq_completion_direct(struct ib_cq *cq, void *private)
+{
+	WARN_ONCE(1, "got unsolicited completion for CQ 0x%p\n", cq);
+}
+
+static int ib_poll_handler(struct irq_poll *iop, int budget)
+{
+	struct ib_cq *cq = container_of(iop, struct ib_cq, iop);
+	int completed;
+
+	completed = __ib_process_cq(cq, budget);
+	if (completed < budget) {
+		irq_poll_complete(&cq->iop);
+		if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) {
+			if (!irq_poll_sched_prep(&cq->iop))
+				irq_poll_sched(&cq->iop);
+		}
+			
+	}
+
+	return completed;
+}
+
+static void ib_cq_completion_softirq(struct ib_cq *cq, void *private)
+{
+	if (!irq_poll_sched_prep(&cq->iop))
+		irq_poll_sched(&cq->iop);
+}
+
+static void ib_cq_poll_work(struct work_struct *work)
+{
+	struct ib_cq *cq = container_of(work, struct ib_cq, work);
+	int completed;
+
+	completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE);
+	if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
+	    ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
+		queue_work(ib_comp_wq, &cq->work);
+}
+
+static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
+{
+	queue_work(ib_comp_wq, &cq->work);
+}
+
+/**
+ * ib_alloc_cq - allocate a completion queue
+ * @dev:		device to allocate the CQ for
+ * @private:		driver private data, accessible from cq->cq_context
+ * @nr_cqe:		number of CQEs to allocate
+ * @comp_vector:	HCA completion vectors for this CQ
+ * @poll_ctx:		context to poll the CQ from.
+ *
+ * This is the proper interface to allocate a CQ for in-kernel users. A
+ * CQ allocated with this interface will automatically be polled from the
+ * specified context.  The ULP needs must use wr->wr_cqe instead of wr->wr_id
+ * to use this CQ abstraction.
+ */
+struct ib_cq *ib_alloc_cq(struct ib_device *dev, void *private,
+		int nr_cqe, int comp_vector, enum ib_poll_context poll_ctx)
+{
+	struct ib_cq_init_attr cq_attr = {
+		.cqe		= nr_cqe,
+		.comp_vector	= comp_vector,
+	};
+	struct ib_cq *cq;
+	int ret = -ENOMEM;
+
+	cq = dev->create_cq(dev, &cq_attr, NULL, NULL);
+	if (IS_ERR(cq))
+		return cq;
+
+	cq->device = dev;
+	cq->uobject = NULL;
+	cq->event_handler = NULL;
+	cq->cq_context = private;;
+	cq->poll_ctx = poll_ctx;
+	atomic_set(&cq->usecnt, 0);
+
+	cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);
+	if (!cq->wc)
+		goto out_destroy_cq;
+
+	switch (cq->poll_ctx) {
+	case IB_POLL_DIRECT:
+		cq->comp_handler = ib_cq_completion_direct;
+		break;
+	case IB_POLL_SOFTIRQ:
+		cq->comp_handler = ib_cq_completion_softirq;
+
+		irq_poll_init(&cq->iop, IB_POLL_BUDGET_IRQ, ib_poll_handler);
+		irq_poll_enable(&cq->iop);
+		ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+		break;
+	case IB_POLL_WORKQUEUE:
+		cq->comp_handler = ib_cq_completion_workqueue;
+		INIT_WORK(&cq->work, ib_cq_poll_work);
+		ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+		break;
+	default:
+		ret = -EINVAL;
+		goto out_free_wc;
+	}
+
+	return cq;
+
+out_free_wc:
+	kfree(cq->wc);
+out_destroy_cq:
+	cq->device->destroy_cq(cq);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(ib_alloc_cq);
+
+/**
+ * ib_free_cq - free a completion queue
+ * @cq:		completion queue to free.
+ */
+void ib_free_cq(struct ib_cq *cq)
+{
+	int ret;
+
+	if (WARN_ON_ONCE(atomic_read(&cq->usecnt)))
+		return;
+
+	switch (cq->poll_ctx) {
+	case IB_POLL_DIRECT:
+		break;
+	case IB_POLL_SOFTIRQ:
+		irq_poll_disable(&cq->iop);
+		break;
+	case IB_POLL_WORKQUEUE:
+		flush_work(&cq->work);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+	}
+
+	kfree(cq->wc);
+	ret = cq->device->destroy_cq(cq);
+	WARN_ON_ONCE(ret);
+}
+EXPORT_SYMBOL(ib_free_cq);
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 0315bd7..f0ac300 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -58,6 +58,7 @@ struct ib_client_data {
 	bool		  going_down;
 };
 
+struct workqueue_struct *ib_comp_wq;
 struct workqueue_struct *ib_wq;
 EXPORT_SYMBOL_GPL(ib_wq);
 
@@ -934,10 +935,18 @@ static int __init ib_core_init(void)
 	if (!ib_wq)
 		return -ENOMEM;
 
+	ib_comp_wq = alloc_workqueue("ib-comp-wq",
+			WQ_UNBOUND | WQ_HIGHPRI | WQ_MEM_RECLAIM,
+			WQ_UNBOUND_MAX_ACTIVE);
+	if (!ib_comp_wq) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
 	ret = class_register(&ib_class);
 	if (ret) {
 		printk(KERN_WARNING "Couldn't create InfiniBand device class\n");
-		goto err;
+		goto err_comp;
 	}
 
 	ret = ibnl_init();
@@ -952,7 +961,8 @@ static int __init ib_core_init(void)
 
 err_sysfs:
 	class_unregister(&ib_class);
-
+err_comp:
+	destroy_workqueue(ib_comp_wq);
 err:
 	destroy_workqueue(ib_wq);
 	return ret;
@@ -963,6 +973,7 @@ static void __exit ib_core_cleanup(void)
 	ib_cache_cleanup();
 	ibnl_cleanup();
 	class_unregister(&ib_class);
+	destroy_workqueue(ib_comp_wq);
 	/* Make sure that any pending umem accounting work is done. */
 	destroy_workqueue(ib_wq);
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 737d273..d315367 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -70,7 +70,6 @@ static struct ib_qp_attr ipoib_cm_err_attr = {
 #define IPOIB_CM_RX_DRAIN_WRID 0xffffffff
 
 static struct ib_send_wr ipoib_cm_rx_drain_wr = {
-	.wr_id = IPOIB_CM_RX_DRAIN_WRID,
 	.opcode = IB_WR_SEND,
 };
 
@@ -223,6 +222,7 @@ static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv *priv)
 	 * error" WC will be immediately generated for each WR we post.
 	 */
 	p = list_entry(priv->cm.rx_flush_list.next, typeof(*p), list);
+	ipoib_cm_rx_drain_wr.wr_id = IPOIB_CM_RX_DRAIN_WRID;
 	if (ib_post_send(p->qp, &ipoib_cm_rx_drain_wr, &bad_wr))
 		ipoib_warn(priv, "failed to post drain wr\n");
 
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 62b6cba..3027824 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -457,10 +457,11 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct srp_target_port *target)
 static void srp_destroy_qp(struct srp_rdma_ch *ch)
 {
 	static struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR };
-	static struct ib_recv_wr wr = { .wr_id = SRP_LAST_WR_ID };
+	static struct ib_recv_wr wr = { 0 };
 	struct ib_recv_wr *bad_wr;
 	int ret;
 
+	wr.wr_id = SRP_LAST_WR_ID;
 	/* Destroying a QP and reusing ch->done is only safe if not connected */
 	WARN_ON_ONCE(ch->connected);
 
@@ -1042,13 +1043,14 @@ static int srp_inv_rkey(struct srp_rdma_ch *ch, u32 rkey)
 	struct ib_send_wr *bad_wr;
 	struct ib_send_wr wr = {
 		.opcode		    = IB_WR_LOCAL_INV,
-		.wr_id		    = LOCAL_INV_WR_ID_MASK,
 		.next		    = NULL,
 		.num_sge	    = 0,
 		.send_flags	    = 0,
 		.ex.invalidate_rkey = rkey,
 	};
 
+	wr.wr_id = LOCAL_INV_WR_ID_MASK;
+
 	return ib_post_send(ch->qp, &wr, &bad_wr);
 }
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 45ce36e..e11e038 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -49,6 +49,7 @@
 #include <linux/scatterlist.h>
 #include <linux/workqueue.h>
 #include <linux/socket.h>
+#include <linux/irq_poll.h>
 #include <uapi/linux/if_ether.h>
 
 #include <linux/atomic.h>
@@ -56,6 +57,7 @@
 #include <asm/uaccess.h>
 
 extern struct workqueue_struct *ib_wq;
+extern struct workqueue_struct *ib_comp_wq;
 
 union ib_gid {
 	u8	raw[16];
@@ -710,7 +712,10 @@ enum ib_wc_flags {
 };
 
 struct ib_wc {
-	u64			wr_id;
+	union {
+		u64		wr_id;
+		struct ib_cqe	*wr_cqe;
+	};
 	enum ib_wc_status	status;
 	enum ib_wc_opcode	opcode;
 	u32			vendor_err;
@@ -1014,6 +1019,10 @@ struct ib_sge {
 	u32	lkey;
 };
 
+struct ib_cqe {
+	void (*done)(struct ib_cq *cq, struct ib_wc *wc);
+};
+
 /**
  * struct ib_mw_bind_info - Parameters for a memory window bind operation.
  * @mr: A memory region to bind the memory window to.
@@ -1033,7 +1042,10 @@ struct ib_mw_bind_info {
 
 struct ib_send_wr {
 	struct ib_send_wr      *next;
-	u64			wr_id;
+	union {
+		u64		wr_id;
+		struct ib_cqe	*wr_cqe;
+	};
 	struct ib_sge	       *sg_list;
 	int			num_sge;
 	enum ib_wr_opcode	opcode;
@@ -1127,7 +1139,10 @@ static inline struct ib_sig_handover_wr *sig_handover_wr(struct ib_send_wr *wr)
 
 struct ib_recv_wr {
 	struct ib_recv_wr      *next;
-	u64			wr_id;
+	union {
+		u64		wr_id;
+		struct ib_cqe	*wr_cqe;
+	};
 	struct ib_sge	       *sg_list;
 	int			num_sge;
 };
@@ -1258,6 +1273,12 @@ struct ib_ah {
 
 typedef void (*ib_comp_handler)(struct ib_cq *cq, void *cq_context);
 
+enum ib_poll_context {
+	IB_POLL_DIRECT,		/* caller context, no hw completions */
+	IB_POLL_SOFTIRQ,	/* poll from softirq context */
+	IB_POLL_WORKQUEUE,	/* poll from workqueue */
+};
+
 struct ib_cq {
 	struct ib_device       *device;
 	struct ib_uobject      *uobject;
@@ -1266,6 +1287,12 @@ struct ib_cq {
 	void                   *cq_context;
 	int               	cqe;
 	atomic_t          	usecnt; /* count number of work queues */
+	enum ib_poll_context	poll_ctx;
+	struct ib_wc		*wc;
+	union {
+		struct irq_poll		iop;
+		struct work_struct	work;
+	};
 };
 
 struct ib_srq {
@@ -2447,6 +2474,11 @@ static inline int ib_post_recv(struct ib_qp *qp,
 	return qp->device->post_recv(qp, recv_wr, bad_recv_wr);
 }
 
+struct ib_cq *ib_alloc_cq(struct ib_device *dev, void *private,
+		int nr_cqe, int comp_vector, enum ib_poll_context poll_ctx);
+void ib_free_cq(struct ib_cq *cq);
+void ib_process_cq_direct(struct ib_cq *cq);
+
 /**
  * ib_create_cq - Creates a CQ on the specified device.
  * @device: The device on which to create the CQ.
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2015-11-25  6:21 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20151124100839.48b52fb35c6f209c51bccbb9807b6df0.f113bf890f.wbe@email24.secureserver.net>
     [not found] ` <20151124100839.48b52fb35c6f209c51bccbb9807b6df0.f113bf890f.wbe-wCzC33v8tqnShzhksYgB+AejPw4fNl8p@public.gmane.org>
2015-11-24 17:52   ` [PATCH 2/9] IB: add a proper completion queue abstraction Jason Gunthorpe
2015-11-24 17:52     ` Jason Gunthorpe
     [not found]     ` <56552132.7090701@asomi.com>
     [not found]       ` <56552132.7090701-DpaxOq6QOWMAvxtiuMwx3w@public.gmane.org>
2015-11-25  6:21         ` Jason Gunthorpe
2015-11-25  6:21           ` Jason Gunthorpe
2015-11-13 13:46 Christoph Hellwig
2015-11-13 13:46 ` [PATCH 2/9] IB: " Christoph Hellwig
2015-11-15  9:40   ` Sagi Grimberg
     [not found]     ` <564852F2.5080602-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-11-15 12:55       ` Christoph Hellwig
2015-11-15 12:55         ` Christoph Hellwig
     [not found]         ` <20151115125501.GB2218-jcswGhMUV9g@public.gmane.org>
2015-11-15 13:21           ` Sagi Grimberg
2015-11-15 13:21             ` Sagi Grimberg
     [not found]   ` <1447422410-20891-3-git-send-email-hch-jcswGhMUV9g@public.gmane.org>
2015-11-13 18:25     ` Jason Gunthorpe
2015-11-13 18:25       ` Jason Gunthorpe
     [not found]       ` <20151113182513.GB21808-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-11-13 19:57         ` Bart Van Assche
2015-11-13 19:57           ` Bart Van Assche
2015-11-13 22:06           ` Jason Gunthorpe
     [not found]             ` <20151113220636.GA32133-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-11-14  7:13               ` Christoph Hellwig
2015-11-14  7:13                 ` Christoph Hellwig
     [not found]                 ` <20151114071344.GE27738-jcswGhMUV9g@public.gmane.org>
2015-11-23 20:37                   ` Jason Gunthorpe
2015-11-23 20:37                     ` Jason Gunthorpe
2015-11-23 21:04                     ` Bart Van Assche
2015-11-23 21:04                       ` Bart Van Assche
2015-11-23 21:28                       ` Jason Gunthorpe
2015-11-23 21:54                         ` Bart Van Assche
2015-11-23 22:18                           ` Jason Gunthorpe
2015-11-23 22:33                             ` Bart Van Assche
2015-11-23 23:06                               ` Jason Gunthorpe
     [not found]                                 ` <B24F4DDE-709A-4D2D-8B26-4E83325DBB1A@asomi.com>
2015-11-24  0:00                                   ` Jason Gunthorpe
     [not found]                                     ` <20151124000011.GA9301-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-11-24  0:34                                       ` Tom Talpey
2015-11-24  0:34                                         ` Tom Talpey
     [not found]                                         ` <5653B0AD.7090402-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2015-11-24  0:40                                           ` Jason Gunthorpe
2015-11-24  0:40                                             ` Jason Gunthorpe
2015-11-24  2:35                                       ` Caitlin Bestler
2015-11-24  2:35                                         ` Caitlin Bestler
     [not found]                                         ` <5653CCF0.7050501-DpaxOq6QOWMAvxtiuMwx3w@public.gmane.org>
2015-11-24  7:03                                           ` Jason Gunthorpe
2015-11-24  7:03                                             ` Jason Gunthorpe
     [not found]                                             ` <20151124070301.GA23597-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-11-24 12:52                                               ` Tom Talpey
2015-11-24 12:52                                                 ` Tom Talpey
2015-11-14  7:08         ` Christoph Hellwig
2015-11-14  7:08           ` Christoph Hellwig
2015-11-23 20:01           ` Jason Gunthorpe
     [not found]             ` <20151123200136.GA5640-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-11-23 20:57               ` Christoph Hellwig
2015-11-23 20:57                 ` Christoph Hellwig
2015-11-17 17:52     ` Bart Van Assche
2015-11-17 17:52       ` Bart Van Assche
     [not found]       ` <564B697A.2020601-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2015-11-18  7:55         ` Sagi Grimberg
2015-11-18  7:55           ` Sagi Grimberg
     [not found]           ` <564C2F01.6020407-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-11-18 18:20             ` Bart Van Assche
2015-11-18 18:20               ` Bart Van Assche
2015-11-20 10:16               ` Christoph Hellwig
2015-11-20 16:50                 ` Bart Van Assche
2015-11-22  9:51                   ` Sagi Grimberg
2015-11-22 10:13                     ` Christoph Hellwig
     [not found]                       ` <20151122101308.GA12189-jcswGhMUV9g@public.gmane.org>
2015-11-22 10:36                         ` Sagi Grimberg
2015-11-22 10:36                           ` Sagi Grimberg
     [not found]                           ` <56519A90.5010502-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-11-22 13:23                             ` Christoph Hellwig
2015-11-22 13:23                               ` Christoph Hellwig
     [not found]                               ` <20151122132352.GA14154-jcswGhMUV9g@public.gmane.org>
2015-11-22 14:57                                 ` Sagi Grimberg
2015-11-22 14:57                                   ` Sagi Grimberg
2015-11-22 16:55                                   ` Bart Van Assche
2015-11-18 14:00         ` Christoph Hellwig
2015-11-18 14:00           ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.