All of lore.kernel.org
 help / color / mirror / Atom feed
* Potential lost receive WCs (was "[PATCH WIP 38/43]")
@ 2015-07-24 20:26 Chuck Lever
       [not found] ` <7824831C-3CC5-49C4-9E0B-58129D0E7FFF-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2015-07-24 20:26 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma


>>>> During some other testing I found that when a completion upcall
>>>> returns to the provider leaving CQEs still on the completion queue,
>>>> there is a non-zero probability that a completion will be lost.
>>> 
>>> What does lost mean?
>> 
>> Lost means a WC in the CQ is skipped by ib_poll_cq().
>> 
>> In other words, I expected that during the next upcall,
>> ib_poll_cq() would return WCs that were not processed, starting
>> with the last one on the CQ when my upcall handler returned.
> 
> Yes, this is what it should do. I wouldn't expect a timely upcall, but
> none should be lost.
> 
>> I found this by intentionally having the completion handler
>> process only one or two WCs and then return.
>> 
>>> The CQ is edge triggered, so if you don't drain it you might not get
>>> another timely CQ callback (which is bad), but CQEs themselves should
>>> not be lost.
>> 
>> I’m not sure I fully understand this problem, it might
>> even be my misuderstanding about ib_poll_cq(). But forcing
>> the completion upcall handler to completely drain the CQ
>> during each upcall prevents the issue.
> 
> CQs should never be lost.
> 
> The idea that you can completely drain the CQ during the upcall is
> inherently racey, so this cannot be the answer to whatever the problem
> is..

Hrm, ok. Completely draining the CQ is how the upcall handler
worked before commit 8301a2c047cc.


> Is there any chance this is still an artifact of the lazy SQE flow
> control? The RDMA buffer SQE recycling is solved by the sync
> invalidate, but workloads that don't use RDMA buffers (ie SEND only)
> will still run without proper flow control…

I can’t see how it can be related to the send queue these days.
The CQ is split. The behavior I observed was in the receive
completion path. All RECVs are signaled, and there are a fixed
and limited number of reply buffers that match the number of
RPC/RDMA credits.

Basically RPC work flow stopped because an RPC reply never
arrived.

The send queue accounting problem would cause the client to
stop sending RPC requests before we hit our credit limit.


> If you are totally certain a CQ was dropped from ib_poll_cq, and that
> the SQ is not overflowing by strict accounting, then I'd say driver
> problem, but the odds of having an undetected driver problem like that
> at this point seem somehow small…

Under normal circumstances, most ULPs are prepared to deal with a
large number of WCs per upcall. IMO the issue would be difficult to
hit unless you have rigged the upcall handler to force the problem
to occur (poll once with a small ib_wc array size and then return
from the upcall handler).

I will have some time to experiment next week. Thanks for confirming
my understanding of ib_poll_cq().


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Potential lost receive WCs (was "[PATCH WIP 38/43]")
       [not found] ` <7824831C-3CC5-49C4-9E0B-58129D0E7FFF-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2015-07-24 20:46   ` Jason Gunthorpe
       [not found]     ` <20150724204604.GA28244-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2015-07-24 20:46 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma

On Fri, Jul 24, 2015 at 04:26:00PM -0400, Chuck Lever wrote:
> Basically RPC work flow stopped because an RPC reply never
> arrived.

Oh, that is what I expect to see.. Remebmer the cq upcall is edge
triggered, so if you leave stuff in the cq then you don't get another
upcall until another CQE is added. If adding another CQE is somehow
contingent on the CQE left behind then the scheme deadlocks.

The CQE is not lost because calling ib_poll_cq from outside the upcall
will return it.

To confirm lost you need to see ib_poll_cq return no results and
confirm an expected CQE is missing.

The driver is expected to avoid racing with the upcall and guarentee
new CQEs will trigger no matter how many CQEs are consumed by the ULP.

So, as Steve said, if the ULP leaves CQEs behind then it must do
something to guarantee that ib_poll_cq is eventually called to collect
them, or not care about forward progress on the CQ.

Does that make sense and explain what you saw?

If yes, I recommend revising the commit and comment language. CQEs are
not lost, only the upcall isn't happening.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Potential lost receive WCs (was "[PATCH WIP 38/43]")
       [not found]     ` <20150724204604.GA28244-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-29 20:47       ` Chuck Lever
       [not found]         ` <E855E210-F640-4104-9B35-2A75DF1BF2E3-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2015-07-29 20:47 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma

Hi Jason-


On Jul 24, 2015, at 4:46 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

> On Fri, Jul 24, 2015 at 04:26:00PM -0400, Chuck Lever wrote:
>> Basically RPC work flow stopped because an RPC reply never
>> arrived.
> 
> Oh, that is what I expect to see.. Remebmer the cq upcall is edge
> triggered, so if you leave stuff in the cq then you don't get another
> upcall until another CQE is added. If adding another CQE is somehow
> contingent on the CQE left behind then the scheme deadlocks.
> 
> The CQE is not lost because calling ib_poll_cq from outside the upcall
> will return it.
> 
> To confirm lost you need to see ib_poll_cq return no results and
> confirm an expected CQE is missing.

I tested this again, now with the patches that ensure invalidate
WRs are complete before allowing more RPCs to be dispatched. I
set the poll budget to three ib_wc's per receive upcall.

During a write-intensive workload, the RPC workflow pauses. After
a minute the RPC upper layer emits a retransmit for the missing
work, which generates a fresh server reply and RECV completion.

At that point I see a duplicate XID, which is a sign that the
original CQE was still on the CQ but no upcall was done.

The RPC workflow then resumes.


> The driver is expected to avoid racing with the upcall and guarentee
> new CQEs will trigger no matter how many CQEs are consumed by the ULP.
> 
> So, as Steve said, if the ULP leaves CQEs behind then it must do
> something to guarantee that ib_poll_cq is eventually called to collect
> them, or not care about forward progress on the CQ.
> 
> Does that make sense and explain what you saw?

It seems to, yes.

The design of the current upcall handler is based on the assumption
that the provider will call again immediately if there are still
CQEs to consume. Apparently this is true for some providers, and not
for others, and I misunderstood that when I put this together last
year.

The budgeting mechanism that I copied from another kernel ULP seems
inappropriate for xprtrdma. Perhaps it's unnecessary since sending
RPCs is flow controlled based on the reply traffic.


> If yes, I recommend revising the commit and comment language. CQEs are
> not lost, only the upcall isn't happening.

I would like to change the upcall handler to poll until ib_poll_cq
says the CQ is empty, but I don't understand this remark:

> The idea that you can completely drain the CQ during the upcall is
> inherently racey, so this cannot be the answer to whatever the problem
> is..

I thought IB_CQ_REPORT_MISSED_EVENTS was supposed to close the race
windows here. And Section 8.2.5 of draft-hilland-rddp-verbs
recommends dequeuing all existing CQEs.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Potential lost receive WCs (was "[PATCH WIP 38/43]")
       [not found]         ` <E855E210-F640-4104-9B35-2A75DF1BF2E3-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2015-07-29 21:15           ` Jason Gunthorpe
       [not found]             ` <20150729211557.GA16284-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2015-07-29 21:15 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma

On Wed, Jul 29, 2015 at 04:47:59PM -0400, Chuck Lever wrote:

> Apparently this is true for some providers, and not for others, and
> I misunderstood that when I put this together last year.

Really? In kernel providers? Interesting, those are probably wrong...

> > The idea that you can completely drain the CQ during the upcall is
> > inherently racey, so this cannot be the answer to whatever the problem
> > is..

This comment was directed toward using a complete drain to cover up a
driver bug.

A full drain to guarentee ULP progress is OK and the driver must make
sure that case isn't racey.

Which is done via:

> I thought IB_CQ_REPORT_MISSED_EVENTS was supposed to close the race
> windows here.

Basically:
 * Don't call ib_req_notify_cq unless you think the CQ is empty
 * Don't expect an upcall untill you call ib_req_notify_cq
 * Call ib_req_notify_cq last

> And Section 8.2.5 of draft-hilland-rddp-verbs recommends dequeuing
> all existing CQEs.

The drivers we have that don't dequeue all the CQEs are doing
something like NAPI polling and have other mechanisms to guarentee
progress. Don't copy something like budget without copying the other
mechanisms :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Potential lost receive WCs (was "[PATCH WIP 38/43]")
       [not found]             ` <20150729211557.GA16284-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-07-29 21:19               ` Chuck Lever
       [not found]                 ` <DC5354A4-3EB4-46FF-AA34-9AE26DD25031-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2015-07-29 21:19 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma


On Jul 29, 2015, at 5:15 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

> On Wed, Jul 29, 2015 at 04:47:59PM -0400, Chuck Lever wrote:
> 
>> Apparently this is true for some providers, and not for others, and
>> I misunderstood that when I put this together last year.
> 
> Really? In kernel providers? Interesting, those are probably wrong...
> 
>>> The idea that you can completely drain the CQ during the upcall is
>>> inherently racey, so this cannot be the answer to whatever the problem
>>> is..
> 
> This comment was directed toward using a complete drain to cover up a
> driver bug.
> 
> A full drain to guarentee ULP progress is OK and the driver must make
> sure that case isn't racey.
> 
> Which is done via:
> 
>> I thought IB_CQ_REPORT_MISSED_EVENTS was supposed to close the race
>> windows here.
> 
> Basically:
> * Don't call ib_req_notify_cq unless you think the CQ is empty
> * Don't expect an upcall untill you call ib_req_notify_cq
> * Call ib_req_notify_cq last
> 
>> And Section 8.2.5 of draft-hilland-rddp-verbs recommends dequeuing
>> all existing CQEs.
> 
> The drivers we have that don't dequeue all the CQEs are doing
> something like NAPI polling and have other mechanisms to guarentee
> progress. Don't copy something like budget without copying the other
> mechanisms :)

OK, that makes total sense. Thanks for clarifying.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Potential lost receive WCs (was "[PATCH WIP 38/43]")
       [not found]                 ` <DC5354A4-3EB4-46FF-AA34-9AE26DD25031-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2015-07-30  7:00                   ` Sagi Grimberg
       [not found]                     ` <55B9CB78.9040501-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2015-07-30  7:00 UTC (permalink / raw)
  To: Chuck Lever, Jason Gunthorpe; +Cc: linux-rdma


>> The drivers we have that don't dequeue all the CQEs are doing
>> something like NAPI polling and have other mechanisms to guarentee
>> progress. Don't copy something like budget without copying the other
>> mechanisms :)
>
> OK, that makes total sense. Thanks for clarifying.

IIRC NAPI is soft-IRQ which chuck is trying to avoid.

Chuck, I think I was the one that commented on this. I observed a
situation in iser where the polling loop kept going continuously
without ever leaving the soft-IRQ context (high workload obviously).
In addition to the polling loop hogging the CPU, other CQs with the
same IRQ assignment were starved. So I suggested you should take care
of it in xprtrdma as well.

The correct approach is NAPI. There is an equivalent for storage which
is called blk_iopoll (block/blk-iopool.c) which sort of has nothing
specific to block devices (also soft-IRQ context). I have attempted to
convert iser to use it, but I got some unpredictable latency jitters so
I stopped and didn't get a chance to pick it up ever since.

I still think that draining the CQ without respecting a quota is
wrong, even if driverX has a glitch there.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Potential lost receive WCs (was "[PATCH WIP 38/43]")
       [not found]                     ` <55B9CB78.9040501-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-07-30 14:51                       ` Chuck Lever
  2015-07-30 16:00                       ` Jason Gunthorpe
  1 sibling, 0 replies; 8+ messages in thread
From: Chuck Lever @ 2015-07-30 14:51 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Jason Gunthorpe, linux-rdma


On Jul 30, 2015, at 3:00 AM, Sagi Grimberg <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

> 
>>> The drivers we have that don't dequeue all the CQEs are doing
>>> something like NAPI polling and have other mechanisms to guarentee
>>> progress. Don't copy something like budget without copying the other
>>> mechanisms :)
>> 
>> OK, that makes total sense. Thanks for clarifying.
> 
> IIRC NAPI is soft-IRQ which chuck is trying to avoid.
> 
> Chuck, I think I was the one that commented on this. I observed a
> situation in iser where the polling loop kept going continuously
> without ever leaving the soft-IRQ context (high workload obviously).
> In addition to the polling loop hogging the CPU, other CQs with the
> same IRQ assignment were starved. So I suggested you should take care
> of it in xprtrdma as well.
> 
> The correct approach is NAPI. There is an equivalent for storage which
> is called blk_iopoll (block/blk-iopool.c) which sort of has nothing
> specific to block devices (also soft-IRQ context). I have attempted to
> convert iser to use it, but I got some unpredictable latency jitters so
> I stopped and didn't get a chance to pick it up ever since.
> 
> I still think that draining the CQ without respecting a quota is
> wrong, even if driverX has a glitch there.

The iWARP and IBTA specs disagree: they both recommend clearing
existing CQEs when handling a completion upcall. Thus the API is
designed with the expectation that consumers do not impose a poll
budget.

Any solution to the starvation problem, including quota + NAPI,
involves deferring receive work. xprtrdma already defers work.

Our completion handlers are lightweight. The bulk of receive
handling is done in softIRQ in a tasklet that handles each RPC
reply in a loop. It's more likely the tasklet loop, rather than
completion handling, is going to result in starvation.

The only issue we've seen so far is the reply tasklet can hog
one CPU because it is single-threaded across all transport
connections. Thus it is more effective for us to replace the
tasklet with a work queue where each RPC reply can be globally
scheduled and does not interfere with other work being done
by softIRQ.

In other words, the starvation issue seen in xprtrdma is not
in the receive handler, so fixing it there is likely to be
ineffective.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Potential lost receive WCs (was "[PATCH WIP 38/43]")
       [not found]                     ` <55B9CB78.9040501-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2015-07-30 14:51                       ` Chuck Lever
@ 2015-07-30 16:00                       ` Jason Gunthorpe
  1 sibling, 0 replies; 8+ messages in thread
From: Jason Gunthorpe @ 2015-07-30 16:00 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Chuck Lever, linux-rdma

On Thu, Jul 30, 2015 at 10:00:08AM +0300, Sagi Grimberg wrote:

> I still think that draining the CQ without respecting a quota is
> wrong, even if driverX has a glitch there.

Sure, but you can't just return from the CQ upcall after doing a
budget and expect to be called again in the future. That is absolutely
wrong.

It is very difficult to mix and match processing in the CQ upcall and
in another context.

So, either you drain the whole thing in the CQ upcall, or delegate to
another context and process it there, possibly sleeping during
processing when the budget is hit.

Overall, the process is the same, drain entirely before calling
ib_req_notify_cq, don't expect any CQ upcalls until ib_req_notify_cq
is called.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-07-30 16:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-24 20:26 Potential lost receive WCs (was "[PATCH WIP 38/43]") Chuck Lever
     [not found] ` <7824831C-3CC5-49C4-9E0B-58129D0E7FFF-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-07-24 20:46   ` Jason Gunthorpe
     [not found]     ` <20150724204604.GA28244-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-29 20:47       ` Chuck Lever
     [not found]         ` <E855E210-F640-4104-9B35-2A75DF1BF2E3-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-07-29 21:15           ` Jason Gunthorpe
     [not found]             ` <20150729211557.GA16284-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-07-29 21:19               ` Chuck Lever
     [not found]                 ` <DC5354A4-3EB4-46FF-AA34-9AE26DD25031-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-07-30  7:00                   ` Sagi Grimberg
     [not found]                     ` <55B9CB78.9040501-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-07-30 14:51                       ` Chuck Lever
2015-07-30 16:00                       ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.