All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] ipoib: Fix lockup of the tx queue
@ 2010-03-03 12:27 Eli Cohen
       [not found] ` <20100303122752.GA29784-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Eli Cohen @ 2010-03-03 12:27 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Linux RDMA list, ewg

The ipoib UD QP reports send completions to priv->send_cq which is unarmed
generally; it only gets armed when the number of outstanding send requests
(e.g. those for which a completion was not polled yet) reaches the size of the
tx queue. This arming (done using ib_req_notify_cq()) is done only in the send
path for the UD QP. However, when sending CM packets, the net queue may be
stopped for the same reasons but no measures are taken to recover the UD path
from a lockup.
Consider this scenario: a host sends high rate of both CM and UD packets.
Suppose also that the tx queue length is N. If at some time the number of
outstanding UD packets is more than N/2 and the overall outstanding packets is
N-1, and now CM sends a packet making the number of outstanding equal N, the tx
queue will be stopped. When all the CM packets will complete, the number of
outstanding packets will still be higher than N/2 so the tx queue will not be
enabled.
Fix this by calling ib_req_notify_cq() when the queue is stopped in the CM
path.

Signed-off-by: Eli Cohen <eli-VPRAkNaXOzVS1MOuV/RT9w@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 30bdf42..f8302c2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -752,6 +752,8 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
 		if (++priv->tx_outstanding == ipoib_sendq_size) {
 			ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net queue\n",
 				  tx->qp->qp_num);
+			if (ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP))
+				ipoib_warn(priv, "request notify on send CQ failed\n");
 			netif_stop_queue(dev);
 		}
 	}
-- 
1.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipoib: Fix lockup of the tx queue
       [not found] ` <20100303122752.GA29784-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>
@ 2010-03-11 21:38   ` Roland Dreier
       [not found]     ` <adaiq92czdt.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Roland Dreier @ 2010-03-11 21:38 UTC (permalink / raw)
  To: Eli Cohen; +Cc: Linux RDMA list, ewg

good debugging, applied thanks.

I do worry (as Moni mentioned) that this doesn't explain why you would
get send failures in this case, but the patch itself is well-explained
and looks "obviously correct" so I think we should apply it.
-- 
Roland Dreier  <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue
       [not found]     ` <adaiq92czdt.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2010-03-11 21:41       ` Ralph Campbell
       [not found]         ` <1268343670.2255.44.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
  2010-03-14  6:52       ` Eli Cohen
  1 sibling, 1 reply; 10+ messages in thread
From: Ralph Campbell @ 2010-03-11 21:41 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Eli Cohen, Linux RDMA list, ewg

On Thu, 2010-03-11 at 13:38 -0800, Roland Dreier wrote:
> good debugging, applied thanks.
> 
> I do worry (as Moni mentioned) that this doesn't explain why you would
> get send failures in this case, but the patch itself is well-explained
> and looks "obviously correct" so I think we should apply it.

Well, after more testing it seems there may still be a problem.
I haven't isolated it yet though. I could definitely use help
reviewing the code changes.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue
       [not found]         ` <1268343670.2255.44.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
@ 2010-03-11 21:45           ` Ralph Campbell
       [not found]             ` <1268343937.2255.46.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Ralph Campbell @ 2010-03-11 21:45 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Linux RDMA list, Eli Cohen, ewg

Sorry, I was referring to my patch not Eli's.

On Thu, 2010-03-11 at 13:41 -0800, Ralph Campbell wrote:
> On Thu, 2010-03-11 at 13:38 -0800, Roland Dreier wrote:
> > good debugging, applied thanks.
> > 
> > I do worry (as Moni mentioned) that this doesn't explain why you would
> > get send failures in this case, but the patch itself is well-explained
> > and looks "obviously correct" so I think we should apply it.
> 
> Well, after more testing it seems there may still be a problem.
> I haven't isolated it yet though. I could definitely use help
> reviewing the code changes.
> 
> _______________________________________________
> ewg mailing list
> ewg-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5@public.gmane.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue
       [not found]             ` <1268343937.2255.46.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
@ 2010-03-11 21:52               ` Roland Dreier
       [not found]                 ` <ada3a06cyqw.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Roland Dreier @ 2010-03-11 21:52 UTC (permalink / raw)
  To: Ralph Campbell; +Cc: Linux RDMA list, Eli Cohen, ewg

 > Sorry, I was referring to my patch not Eli's.

Heh, I never would have said anything about your patch was "obvious".
I skimmed yours once but I do want to read it more carefully.

Did you ever say what test case you are using to provoke the problem you're fixing?
-- 
Roland Dreier  <rolandd-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org>
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/index.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [ewg] [PATCH] ipoib: Fix lockup of the tx queue
       [not found]                 ` <ada3a06cyqw.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
@ 2010-03-11 22:04                   ` Ralph Campbell
  0 siblings, 0 replies; 10+ messages in thread
From: Ralph Campbell @ 2010-03-11 22:04 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Linux RDMA list, Eli Cohen, ewg

On Thu, 2010-03-11 at 13:52 -0800, Roland Dreier wrote:
> > Sorry, I was referring to my patch not Eli's.
> 
> Heh, I never would have said anything about your patch was "obvious".
> I skimmed yours once but I do want to read it more carefully.
> 
> Did you ever say what test case you are using to provoke the problem you're fixing?

I think I did but it is just UDP stress tests in general.
Throwing in some link failures and switching between connected
and datagram modes helps too. netperf, qperf, etc. should work.
Anything which causes the connected mode QP to fail should
exercise the fix too.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipoib: Fix lockup of the tx queue
       [not found]     ` <adaiq92czdt.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
  2010-03-11 21:41       ` [ewg] " Ralph Campbell
@ 2010-03-14  6:52       ` Eli Cohen
       [not found]         ` <20100314065238.GA23263-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>
  1 sibling, 1 reply; 10+ messages in thread
From: Eli Cohen @ 2010-03-14  6:52 UTC (permalink / raw)
  To: Roland Dreier, jjengla-Re5JQEeQqe8AvxtiuMwx3w
  Cc: Eli Cohen, Linux RDMA list, ewg

On Thu, Mar 11, 2010 at 01:38:38PM -0800, Roland Dreier wrote:
> 
> I do worry (as Moni mentioned) that this doesn't explain why you would
> get send failures in this case, but the patch itself is well-explained
> and looks "obviously correct" so I think we should apply it.

It could be a problem in the hardware driver.
Josh, can you tell what kind of hardware you were using?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipoib: Fix lockup of the tx queue
       [not found]         ` <20100314065238.GA23263-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>
@ 2010-03-15 16:25           ` Josh England
       [not found]             ` <a72123c41003150925t173815e6lc1e94c999be45357-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Josh England @ 2010-03-15 16:25 UTC (permalink / raw)
  To: Eli Cohen; +Cc: Roland Dreier, Eli Cohen, Linux RDMA list, ewg

Everything has MT264328 ConnectX cards using the mlx4_ib driver.
Boot/file servers are using an HP OEM 2.7.000 firmware.  Compute nodes
have cards using Sun OEM 2.6.200 FW.

-JE

On Sat, Mar 13, 2010 at 10:52 PM, Eli Cohen <eli-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> On Thu, Mar 11, 2010 at 01:38:38PM -0800, Roland Dreier wrote:
>>
>> I do worry (as Moni mentioned) that this doesn't explain why you would
>> get send failures in this case, but the patch itself is well-explained
>> and looks "obviously correct" so I think we should apply it.
>
> It could be a problem in the hardware driver.
> Josh, can you tell what kind of hardware you were using?
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipoib: Fix lockup of the tx queue
       [not found]             ` <a72123c41003150925t173815e6lc1e94c999be45357-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-03-16  6:33               ` Eli Cohen
       [not found]                 ` <20100316063323.GA1887-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Eli Cohen @ 2010-03-16  6:33 UTC (permalink / raw)
  To: Josh England; +Cc: Roland Dreier, Eli Cohen, Linux RDMA list, ewg

On Mon, Mar 15, 2010 at 08:25:51AM -0800, Josh England wrote:
> Everything has MT264328 ConnectX cards using the mlx4_ib driver.
> Boot/file servers are using an HP OEM 2.7.000 firmware.  Compute nodes
> have cards using Sun OEM 2.6.200 FW.
> 

You probably mean MT26428? Anyway, do you still see the post send
failed messages? If you do, could you apply this patch so we'll have
better insight as for the reason?
http://patchwork.kernel.org/patch/83593/
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] ipoib: Fix lockup of the tx queue
       [not found]                 ` <20100316063323.GA1887-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>
@ 2010-03-16 16:27                   ` Josh England
  0 siblings, 0 replies; 10+ messages in thread
From: Josh England @ 2010-03-16 16:27 UTC (permalink / raw)
  To: Eli Cohen; +Cc: Roland Dreier, Eli Cohen, Linux RDMA list, ewg

On Mon, Mar 15, 2010 at 11:33 PM, Eli Cohen <eli-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> On Mon, Mar 15, 2010 at 08:25:51AM -0800, Josh England wrote:
>> Everything has MT264328 ConnectX cards using the mlx4_ib driver.
>> Boot/file servers are using an HP OEM 2.7.000 firmware.  Compute nodes
>> have cards using Sun OEM 2.6.200 FW.
>>
>
> You probably mean MT26428?

Yeah...threw an extra digit in there...

> Anyway, do you still see the post send
> failed messages? If you do, could you apply this patch so we'll have
> better insight as for the reason?
> http://patchwork.kernel.org/patch/83593/

I'll throw the patch in and try to get some datagram-mode testing in
soon.  I haven't gone back since the CM code fix.

-JE
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-03-16 16:27 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-03 12:27 [PATCH] ipoib: Fix lockup of the tx queue Eli Cohen
     [not found] ` <20100303122752.GA29784-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>
2010-03-11 21:38   ` Roland Dreier
     [not found]     ` <adaiq92czdt.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-03-11 21:41       ` [ewg] " Ralph Campbell
     [not found]         ` <1268343670.2255.44.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
2010-03-11 21:45           ` Ralph Campbell
     [not found]             ` <1268343937.2255.46.camel-/vjeY7uYZjrPXfVEPVhPGq6RkeBMCJyt@public.gmane.org>
2010-03-11 21:52               ` Roland Dreier
     [not found]                 ` <ada3a06cyqw.fsf-BjVyx320WGW9gfZ95n9DRSW4+XlvGpQz@public.gmane.org>
2010-03-11 22:04                   ` Ralph Campbell
2010-03-14  6:52       ` Eli Cohen
     [not found]         ` <20100314065238.GA23263-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>
2010-03-15 16:25           ` Josh England
     [not found]             ` <a72123c41003150925t173815e6lc1e94c999be45357-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-03-16  6:33               ` Eli Cohen
     [not found]                 ` <20100316063323.GA1887-8YAHvHwT2UEvbXDkjdHOrw/a8Rv0c6iv@public.gmane.org>
2010-03-16 16:27                   ` Josh England

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.