All of lore.kernel.org
 help / color / mirror / Atom feed
* RQ overflow seen running isert traffic
@ 2016-09-27  7:01 Potnuri Bharat Teja
       [not found] ` <20160927070157.GA13140-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
  2016-10-05  6:14 ` Sagi Grimberg
  0 siblings, 2 replies; 24+ messages in thread
From: Potnuri Bharat Teja @ 2016-09-27  7:01 UTC (permalink / raw)
  To: sagi; +Cc: swise, target-devel, nab, linux-rdma

Hi Sagi,
I've been trying to understand the isert functionality with respect to
RDMA Receive Queue sizing and Queue full handling. Here is the problem
is see with iw_cxgb4:

After running few minutes of iSER traffic with iw_cxgb4, I am seeing
post receive failures due to receive queue full returning -ENOMEM.
In case of iw_cxgb4 the RQ size is 130 with qp attribute max_recv_wr = 129, 
passed down by isert to iw_cxgb4.isert decides on max_recv_wr as 129 based 
on (ISERT_QP_MAX_RECV_DTOS = ISCSI_DEF_XMIT_CMDS_MAX = 128) + 1.

My debug suggests that at some point isert tries to post more than
129 receive WRs into the RQ and fails as the queue is full already. From
the code most of the recv wr are posted only after a recieve completion,
but few datain operations(isert_put_datain()) are done independent of
receive completions. In fact the last WR failed to post in to RQ is from
isert_put_datain() through target_complete_ok_work(). CQ stats at the
time of failure shows the cq polled to empty.

How could I check for the possible conditions leading to this issue?
I am happy to provide you any details further.
Thanks for your time.

Bharat.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: RQ overflow seen running isert traffic
       [not found] ` <20160927070157.GA13140-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
@ 2016-09-29 14:12   ` Steve Wise
  0 siblings, 0 replies; 24+ messages in thread
From: Steve Wise @ 2016-09-29 14:12 UTC (permalink / raw)
  To: 'Potnuri Bharat Teja', sagi-NQWnxTmZq1alnMjI0IkVqw
  Cc: target-devel-u79uwXL29TY76Z2rM5mHXA, nab-IzHhD5pYlfBP7FQvKIMDCQ,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

> 
> Hi Sagi,
> I've been trying to understand the isert functionality with respect to
> RDMA Receive Queue sizing and Queue full handling. Here is the problem
> is see with iw_cxgb4:
> 
> After running few minutes of iSER traffic with iw_cxgb4, I am seeing
> post receive failures due to receive queue full returning -ENOMEM.
> In case of iw_cxgb4 the RQ size is 130 with qp attribute max_recv_wr = 129,
> passed down by isert to iw_cxgb4.isert decides on max_recv_wr as 129 based
> on (ISERT_QP_MAX_RECV_DTOS = ISCSI_DEF_XMIT_CMDS_MAX = 128) + 1.
> 

Just to clarify:  The RQ size of 130 in iw_cxgb4 has to do with how the the SW
and HW share the queue state.  It will always be at least 1 more than the
application's size.  So the RQ depth is 129, and a post_recv() is done when the
RQ is full, which should never happen...

> My debug suggests that at some point isert tries to post more than
> 129 receive WRs into the RQ and fails as the queue is full already. From
> the code most of the recv wr are posted only after a recieve completion,
> but few datain operations(isert_put_datain()) are done independent of
> receive completions. In fact the last WR failed to post in to RQ is from
> isert_put_datain() through target_complete_ok_work(). CQ stats at the
> time of failure shows the cq polled to empty.
> 
> How could I check for the possible conditions leading to this issue?
> I am happy to provide you any details further.
> Thanks for your time.
>
> Bharat.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RQ overflow seen running isert traffic
  2016-09-27  7:01 RQ overflow seen running isert traffic Potnuri Bharat Teja
       [not found] ` <20160927070157.GA13140-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
@ 2016-10-05  6:14 ` Sagi Grimberg
  2016-10-17 11:16   ` Potnuri Bharat Teja
  1 sibling, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2016-10-05  6:14 UTC (permalink / raw)
  To: Potnuri Bharat Teja; +Cc: swise, target-devel, nab, linux-rdma


> Hi Sagi,

Hey Baharat,

Sorry for the late response, its the holiday
season in Israel...

> I've been trying to understand the isert functionality with respect to
> RDMA Receive Queue sizing and Queue full handling. Here is the problem
> is see with iw_cxgb4:
>
> After running few minutes of iSER traffic with iw_cxgb4, I am seeing
> post receive failures due to receive queue full returning -ENOMEM.
> In case of iw_cxgb4 the RQ size is 130 with qp attribute max_recv_wr = 129,
> passed down by isert to iw_cxgb4.isert decides on max_recv_wr as 129 based
> on (ISERT_QP_MAX_RECV_DTOS = ISCSI_DEF_XMIT_CMDS_MAX = 128) + 1.

That's correct.

>
> My debug suggests that at some point isert tries to post more than
> 129 receive WRs into the RQ and fails as the queue is full already. From
> the code most of the recv wr are posted only after a recieve completion,
> but few datain operations(isert_put_datain()) are done independent of
> receive completions.

Interesting. I suspect that this issue haven't come up is that
the devices I used to test with allocate the send/recv queues in
the next power of 2 (which would be 256) which was enough to hide
this I guess...

We repost the recv buffer under the following conditions:
1. We are queueing data + response (datain) or just response (dataout)
and we are done with the recv buffer.
2. We got a unsolicited dataout.

Can you please turn off unsolicited dataouts and see if this
still happen? (InitialR2T=Yes)

> In fact the last WR failed to post in to RQ is from
> isert_put_datain() through target_complete_ok_work(). CQ stats at the
> time of failure shows the cq polled to empty.

That is strange, each scsi command should trigger iscsit_queue_data_in
just once. Can you provide evidence of a command that triggers it more
than once?

Another possible reason is that we somehow get to put_data_in and
put_response for the same command (which we should never do because
we handle the response in put_data_in).

Thanks for reporting.
Sagi.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RQ overflow seen running isert traffic
  2016-10-05  6:14 ` Sagi Grimberg
@ 2016-10-17 11:16   ` Potnuri Bharat Teja
  2016-10-17 18:29     ` Steve Wise
  0 siblings, 1 reply; 24+ messages in thread
From: Potnuri Bharat Teja @ 2016-10-17 11:16 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: SWise OGC, target-devel, nab, linux-rdma

On Wednesday, October 10/05/16, 2016 at 11:44:12 +0530, Sagi Grimberg wrote:
> 
> > Hi Sagi,
> 
> Hey Baharat,
> 
> Sorry for the late response, its the holiday
> season in Israel...
> 
> > I've been trying to understand the isert functionality with respect to
> > RDMA Receive Queue sizing and Queue full handling. Here is the problem
> > is see with iw_cxgb4:
> >
> > After running few minutes of iSER traffic with iw_cxgb4, I am seeing
> > post receive failures due to receive queue full returning -ENOMEM.
> > In case of iw_cxgb4 the RQ size is 130 with qp attribute max_recv_wr = 129,
> > passed down by isert to iw_cxgb4.isert decides on max_recv_wr as 129 based
> > on (ISERT_QP_MAX_RECV_DTOS = ISCSI_DEF_XMIT_CMDS_MAX = 128) + 1.
> 
> That's correct.

Hi Sagi,
Sorry for the late reply, I had to recheck my findings before I reply
you back.

My interpretation of the queue full issue was not complete, got carried
away by the recieve queue and missed the SQ failure due to full among
the debug logs.

Here is what was happening:
It was the SQ full first and fails to get posted with ENOMEM, due to this 
the command is queued to queue full list this will schedule it to post at
later instance and this repeated try for posting will cause the rq to get
full. For everyfurther try sq post will any way fail and an extra rq wr is
posted as a part of datain() leading it to be full too.

This happened a bit earlier in my case, since I corrected the the
iser_put_datain() to return error to LIO. 

Here is the failure log on a fresh 4.8 kernel:
isert: isert_rdma_rw_ctx_post: Cmd: ffff882ec8c96e60 failed to post RDMA
res <===here is the post send failure due to ENOMEM.
ABORT_TASK: Found referenced iSCSI task_tag: 33
ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33
isert: isert_post_recv: ib_post_recv() failed with ret: -22
isert: isert_post_response: ib_post_recv failed with -22
isert: isert_post_recv: ib_post_recv() failed with ret: -22
isert: isert_post_response: ib_post_recv failed with -22
isert: isert_post_recv: ib_post_recv() failed with ret: -22
isert: isert_post_response: ib_post_recv failed with -22
isert: isert_post_recv: ib_post_recv() failed with ret: -22
isert: isert_post_response: ib_post_recv failed with -22
isert: isert_post_recv: ib_post_recv() failed with ret: -22
isert: isert_post_response: ib_post_recv failed with -22
cxgb4 0000:84:00.4: AE qpid 1026 opcode 10 status 0x1 type 1 len 0x0
wrid.hi 0x0 wrid.lo 0x4c01
isert: isert_qp_event_callback: QP access error (3): conn
ffff8817ddc46000
iSCSI Login timeout on Network Portal 10.40.40.198:3260
INFO: task iscsi_np:14744 blocked for more than 120 seconds.

Here are the QP and CQ stats before my assert for queue full:
qhp->attr.sq_num_entries 523
qhp->attr.rq_num_entries 129
qhp->wq.sq.qid 1026
qhp->wq.rq.qid 1027
qhp->wq.sq.in_use 523 <=== SQ to the brim
qhp->wq.sq.size 524 <=== SQ size
qhp->wq.sq.cidx 391
qhp->wq.sq.pidx 390
qhp->wq.sq.wq_pidx 202
qhp->wq.sq.wq_pidx_inc 0
qhp->wq.sq.flush_cidx 391
qhp->wq.rq.in_use 128
qhp->wq.rq.size 130
qhp->wq.rq.cidx 112
qhp->wq.rq.pidx 110
qhp->wq.rq.wq_pidx 240
qhp->wq.rq.wq_pidx_inc 0
qhp->wq.flushed 0
chp->cq.cqid 1024
chp->cq.size 6335
chp->cq.cidx 4978
chp->cq.sw_cidx 4126
chp->cq.sw_pidx 4126
chp->cq.sw_in_use 0
chp->cq.cidx_inc 0

As an experiment I tried increasing the ISCSI_DEF_XMIT_CMDS_MAX to 256
instead of 128, which incase of iwarp creates SQ with size 1548, the 
issue is not seen.
I doubt if the SQ is not sized properly incase of iWARP Or factored
incorrectly in the rdma api for iser IOP. 

I am digging for root cause.

Thanks for your time,
Bharat.

> 
> >
> > My debug suggests that at some point isert tries to post more than
> > 129 receive WRs into the RQ and fails as the queue is full already. From
> > the code most of the recv wr are posted only after a recieve completion,
> > but few datain operations(isert_put_datain()) are done independent of
> > receive completions.
> 
> Interesting. I suspect that this issue haven't come up is that
> the devices I used to test with allocate the send/recv queues in
> the next power of 2 (which would be 256) which was enough to hide
> this I guess...
> 
> We repost the recv buffer under the following conditions:
> 1. We are queueing data + response (datain) or just response (dataout)
> and we are done with the recv buffer.
> 2. We got a unsolicited dataout.
> 
> Can you please turn off unsolicited dataouts and see if this
> still happen? (InitialR2T=Yes)
> 
> > In fact the last WR failed to post in to RQ is from
> > isert_put_datain() through target_complete_ok_work(). CQ stats at the
> > time of failure shows the cq polled to empty.
> 
> That is strange, each scsi command should trigger iscsit_queue_data_in
> just once. Can you provide evidence of a command that triggers it more
> than once?
> 
> Another possible reason is that we somehow get to put_data_in and
> put_response for the same command (which we should never do because
> we handle the response in put_data_in).
> 
> Thanks for reporting.
> Sagi.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: RQ overflow seen running isert traffic
  2016-10-17 11:16   ` Potnuri Bharat Teja
@ 2016-10-17 18:29     ` Steve Wise
  2016-10-18  8:04       ` Sagi Grimberg
  0 siblings, 1 reply; 24+ messages in thread
From: Steve Wise @ 2016-10-17 18:29 UTC (permalink / raw)
  To: 'Potnuri Bharat Teja', 'Sagi Grimberg'
  Cc: target-devel, nab, linux-rdma

> On Wednesday, October 10/05/16, 2016 at 11:44:12 +0530, Sagi Grimberg wrote:
> >
> > > Hi Sagi,
> >
> > Hey Baharat,
> >
> > Sorry for the late response, its the holiday
> > season in Israel...
> >
> > > I've been trying to understand the isert functionality with respect to
> > > RDMA Receive Queue sizing and Queue full handling. Here is the problem
> > > is see with iw_cxgb4:
> > >
> > > After running few minutes of iSER traffic with iw_cxgb4, I am seeing
> > > post receive failures due to receive queue full returning -ENOMEM.
> > > In case of iw_cxgb4 the RQ size is 130 with qp attribute max_recv_wr =
129,
> > > passed down by isert to iw_cxgb4.isert decides on max_recv_wr as 129 based
> > > on (ISERT_QP_MAX_RECV_DTOS = ISCSI_DEF_XMIT_CMDS_MAX = 128) + 1.
> >
> > That's correct.
> 
> Hi Sagi,
> Sorry for the late reply, I had to recheck my findings before I reply
> you back.
> 
> My interpretation of the queue full issue was not complete, got carried
> away by the recieve queue and missed the SQ failure due to full among
> the debug logs.
> 
> Here is what was happening:
> It was the SQ full first and fails to get posted with ENOMEM, due to this
> the command is queued to queue full list this will schedule it to post at
> later instance and this repeated try for posting will cause the rq to get
> full. For everyfurther try sq post will any way fail and an extra rq wr is
> posted as a part of datain() leading it to be full too.
> 
> This happened a bit earlier in my case, since I corrected the the
> iser_put_datain() to return error to LIO.
> 
> Here is the failure log on a fresh 4.8 kernel:
> isert: isert_rdma_rw_ctx_post: Cmd: ffff882ec8c96e60 failed to post RDMA
> res <===here is the post send failure due to ENOMEM.
> ABORT_TASK: Found referenced iSCSI task_tag: 33
> ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 33
> isert: isert_post_recv: ib_post_recv() failed with ret: -22
> isert: isert_post_response: ib_post_recv failed with -22
> isert: isert_post_recv: ib_post_recv() failed with ret: -22
> isert: isert_post_response: ib_post_recv failed with -22
> isert: isert_post_recv: ib_post_recv() failed with ret: -22
> isert: isert_post_response: ib_post_recv failed with -22
> isert: isert_post_recv: ib_post_recv() failed with ret: -22
> isert: isert_post_response: ib_post_recv failed with -22
> isert: isert_post_recv: ib_post_recv() failed with ret: -22
> isert: isert_post_response: ib_post_recv failed with -22
> cxgb4 0000:84:00.4: AE qpid 1026 opcode 10 status 0x1 type 1 len 0x0
> wrid.hi 0x0 wrid.lo 0x4c01
> isert: isert_qp_event_callback: QP access error (3): conn
> ffff8817ddc46000
> iSCSI Login timeout on Network Portal 10.40.40.198:3260
> INFO: task iscsi_np:14744 blocked for more than 120 seconds.
> 
> Here are the QP and CQ stats before my assert for queue full:
> qhp->attr.sq_num_entries 523
> qhp->attr.rq_num_entries 129
> qhp->wq.sq.qid 1026
> qhp->wq.rq.qid 1027
> qhp->wq.sq.in_use 523 <=== SQ to the brim
> qhp->wq.sq.size 524 <=== SQ size
> qhp->wq.sq.cidx 391
> qhp->wq.sq.pidx 390
> qhp->wq.sq.wq_pidx 202
> qhp->wq.sq.wq_pidx_inc 0
> qhp->wq.sq.flush_cidx 391
> qhp->wq.rq.in_use 128
> qhp->wq.rq.size 130
> qhp->wq.rq.cidx 112
> qhp->wq.rq.pidx 110
> qhp->wq.rq.wq_pidx 240
> qhp->wq.rq.wq_pidx_inc 0
> qhp->wq.flushed 0
> chp->cq.cqid 1024
> chp->cq.size 6335
> chp->cq.cidx 4978
> chp->cq.sw_cidx 4126
> chp->cq.sw_pidx 4126
> chp->cq.sw_in_use 0
> chp->cq.cidx_inc 0
> 
> As an experiment I tried increasing the ISCSI_DEF_XMIT_CMDS_MAX to 256
> instead of 128, which incase of iwarp creates SQ with size 1548, the
> issue is not seen.
> I doubt if the SQ is not sized properly incase of iWARP Or factored
> incorrectly in the rdma api for iser IOP.
> 
> I am digging for root cause.
> 
> Thanks for your time,
> Bharat.
> 

Hey Sagi, I'm looking at isert_create_qp() and it appears to not be correctly
sizing the SQ:

...
#define ISERT_QP_MAX_REQ_DTOS   (ISCSI_DEF_XMIT_CMDS_MAX +    \
                                ISERT_MAX_TX_MISC_PDUS  + \
                                ISERT_MAX_RX_MISC_PDUS)
...
        attr.cap.max_send_wr = ISERT_QP_MAX_REQ_DTOS + 1;
        attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1;
...

I think above snipit assumes a DTO consumes exactly one WR/WQE in the SQ.  But
the DTO can be broken into multiple WRs to handle REG_MRs, multiple WRITE or
READ WRs due to limits on local sge depths target sge depths, etc.  Yes?  Or am
I all wet?  Or perhaps isert doesn't require the SQ to be the max possible
because it flow controls the DTO submissions? 

Stevo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: RQ overflow seen running isert traffic
  2016-10-17 18:29     ` Steve Wise
@ 2016-10-18  8:04       ` Sagi Grimberg
  2016-10-18 11:28         ` SQ " Potnuri Bharat Teja
  0 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2016-10-18  8:04 UTC (permalink / raw)
  To: Steve Wise, 'Potnuri Bharat Teja'
  Cc: target-devel-u79uwXL29TY76Z2rM5mHXA, nab-IzHhD5pYlfBP7FQvKIMDCQ,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hey Steve and Baharat,

> Hey Sagi, I'm looking at isert_create_qp() and it appears to not be correctly
> sizing the SQ:
> ...
> #define ISERT_QP_MAX_REQ_DTOS   (ISCSI_DEF_XMIT_CMDS_MAX +    \
>                                 ISERT_MAX_TX_MISC_PDUS  + \
>                                 ISERT_MAX_RX_MISC_PDUS)
> ...
>         attr.cap.max_send_wr = ISERT_QP_MAX_REQ_DTOS + 1;
>         attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1;
> ...
>
> I think above snipit assumes a DTO consumes exactly one WR/WQE in the SQ.  But
> the DTO can be broken into multiple WRs to handle REG_MRs, multiple WRITE or
> READ WRs due to limits on local sge depths target sge depths, etc.  Yes?  Or am
> I all wet?  Or perhaps isert doesn't require the SQ to be the max possible
> because it flow controls the DTO submissions?

I think you are correct.

On my test devices, I didn't see that multiple WRs has had any effect
becuase:
1. My test devices usually give next power of 2 (256)
2. workloads that involved multiple rdma operations never stressed the
system enough to get the queues full.

Now, in iWARP for non-immediate writes we'll need more than a single
wr per IO (I think the SQ size is expanded with the new rdma RW API
which implicitly increases with attr.cap.max_rdma_ctxs).

But I do agree that we need to take into account that each IO needs
at least 2 WRs (one for rdma and one for send).

So a temp bandage would be:
--
diff --git a/drivers/infiniband/ulp/isert/ib_isert.h 
b/drivers/infiniband/ulp/isert/ib_isert.h
index fc791efe3a10..81afb95aeea9 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.h
+++ b/drivers/infiniband/ulp/isert/ib_isert.h
@@ -54,8 +54,14 @@

  #define ISERT_MIN_POSTED_RX    (ISCSI_DEF_XMIT_CMDS_MAX >> 2)

-#define ISERT_QP_MAX_REQ_DTOS  (ISCSI_DEF_XMIT_CMDS_MAX +    \
-                               ISERT_MAX_TX_MISC_PDUS  + \
+/*
+ * Max QP send work requests consist of:
+ * - RDMA + SEND for each iscsi IO
+ * - iscsi misc TX pdus
+ * - iscsi misc RX response pdus
+ */
+#define ISERT_QP_MAX_REQ_DTOS  ((ISCSI_DEF_XMIT_CMDS_MAX * 2 ) +       \
+                               ISERT_MAX_TX_MISC_PDUS          +       \
                                 ISERT_MAX_RX_MISC_PDUS)

  #define ISER_RX_PAD_SIZE       (ISCSI_DEF_MAX_RECV_SEG_LEN + 4096 - \
--

But we do need to track the SQ overflow and queue a retransmit work when
we don't have enough available SQ slots..

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: SQ overflow seen running isert traffic
  2016-10-18  8:04       ` Sagi Grimberg
@ 2016-10-18 11:28         ` Potnuri Bharat Teja
  2016-10-18 13:17           ` Sagi Grimberg
  0 siblings, 1 reply; 24+ messages in thread
From: Potnuri Bharat Teja @ 2016-10-18 11:28 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Steve Wise, target-devel, nab, linux-rdma

On Tuesday, October 10/18/16, 2016 at 11:04:00 +0300, Sagi Grimberg wrote:
> Hey Steve and Baharat,
> 
> >Hey Sagi, I'm looking at isert_create_qp() and it appears to not be correctly
> >sizing the SQ:
> >...
> >#define ISERT_QP_MAX_REQ_DTOS   (ISCSI_DEF_XMIT_CMDS_MAX +    \
> >                                ISERT_MAX_TX_MISC_PDUS  + \
> >                                ISERT_MAX_RX_MISC_PDUS)
> >...
> >        attr.cap.max_send_wr = ISERT_QP_MAX_REQ_DTOS + 1;
> >        attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1;
> >...
> >
> >I think above snipit assumes a DTO consumes exactly one WR/WQE in the SQ.  But
> >the DTO can be broken into multiple WRs to handle REG_MRs, multiple WRITE or
> >READ WRs due to limits on local sge depths target sge depths, etc.  Yes?  Or am
> >I all wet?  Or perhaps isert doesn't require the SQ to be the max possible
> >because it flow controls the DTO submissions?
> 
> I think you are correct.
> 
> On my test devices, I didn't see that multiple WRs has had any effect
> becuase:
> 1. My test devices usually give next power of 2 (256)
> 2. workloads that involved multiple rdma operations never stressed the
> system enough to get the queues full.
> 
> Now, in iWARP for non-immediate writes we'll need more than a single
> wr per IO (I think the SQ size is expanded with the new rdma RW API
> which implicitly increases with attr.cap.max_rdma_ctxs).
Yes, rdma RW api factors the attr.cap.max_rdma_ctxs based on attr Flags.
> 
> But I do agree that we need to take into account that each IO needs
> at least 2 WRs (one for rdma and one for send).
> 
> So a temp bandage would be:
> --
> diff --git a/drivers/infiniband/ulp/isert/ib_isert.h
> b/drivers/infiniband/ulp/isert/ib_isert.h
> index fc791efe3a10..81afb95aeea9 100644
> --- a/drivers/infiniband/ulp/isert/ib_isert.h
> +++ b/drivers/infiniband/ulp/isert/ib_isert.h
> @@ -54,8 +54,14 @@
> 
>  #define ISERT_MIN_POSTED_RX    (ISCSI_DEF_XMIT_CMDS_MAX >> 2)
> 
> -#define ISERT_QP_MAX_REQ_DTOS  (ISCSI_DEF_XMIT_CMDS_MAX +    \
> -                               ISERT_MAX_TX_MISC_PDUS  + \
> +/*
> + * Max QP send work requests consist of:
> + * - RDMA + SEND for each iscsi IO
> + * - iscsi misc TX pdus
> + * - iscsi misc RX response pdus
> + */
> +#define ISERT_QP_MAX_REQ_DTOS  ((ISCSI_DEF_XMIT_CMDS_MAX * 2 ) +       \
> +                               ISERT_MAX_TX_MISC_PDUS          +       \
>                                 ISERT_MAX_RX_MISC_PDUS)
> 
>  #define ISER_RX_PAD_SIZE       (ISCSI_DEF_MAX_RECV_SEG_LEN + 4096 - \
> --
> 
I tried out this change and it works fine with iwarp. I dont see SQ
overflow. Apparently we have increased the sq too big to overflow. I am going
to let it run with higher workloads for longer time, to see if it holds good.
> But we do need to track the SQ overflow and queue a retransmit work when
> we don't have enough available SQ slots..
Agreed, iscsi-target (LIO in our case)expects failure to be returned by
overlying modules Or it could be for tcp as it handling is different to
that of iser.
It queues the WR and schedules it for repost incase
of post failures with ENOMEM/EAGAIN. Shall send the necessary change if
it is needed.

Thanks,
Bharat.
> 
> Thoughts?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: SQ overflow seen running isert traffic
  2016-10-18 11:28         ` SQ " Potnuri Bharat Teja
@ 2016-10-18 13:17           ` Sagi Grimberg
       [not found]             ` <ed7ebb39-be81-00b3-ef23-3f4c0e3afbb1-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2016-10-18 13:17 UTC (permalink / raw)
  To: Potnuri Bharat Teja; +Cc: Steve Wise, target-devel, nab, linux-rdma


> I tried out this change and it works fine with iwarp. I dont see SQ
> overflow. Apparently we have increased the sq too big to overflow. I am going
> to let it run with higher workloads for longer time, to see if it holds good.

Actually on second thought, this patch is an overkill. Effectively we
now set:

MAX_CMD=266
and max_rdma_ctx=128 so together we take 394 which seems to too much.

If we go by the scheme of 1 rdma + 1 send for each IO we need:
- 128 sends
- 128 rdmas
- 10 miscs

so this gives 266.

Perhaps this is due to the fact that iWARP needs to register memory for
rdma reads as well? (and also rdma writes > 128k for chelsio HW right?)

What is the workload you are running? with immediatedata enabled you
should issue reg+rdma_read+send only for writes > 8k.

Does this happen when you run only reads for example?

I guess its time to get the sq accounting into shape...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: SQ overflow seen running isert traffic
       [not found]             ` <ed7ebb39-be81-00b3-ef23-3f4c0e3afbb1-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2016-10-18 14:34               ` Steve Wise
  2016-10-18 16:13                 ` Jason Gunthorpe
  2016-10-31  3:40                 ` Nicholas A. Bellinger
  0 siblings, 2 replies; 24+ messages in thread
From: Steve Wise @ 2016-10-18 14:34 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Potnuri Bharat Teja'
  Cc: target-devel-u79uwXL29TY76Z2rM5mHXA, nab-IzHhD5pYlfBP7FQvKIMDCQ,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

> 
> > I tried out this change and it works fine with iwarp. I dont see SQ
> > overflow. Apparently we have increased the sq too big to overflow. I am
going
> > to let it run with higher workloads for longer time, to see if it holds
good.
> 
> Actually on second thought, this patch is an overkill. Effectively we
> now set:
> 
> MAX_CMD=266
> and max_rdma_ctx=128 so together we take 394 which seems to too much.
> 
> If we go by the scheme of 1 rdma + 1 send for each IO we need:
> - 128 sends
> - 128 rdmas
> - 10 miscs
> 
> so this gives 266.
> 
> Perhaps this is due to the fact that iWARP needs to register memory for
> rdma reads as well? (and also rdma writes > 128k for chelsio HW right?)
>

iWARP definitely needs to register memory for the target of reads, due to
REMOTE_WRITE requirement for the protocol.  The source of a write doesn't need
to register memory, but the SGE depth can cause multiple WRITE WRs to be
required to service the IO.  And in theory there should be some threshold where
it might be better performance-wise to do a memory register + 1 WRITE vs X
WRITEs.    

As you mentioned, the RW API should account for this, but perhaps it is still
off some.  Bharat, have a look into the RDMA-RW API and let us see if we can
figure out if the additional SQ depth it adds is sufficient.
 
> What is the workload you are running? with immediatedata enabled you
> should issue reg+rdma_read+send only for writes > 8k.
> 
> Does this happen when you run only reads for example?
> 
> I guess its time to get the sq accounting into shape...

So to sum up - 2 issues:

1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ
overflows.  So the SQ sizing needs more investigation.

2) if the SQ is full, then the iSER/target code is supposed to resubmit.  And
apparently that isn't working.

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: SQ overflow seen running isert traffic
  2016-10-18 14:34               ` Steve Wise
@ 2016-10-18 16:13                 ` Jason Gunthorpe
  2016-10-18 19:03                   ` Steve Wise
  2016-10-20  8:34                   ` Sagi Grimberg
  2016-10-31  3:40                 ` Nicholas A. Bellinger
  1 sibling, 2 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2016-10-18 16:13 UTC (permalink / raw)
  To: Steve Wise
  Cc: 'Sagi Grimberg', 'Potnuri Bharat Teja',
	target-devel, nab, linux-rdma

On Tue, Oct 18, 2016 at 09:34:33AM -0500, Steve Wise wrote:

> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ
> overflows.  So the SQ sizing needs more investigation.

NFS had this sort of problem - in that case it was because the code
was assuming that a RQ completion implied SQ space - that is not
legal, only direct completions from SQ WCs can guide available space
in the SQ..

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: SQ overflow seen running isert traffic
  2016-10-18 16:13                 ` Jason Gunthorpe
@ 2016-10-18 19:03                   ` Steve Wise
  2016-10-20  8:34                   ` Sagi Grimberg
  1 sibling, 0 replies; 24+ messages in thread
From: Steve Wise @ 2016-10-18 19:03 UTC (permalink / raw)
  To: 'Jason Gunthorpe'
  Cc: 'Sagi Grimberg', 'Potnuri Bharat Teja',
	target-devel, nab, linux-rdma

> On Tue, Oct 18, 2016 at 09:34:33AM -0500, Steve Wise wrote:
> 
> > 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ
> > overflows.  So the SQ sizing needs more investigation.
> 
> NFS had this sort of problem - in that case it was because the code
> was assuming that a RQ completion implied SQ space - that is not
> legal, only direct completions from SQ WCs can guide available space
> in the SQ..
> 

Agreed.  And this could be the case Bharat is hitting. (I wonder if there is a
clever way we can tell?)  I don't see I the code that isert flow controls SQ WR
submission based on SQ completions.  

It would be nice to have some core API, maybe in rdma-rw, that could handle this
by queuing and processing when the SQ has resources...

Steve.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: SQ overflow seen running isert traffic
  2016-10-18 16:13                 ` Jason Gunthorpe
  2016-10-18 19:03                   ` Steve Wise
@ 2016-10-20  8:34                   ` Sagi Grimberg
       [not found]                     ` <f7a4b395-1786-3c7a-7639-195e830db5ad-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  1 sibling, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2016-10-20  8:34 UTC (permalink / raw)
  To: Jason Gunthorpe, Steve Wise
  Cc: 'Potnuri Bharat Teja', target-devel, nab, linux-rdma

Hey Jason,

>> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ
>> overflows.  So the SQ sizing needs more investigation.
>
> NFS had this sort of problem - in that case it was because the code
> was assuming that a RQ completion implied SQ space - that is not
> legal, only direct completions from SQ WCs can guide available space
> in the SQ..

Its not the same problem. iser-target does not tie SQ and RQ spaces.
The origin here is the difference between IB/RoCE and iWARP and the
chelsio HW that makes it hard to predict the SQ correct size.

iWARP needs extra registration for rdma reads and the chelsio device
seems to be limited in the number of pages per registration so this
configuration will need a larger send queue.

Another problem is that we don't have a correct retry flow.

Hopefully we can address that in the RW API which is designed to hide
these details from the ULP...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: SQ overflow seen running isert traffic
  2016-10-18 14:34               ` Steve Wise
  2016-10-18 16:13                 ` Jason Gunthorpe
@ 2016-10-31  3:40                 ` Nicholas A. Bellinger
  2016-11-02 17:03                   ` Steve Wise
       [not found]                   ` <1477885208.27946.8.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
  1 sibling, 2 replies; 24+ messages in thread
From: Nicholas A. Bellinger @ 2016-10-31  3:40 UTC (permalink / raw)
  To: Steve Wise
  Cc: 'Sagi Grimberg', 'Potnuri Bharat Teja',
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Steve, Potnuri, & Co,

On Tue, 2016-10-18 at 09:34 -0500, Steve Wise wrote:
> > 
> > > I tried out this change and it works fine with iwarp. I dont see SQ
> > > overflow. Apparently we have increased the sq too big to overflow. I am
> going
> > > to let it run with higher workloads for longer time, to see if it holds
> good.
> > 
> > Actually on second thought, this patch is an overkill. Effectively we
> > now set:
> > 
> > MAX_CMD=266
> > and max_rdma_ctx=128 so together we take 394 which seems to too much.
> > 
> > If we go by the scheme of 1 rdma + 1 send for each IO we need:
> > - 128 sends
> > - 128 rdmas
> > - 10 miscs
> > 
> > so this gives 266.
> > 
> > Perhaps this is due to the fact that iWARP needs to register memory for
> > rdma reads as well? (and also rdma writes > 128k for chelsio HW right?)
> >
> 
> iWARP definitely needs to register memory for the target of reads, due to
> REMOTE_WRITE requirement for the protocol.  The source of a write doesn't need
> to register memory, but the SGE depth can cause multiple WRITE WRs to be
> required to service the IO.  And in theory there should be some threshold where
> it might be better performance-wise to do a memory register + 1 WRITE vs X
> WRITEs.    
> 
> As you mentioned, the RW API should account for this, but perhaps it is still
> off some.  Bharat, have a look into the RDMA-RW API and let us see if we can
> figure out if the additional SQ depth it adds is sufficient.
>  
> > What is the workload you are running? with immediatedata enabled you
> > should issue reg+rdma_read+send only for writes > 8k.
> > 
> > Does this happen when you run only reads for example?
> > 
> > I guess its time to get the sq accounting into shape...
> 
> So to sum up - 2 issues:
> 
> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ
> overflows.  So the SQ sizing needs more investigation.
> 
> 2) if the SQ is full, then the iSER/target code is supposed to resubmit.  And
> apparently that isn't working.
> 

For #2, target-core expects -ENOMEM or -EAGAIN return from fabric driver
callbacks to signal internal queue-full retry logic.  Otherwise, the
extra se_cmd->cmd_kref response SCF_ACK_KREF is leaked until session
shutdown and/or reinstatement occurs.

AFAICT, Potunri's earlier hung task with v4.8.y + ABORT_TASK is likely
the earlier v4.1+ regression:

https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e

That said, there is room for improvement in target-core queue-full error
signaling, and iscsi-target/iser-target callback error propagation.  

Sending out a series shortly to address these particular items.
Please have a look.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: SQ overflow seen running isert traffic
  2016-10-31  3:40                 ` Nicholas A. Bellinger
@ 2016-11-02 17:03                   ` Steve Wise
       [not found]                   ` <1477885208.27946.8.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
  1 sibling, 0 replies; 24+ messages in thread
From: Steve Wise @ 2016-11-02 17:03 UTC (permalink / raw)
  To: 'Nicholas A. Bellinger'
  Cc: 'Sagi Grimberg', 'Potnuri Bharat Teja',
	target-devel, linux-rdma

> > So to sum up - 2 issues:
> >
> > 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ
> > overflows.  So the SQ sizing needs more investigation.
> >
> > 2) if the SQ is full, then the iSER/target code is supposed to resubmit.  And
> > apparently that isn't working.
> >
> 
> For #2, target-core expects -ENOMEM or -EAGAIN return from fabric driver
> callbacks to signal internal queue-full retry logic.  Otherwise, the
> extra se_cmd->cmd_kref response SCF_ACK_KREF is leaked until session
> shutdown and/or reinstatement occurs.
> 
> AFAICT, Potunri's earlier hung task with v4.8.y + ABORT_TASK is likely
> the earlier v4.1+ regression:
> 
> https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afd
> b1e3e
> 
> That said, there is room for improvement in target-core queue-full error
> signaling, and iscsi-target/iser-target callback error propagation.
> 
> Sending out a series shortly to address these particular items.
> Please have a look.
> 

Hey Nicholas,  Bharat is out until next week.  He'll try this all out next week and report back.

Thanks!

Steve.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: SQ overflow seen running isert traffic
       [not found]                   ` <1477885208.27946.8.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
@ 2016-11-08 10:06                     ` Potnuri Bharat Teja
  2017-03-20 10:15                       ` Potnuri Bharat Teja
  0 siblings, 1 reply; 24+ messages in thread
From: Potnuri Bharat Teja @ 2016-11-08 10:06 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: SWise OGC, 'Sagi Grimberg',
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Monday, October 10/31/16, 2016 at 09:10:08 +0530, Nicholas A. Bellinger wrote:
> Hi Steve, Potnuri, & Co,
> 
> On Tue, 2016-10-18 at 09:34 -0500, Steve Wise wrote:
> > > 
> > > > I tried out this change and it works fine with iwarp. I dont see SQ
> > > > overflow. Apparently we have increased the sq too big to overflow. I am
> > going
> > > > to let it run with higher workloads for longer time, to see if it holds
> > good.
> > > 
> > > Actually on second thought, this patch is an overkill. Effectively we
> > > now set:
> > > 
> > > MAX_CMD=266
> > > and max_rdma_ctx=128 so together we take 394 which seems to too much.
> > > 
> > > If we go by the scheme of 1 rdma + 1 send for each IO we need:
> > > - 128 sends
> > > - 128 rdmas
> > > - 10 miscs
> > > 
> > > so this gives 266.
> > > 
> > > Perhaps this is due to the fact that iWARP needs to register memory for
> > > rdma reads as well? (and also rdma writes > 128k for chelsio HW right?)
> > >
> > 
> > iWARP definitely needs to register memory for the target of reads, due to
> > REMOTE_WRITE requirement for the protocol.  The source of a write doesn't need
> > to register memory, but the SGE depth can cause multiple WRITE WRs to be
> > required to service the IO.  And in theory there should be some threshold where
> > it might be better performance-wise to do a memory register + 1 WRITE vs X
> > WRITEs.    
> > 
> > As you mentioned, the RW API should account for this, but perhaps it is still
> > off some.  Bharat, have a look into the RDMA-RW API and let us see if we can
> > figure out if the additional SQ depth it adds is sufficient.
> >  
> > > What is the workload you are running? with immediatedata enabled you
> > > should issue reg+rdma_read+send only for writes > 8k.
> > > 
> > > Does this happen when you run only reads for example?
> > > 
> > > I guess its time to get the sq accounting into shape...
> > 
> > So to sum up - 2 issues:
> > 
> > 1) we believe the iSER + RW API correctly sizes the SQ, yet we're seeing SQ
> > overflows.  So the SQ sizing needs more investigation.
> > 
> > 2) if the SQ is full, then the iSER/target code is supposed to resubmit.  And
> > apparently that isn't working.
> > 
> 
> For #2, target-core expects -ENOMEM or -EAGAIN return from fabric driver
> callbacks to signal internal queue-full retry logic.  Otherwise, the
> extra se_cmd->cmd_kref response SCF_ACK_KREF is leaked until session
> shutdown and/or reinstatement occurs.
> 
> AFAICT, Potunri's earlier hung task with v4.8.y + ABORT_TASK is likely
> the earlier v4.1+ regression:
> 
> https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e
> 
> That said, there is room for improvement in target-core queue-full error
> signaling, and iscsi-target/iser-target callback error propagation.  
> 
> Sending out a series shortly to address these particular items.
> Please have a look.
>
Thanks for the changes Nicholas.
Testing them now.
Thanks,
Bharat.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: SQ overflow seen running isert traffic
  2016-11-08 10:06                     ` Potnuri Bharat Teja
@ 2017-03-20 10:15                       ` Potnuri Bharat Teja
  2017-03-21  6:32                         ` Nicholas A. Bellinger
  0 siblings, 1 reply; 24+ messages in thread
From: Potnuri Bharat Teja @ 2017-03-20 10:15 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: SWise OGC, 'Sagi Grimberg', target-devel, linux-rdma

On Tuesday, November 11/08/16, 2016 at 15:36:18 +0530, Potnuri Bharat Teja wrote:
>    On Monday, October 10/31/16, 2016 at 09:10:08 +0530, Nicholas A. Bellinger
>    wrote:
Hi Nicholas,
Tested the following 3 patches for SQ overflow.
https://www.spinics.net/lists/target-devel/msg13571.html
They along with an additional change added on top of those seems to hold good.( Aborts seen 
intermittently as shown at the end of mail)
I have been testing the changes using fio with IO sizes ranging from 1KB to 32MB and test is 
running fine (despite few Intermittent Aborts).

Here is what I see with the 3 patches alone applied:

-> In isert_put_datain() and isert_post_response() a corresponding recv WR is posted before
posting a send and hence for every post failure a recv is already posted into a tightly packed
RQ, causing it to overflow.

Below are the logs suggesting it:

[  189.944714] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0 failed to post RDMA res
[  189.944718] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0 failed to post RDMA res
[  189.944721] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0 failed to post RDMA res
[  189.944724] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0 failed to post RDMA res
[  189.944729] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0 failed to post RDMA res
[  189.944731] isert: isert_post_recv: ib_post_recv() failed with ret: -12
[  189.944731] isert: isert_put_datain: ib_post_recv failed with -12
[  189.944732] isert: isert_post_recv: ib_post_recv() failed with ret: -12
[  189.944733] isert: isert_put_datain: ib_post_recv failed with -12
[  189.944734] isert: isert_post_recv: ib_post_recv() failed with ret: -12
[  189.944735] isert: isert_put_datain: ib_post_recv failed with -12
[  189.944736] isert: isert_post_recv: ib_post_recv() failed with ret: -12
[  189.944736] isert: isert_put_datain: ib_post_recv failed with -12
[  189.944737] isert: isert_post_recv: ib_post_recv() failed with ret: -12
[  189.944738] isert: isert_put_datain: ib_post_recv failed with -12
[  189.944739] isert: isert_post_recv: ib_post_recv() failed with ret: -12
[  189.944740] isert: isert_put_datain: ib_post_recv failed with -12
[  189.944740] isert: isert_post_recv: ib_post_recv() failed with ret: -12
[  189.944741] isert: isert_put_datain: ib_post_recv failed with -12
[  189.944746] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0 failed to post RDMA res
[  189.944747] isert: isert_post_recv: ib_post_recv() failed with ret: -12
[  189.944748] isert: isert_put_datain: ib_post_recv failed with -12
[  189.944749] isert: isert_post_recv: ib_post_recv() failed with ret: -12
[  251.543847] ABORT_TASK: Found referenced iSCSI task_tag: 68
[  251.543858] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 68
[  251.544071] ABORT_TASK: Found referenced iSCSI task_tag: 46
[  251.544076] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 46
[  251.544300] ABORT_TASK: Found referenced iSCSI task_tag: 51
[  251.544307] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 51
[  251.544528] ABORT_TASK: Found referenced iSCSI task_tag: 117
[  251.544533] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 117
[  282.778402] Unable to locate ITT: 0x00000062 on CID: 0
[  282.778404] Unable to locate RefTaskTag: 0x00000062 on CID: 0.
[  282.778407] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0
[  282.778436] ABORT_TASK: Found referenced iSCSI task_tag: 37
[  282.778439] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 37
[  282.778519] ABORT_TASK: Found referenced iSCSI task_tag: 65
[  282.778522] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 65
[  282.778559] Unable to locate ITT: 0x00000019 on CID: 0
[  282.778560] Unable to locate RefTaskTag: 0x00000019 on CID: 0.
[  282.778563] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0
[  293.004632] ABORT_TASK: Found referenced iSCSI task_tag: 22
[  293.004643] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 22
[  293.004808] TMR Opcode TARGET_WARM_RESET authorization failed for Initiator Node: iqn.1994-05.com.redhat:37fc2b6a5331
[  310.265993] iSCSI Login timeout on Network Portal 10.0.2.2:3260
[  493.567143] INFO: task iscsi_np:3998 blocked for more than 120 seconds.
[  493.567149]       Tainted: G           OE   4.9.0-For-DSGL-Debug+ #1
[  493.567151] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  493.567153] iscsi_np        D    0  3998      2 0x00000084
[  493.567159]  0000000000000000 ffff883028a88b40 ffff88302693f7c0 ffff88302de3e300
[  493.567163]  ffff88303e057a40 ffffc900008abc00 ffffffff815e415b ffffc900008abca0
[  493.567166]  ffffffff81af4180 ffff883028a88b40 7fffffffffffffff ffff882f7426db68
[  493.567170] Call Trace:
[  493.567180]  [<ffffffff815e415b>] ? __schedule+0x23b/0x6c0
[  493.567184]  [<ffffffff815e4616>] schedule+0x36/0x80
[  493.567187]  [<ffffffff815e71c3>] schedule_timeout+0x1f3/0x390
[  493.567194]  [<ffffffff812c1f2e>] ? number+0x2ce/0x300
[  493.567197]  [<ffffffff812c42be>] ? vsnprintf+0x32e/0x4b0
[  493.567199]  [<ffffffff815e56a4>] wait_for_completion+0xb4/0xf0
[  493.567201]  [<ffffffff8108a6b0>] ? wake_up_q+0x80/0x80
[  493.567216]  [<ffffffffa05284d1>] iscsit_stop_session+0x1b1/0x1c0 [iscsi_target_mod]
[  493.567220]  [<ffffffffa0519db2>] iscsi_check_for_session_reinstatement+0x1e2/0x270 [iscsi_target_mod]
[  493.567225]  [<ffffffffa051c8c0>] iscsi_target_check_for_existing_instances+0x30/0x40 [iscsi_target_mod]
[  493.567230]  [<ffffffffa051c9ee>] iscsi_target_do_login+0x11e/0x610 [iscsi_target_mod]
[  493.567235]  [<ffffffffa051dbbe>] iscsi_target_start_negotiation+0x4e/0xa0 [iscsi_target_mod]
[  493.567239]  [<ffffffffa051b933>] iscsi_target_login_thread+0xa73/0xfc0 [iscsi_target_mod]
[  493.567244]  [<ffffffffa051aec0>] ? iscsi_target_login_sess_out+0x250/0x250 [iscsi_target_mod]
[  493.567246]  [<ffffffff810660c4>] ? SyS_exit_group+0x14/0x20
[  493.567249]  [<ffffffff8107ee5a>] kthread+0xca/0xe0
[  493.567250]  [<ffffffff8107ed90>] ? kthread_park+0x60/0x60
[  493.567252]  [<ffffffff815e8752>] ret_from_fork+0x22/0x30
[  493.567254] INFO: task iscsi_trx:4149 blocked for more than 120 seconds.
[  493.567255]       Tainted: G           OE   4.9.0-For-DSGL-Debug+ #1
[  493.567256] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  493.567256] iscsi_trx       D    0  4149      2 0x00000084
[  493.567258]  0000000000000000 ffff882f5ef70340 ffff882f76253040 ffff88302de40340
[  493.567259]  ffff88303e097a40 ffffc90000f93c90 ffffffff815e415b ffffffff815e56c2
[  493.567261]  0000000000000001 ffff882f5ef70340 7fffffffffffffff ffff882d233cc9b0
[  493.567262] Call Trace:
[  493.567264]  [<ffffffff815e415b>] ? __schedule+0x23b/0x6c0
[  493.567265]  [<ffffffff815e56c2>] ? wait_for_completion+0xd2/0xf0
[  493.567267]  [<ffffffff815e4616>] schedule+0x36/0x80
[  493.567268]  [<ffffffff815e71c3>] schedule_timeout+0x1f3/0x390
[  493.567281]  [<ffffffffa118e910>] ? ib_sg_to_pages+0x1a0/0x1a0 [ib_core]
[  493.567285]  [<ffffffffa118e910>] ? ib_sg_to_pages+0x1a0/0x1a0 [ib_core]
[  493.567286]  [<ffffffff815e56a4>] wait_for_completion+0xb4/0xf0
[  493.567287]  [<ffffffff8108a6b0>] ? wake_up_q+0x80/0x80
[  493.567302]  [<ffffffffa04d5399>] target_wait_for_sess_cmds+0x49/0x1a0 [target_core_mod]
[  493.567306]  [<ffffffffa0471644>] isert_wait_conn+0x1a4/0x2d0 [ib_isert]
[  493.567311]  [<ffffffffa05289ab>] iscsit_close_connection+0x14b/0x820 [iscsi_target_mod]
[  493.567315]  [<ffffffffa05172ef>] iscsit_take_action_for_connection_exit+0x7f/0x100 [iscsi_target_mod]
[  493.567320]  [<ffffffffa0528265>] iscsi_target_rx_thread+0x95/0xa0 [iscsi_target_mod]
[  493.567324]  [<ffffffffa05281d0>] ? iscsi_target_tx_thread+0x1b0/0x1b0 [iscsi_target_mod]
[  493.567325]  [<ffffffff8107ee5a>] kthread+0xca/0xe0
[  493.567327]  [<ffffffff8107ed90>] ? kthread_park+0x60/0x60
[  493.567328]  [<ffffffff815e8752>] ret_from_fork+0x22/0x30

-> In order to address this I tried the following (experimental) change on top of your 3 patches:

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
index 501a0ad124a7..0aead5faa2a2 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -1817,16 +1817,16 @@ isert_post_response(struct isert_conn *isert_conn, struct isert_cmd *isert_cmd)
        struct ib_send_wr *wr_failed;
        int ret;

-       ret = isert_post_recv(isert_conn, isert_cmd->rx_desc);
+       ret = ib_post_send(isert_conn->qp, &isert_cmd->tx_desc.send_wr,
+                          &wr_failed);
        if (ret) {
-               isert_err("ib_post_recv failed with %d\n", ret);
+               isert_err("ib_post_send failed with %d\n", ret);
                return ret;
        }

-       ret = ib_post_send(isert_conn->qp, &isert_cmd->tx_desc.send_wr,
-                          &wr_failed);
+       ret = isert_post_recv(isert_conn, isert_cmd->rx_desc);
        if (ret) {
-               isert_err("ib_post_send failed with %d\n", ret);
+               isert_err("ib_post_recv failed with %d\n", ret);
                return ret;
        }
        return ret;
@@ -2166,6 +2166,13 @@ isert_put_datain(struct iscsi_conn *conn, struct iscsi_cmd *cmd)
        if (isert_prot_cmd(isert_conn, se_cmd)) {
                isert_cmd->tx_desc.tx_cqe.done = isert_rdma_write_done;
                cqe = &isert_cmd->tx_desc.tx_cqe;
+
+               rc = isert_rdma_rw_ctx_post(isert_cmd, isert_conn, cqe, chain_wr);
+               if (rc) {
+                       isert_dbg("Cmd: %p posted RDMA_WRITE for iSER Data READ rc: %d\n",
+                                  isert_cmd, rc);
+                       return rc;
+               }
        } else {
                /*
                 * Build isert_conn->tx_desc for iSCSI response PDU and attach
@@ -2178,18 +2185,21 @@ isert_put_datain(struct iscsi_conn *conn, struct iscsi_cmd *cmd)
                isert_init_send_wr(isert_conn, isert_cmd,
                                   &isert_cmd->tx_desc.send_wr);

+               chain_wr = &isert_cmd->tx_desc.send_wr;
+
+               rc = isert_rdma_rw_ctx_post(isert_cmd, isert_conn, cqe, chain_wr);
+               if (rc) {
+                       isert_err("Cmd: %p posted RDMA_WRITE for iSER Data READ rc: %d\n",
+                                  isert_cmd, rc);
+                       return rc;
+               }
+
                rc = isert_post_recv(isert_conn, isert_cmd->rx_desc);
                if (rc) {
                        isert_err("ib_post_recv failed with %d\n", rc);
                        return rc;
                }
-
-               chain_wr = &isert_cmd->tx_desc.send_wr;
        }
-       rc = isert_rdma_rw_ctx_post(isert_cmd, isert_conn, cqe, chain_wr);
-       isert_dbg("Cmd: %p posted RDMA_WRITE for iSER Data READ rc: %d\n",
-                  isert_cmd, rc);
-
        return rc;
 }

-> With these changes it works fine, But I still see few Aborts intermittently as follows:

[10674.434643] isert: isert_rdma_rw_ctx_post: Cmd: ffff882cbb961138 failed to post RDMA res
[10674.434690] isert: isert_rdma_rw_ctx_post: Cmd: ffff882cbb9434c0 failed to post RDMA res
[10715.186372] ABORT_TASK: Found referenced iSCSI task_tag: 58
[10715.186417] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 58

-> Apart from the Aborts and post failure logs filling dmesg, luns are in state "running" and 
fio countinues to run.
[root@bharat ~]# iscsiadm -m session -P3 | grep "Attached scsi disk"
                        Attached scsi disk sdj          State: running
                        Attached scsi disk sdc          State: running
                        Attached scsi disk sdd          State: running
                        Attached scsi disk sde          State: running
                        Attached scsi disk sdf          State: running
                        Attached scsi disk sdg          State: running
                        Attached scsi disk sdh          State: running
                        Attached scsi disk sdi          State: running

Thanks,
Bharat.
>    > Hi Steve, Potnuri, & Co,
>    >
>    > On Tue, 2016-10-18 at 09:34 -0500, Steve Wise wrote:
>    > > >
>    > > > > I tried out this change and it works fine with iwarp. I dont see
>    SQ
>    > > > > overflow. Apparently we have increased the sq too big to overflow.
>    I am
>    > > going
>    > > > > to let it run with higher workloads for longer time, to see if it
>    holds
>    > > good.
>    > > >
>    > > > Actually on second thought, this patch is an overkill. Effectively
>    we
>    > > > now set:
>    > > >
>    > > > MAX_CMD=266
>    > > > and max_rdma_ctx=128 so together we take 394 which seems to too
>    much.
>    > > >
>    > > > If we go by the scheme of 1 rdma + 1 send for each IO we need:
>    > > > - 128 sends
>    > > > - 128 rdmas
>    > > > - 10 miscs
>    > > >
>    > > > so this gives 266.
>    > > >
>    > > > Perhaps this is due to the fact that iWARP needs to register memory
>    for
>    > > > rdma reads as well? (and also rdma writes > 128k for chelsio HW
>    right?)
>    > > >
>    > >
>    > > iWARP definitely needs to register memory for the target of reads, due
>    to
>    > > REMOTE_WRITE requirement for the protocol.  The source of a write
>    doesn't need
>    > > to register memory, but the SGE depth can cause multiple WRITE WRs to
>    be
>    > > required to service the IO.  And in theory there should be some
>    threshold where
>    > > it might be better performance-wise to do a memory register + 1 WRITE
>    vs X
>    > > WRITEs.   
>    > >
>    > > As you mentioned, the RW API should account for this, but perhaps it
>    is still
>    > > off some.  Bharat, have a look into the RDMA-RW API and let us see if
>    we can
>    > > figure out if the additional SQ depth it adds is sufficient.
>    > > 
>    > > > What is the workload you are running? with immediatedata enabled you
>    > > > should issue reg+rdma_read+send only for writes > 8k.
>    > > >
>    > > > Does this happen when you run only reads for example?
>    > > >
>    > > > I guess its time to get the sq accounting into shape...
>    > >
>    > > So to sum up - 2 issues:
>    > >
>    > > 1) we believe the iSER + RW API correctly sizes the SQ, yet we're
>    seeing SQ
>    > > overflows.  So the SQ sizing needs more investigation.
>    > >
>    > > 2) if the SQ is full, then the iSER/target code is supposed to
>    resubmit.  And
>    > > apparently that isn't working.
>    > >
>    >
>    > For #2, target-core expects -ENOMEM or -EAGAIN return from fabric driver
>    > callbacks to signal internal queue-full retry logic.  Otherwise, the
>    > extra se_cmd->cmd_kref response SCF_ACK_KREF is leaked until session
>    > shutdown and/or reinstatement occurs.
>    >
>    > AFAICT, Potunri's earlier hung task with v4.8.y + ABORT_TASK is likely
>    > the earlier v4.1+ regression:
>    >
>    >
>    [1]https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e
>    >
>    > That said, there is room for improvement in target-core queue-full error
>    > signaling, and iscsi-target/iser-target callback error propagation. 
>    >
>    > Sending out a series shortly to address these particular items.
>    > Please have a look.
>    >
>    Thanks for the changes Nicholas.
>    Testing them now.
>    --
>    To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>    the body of a message to majordomo@vger.kernel.org
>    More majordomo info at  [2]http://vger.kernel.org/majordomo-info.html
> 
> References
> 
>    Visible links
>    1. https://github.com/torvalds/linux/commit/527268df31e57cf2b6d417198717c6d6afdb1e3e
>    2. http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: SQ overflow seen running isert traffic
       [not found]                     ` <f7a4b395-1786-3c7a-7639-195e830db5ad-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2017-03-20 13:05                       ` Potnuri Bharat Teja
  2017-03-20 15:04                         ` Steve Wise
  0 siblings, 1 reply; 24+ messages in thread
From: Potnuri Bharat Teja @ 2017-03-20 13:05 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jason Gunthorpe, SWise OGC, target-devel-u79uwXL29TY76Z2rM5mHXA,
	nab-IzHhD5pYlfBP7FQvKIMDCQ, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thursday, October 10/20/16, 2016 at 14:04:34 +0530, Sagi Grimberg wrote:
>    Hey Jason,
> 
>    >> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're
>    seeing SQ
>    >> overflows.  So the SQ sizing needs more investigation.
>    >
>    > NFS had this sort of problem - in that case it was because the code
>    > was assuming that a RQ completion implied SQ space - that is not
>    > legal, only direct completions from SQ WCs can guide available space
>    > in the SQ..
> 
>    Its not the same problem. iser-target does not tie SQ and RQ spaces.
>    The origin here is the difference between IB/RoCE and iWARP and the
>    chelsio HW that makes it hard to predict the SQ correct size.
> 
>    iWARP needs extra registration for rdma reads and the chelsio device
>    seems to be limited in the number of pages per registration so this
>    configuration will need a larger send queue.
> 
>    Another problem is that we don't have a correct retry flow.
> 
>    Hopefully we can address that in the RW API which is designed to hide
>    these details from the ULP...
Hi Sagi,
Here is what our further analysis of SQ dump at the time of overflow says:

RDMA read/write API is creating long chains (32 WRs) to handle large ISCSI 
READs. For Writing iscsi default block size of 512KB data, iw_cxgb4's max 
number of sge advertised is 4 page ~ 16KB for write, needs WR chain of 32 WRs 
(another possible factor is they all are unsignalled WRs and are completed
only after next signalled WR) But apparantly rdma_rw_init_qp() assumes that 
any given IO will take only 1 WRITE WR to convey the data.

This evidently is incorrect and rdma_rw_init_qp() needs to factor and size 
the queue based on max_sge of device for write and read and the sg_tablesize 
for which rdma read/write is used for, like ISCSI_ISER_MAX_SG_TABLESIZE of 
initiator. If above analysis is correct, please suggest how could this be fixed? 

Further, using MRs for rdma WRITE by using rdma_wr_force_mr = 1 module 
parameter of ib_core avoids SQ overflow by registering a single REG_MR and 
using that MR for a single WRITE WR. So a rdma-rw IO chain of say 32 WRITE 
WRs, becomes just 3 WRS:  REG_MR + WRITE + INV_MR as 
max_fast_reg_page_list_len of iw_cxgb4 is 128 page.

(By default force_mr is not set and iw_cxgb4 could only use MR for rdma 
READs only as per rdma_rw_io_needs_mr() if force_mr isnt set) 
>From this is there any possibility that we could use MR if the write WR
chain exceeds a certain number?

Thanks for your time!

-Bharat.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: SQ overflow seen running isert traffic
  2017-03-20 13:05                       ` Potnuri Bharat Teja
@ 2017-03-20 15:04                         ` Steve Wise
  0 siblings, 0 replies; 24+ messages in thread
From: Steve Wise @ 2017-03-20 15:04 UTC (permalink / raw)
  To: 'Potnuri Bharat Teja', 'Sagi Grimberg'
  Cc: 'Jason Gunthorpe',
	target-devel, nab, linux-rdma, Christoph Hellwig

+Christoph

> 
> On Thursday, October 10/20/16, 2016 at 14:04:34 +0530, Sagi Grimberg wrote:
> >    Hey Jason,
> >
> >    >> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're
> >    seeing SQ
> >    >> overflows.  So the SQ sizing needs more investigation.
> >    >
> >    > NFS had this sort of problem - in that case it was because the code
> >    > was assuming that a RQ completion implied SQ space - that is not
> >    > legal, only direct completions from SQ WCs can guide available space
> >    > in the SQ..
> >
> >    Its not the same problem. iser-target does not tie SQ and RQ spaces.
> >    The origin here is the difference between IB/RoCE and iWARP and the
> >    chelsio HW that makes it hard to predict the SQ correct size.
> >
> >    iWARP needs extra registration for rdma reads and the chelsio device
> >    seems to be limited in the number of pages per registration so this
> >    configuration will need a larger send queue.
> >
> >    Another problem is that we don't have a correct retry flow.
> >
> >    Hopefully we can address that in the RW API which is designed to hide
> >    these details from the ULP...
> Hi Sagi,
> Here is what our further analysis of SQ dump at the time of overflow says:
> 
> RDMA read/write API is creating long chains (32 WRs) to handle large ISCSI
> READs. For Writing iscsi default block size of 512KB data, iw_cxgb4's max
> number of sge advertised is 4 page ~ 16KB for write, needs WR chain of 32 WRs
> (another possible factor is they all are unsignalled WRs and are completed
> only after next signalled WR) But apparantly rdma_rw_init_qp() assumes that
> any given IO will take only 1 WRITE WR to convey the data.
> 
> This evidently is incorrect and rdma_rw_init_qp() needs to factor and size
> the queue based on max_sge of device for write and read and the sg_tablesize
> for which rdma read/write is used for, like ISCSI_ISER_MAX_SG_TABLESIZE of
> initiator. If above analysis is correct, please suggest how could this be
fixed?
> 
> Further, using MRs for rdma WRITE by using rdma_wr_force_mr = 1 module
> parameter of ib_core avoids SQ overflow by registering a single REG_MR and
> using that MR for a single WRITE WR. So a rdma-rw IO chain of say 32 WRITE
> WRs, becomes just 3 WRS:  REG_MR + WRITE + INV_MR as
> max_fast_reg_page_list_len of iw_cxgb4 is 128 page.
> 
> (By default force_mr is not set and iw_cxgb4 could only use MR for rdma
> READs only as per rdma_rw_io_needs_mr() if force_mr isnt set)
> >From this is there any possibility that we could use MR if the write WR
> chain exceeds a certain number?
> 
> Thanks for your time!
> 

I think it is time to resolve this XXX comment in rw.c for
rdma_rw_io_needs_mr():

/*
 * Check if the device will use memory registration for this RW operation.
 * We currently always use memory registrations for iWarp RDMA READs, and
 * have a debug option to force usage of MRs.
 * XXX: In the future we can hopefully fine tune this based on HCA driver
 * input.
 */

Regardless of whether the HCA driver provides input, I think 30+ RDMA WRITE WR
chains isn't as efficient as 1 REG_MR + 1 WRITE + 1 INV_MR.   Is it unreasonable
to just add some threshold in rw.c?   Also, I think rdma_rw_init_qp() does need
some tweaks:  It needs to take into account the max sge depth, the max REG_MR
depth, and the max SQ depth device attributes/capabilities when sizing the SQ.
However, if that computed depth exceeds the device max, then the SQ will not be
big enough to avoid potential overflowing, and I believe ULPs should _always_
flow control their outgoing WRs based on the SQ depth regardless.    And perhaps
rdma-rw should even avoid overly deep SQs just because that tends to inhibit
scalability.  EG: allowing lots of shallow QPs vs consuming all the device
resources with very deep QPs... 

Steve.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: SQ overflow seen running isert traffic
  2017-03-20 10:15                       ` Potnuri Bharat Teja
@ 2017-03-21  6:32                         ` Nicholas A. Bellinger
  2017-03-21  7:51                           ` Potnuri Bharat Teja
  0 siblings, 1 reply; 24+ messages in thread
From: Nicholas A. Bellinger @ 2017-03-21  6:32 UTC (permalink / raw)
  To: Potnuri Bharat Teja
  Cc: SWise OGC, 'Sagi Grimberg', target-devel, linux-rdma

Hi Bharat & Co,

On Mon, 2017-03-20 at 15:45 +0530, Potnuri Bharat Teja wrote:
> On Tuesday, November 11/08/16, 2016 at 15:36:18 +0530, Potnuri Bharat Teja wrote:
> >    On Monday, October 10/31/16, 2016 at 09:10:08 +0530, Nicholas A. Bellinger
> >    wrote:
> Hi Nicholas,
> Tested the following 3 patches for SQ overflow.
> https://www.spinics.net/lists/target-devel/msg13571.html

Thanks for following up on this.

I was just thinking about sending these patches out again, as separate
from iser-target bits in patch #3 (and your incremental below) they do
address a couple of long-standing target-core issues wrt queue-full
handling.

> They along with an additional change added on top of those seems to hold good.( Aborts seen 
> intermittently as shown at the end of mail)
> I have been testing the changes using fio with IO sizes ranging from 1KB to 32MB and test is 
> running fine (despite few Intermittent Aborts).
> 

At what I/O size do the post RDMA_WRITE failures for iw_cxgb4 and
associated TMR ABORT_TASK begin..?

> Here is what I see with the 3 patches alone applied:
> 
> -> In isert_put_datain() and isert_post_response() a corresponding recv WR is posted before
> posting a send and hence for every post failure a recv is already posted into a tightly packed
> RQ, causing it to overflow.

Just for me to understand, the intermittent TMR ABORT_TASKs are caused
by the repeated failure to post RDMA_WRITE WRs for a large ISER Data-In
payload, due to mis-sizing of needed WRs from RDMA R/W API vs.
underlying hardware capabilities.

Moving the recv posts after the send post for RDMA_WRITEs helps to
reduce the severity of the issue with iw_cxgb4, but doesn't completely
eliminate the issue under load. 

> 
> Below are the logs suggesting it:
> 
> [  189.944714] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0 failed to post RDMA res
> [  189.944718] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0 failed to post RDMA res
> [  189.944721] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0 failed to post RDMA res
> [  189.944724] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0 failed to post RDMA res
> [  189.944729] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0 failed to post RDMA res
> [  189.944731] isert: isert_post_recv: ib_post_recv() failed with ret: -12
> [  189.944731] isert: isert_put_datain: ib_post_recv failed with -12
> [  189.944732] isert: isert_post_recv: ib_post_recv() failed with ret: -12
> [  189.944733] isert: isert_put_datain: ib_post_recv failed with -12
> [  189.944734] isert: isert_post_recv: ib_post_recv() failed with ret: -12
> [  189.944735] isert: isert_put_datain: ib_post_recv failed with -12
> [  189.944736] isert: isert_post_recv: ib_post_recv() failed with ret: -12
> [  189.944736] isert: isert_put_datain: ib_post_recv failed with -12
> [  189.944737] isert: isert_post_recv: ib_post_recv() failed with ret: -12
> [  189.944738] isert: isert_put_datain: ib_post_recv failed with -12
> [  189.944739] isert: isert_post_recv: ib_post_recv() failed with ret: -12
> [  189.944740] isert: isert_put_datain: ib_post_recv failed with -12
> [  189.944740] isert: isert_post_recv: ib_post_recv() failed with ret: -12
> [  189.944741] isert: isert_put_datain: ib_post_recv failed with -12
> [  189.944746] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0 failed to post RDMA res
> [  189.944747] isert: isert_post_recv: ib_post_recv() failed with ret: -12
> [  189.944748] isert: isert_put_datain: ib_post_recv failed with -12
> [  189.944749] isert: isert_post_recv: ib_post_recv() failed with ret: -12

Mmmm, why does the retry of isert_put_datain() stop here..?

Does this delay between the last isert_put_datain() retry and subsequent
TMR ABORT_TASKs happen with your incremental patch applied too..?

> [  251.543847] ABORT_TASK: Found referenced iSCSI task_tag: 68
> [  251.543858] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 68
> [  251.544071] ABORT_TASK: Found referenced iSCSI task_tag: 46
> [  251.544076] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 46
> [  251.544300] ABORT_TASK: Found referenced iSCSI task_tag: 51
> [  251.544307] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 51
> [  251.544528] ABORT_TASK: Found referenced iSCSI task_tag: 117
> [  251.544533] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 117
> [  282.778402] Unable to locate ITT: 0x00000062 on CID: 0
> [  282.778404] Unable to locate RefTaskTag: 0x00000062 on CID: 0.
> [  282.778407] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0
> [  282.778436] ABORT_TASK: Found referenced iSCSI task_tag: 37
> [  282.778439] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 37
> [  282.778519] ABORT_TASK: Found referenced iSCSI task_tag: 65
> [  282.778522] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 65
> [  282.778559] Unable to locate ITT: 0x00000019 on CID: 0
> [  282.778560] Unable to locate RefTaskTag: 0x00000019 on CID: 0.
> [  282.778563] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 0
> [  293.004632] ABORT_TASK: Found referenced iSCSI task_tag: 22
> [  293.004643] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 22
> [  293.004808] TMR Opcode TARGET_WARM_RESET authorization failed for Initiator Node: iqn.1994-05.com.redhat:37fc2b6a5331
> [  310.265993] iSCSI Login timeout on Network Portal 10.0.2.2:3260
> [  493.567143] INFO: task iscsi_np:3998 blocked for more than 120 seconds.
> [  493.567149]       Tainted: G           OE   4.9.0-For-DSGL-Debug+ #1
> [  493.567151] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  493.567153] iscsi_np        D    0  3998      2 0x00000084
> [  493.567159]  0000000000000000 ffff883028a88b40 ffff88302693f7c0 ffff88302de3e300
> [  493.567163]  ffff88303e057a40 ffffc900008abc00 ffffffff815e415b ffffc900008abca0
> [  493.567166]  ffffffff81af4180 ffff883028a88b40 7fffffffffffffff ffff882f7426db68
> [  493.567170] Call Trace:
> [  493.567180]  [<ffffffff815e415b>] ? __schedule+0x23b/0x6c0
> [  493.567184]  [<ffffffff815e4616>] schedule+0x36/0x80
> [  493.567187]  [<ffffffff815e71c3>] schedule_timeout+0x1f3/0x390
> [  493.567194]  [<ffffffff812c1f2e>] ? number+0x2ce/0x300
> [  493.567197]  [<ffffffff812c42be>] ? vsnprintf+0x32e/0x4b0
> [  493.567199]  [<ffffffff815e56a4>] wait_for_completion+0xb4/0xf0
> [  493.567201]  [<ffffffff8108a6b0>] ? wake_up_q+0x80/0x80
> [  493.567216]  [<ffffffffa05284d1>] iscsit_stop_session+0x1b1/0x1c0 [iscsi_target_mod]
> [  493.567220]  [<ffffffffa0519db2>] iscsi_check_for_session_reinstatement+0x1e2/0x270 [iscsi_target_mod]
> [  493.567225]  [<ffffffffa051c8c0>] iscsi_target_check_for_existing_instances+0x30/0x40 [iscsi_target_mod]
> [  493.567230]  [<ffffffffa051c9ee>] iscsi_target_do_login+0x11e/0x610 [iscsi_target_mod]
> [  493.567235]  [<ffffffffa051dbbe>] iscsi_target_start_negotiation+0x4e/0xa0 [iscsi_target_mod]
> [  493.567239]  [<ffffffffa051b933>] iscsi_target_login_thread+0xa73/0xfc0 [iscsi_target_mod]
> [  493.567244]  [<ffffffffa051aec0>] ? iscsi_target_login_sess_out+0x250/0x250 [iscsi_target_mod]
> [  493.567246]  [<ffffffff810660c4>] ? SyS_exit_group+0x14/0x20
> [  493.567249]  [<ffffffff8107ee5a>] kthread+0xca/0xe0
> [  493.567250]  [<ffffffff8107ed90>] ? kthread_park+0x60/0x60
> [  493.567252]  [<ffffffff815e8752>] ret_from_fork+0x22/0x30
> [  493.567254] INFO: task iscsi_trx:4149 blocked for more than 120 seconds.
> [  493.567255]       Tainted: G           OE   4.9.0-For-DSGL-Debug+ #1
> [  493.567256] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  493.567256] iscsi_trx       D    0  4149      2 0x00000084
> [  493.567258]  0000000000000000 ffff882f5ef70340 ffff882f76253040 ffff88302de40340
> [  493.567259]  ffff88303e097a40 ffffc90000f93c90 ffffffff815e415b ffffffff815e56c2
> [  493.567261]  0000000000000001 ffff882f5ef70340 7fffffffffffffff ffff882d233cc9b0
> [  493.567262] Call Trace:
> [  493.567264]  [<ffffffff815e415b>] ? __schedule+0x23b/0x6c0
> [  493.567265]  [<ffffffff815e56c2>] ? wait_for_completion+0xd2/0xf0
> [  493.567267]  [<ffffffff815e4616>] schedule+0x36/0x80
> [  493.567268]  [<ffffffff815e71c3>] schedule_timeout+0x1f3/0x390
> [  493.567281]  [<ffffffffa118e910>] ? ib_sg_to_pages+0x1a0/0x1a0 [ib_core]
> [  493.567285]  [<ffffffffa118e910>] ? ib_sg_to_pages+0x1a0/0x1a0 [ib_core]
> [  493.567286]  [<ffffffff815e56a4>] wait_for_completion+0xb4/0xf0
> [  493.567287]  [<ffffffff8108a6b0>] ? wake_up_q+0x80/0x80
> [  493.567302]  [<ffffffffa04d5399>] target_wait_for_sess_cmds+0x49/0x1a0 [target_core_mod]
> [  493.567306]  [<ffffffffa0471644>] isert_wait_conn+0x1a4/0x2d0 [ib_isert]
> [  493.567311]  [<ffffffffa05289ab>] iscsit_close_connection+0x14b/0x820 [iscsi_target_mod]
> [  493.567315]  [<ffffffffa05172ef>] iscsit_take_action_for_connection_exit+0x7f/0x100 [iscsi_target_mod]
> [  493.567320]  [<ffffffffa0528265>] iscsi_target_rx_thread+0x95/0xa0 [iscsi_target_mod]
> [  493.567324]  [<ffffffffa05281d0>] ? iscsi_target_tx_thread+0x1b0/0x1b0 [iscsi_target_mod]
> [  493.567325]  [<ffffffff8107ee5a>] kthread+0xca/0xe0
> [  493.567327]  [<ffffffff8107ed90>] ? kthread_park+0x60/0x60
> [  493.567328]  [<ffffffff815e8752>] ret_from_fork+0x22/0x30
> 

OK, AFAICT these hung tasks basically mean iser-target never gave
target-core back the outstanding se_cmd it was attempting to retry, even
up to the point in which session reinstatement was eventually triggered
by iser-initiator.

So the reason why your patch to swap post_recv -> post_send to post_send
-> post_recv makes a difference is because it allows enough trickle of
RDMA_WRITEs to make it through, where iser-initiator doesn't attempt to
escalate recovery and doesn't attempt session reinstatement.

Have you been able to confirm with your incremental patch below that
once TMR ABORT_TASKs start to occur due to isert_rdma_rw_ctx_post
failures, that any subsequent session reinstatement events for unrelated
reasons (or explicit logout..?) doesn't run into the same issue..?

> -> In order to address this I tried the following (experimental) change on top of your 3 patches:
> 
> diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c
> index 501a0ad124a7..0aead5faa2a2 100644
> --- a/drivers/infiniband/ulp/isert/ib_isert.c
> +++ b/drivers/infiniband/ulp/isert/ib_isert.c
> @@ -1817,16 +1817,16 @@ isert_post_response(struct isert_conn *isert_conn, struct isert_cmd *isert_cmd)
>         struct ib_send_wr *wr_failed;
>         int ret;
> 
> -       ret = isert_post_recv(isert_conn, isert_cmd->rx_desc);
> +       ret = ib_post_send(isert_conn->qp, &isert_cmd->tx_desc.send_wr,
> +                          &wr_failed);
>         if (ret) {
> -               isert_err("ib_post_recv failed with %d\n", ret);
> +               isert_err("ib_post_send failed with %d\n", ret);
>                 return ret;
>         }
> 
> -       ret = ib_post_send(isert_conn->qp, &isert_cmd->tx_desc.send_wr,
> -                          &wr_failed);
> +       ret = isert_post_recv(isert_conn, isert_cmd->rx_desc);
>         if (ret) {
> -               isert_err("ib_post_send failed with %d\n", ret);
> +               isert_err("ib_post_recv failed with %d\n", ret);
>                 return ret;
>         }
>         return ret;
> @@ -2166,6 +2166,13 @@ isert_put_datain(struct iscsi_conn *conn, struct iscsi_cmd *cmd)
>         if (isert_prot_cmd(isert_conn, se_cmd)) {
>                 isert_cmd->tx_desc.tx_cqe.done = isert_rdma_write_done;
>                 cqe = &isert_cmd->tx_desc.tx_cqe;
> +
> +               rc = isert_rdma_rw_ctx_post(isert_cmd, isert_conn, cqe, chain_wr);
> +               if (rc) {
> +                       isert_dbg("Cmd: %p posted RDMA_WRITE for iSER Data READ rc: %d\n",
> +                                  isert_cmd, rc);
> +                       return rc;
> +               }
>         } else {
>                 /*
>                  * Build isert_conn->tx_desc for iSCSI response PDU and attach
> @@ -2178,18 +2185,21 @@ isert_put_datain(struct iscsi_conn *conn, struct iscsi_cmd *cmd)
>                 isert_init_send_wr(isert_conn, isert_cmd,
>                                    &isert_cmd->tx_desc.send_wr);
> 
> +               chain_wr = &isert_cmd->tx_desc.send_wr;
> +
> +               rc = isert_rdma_rw_ctx_post(isert_cmd, isert_conn, cqe, chain_wr);
> +               if (rc) {
> +                       isert_err("Cmd: %p posted RDMA_WRITE for iSER Data READ rc: %d\n",
> +                                  isert_cmd, rc);
> +                       return rc;
> +               }
> +
>                 rc = isert_post_recv(isert_conn, isert_cmd->rx_desc);
>                 if (rc) {
>                         isert_err("ib_post_recv failed with %d\n", rc);
>                         return rc;
>                 }
> -
> -               chain_wr = &isert_cmd->tx_desc.send_wr;
>         }
> -       rc = isert_rdma_rw_ctx_post(isert_cmd, isert_conn, cqe, chain_wr);
> -       isert_dbg("Cmd: %p posted RDMA_WRITE for iSER Data READ rc: %d\n",
> -                  isert_cmd, rc);
> -
>         return rc;
>  }
> 

Unsure of the implications for other RNICs + iser-target here..

I'll have to defer to Sagi judgment on this one.  ;)

> -> With these changes it works fine, But I still see few Aborts intermittently as follows:
> 
> [10674.434643] isert: isert_rdma_rw_ctx_post: Cmd: ffff882cbb961138 failed to post RDMA res
> [10674.434690] isert: isert_rdma_rw_ctx_post: Cmd: ffff882cbb9434c0 failed to post RDMA res
> [10715.186372] ABORT_TASK: Found referenced iSCSI task_tag: 58
> [10715.186417] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 58
> 

OK, so there is still a delay between the isert_rdma_rw_ctx_post
failures, and receiving TMR ABORT_TASKs.

Is target-core -> iser-target aggressively retrying the posting of
RDMA_WRITEs via isert_put_datain() during this entire time..?

> -> Apart from the Aborts and post failure logs filling dmesg, luns are in state "running" and 
> fio countinues to run.
> [root@bharat ~]# iscsiadm -m session -P3 | grep "Attached scsi disk"
>                         Attached scsi disk sdj          State: running
>                         Attached scsi disk sdc          State: running
>                         Attached scsi disk sdd          State: running
>                         Attached scsi disk sde          State: running
>                         Attached scsi disk sdf          State: running
>                         Attached scsi disk sdg          State: running
>                         Attached scsi disk sdh          State: running
>                         Attached scsi disk sdi          State: running
> 

Ok, so it looks like you've got the second order issues under control
for target-core + iser-target, once the first order issue of RDMA R/W
API vs. iw_cxgb4 WR resource mis-sizing continuously triggers failed
calls to isert_put_datain() -> isert_rdma_rw_ctx_post()
rdma_rw_ctx_post().

That said, I'm going to go ahead and queue patch #1 and #2 from the
original series to target-pending/for-next as they do address issues
separate from iser-target scenario here, and await on Sagi's feedback
for patch #3 and your incremental above.

Thanks again.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: SQ overflow seen running isert traffic
  2017-03-21  6:32                         ` Nicholas A. Bellinger
@ 2017-03-21  7:51                           ` Potnuri Bharat Teja
       [not found]                             ` <20170321075131.GA11565-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Potnuri Bharat Teja @ 2017-03-21  7:51 UTC (permalink / raw)
  To: Nicholas A. Bellinger
  Cc: SWise OGC, 'Sagi Grimberg', target-devel, linux-rdma

On Tuesday, March 03/21/17, 2017 at 12:02:20 +0530, Nicholas A. Bellinger wrote:
>    Hi Bharat & Co,
> 
>    On Mon, 2017-03-20 at 15:45 +0530, Potnuri Bharat Teja wrote:
>    > On Tuesday, November 11/08/16, 2016 at 15:36:18 +0530, Potnuri Bharat
>    Teja wrote:
>    > >    On Monday, October 10/31/16, 2016 at 09:10:08 +0530, Nicholas A.
>    Bellinger
>    > >    wrote:
>    > Hi Nicholas,
>    > Tested the following 3 patches for SQ overflow.
>    > [1]https://www.spinics.net/lists/target-devel/msg13571.html
> 
>    Thanks for following up on this.
> 
>    I was just thinking about sending these patches out again, as separate
>    from iser-target bits in patch #3 (and your incremental below) they do
>    address a couple of long-standing target-core issues wrt queue-full
>    handling.
> 
>    > They along with an additional change added on top of those seems to hold
>    good.( Aborts seen
>    > intermittently as shown at the end of mail)
>    > I have been testing the changes using fio with IO sizes ranging from 1KB
>    to 32MB and test is
>    > running fine (despite few Intermittent Aborts).
>    >
> 
>    At what I/O size do the post RDMA_WRITE failures for iw_cxgb4 and
>    associated TMR ABORT_TASK begin..?
Hi Nicholas,
I see them from 2MB onwards.
> 
>    > Here is what I see with the 3 patches alone applied:
>    >
>    > -> In isert_put_datain() and isert_post_response() a corresponding recv
>    WR is posted before
>    > posting a send and hence for every post failure a recv is already posted
>    into a tightly packed
>    > RQ, causing it to overflow.
> 
>    Just for me to understand, the intermittent TMR ABORT_TASKs are caused
>    by the repeated failure to post RDMA_WRITE WRs for a large ISER Data-In
>    payload, due to mis-sizing of needed WRs from RDMA R/W API vs.
>    underlying hardware capabilities.
Yes.
> 
>    Moving the recv posts after the send post for RDMA_WRITEs helps to
>    reduce the severity of the issue with iw_cxgb4, but doesn't completely
>    eliminate the issue under load.
Moving recv posts only comes in to effect along with your changes.
> 
>    >
>    > Below are the logs suggesting it:
>    >
>    > [  189.944714] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0
>    failed to post RDMA res
>    > [  189.944718] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0
>    failed to post RDMA res
>    > [  189.944721] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0
>    failed to post RDMA res
>    > [  189.944724] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0
>    failed to post RDMA res
>    > [  189.944729] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0
>    failed to post RDMA res
>    > [  189.944731] isert: isert_post_recv: ib_post_recv() failed with ret:
>    -12
>    > [  189.944731] isert: isert_put_datain: ib_post_recv failed with -12
>    > [  189.944732] isert: isert_post_recv: ib_post_recv() failed with ret:
>    -12
>    > [  189.944733] isert: isert_put_datain: ib_post_recv failed with -12
>    > [  189.944734] isert: isert_post_recv: ib_post_recv() failed with ret:
>    -12
>    > [  189.944735] isert: isert_put_datain: ib_post_recv failed with -12
>    > [  189.944736] isert: isert_post_recv: ib_post_recv() failed with ret:
>    -12
>    > [  189.944736] isert: isert_put_datain: ib_post_recv failed with -12
>    > [  189.944737] isert: isert_post_recv: ib_post_recv() failed with ret:
>    -12
>    > [  189.944738] isert: isert_put_datain: ib_post_recv failed with -12
>    > [  189.944739] isert: isert_post_recv: ib_post_recv() failed with ret:
>    -12
>    > [  189.944740] isert: isert_put_datain: ib_post_recv failed with -12
>    > [  189.944740] isert: isert_post_recv: ib_post_recv() failed with ret:
>    -12
>    > [  189.944741] isert: isert_put_datain: ib_post_recv failed with -12
>    > [  189.944746] isert: isert_rdma_rw_ctx_post: Cmd: ffff882d233e83d0
>    failed to post RDMA res
>    > [  189.944747] isert: isert_post_recv: ib_post_recv() failed with ret:
>    -12
>    > [  189.944748] isert: isert_put_datain: ib_post_recv failed with -12
>    > [  189.944749] isert: isert_post_recv: ib_post_recv() failed with ret:
>    -12
> 
>    Mmmm, why does the retry of isert_put_datain() stop here..?
It is because the iser command continues to loop failing to post either with SQ 
or RQ failure and reties reposting. Subsequently iser command times out on the 
Initiator and aborts the command.
>
>    Does this delay between the last isert_put_datain() retry and subsequent
>    TMR ABORT_TASKs happen with your incremental patch applied too..?
With my changes applied, I see a few aborts with high IO sizes like 
32MB, 16MB etc. I dont see login time out failure with incremental patch.
> 
>    > [  251.543847] ABORT_TASK: Found referenced iSCSI task_tag: 68
>    > [  251.543858] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 68
>    > [  251.544071] ABORT_TASK: Found referenced iSCSI task_tag: 46
>    > [  251.544076] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 46
>    > [  251.544300] ABORT_TASK: Found referenced iSCSI task_tag: 51
>    > [  251.544307] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 51
>    > [  251.544528] ABORT_TASK: Found referenced iSCSI task_tag: 117
>    > [  251.544533] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
>    117
>    > [  282.778402] Unable to locate ITT: 0x00000062 on CID: 0
>    > [  282.778404] Unable to locate RefTaskTag: 0x00000062 on CID: 0.
>    > [  282.778407] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag:
>    0
>    > [  282.778436] ABORT_TASK: Found referenced iSCSI task_tag: 37
>    > [  282.778439] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 37
>    > [  282.778519] ABORT_TASK: Found referenced iSCSI task_tag: 65
>    > [  282.778522] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 65
>    > [  282.778559] Unable to locate ITT: 0x00000019 on CID: 0
>    > [  282.778560] Unable to locate RefTaskTag: 0x00000019 on CID: 0.
>    > [  282.778563] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag:
>    0
>    > [  293.004632] ABORT_TASK: Found referenced iSCSI task_tag: 22
>    > [  293.004643] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 22
>    > [  293.004808] TMR Opcode TARGET_WARM_RESET authorization failed for
>    Initiator Node: iqn.1994-05.com.redhat:37fc2b6a5331
>    > [  310.265993] iSCSI Login timeout on Network Portal 10.0.2.2:3260
>    > [  493.567143] INFO: task iscsi_np:3998 blocked for more than 120
>    seconds.
>    > [  493.567149]       Tainted: G           OE   4.9.0-For-DSGL-Debug+ #1
>    > [  493.567151] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>    disables this message.
>    > [  493.567153] iscsi_np        D    0  3998      2 0x00000084
>    > [  493.567159]  0000000000000000 ffff883028a88b40 ffff88302693f7c0
>    ffff88302de3e300
>    > [  493.567163]  ffff88303e057a40 ffffc900008abc00 ffffffff815e415b
>    ffffc900008abca0
>    > [  493.567166]  ffffffff81af4180 ffff883028a88b40 7fffffffffffffff
>    ffff882f7426db68
>    > [  493.567170] Call Trace:
>    > [  493.567180]  [<ffffffff815e415b>] ? __schedule+0x23b/0x6c0
>    > [  493.567184]  [<ffffffff815e4616>] schedule+0x36/0x80
>    > [  493.567187]  [<ffffffff815e71c3>] schedule_timeout+0x1f3/0x390
>    > [  493.567194]  [<ffffffff812c1f2e>] ? number+0x2ce/0x300
>    > [  493.567197]  [<ffffffff812c42be>] ? vsnprintf+0x32e/0x4b0
>    > [  493.567199]  [<ffffffff815e56a4>] wait_for_completion+0xb4/0xf0
>    > [  493.567201]  [<ffffffff8108a6b0>] ? wake_up_q+0x80/0x80
>    > [  493.567216]  [<ffffffffa05284d1>] iscsit_stop_session+0x1b1/0x1c0
>    [iscsi_target_mod]
>    > [  493.567220]  [<ffffffffa0519db2>]
>    iscsi_check_for_session_reinstatement+0x1e2/0x270 [iscsi_target_mod]
>    > [  493.567225]  [<ffffffffa051c8c0>]
>    iscsi_target_check_for_existing_instances+0x30/0x40 [iscsi_target_mod]
>    > [  493.567230]  [<ffffffffa051c9ee>] iscsi_target_do_login+0x11e/0x610
>    [iscsi_target_mod]
>    > [  493.567235]  [<ffffffffa051dbbe>]
>    iscsi_target_start_negotiation+0x4e/0xa0 [iscsi_target_mod]
>    > [  493.567239]  [<ffffffffa051b933>]
>    iscsi_target_login_thread+0xa73/0xfc0 [iscsi_target_mod]
>    > [  493.567244]  [<ffffffffa051aec0>] ?
>    iscsi_target_login_sess_out+0x250/0x250 [iscsi_target_mod]
>    > [  493.567246]  [<ffffffff810660c4>] ? SyS_exit_group+0x14/0x20
>    > [  493.567249]  [<ffffffff8107ee5a>] kthread+0xca/0xe0
>    > [  493.567250]  [<ffffffff8107ed90>] ? kthread_park+0x60/0x60
>    > [  493.567252]  [<ffffffff815e8752>] ret_from_fork+0x22/0x30
>    > [  493.567254] INFO: task iscsi_trx:4149 blocked for more than 120
>    seconds.
>    > [  493.567255]       Tainted: G           OE   4.9.0-For-DSGL-Debug+ #1
>    > [  493.567256] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>    disables this message.
>    > [  493.567256] iscsi_trx       D    0  4149      2 0x00000084
>    > [  493.567258]  0000000000000000 ffff882f5ef70340 ffff882f76253040
>    ffff88302de40340
>    > [  493.567259]  ffff88303e097a40 ffffc90000f93c90 ffffffff815e415b
>    ffffffff815e56c2
>    > [  493.567261]  0000000000000001 ffff882f5ef70340 7fffffffffffffff
>    ffff882d233cc9b0
>    > [  493.567262] Call Trace:
>    > [  493.567264]  [<ffffffff815e415b>] ? __schedule+0x23b/0x6c0
>    > [  493.567265]  [<ffffffff815e56c2>] ? wait_for_completion+0xd2/0xf0
>    > [  493.567267]  [<ffffffff815e4616>] schedule+0x36/0x80
>    > [  493.567268]  [<ffffffff815e71c3>] schedule_timeout+0x1f3/0x390
>    > [  493.567281]  [<ffffffffa118e910>] ? ib_sg_to_pages+0x1a0/0x1a0
>    [ib_core]
>    > [  493.567285]  [<ffffffffa118e910>] ? ib_sg_to_pages+0x1a0/0x1a0
>    [ib_core]
>    > [  493.567286]  [<ffffffff815e56a4>] wait_for_completion+0xb4/0xf0
>    > [  493.567287]  [<ffffffff8108a6b0>] ? wake_up_q+0x80/0x80
>    > [  493.567302]  [<ffffffffa04d5399>]
>    target_wait_for_sess_cmds+0x49/0x1a0 [target_core_mod]
>    > [  493.567306]  [<ffffffffa0471644>] isert_wait_conn+0x1a4/0x2d0
>    [ib_isert]
>    > [  493.567311]  [<ffffffffa05289ab>] iscsit_close_connection+0x14b/0x820
>    [iscsi_target_mod]
>    > [  493.567315]  [<ffffffffa05172ef>]
>    iscsit_take_action_for_connection_exit+0x7f/0x100 [iscsi_target_mod]
>    > [  493.567320]  [<ffffffffa0528265>] iscsi_target_rx_thread+0x95/0xa0
>    [iscsi_target_mod]
>    > [  493.567324]  [<ffffffffa05281d0>] ?
>    iscsi_target_tx_thread+0x1b0/0x1b0 [iscsi_target_mod]
>    > [  493.567325]  [<ffffffff8107ee5a>] kthread+0xca/0xe0
>    > [  493.567327]  [<ffffffff8107ed90>] ? kthread_park+0x60/0x60
>    > [  493.567328]  [<ffffffff815e8752>] ret_from_fork+0x22/0x30
>    >
> 
>    OK, AFAICT these hung tasks basically mean iser-target never gave
>    target-core back the outstanding se_cmd it was attempting to retry, even
>    up to the point in which session reinstatement was eventually triggered
>    by iser-initiator.
Yes.
> 
>    So the reason why your patch to swap post_recv -> post_send to post_send
>    -> post_recv makes a difference is because it allows enough trickle of
>    RDMA_WRITEs to make it through, where iser-initiator doesn't attempt to
>    escalate recovery and doesn't attempt session reinstatement.
I dont exactly know if above thing comes into play but the actual reason I did 
swap posting RQ and SQ is, unlike SQ, RQ is posted with WRs to the brim during 
the intialisation itself. From thereon we post a RQ WR for every RQ completion 
That makes it almost full at any point of time.

Now in our scenario, SQ is miscalulated and too small for few adapters and so 
filled gradually as the IO starts. Once SQ is full, according to your patches 
isert queues it and tries to repost the command again. Here in iser functions 
like isert_post_response(), isert_put_datain() post send is done after post recv.
For the first post send failure in say isert_put_datain(), the corresponding 
post recv is already posted, then on queuing the command and trying reposting 
an extra recv is again posted which fills up the RQ also.

 By swapping post recv and send as in my incermental patch, we dont post that 
extra recv, and post recv only on successful post send.
Therfore I think this incremental patch is necessary. 
> 
>    Have you been able to confirm with your incremental patch below that
>    once TMR ABORT_TASKs start to occur due to isert_rdma_rw_ctx_post
>    failures, that any subsequent session reinstatement events for unrelated
>    reasons (or explicit logout..?) doesn't run into the same issue..?
Yes, It is tested good! I've left it to run from yesterday. IO is still running 
with out any login timeouts.
[root@bharat ~]# iscsiadm -m session -P3 | grep "Attached scsi disk"
                        Attached scsi disk sdj          State: running
                        Attached scsi disk sdc          State: running
                        Attached scsi disk sdd          State: running
                        Attached scsi disk sde          State: running
                        Attached scsi disk sdf          State: running
                        Attached scsi disk sdg          State: running
                        Attached scsi disk sdh          State: running
                        Attached scsi disk sdi          State: running
[root@bharat ~]# uptime
 13:12:04 up 1 day,  1:22,  3 users,  load average: 8.11, 8.30, 7.26

> 
>    > -> In order to address this I tried the following (experimental) change
>    on top of your 3 patches:
>    >
>    > diff --git a/drivers/infiniband/ulp/isert/ib_isert.c
>    b/drivers/infiniband/ulp/isert/ib_isert.c
>    > index 501a0ad124a7..0aead5faa2a2 100644
>    > --- a/drivers/infiniband/ulp/isert/ib_isert.c
>    > +++ b/drivers/infiniband/ulp/isert/ib_isert.c
>    > @@ -1817,16 +1817,16 @@ isert_post_response(struct isert_conn
>    *isert_conn, struct isert_cmd *isert_cmd)
>    >         struct ib_send_wr *wr_failed;
>    >         int ret;
>    >
>    > -       ret = isert_post_recv(isert_conn, isert_cmd->rx_desc);
>    > +       ret = ib_post_send(isert_conn->qp, &isert_cmd->tx_desc.send_wr,
>    > +                          &wr_failed);
>    >         if (ret) {
>    > -               isert_err("ib_post_recv failed with %d\n", ret);
>    > +               isert_err("ib_post_send failed with %d\n", ret);
>    >                 return ret;
>    >         }
>    >
>    > -       ret = ib_post_send(isert_conn->qp, &isert_cmd->tx_desc.send_wr,
>    > -                          &wr_failed);
>    > +       ret = isert_post_recv(isert_conn, isert_cmd->rx_desc);
>    >         if (ret) {
>    > -               isert_err("ib_post_send failed with %d\n", ret);
>    > +               isert_err("ib_post_recv failed with %d\n", ret);
>    >                 return ret;
>    >         }
>    >         return ret;
>    > @@ -2166,6 +2166,13 @@ isert_put_datain(struct iscsi_conn *conn, struct
>    iscsi_cmd *cmd)
>    >         if (isert_prot_cmd(isert_conn, se_cmd)) {
>    >                 isert_cmd->tx_desc.tx_cqe.done = isert_rdma_write_done;
>    >                 cqe = &isert_cmd->tx_desc.tx_cqe;
>    > +
>    > +               rc = isert_rdma_rw_ctx_post(isert_cmd, isert_conn, cqe,
>    chain_wr);
>    > +               if (rc) {
>    > +                       isert_dbg("Cmd: %p posted RDMA_WRITE for iSER
>    Data READ rc: %d\n",
>    > +                                  isert_cmd, rc);
>    > +                       return rc;
>    > +               }
>    >         } else {
>    >                 /*
>    >                  * Build isert_conn->tx_desc for iSCSI response PDU and
>    attach
>    > @@ -2178,18 +2185,21 @@ isert_put_datain(struct iscsi_conn *conn, struct
>    iscsi_cmd *cmd)
>    >                 isert_init_send_wr(isert_conn, isert_cmd,
>    >                                    &isert_cmd->tx_desc.send_wr);
>    >
>    > +               chain_wr = &isert_cmd->tx_desc.send_wr;
>    > +
>    > +               rc = isert_rdma_rw_ctx_post(isert_cmd, isert_conn, cqe,
>    chain_wr);
>    > +               if (rc) {
>    > +                       isert_err("Cmd: %p posted RDMA_WRITE for iSER
>    Data READ rc: %d\n",
>    > +                                  isert_cmd, rc);
>    > +                       return rc;
>    > +               }
>    > +
>    >                 rc = isert_post_recv(isert_conn, isert_cmd->rx_desc);
>    >                 if (rc) {
>    >                         isert_err("ib_post_recv failed with %d\n", rc);
>    >                         return rc;
>    >                 }
>    > -
>    > -               chain_wr = &isert_cmd->tx_desc.send_wr;
>    >         }
>    > -       rc = isert_rdma_rw_ctx_post(isert_cmd, isert_conn, cqe,
>    chain_wr);
>    > -       isert_dbg("Cmd: %p posted RDMA_WRITE for iSER Data READ rc:
>    %d\n",
>    > -                  isert_cmd, rc);
>    > -
>    >         return rc;
>    >  }
>    >
> 
>    Unsure of the implications for other RNICs + iser-target here..
I believe it is needed. Lets wait for Sagi's opinion.
> 
>    I'll have to defer to Sagi judgment on this one.  ;)
> 
>    > -> With these changes it works fine, But I still see few Aborts
>    intermittently as follows:
>    >
>    > [10674.434643] isert: isert_rdma_rw_ctx_post: Cmd: ffff882cbb961138
>    failed to post RDMA res
>    > [10674.434690] isert: isert_rdma_rw_ctx_post: Cmd: ffff882cbb9434c0
>    failed to post RDMA res
>    > [10715.186372] ABORT_TASK: Found referenced iSCSI task_tag: 58
>    > [10715.186417] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 58
>    >
> 
>    OK, so there is still a delay between the isert_rdma_rw_ctx_post
>    failures, and receiving TMR ABORT_TASKs.
> 
>    Is target-core -> iser-target aggressively retrying the posting of
>    RDMA_WRITEs via isert_put_datain() during this entire time..?
> 
Yes, at higher IO sizes it happens.

I belive that in addition to this Queue failure handling. There should be flow 
control mechanism to slow down the posts on ENOMEM/ post failures.

I see something like that happening in Initiator:

iscsi_iser.h:
/* the max TX (send) WR supported by the iSER QP is defined by                 *
 * max_send_wr = T * (1 + D) + C ; D is how many inflight dataouts we expect   *
 * to have at max for SCSI command. The tx posting & completion handling code  *
 * supports -EAGAIN scheme where tx is suspended till the QP has room for more *
 * send WR. D=8 comes from 64K/8K                                              */

"where tx is suspended til the QP has room for more send WR"
I am not sure if that is done on target side.
>    > -> Apart from the Aborts and post failure logs filling dmesg, luns are
>    in state "running" and
>    > fio countinues to run.
>    > [root@bharat ~]# iscsiadm -m session -P3 | grep "Attached scsi disk"
>    >                         Attached scsi disk sdj          State: running
>    >                         Attached scsi disk sdc          State: running
>    >                         Attached scsi disk sdd          State: running
>    >                         Attached scsi disk sde          State: running
>    >                         Attached scsi disk sdf          State: running
>    >                         Attached scsi disk sdg          State: running
>    >                         Attached scsi disk sdh          State: running
>    >                         Attached scsi disk sdi          State: running
>    >
> 
>    Ok, so it looks like you've got the second order issues under control
>    for target-core + iser-target, once the first order issue of RDMA R/W
>    API vs. iw_cxgb4 WR resource mis-sizing continuously triggers failed
>    calls to isert_put_datain() -> isert_rdma_rw_ctx_post()
>    rdma_rw_ctx_post().
Yes, Thank you for the patches :).
> 
>    That said, I'm going to go ahead and queue patch #1 and #2 from the
>    original series to target-pending/for-next as they do address issues
>    separate from iser-target scenario here, and await on Sagi's feedback
>    for patch #3 and your incremental above.
> 
>    Thanks again.
> 
> References
> 
>    Visible links
>    1. https://www.spinics.net/lists/target-devel/msg13571.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: SQ overflow seen running isert traffic
       [not found]                             ` <20170321075131.GA11565-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
@ 2017-03-21 13:52                               ` Sagi Grimberg
       [not found]                                 ` <945e2947-f67a-4202-cd27-d4631fe10f68-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2017-03-21 13:52 UTC (permalink / raw)
  To: Potnuri Bharat Teja, Nicholas A. Bellinger
  Cc: SWise OGC, target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Baharat and Nic,

Apologies for the late reply,

> Hi Nicholas,
> I see them from 2MB onwards.
>>
>>    > Here is what I see with the 3 patches alone applied:
>>    >
>>    > -> In isert_put_datain() and isert_post_response() a corresponding recv
>>    WR is posted before
>>    > posting a send and hence for every post failure a recv is already posted
>>    into a tightly packed
>>    > RQ, causing it to overflow.
>>
>>    Just for me to understand, the intermittent TMR ABORT_TASKs are caused
>>    by the repeated failure to post RDMA_WRITE WRs for a large ISER Data-In
>>    payload, due to mis-sizing of needed WRs from RDMA R/W API vs.
>>    underlying hardware capabilities.
> Yes.
>>
>>    Moving the recv posts after the send post for RDMA_WRITEs helps to
>>    reduce the severity of the issue with iw_cxgb4, but doesn't completely
>>    eliminate the issue under load.
> Moving recv posts only comes in to effect along with your changes.

...

>>    So the reason why your patch to swap post_recv -> post_send to post_send
>>    -> post_recv makes a difference is because it allows enough trickle of
>>    RDMA_WRITEs to make it through, where iser-initiator doesn't attempt to
>>    escalate recovery and doesn't attempt session reinstatement.
> I dont exactly know if above thing comes into play but the actual reason I did
> swap posting RQ and SQ is, unlike SQ, RQ is posted with WRs to the brim during
> the intialisation itself. From thereon we post a RQ WR for every RQ completion
> That makes it almost full at any point of time.
>
> Now in our scenario, SQ is miscalulated and too small for few adapters and so
> filled gradually as the IO starts. Once SQ is full, according to your patches
> isert queues it and tries to repost the command again. Here in iser functions
> like isert_post_response(), isert_put_datain() post send is done after post recv.
> For the first post send failure in say isert_put_datain(), the corresponding
> post recv is already posted, then on queuing the command and trying reposting
> an extra recv is again posted which fills up the RQ also.
>
>  By swapping post recv and send as in my incermental patch, we dont post that
> extra recv, and post recv only on successful post send.
> Therfore I think this incremental patch is necessary.

Reversing the order to recv and send posting will cause problems
in stress IO workloads (especially for iWARP). The problem of sending
a reply before reposting the recv buffer is that the initiator can send
immediately a new request and we don't have a recv buffer waiting for
it, which will cause RNR-NAK. This *will* cause performance drops and
jitters for sure.

How about we just track the rx_desc to know if we already posted it as
a start (untested as I don't have access to RDMA HW this week):
--
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c 
b/drivers/infiniband/ulp/isert/ib_isert.c
index 9b33c0c97468..fcbed35e95a8 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -817,6 +817,7 @@ isert_post_recvm(struct isert_conn *isert_conn, u32 
count)
                 rx_wr->sg_list = &rx_desc->rx_sg;
                 rx_wr->num_sge = 1;
                 rx_wr->next = rx_wr + 1;
+               rx_desc->in_use = false;
         }
         rx_wr--;
         rx_wr->next = NULL; /* mark end of work requests list */
@@ -835,6 +836,15 @@ isert_post_recv(struct isert_conn *isert_conn, 
struct iser_rx_desc *rx_desc)
         struct ib_recv_wr *rx_wr_failed, rx_wr;
         int ret;

+       if (!rx_desc->in_use) {
+               /*
+                * if the descriptor is not in-use we already reposted it
+                * for recv, so just silently return
+                */
+               return 0;
+       }
+
+       rx_desc->in_use = false;
         rx_wr.wr_cqe = &rx_desc->rx_cqe;
         rx_wr.sg_list = &rx_desc->rx_sg;
         rx_wr.num_sge = 1;
@@ -1397,6 +1407,8 @@ isert_recv_done(struct ib_cq *cq, struct ib_wc *wc)
                 return;
         }

+       rx_desc->in_use = true;
+
         ib_dma_sync_single_for_cpu(ib_dev, rx_desc->dma_addr,
                         ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);

diff --git a/drivers/infiniband/ulp/isert/ib_isert.h 
b/drivers/infiniband/ulp/isert/ib_isert.h
index c02ada57d7f5..87d994de8c91 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.h
+++ b/drivers/infiniband/ulp/isert/ib_isert.h
@@ -60,7 +60,7 @@

  #define ISER_RX_PAD_SIZE       (ISCSI_DEF_MAX_RECV_SEG_LEN + 4096 - \
                 (ISER_RX_PAYLOAD_SIZE + sizeof(u64) + sizeof(struct 
ib_sge) + \
-                sizeof(struct ib_cqe)))
+                sizeof(struct ib_cqe) + sizeof(bool)))

  #define ISCSI_ISER_SG_TABLESIZE                256

@@ -85,6 +85,7 @@ struct iser_rx_desc {
         u64             dma_addr;
         struct ib_sge   rx_sg;
         struct ib_cqe   rx_cqe;
+       bool            in_use;
         char            pad[ISER_RX_PAD_SIZE];
  } __packed;
--

We have a lot of room for cleanups in isert... I'll need to
make some time to get it going...

I'll be waiting to hear from you if it makes your issue go away.

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [SPAMMY (7.002)]Re: SQ overflow seen running isert traffic
       [not found]                                 ` <945e2947-f67a-4202-cd27-d4631fe10f68-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2017-03-21 15:25                                   ` Potnuri Bharat Teja
       [not found]                                     ` <20170321152506.GA32655-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Potnuri Bharat Teja @ 2017-03-21 15:25 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Nicholas A. Bellinger, SWise OGC,
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tuesday, March 03/21/17, 2017 at 19:22:30 +0530, Sagi Grimberg wrote:
Hi Sagi,
The patch works fine!
>    Hi Baharat and Nic,
> 
>    Apologies for the late reply,
> 
>    > Hi Nicholas,
>    > I see them from 2MB onwards.
>    >>
>    >>    > Here is what I see with the 3 patches alone applied:
>    >>    >
>    >>    > -> In isert_put_datain() and isert_post_response() a corresponding
>    recv
>    >>    WR is posted before
>    >>    > posting a send and hence for every post failure a recv is already
>    posted
>    >>    into a tightly packed
>    >>    > RQ, causing it to overflow.
>    >>
>    >>    Just for me to understand, the intermittent TMR ABORT_TASKs are
>    caused
>    >>    by the repeated failure to post RDMA_WRITE WRs for a large ISER
>    Data-In
>    >>    payload, due to mis-sizing of needed WRs from RDMA R/W API vs.
>    >>    underlying hardware capabilities.
>    > Yes.
>    >>
>    >>    Moving the recv posts after the send post for RDMA_WRITEs helps to
>    >>    reduce the severity of the issue with iw_cxgb4, but doesn't
>    completely
>    >>    eliminate the issue under load.
>    > Moving recv posts only comes in to effect along with your changes.
> 
>    ...
> 
>    >>    So the reason why your patch to swap post_recv -> post_send to
>    post_send
>    >>    -> post_recv makes a difference is because it allows enough trickle
>    of
>    >>    RDMA_WRITEs to make it through, where iser-initiator doesn't attempt
>    to
>    >>    escalate recovery and doesn't attempt session reinstatement.
>    > I dont exactly know if above thing comes into play but the actual reason
>    I did
>    > swap posting RQ and SQ is, unlike SQ, RQ is posted with WRs to the brim
>    during
>    > the intialisation itself. From thereon we post a RQ WR for every RQ
>    completion
>    > That makes it almost full at any point of time.
>    >
>    > Now in our scenario, SQ is miscalulated and too small for few adapters
>    and so
>    > filled gradually as the IO starts. Once SQ is full, according to your
>    patches
>    > isert queues it and tries to repost the command again. Here in iser
>    functions
>    > like isert_post_response(), isert_put_datain() post send is done after
>    post recv.
>    > For the first post send failure in say isert_put_datain(), the
>    corresponding
>    > post recv is already posted, then on queuing the command and trying
>    reposting
>    > an extra recv is again posted which fills up the RQ also.
>    >
>    >  By swapping post recv and send as in my incermental patch, we dont post
>    that
>    > extra recv, and post recv only on successful post send.
>    > Therfore I think this incremental patch is necessary.
> 
>    Reversing the order to recv and send posting will cause problems
>    in stress IO workloads (especially for iWARP). The problem of sending
>    a reply before reposting the recv buffer is that the initiator can send
>    immediately a new request and we don't have a recv buffer waiting for
>    it, which will cause RNR-NAK. This *will* cause performance drops and
>    jitters for sure.
Totally agree with you.
> 
>    How about we just track the rx_desc to know if we already posted it as
>    a start (untested as I don't have access to RDMA HW this week):
>    --
>    diff --git a/drivers/infiniband/ulp/isert/ib_isert.c
>    b/drivers/infiniband/ulp/isert/ib_isert.c
>    index 9b33c0c97468..fcbed35e95a8 100644
>    --- a/drivers/infiniband/ulp/isert/ib_isert.c
>    +++ b/drivers/infiniband/ulp/isert/ib_isert.c
>    @@ -817,6 +817,7 @@ isert_post_recvm(struct isert_conn *isert_conn, u32
>    count)
>                     rx_wr->sg_list = &rx_desc->rx_sg;
>                     rx_wr->num_sge = 1;
>                     rx_wr->next = rx_wr + 1;
>    +               rx_desc->in_use = false;
>             }
>             rx_wr--;
>             rx_wr->next = NULL; /* mark end of work requests list */
>    @@ -835,6 +836,15 @@ isert_post_recv(struct isert_conn *isert_conn,
>    struct iser_rx_desc *rx_desc)
>             struct ib_recv_wr *rx_wr_failed, rx_wr;
>             int ret;
> 
>    +       if (!rx_desc->in_use) {
>    +               /*
>    +                * if the descriptor is not in-use we already reposted it
>    +                * for recv, so just silently return
>    +                */
>    +               return 0;
>    +       }
>    +
>    +       rx_desc->in_use = false;
>             rx_wr.wr_cqe = &rx_desc->rx_cqe;
>             rx_wr.sg_list = &rx_desc->rx_sg;
>             rx_wr.num_sge = 1;
>    @@ -1397,6 +1407,8 @@ isert_recv_done(struct ib_cq *cq, struct ib_wc *wc)
>                     return;
>             }
> 
>    +       rx_desc->in_use = true;
>    +
>             ib_dma_sync_single_for_cpu(ib_dev, rx_desc->dma_addr,
>                             ISER_RX_PAYLOAD_SIZE, DMA_FROM_DEVICE);
> 
>    diff --git a/drivers/infiniband/ulp/isert/ib_isert.h
>    b/drivers/infiniband/ulp/isert/ib_isert.h
>    index c02ada57d7f5..87d994de8c91 100644
>    --- a/drivers/infiniband/ulp/isert/ib_isert.h
>    +++ b/drivers/infiniband/ulp/isert/ib_isert.h
>    @@ -60,7 +60,7 @@
> 
>      #define ISER_RX_PAD_SIZE       (ISCSI_DEF_MAX_RECV_SEG_LEN + 4096 - \
>                     (ISER_RX_PAYLOAD_SIZE + sizeof(u64) + sizeof(struct
>    ib_sge) + \
>    -                sizeof(struct ib_cqe)))
>    +                sizeof(struct ib_cqe) + sizeof(bool)))
> 
>      #define ISCSI_ISER_SG_TABLESIZE                256
> 
>    @@ -85,6 +85,7 @@ struct iser_rx_desc {
>             u64             dma_addr;
>             struct ib_sge   rx_sg;
>             struct ib_cqe   rx_cqe;
>    +       bool            in_use;
>             char            pad[ISER_RX_PAD_SIZE];
>      } __packed;
>    --
> 
>    We have a lot of room for cleanups in isert... I'll need to
>    make some time to get it going...
> 
>    I'll be waiting to hear from you if it makes your issue go away.
Test runs fine and it solved the issue.
Thanks for the patch!
> 
>    Cheers,
>    Sagi.
Thanks,
Bharat.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [SPAMMY (7.002)]Re: SQ overflow seen running isert traffic
       [not found]                                     ` <20170321152506.GA32655-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
@ 2017-03-21 16:38                                       ` Sagi Grimberg
       [not found]                                         ` <4dab6b43-20d3-86f0-765a-be0851e9f4a0-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Sagi Grimberg @ 2017-03-21 16:38 UTC (permalink / raw)
  To: Potnuri Bharat Teja
  Cc: Nicholas A. Bellinger, SWise OGC,
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA


> Test runs fine and it solved the issue.
> Thanks for the patch!

Cool, can I add your "Tested-by" tag on a formal patch then?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [SPAMMY (7.002)]Re: SQ overflow seen running isert traffic
       [not found]                                         ` <4dab6b43-20d3-86f0-765a-be0851e9f4a0-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2017-03-21 17:50                                           ` Potnuri Bharat Teja
  0 siblings, 0 replies; 24+ messages in thread
From: Potnuri Bharat Teja @ 2017-03-21 17:50 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Nicholas A. Bellinger, SWise OGC,
	target-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tuesday, March 03/21/17, 2017 at 22:08:04 +0530, Sagi Grimberg wrote:
>    > Test runs fine and it solved the issue.
>    > Thanks for the patch!
> 
>    Cool, can I add your "Tested-by" tag on a formal patch then?
Sure, Please add.
Tested-by: Potnuri Bharat Teja <bharat-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

Thanks.
>    --
>    To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>    the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>    More majordomo info at  [1]http://vger.kernel.org/majordomo-info.html
> 
> References
> 
>    Visible links
>    1. http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-03-21 17:50 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-27  7:01 RQ overflow seen running isert traffic Potnuri Bharat Teja
     [not found] ` <20160927070157.GA13140-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
2016-09-29 14:12   ` Steve Wise
2016-10-05  6:14 ` Sagi Grimberg
2016-10-17 11:16   ` Potnuri Bharat Teja
2016-10-17 18:29     ` Steve Wise
2016-10-18  8:04       ` Sagi Grimberg
2016-10-18 11:28         ` SQ " Potnuri Bharat Teja
2016-10-18 13:17           ` Sagi Grimberg
     [not found]             ` <ed7ebb39-be81-00b3-ef23-3f4c0e3afbb1-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2016-10-18 14:34               ` Steve Wise
2016-10-18 16:13                 ` Jason Gunthorpe
2016-10-18 19:03                   ` Steve Wise
2016-10-20  8:34                   ` Sagi Grimberg
     [not found]                     ` <f7a4b395-1786-3c7a-7639-195e830db5ad-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-03-20 13:05                       ` Potnuri Bharat Teja
2017-03-20 15:04                         ` Steve Wise
2016-10-31  3:40                 ` Nicholas A. Bellinger
2016-11-02 17:03                   ` Steve Wise
     [not found]                   ` <1477885208.27946.8.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2016-11-08 10:06                     ` Potnuri Bharat Teja
2017-03-20 10:15                       ` Potnuri Bharat Teja
2017-03-21  6:32                         ` Nicholas A. Bellinger
2017-03-21  7:51                           ` Potnuri Bharat Teja
     [not found]                             ` <20170321075131.GA11565-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
2017-03-21 13:52                               ` Sagi Grimberg
     [not found]                                 ` <945e2947-f67a-4202-cd27-d4631fe10f68-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-03-21 15:25                                   ` [SPAMMY (7.002)]Re: " Potnuri Bharat Teja
     [not found]                                     ` <20170321152506.GA32655-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
2017-03-21 16:38                                       ` Sagi Grimberg
     [not found]                                         ` <4dab6b43-20d3-86f0-765a-be0851e9f4a0-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-03-21 17:50                                           ` Potnuri Bharat Teja

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.