All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Steve Wise" <swise@opengridcomputing.com>
To: 'Potnuri Bharat Teja' <bharat@chelsio.com>,
	'Sagi Grimberg' <sagi@grimberg.me>
Cc: 'Jason Gunthorpe' <jgunthorpe@obsidianresearch.com>,
	target-devel@vger.kernel.org, nab@linux-iscsi.org,
	linux-rdma@vger.kernel.org, Christoph Hellwig <hch@lst.de>
Subject: RE: SQ overflow seen running isert traffic
Date: Mon, 20 Mar 2017 10:04:06 -0500	[thread overview]
Message-ID: <0dc201d2a18b$385caf10$a9160d30$@opengridcomputing.com> (raw)
In-Reply-To: <20170320130545.GB11699@chelsio.com>

+Christoph

> 
> On Thursday, October 10/20/16, 2016 at 14:04:34 +0530, Sagi Grimberg wrote:
> >    Hey Jason,
> >
> >    >> 1) we believe the iSER + RW API correctly sizes the SQ, yet we're
> >    seeing SQ
> >    >> overflows.  So the SQ sizing needs more investigation.
> >    >
> >    > NFS had this sort of problem - in that case it was because the code
> >    > was assuming that a RQ completion implied SQ space - that is not
> >    > legal, only direct completions from SQ WCs can guide available space
> >    > in the SQ..
> >
> >    Its not the same problem. iser-target does not tie SQ and RQ spaces.
> >    The origin here is the difference between IB/RoCE and iWARP and the
> >    chelsio HW that makes it hard to predict the SQ correct size.
> >
> >    iWARP needs extra registration for rdma reads and the chelsio device
> >    seems to be limited in the number of pages per registration so this
> >    configuration will need a larger send queue.
> >
> >    Another problem is that we don't have a correct retry flow.
> >
> >    Hopefully we can address that in the RW API which is designed to hide
> >    these details from the ULP...
> Hi Sagi,
> Here is what our further analysis of SQ dump at the time of overflow says:
> 
> RDMA read/write API is creating long chains (32 WRs) to handle large ISCSI
> READs. For Writing iscsi default block size of 512KB data, iw_cxgb4's max
> number of sge advertised is 4 page ~ 16KB for write, needs WR chain of 32 WRs
> (another possible factor is they all are unsignalled WRs and are completed
> only after next signalled WR) But apparantly rdma_rw_init_qp() assumes that
> any given IO will take only 1 WRITE WR to convey the data.
> 
> This evidently is incorrect and rdma_rw_init_qp() needs to factor and size
> the queue based on max_sge of device for write and read and the sg_tablesize
> for which rdma read/write is used for, like ISCSI_ISER_MAX_SG_TABLESIZE of
> initiator. If above analysis is correct, please suggest how could this be
fixed?
> 
> Further, using MRs for rdma WRITE by using rdma_wr_force_mr = 1 module
> parameter of ib_core avoids SQ overflow by registering a single REG_MR and
> using that MR for a single WRITE WR. So a rdma-rw IO chain of say 32 WRITE
> WRs, becomes just 3 WRS:  REG_MR + WRITE + INV_MR as
> max_fast_reg_page_list_len of iw_cxgb4 is 128 page.
> 
> (By default force_mr is not set and iw_cxgb4 could only use MR for rdma
> READs only as per rdma_rw_io_needs_mr() if force_mr isnt set)
> >From this is there any possibility that we could use MR if the write WR
> chain exceeds a certain number?
> 
> Thanks for your time!
> 

I think it is time to resolve this XXX comment in rw.c for
rdma_rw_io_needs_mr():

/*
 * Check if the device will use memory registration for this RW operation.
 * We currently always use memory registrations for iWarp RDMA READs, and
 * have a debug option to force usage of MRs.
 * XXX: In the future we can hopefully fine tune this based on HCA driver
 * input.
 */

Regardless of whether the HCA driver provides input, I think 30+ RDMA WRITE WR
chains isn't as efficient as 1 REG_MR + 1 WRITE + 1 INV_MR.   Is it unreasonable
to just add some threshold in rw.c?   Also, I think rdma_rw_init_qp() does need
some tweaks:  It needs to take into account the max sge depth, the max REG_MR
depth, and the max SQ depth device attributes/capabilities when sizing the SQ.
However, if that computed depth exceeds the device max, then the SQ will not be
big enough to avoid potential overflowing, and I believe ULPs should _always_
flow control their outgoing WRs based on the SQ depth regardless.    And perhaps
rdma-rw should even avoid overly deep SQs just because that tends to inhibit
scalability.  EG: allowing lots of shallow QPs vs consuming all the device
resources with very deep QPs... 

Steve.

  reply	other threads:[~2017-03-20 15:04 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-27  7:01 RQ overflow seen running isert traffic Potnuri Bharat Teja
     [not found] ` <20160927070157.GA13140-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
2016-09-29 14:12   ` Steve Wise
2016-10-05  6:14 ` Sagi Grimberg
2016-10-17 11:16   ` Potnuri Bharat Teja
2016-10-17 18:29     ` Steve Wise
2016-10-18  8:04       ` Sagi Grimberg
2016-10-18 11:28         ` SQ " Potnuri Bharat Teja
2016-10-18 13:17           ` Sagi Grimberg
     [not found]             ` <ed7ebb39-be81-00b3-ef23-3f4c0e3afbb1-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2016-10-18 14:34               ` Steve Wise
2016-10-18 16:13                 ` Jason Gunthorpe
2016-10-18 19:03                   ` Steve Wise
2016-10-20  8:34                   ` Sagi Grimberg
     [not found]                     ` <f7a4b395-1786-3c7a-7639-195e830db5ad-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-03-20 13:05                       ` Potnuri Bharat Teja
2017-03-20 15:04                         ` Steve Wise [this message]
2016-10-31  3:40                 ` Nicholas A. Bellinger
2016-11-02 17:03                   ` Steve Wise
     [not found]                   ` <1477885208.27946.8.camel-XoQW25Eq2zviZyQQd+hFbcojREIfoBdhmpATvIKMPHk@public.gmane.org>
2016-11-08 10:06                     ` Potnuri Bharat Teja
2017-03-20 10:15                       ` Potnuri Bharat Teja
2017-03-21  6:32                         ` Nicholas A. Bellinger
2017-03-21  7:51                           ` Potnuri Bharat Teja
     [not found]                             ` <20170321075131.GA11565-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
2017-03-21 13:52                               ` Sagi Grimberg
     [not found]                                 ` <945e2947-f67a-4202-cd27-d4631fe10f68-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-03-21 15:25                                   ` [SPAMMY (7.002)]Re: " Potnuri Bharat Teja
     [not found]                                     ` <20170321152506.GA32655-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
2017-03-21 16:38                                       ` Sagi Grimberg
     [not found]                                         ` <4dab6b43-20d3-86f0-765a-be0851e9f4a0-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-03-21 17:50                                           ` Potnuri Bharat Teja

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='0dc201d2a18b$385caf10$a9160d30$@opengridcomputing.com' \
    --to=swise@opengridcomputing.com \
    --cc=bharat@chelsio.com \
    --cc=hch@lst.de \
    --cc=jgunthorpe@obsidianresearch.com \
    --cc=linux-rdma@vger.kernel.org \
    --cc=nab@linux-iscsi.org \
    --cc=sagi@grimberg.me \
    --cc=target-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.