From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jason Gunthorpe Subject: Re: Kernel fast memory registration API proposal [RFC] Date: Wed, 15 Jul 2015 16:49:28 -0600 Message-ID: <20150715224928.GA941@obsidianresearch.com> References: <55A4CABC.5050807@dev.mellanox.co.il> <20150714153347.GA11026@infradead.org> <55A534D1.6030008@dev.mellanox.co.il> <20150714163506.GC7399@obsidianresearch.com> <55A53F0B.5050009@dev.mellanox.co.il> <20150714170859.GB19814@obsidianresearch.com> <55A6136A.8010204@dev.mellanox.co.il> <20150715171926.GB23588@obsidianresearch.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Chuck Lever Cc: Sagi Grimberg , Christoph Hellwig , "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , Steve Wise , Or Gerlitz , Oren Duer , Bart Van Assche , Liran Liss , "Hefty, Sean" , Doug Ledford , Tom Talpey List-Id: linux-rdma@vger.kernel.org On Wed, Jul 15, 2015 at 05:25:11PM -0400, Chuck Lever wrote: > NFS READ and WRITE data payloads are mapped with ib_map_phys_mr() > just before the RPC is sent, and those payloads are unmapped > with ib_unmap_fmr() as soon as the client sees the server=E2=80=99s R= PC > reply. Okay.. but.. ib_unmap_fmr is the thing that sleeps, so you must already have a sleepable context when you call it? I was poking around to see how NFS is working (to see how we might fit a different API under here), I didn't find the call to ro_unmap I'd expect? xprt_rdma_free is presumbly the place, but how it relates to rpcrdma_reply_handler I could not obviously see. Does the upper layer call back to xprt_rdma_free before any of the RDMA buffers are touched? Can you clear up the call chain for me? Second, the FRWR stuff looks deeply suspicious, it is posting a IB_WR_LOCAL_INV, but the completion of that (in frwr_sendcompletion) triggers nothing. Handoff to the kernel must be done only after seeing IB_WC_LOCAL_INV, never before. Third all the unmaps do something like this: frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg) { invalidate_wr.opcode =3D IB_WR_LOCAL_INV; [..] while (seg1->mr_nsegs--) rpcrdma_unmap_one(ia->ri_device, seg++); read_lock(&ia->ri_qplock); rc =3D ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr); That is the wrong order, the DMA unmap of rpcrdma_unmap_one must only be done once the invalidate is complete. For FR this is ib_unmap_fmr returning, for FRWR it is when you see IB_WC_LOCAL_INV. =46inally, where is the flow control for posting the IB_WR_LOCAL_INV to the SQ? I'm guessing there is some kind of implicit flow control here where the SEND buffer is recycled during RECV of the response, and this limits the SQ usage, then there are guarenteed 3x as many SQEs as SEND buffers to accommodate the REG_MR and INVALIDATE WRs?? > These memory regions require an rkey, which is sent in the RPC > call to the server. The server performs RDMA READ or WRITE on > these regions. >=20 > I don=E2=80=99t think the server ever uses FMR to register the target > memory regions for RDMA READ and WRITE. What happens if you hit the SGE limit when constructing the RDMA READ/WRITE? Upper layer forbids that? What about iWARP, how do you avoid the 1 SGE limit on RDMA READ? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" i= n the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html