From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jason Gunthorpe Subject: Re: Kernel fast memory registration API proposal [RFC] Date: Thu, 16 Jul 2015 11:40:46 -0600 Message-ID: <20150716174046.GB3680@obsidianresearch.com> References: <55A534D1.6030008@dev.mellanox.co.il> <20150714163506.GC7399@obsidianresearch.com> <55A53F0B.5050009@dev.mellanox.co.il> <20150714170859.GB19814@obsidianresearch.com> <55A6136A.8010204@dev.mellanox.co.il> <20150715171926.GB23588@obsidianresearch.com> <20150715224928.GA941@obsidianresearch.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Chuck Lever Cc: Sagi Grimberg , Christoph Hellwig , "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , Steve Wise , Or Gerlitz , Oren Duer , Bart Van Assche , Liran Liss , "Hefty, Sean" , Doug Ledford , Tom Talpey List-Id: linux-rdma@vger.kernel.org On Thu, Jul 16, 2015 at 10:45:46AM -0400, Chuck Lever wrote: > For some time, I=E2=80=99ve been considering deferring ib_unmap_fmr()= to > a work queue, but FMR is operational and is a bit of an antique > so I haven=E2=80=99t put much effort into bettering it. Okay, I think I get it.. The fmr unmap could be made async, but there would need to be some additional logic to defer reuse of the MR until it completes.. =20 > The reply handler parses the incoming reply, looks up the XID and > matches it to a waiting RPC request (xprt_lookup_rqst). It then > wakes that request (xprt_complete_rqst). The tasklet goes to the > next reply on the global list. >=20 > The RPC scheduler sees the awoken RPC request and steps the > finished request through to completion, at which point > xprt_release() is invoked to retire the request slot. At what point does the RPC layer touch the RDMA'd data? Based on your description it must be after xprt_release is called? > > Second, the FRWR stuff looks deeply suspicious, it is posting a > > IB_WR_LOCAL_INV, but the completion of that (in frwr_sendcompletion= ) > > triggers nothing. Handoff to the kernel must be done only after see= ing > > IB_WC_LOCAL_INV, never before. >=20 > I don=E2=80=99t understand. Our LOCAL_INV is typically unsignalled be= cause > there=E2=80=99s nothing to do in the normal case. frwr_sendcompletion= is > there to handle only flushed sends. The purpose of invalidate in the spec is to fence the RDMA. The correct, secure, way to use RDMA is to invalidate a rkey, then DMA flush it's backing memory, then access that data with the CPU. The scheme NFS is using for FRWR is trading security for performance by running the invalidate async - so a black-hat peer can maliciously abuse that. Recovering the lost performance is what SEND WITH INVALIDATE is used for. Usually such a trade off should be a user option.. > I=E2=80=99m not sure it matters here, because when the RPC reply show= s > up at the client, it already means the server isn=E2=80=99t going to > access that MR/rkey again. (If the server does access that MR > again, it would be a protocol violation). A simple contained protocol violation would be fine, but this elevates into threatening the whole machine if it happens: - Random memory corruption: Freeing the RDMA buffers before invalidate completes recylces them to other users while the remote can still write to them - CPU mis-operation: A validating parse can become broken if the data under it is changed by the remote during operation - Linux DMA API contract failure: * If bounce buffering, recycling the buffer will allow corruption of someone eles's memory * Cache coherence is lost, resulting in unpredictable data corruption, happening unpredictably in time (ie perhaps after the buffer is recycled to some other use) * If using the IOMMU a spurious DMA may machine check and hard stop the machine. [I wouldn't say this is serious enough to address immediately, but it means NFS is not an example of best practices..] > Can you provide an example in another kernel ULP? I'm still looking at them, sadly there is alot of 'interesting' stuff in the ULPs :( > > Finally, where is the flow control for posting the IB_WR_LOCAL_INV = to > > the SQ? I'm guessing there is some kind of implicit flow control he= re > > where the SEND buffer is recycled during RECV of the response, and > > this limits the SQ usage, then there are guarenteed 3x as many SQEs= as > > SEND buffers to accommodate the REG_MR and INVALIDATE WRs?? >=20 > RPC/RDMA provides flow control via credits. The peers agree on > a maximum number of concurrent outstanding RPC requests. > Typically that is 32, though implementations are increasing that > default to 128. >=20 > There=E2=80=99s a comment in frwr_op_open that explains how we calcul= ate > the maximum number of send queue entries for each credit. That is what it looked like.. So, this scheme is dangerous too. The RDMA spec doesn't guarantee that the receive side and send side run in lock step - that is to say, even though you got a RECV that indicates a SEND was executed, that doesn't mean that SQ or SCQ space has been made available to post a new SEND. This is primarily because SQ/SCQ space is not freed until the ACK is processed, and the behavior of ACKs is not simple: particularly if a compound ACK is delivered concurrently with the reply's SEND. So, lets say you completely fill the SQ with SENDs, and all the sends get dumped on the wire. The SQ is still full. The far side returns a SEND+ACK combo packet for the first send (either together or close in time), the local HCA+system now runs two concurrent tracks: the first is updating the SQ/SCQ space to reflect the ACK and the second is delivering the SEND as a receive. This is a classic race, if the CPU sees the SEND's recv and tries to post before the CPU sees the ACK side then the QP blows up as full. I've actually hit this - under very heavy loading (ie a continuously full SQ) with a similar scheme that relied on RCQ to track SQ/SCQ usage would very rarely blow up with full SQ/SCQ because of the above lack of synchronicity. The only absolutely correct way to run the RDMA stack is to keep track of SQ/SCQ space directly, and only update that tracking by processing SCQEs. =46or us, the sucky part, is any new API we build will have to be based on 'the correct way to use RDMA' (eg, invalidate used as security, direct SQ/SCQ tracking, etc), retrofitting may be difficult on ULPs that are taking shortcuts. :( :( Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" i= n the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html