linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Mora, Jorge" <Jorge.Mora@netapp.com>
To: Chuck Lever <chuck.lever@oracle.com>, Olga Kornievskaia <aglo@umich.edu>
Cc: linux-rdma <linux-rdma@vger.kernel.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR)
Date: Mon, 19 Nov 2018 21:42:56 +0000	[thread overview]
Message-ID: <5EA42399-05C2-40D3-A5CA-7B40971AEC33@netapp.com> (raw)
In-Reply-To: <4EE34B64-0BEB-439A-B2A2-D77673D4CF70@oracle.com>

Hello Chuck,

I am confused, is it the whole purpose of RDMA is to place the data directly into the memory location given by the virtual address or offset? What you are saying is that this offset is not the actual memory address and so the driver must map this offset to the actual address?


--Jorge

On 11/19/18, 2:33 PM, "linux-nfs-owner@vger.kernel.org on behalf of Chuck Lever" <linux-nfs-owner@vger.kernel.org on behalf of chuck.lever@oracle.com> wrote:

    NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
    
    
    
    
    > On Nov 19, 2018, at 4:22 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
    >
    > On Mon, Nov 19, 2018 at 1:59 PM Chuck Lever <chuck.lever@oracle.com> wrote:
    >>
    >>
    >>
    >>> On Nov 19, 2018, at 1:47 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
    >>>
    >>> On Mon, Nov 19, 2018 at 1:19 PM Chuck Lever <chuck.lever@oracle.com> wrote:
    >>>>
    >>>>
    >>>>
    >>>>> On Nov 19, 2018, at 1:08 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
    >>>>>
    >>>>> On Mon, Nov 19, 2018 at 12:59 PM Chuck Lever <chuck.lever@oracle.com> wrote:
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>> On Nov 19, 2018, at 12:47 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
    >>>>>>>
    >>>>>>> On Mon, Nov 19, 2018 at 10:46 AM Chuck Lever <chuck.lever@oracle.com> wrote:
    >>>>>>>>
    >>>>>>>> Place the associated RPC transaction's XID in the upper 32 bits of
    >>>>>>>> each RDMA segment's rdma_offset field. These bits are currently
    >>>>>>>> always zero.
    >>>>>>>>
    >>>>>>>> There are two reasons to do this:
    >>>>>>>>
    >>>>>>>> - The R_key only has 8 bits that are different from registration to
    >>>>>>>> registration. The XID adds more uniqueness to each RDMA segment to
    >>>>>>>> reduce the likelihood of a software bug on the server reading from
    >>>>>>>> or writing into memory it's not supposed to.
    >>>>>>>>
    >>>>>>>> - On-the-wire RDMA Read and Write operations do not otherwise carry
    >>>>>>>> any identifier that matches them up to an RPC. The XID in the
    >>>>>>>> upper 32 bits will act as an eye-catcher in network captures.
    >>>>>>>
    >>>>>>> Is this just an "eye-catcher" or do you have plans to use it in
    >>>>>>> wireshark? If the latter, then can we really do that? while a linux
    >>>>>>> implementation may do that, other (or even possibly future linux)
    >>>>>>> implementation might not do this. Can we justify changing the
    >>>>>>> wireshark logic for it?
    >>>>>>
    >>>>>> No plans to change the wireshark RPC-over-RDMA dissector.
    >>>>>> That would only be a valid thing to do if adding the XID
    >>>>>> were made part of the RPC-over-RDMA protocol via an RFC.
    >>>>>
    >>>>> Agreed. Can you also help me understand the proposal (as I'm still
    >>>>> trying to figure why it is useful).
    >>>>>
    >>>>> You are proposing to modify the RDMA segments's RDMA offset field (I
    >>>>> see top 6bits are indeed always 0). I don't see how adding that helps
    >>>>> an RDMA read/write message which does not have an "offset" field in it
    >>>>> be matched to a particular RPC. I don't believe we have (had) any
    >>>>> issues matching the initial RC Send only that contains the RDMA_MSG to
    >>>>> the RPC.
    >>>>
    >>>> The ULP has access to only the low order 8 bits of the R_key. The
    >>>> upper 24 bits are fixed for each MR. So for any given MR, there are
    >>>> only 256 unique R_key values. That means the same R_key will appear
    >>>> again quickly on the wire.
    >>>>
    >>>> The 64-bit offset field is set by the ULP, and can be essentially
    >>>> any arbitrary value. Most kernel ULPs use the iova of the registered
    >>>> memory. We only need the lower 32 bits for that.
    >>>>
    >>>> The purpose of adding junk to the offset is to make the offset
    >>>> unique to that RPC transaction, just like the R_key is. This helps
    >>>> make the RDMA segment co-ordinates (handle, length, offset) more
    >>>> unique and thus harder to spoof.
    >>>
    >>> Thank you for the explanation that makes sense.
    >>>
    >>>> We could use random numbers in that upper 32 bits, but we have
    >>>> something more handy: the RPC's XID.
    >>>>
    >>>> Now when you look at an RDMA Read or Write, the top 32 bits in each
    >>>> RDMA segment's offset match the XID of the RPC transaction that the
    >>>> RDMA operations go with. This is really a secondary benefit to the
    >>>> uniquifying effect above.
    >>>
    >>> I find the wording "no the wire RDMA read or write" misleading. Did
    >>> you really mean it as "RDMA read or write" or do you mean "RDMA_MSG"
    >>> or do you mean "NFS RDMA read or write"? Because RDMA offset is not a
    >>> part of the RDMA read/write (first/middle/last) packet. That's what
    >>> I'm hanged up on.
    >>
    >> Here's an RDMA Read request in a network capture I had at hand:
    >>
    >> No.     Time               Source                Destination           Protocol Length Info
    >>    228 22:31:06.203637    LID: 5                LID: 11               InfiniBand 42     RC RDMA Read Request QP=0x000240
    >>
    >> Frame 228: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 1
    >> Extensible Record Format
    >> InfiniBand
    >>    Local Route Header
    >>    Base Transport Header
    >>    RETH - RDMA Extended Transport Header
    >>        Virtual Address: 11104011393315758080   <<<<<<
    >>        Remote Key: 1879114618
    >>        DMA Length: 4015
    >>    Invariant CRC: 0xd492a3e1
    >>    Variant CRC: 0x8736
    >>
    >> The value of the Virtual Address field is what the RPC-over-RDMA
    >> protocol calls the Offset. The Read responses are matched to this
    >> request by their message sequence numbers, and this Read request is
    >> matched to the RPC Call by the XID in the top 32 bits of the
    >> Virtual Address.
    >>
    >> Likewise for an RDMA Write Only request:
    >>
    >>    188 22:31:06.201350    LID: 5                LID: 11               InfiniBand 162    RC RDMA Write Only QP=0x000240
    >>
    >> Frame 188: 162 bytes on wire (1296 bits), 162 bytes captured (1296 bits) on interface 1
    >> Extensible Record Format
    >> InfiniBand
    >>    Local Route Header
    >>    Base Transport Header
    >>    RETH - RDMA Extended Transport Header
    >>        Virtual Address: 10455493047213809920   <<<<<<
    >>        Remote Key: 1879115386
    >>        DMA Length: 120
    >>    Invariant CRC: 0xe2e1b2cd
    >>    Variant CRC: 0x676f
    >> Data (120 bytes)
    >>
    >> 0000  91 19 5f 87 00 00 00 01 00 00 00 00 00 00 00 00   .._.............
    >> 0010  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01   ................
    >> 0020  00 00 00 02 00 00 01 ed 00 00 00 02 00 00 04 16   ................
    >> 0030  00 00 00 64 00 00 00 00 00 00 00 28 00 00 00 00   ...d.......(....
    >> 0040  00 00 00 00 00 00 00 00 00 00 00 00 54 9e e5 9d   ............T...
    >> 0050  c6 d3 1a d2 00 00 00 00 00 00 63 1f 5b f2 2e 7a   ..........c.[..z
    >> 0060  0b ae f9 ec 5b f2 2e 7a 0b 53 6c ef 5b f2 2e 7a   ....[..z.Sl.[..z
    >> 0070  0b 53 6c ef 00 00 00 0c                           .Sl.....
    >>
    >>
    >> I believe RDMA Write First also has an RETH. The sender does not
    >> interleave RDMA Writes, so subsequent Middle and Last packets go
    >> with this RDMA Write First.
    >
    > Ok I see now where I was confused, in RDMA_MSG in the wireshark it's
    > labeled "RDMA offset" and the in the RDMA write first message it's
    > labeled "Virtual address". Thank you for explanation.
    >
    > Here's the next question (coming from Jorge)? Is it reasonable to
    > assume that top 32bits are always zero? I have an network trace (from
    > 4.18-rc2 kernel) where they are not.
    >
    > RPC over RDMA
    >    XID: 0xa347cfa2
    >    Version: 1
    >    Flow Control: 128
    >    Message Type: RDMA_MSG (0)
    >    Read list (count: 0)
    >    Write list (count: 1)
    >        Write chunk (1 segment)
    >            Write chunk segment count: 1
    >            RDMA segment 0
    >                RDMA handle: 0x4000076f
    >                RDMA length: 65676
    >                RDMA offset: 0x0000001049973000
    >
    > I don't believe 0xa347cfa2 can fit?
    
    I've been told by three independent RDMA experts that using the top
    32 bits of the offset, even if they are not zero, should be OK to do.
    If it doesn't work, it's a device driver bug that needs to be fixed.
    
    
    --
    Chuck Lever
    
    
    
    


  reply	other threads:[~2018-11-19 21:43 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-19 15:45 [PATCH v1 0/4] NFS/RDMA client for v4.21 (part 1) Chuck Lever
2018-11-19 15:45 ` [PATCH v1 1/4] xprtrdma: Remove support for FMR memory registration Chuck Lever
2018-11-19 16:16   ` Bart Van Assche
2018-11-19 19:09     ` Leon Romanovsky
2018-11-19 20:52       ` Bart Van Assche
2018-11-20  5:37         ` Leon Romanovsky
2018-11-19 22:41     ` Jason Gunthorpe
2018-11-19 22:56       ` Chuck Lever
2018-11-19 23:10         ` Jason Gunthorpe
2018-11-20 15:22       ` Dennis Dalessandro
2018-11-19 15:45 ` [PATCH v1 2/4] xprtrdma: mrs_create off-by-one Chuck Lever
2018-11-19 15:46 ` [PATCH v1 3/4] xprtrdma: Reduce max_frwr_depth Chuck Lever
2018-11-19 15:46 ` [PATCH v1 4/4] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR) Chuck Lever
2018-11-19 17:47   ` Olga Kornievskaia
2018-11-19 17:58     ` Chuck Lever
2018-11-19 18:08       ` Olga Kornievskaia
2018-11-19 18:18         ` Chuck Lever
2018-11-19 18:47           ` Olga Kornievskaia
2018-11-19 18:58             ` Chuck Lever
2018-11-19 21:22               ` Olga Kornievskaia
2018-11-19 21:32                 ` Chuck Lever
2018-11-19 21:42                   ` Mora, Jorge [this message]
2018-11-19 22:46                     ` Jason Gunthorpe
2018-11-20  2:45                       ` Tom Talpey
2018-11-20  3:09                         ` Jason Gunthorpe
2018-11-20  3:25                           ` Tom Talpey
2018-11-20  3:32                             ` Jason Gunthorpe
2018-11-20  3:38                               ` Tom Talpey
2018-11-20 18:02   ` Anna Schumaker
2018-11-20 18:07     ` Chuck Lever
     [not found]       ` <94ff7ec712e086bfdd9c217a5f97c293a07151b9.camel@gmail.com>
2018-11-20 21:31         ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5EA42399-05C2-40D3-A5CA-7B40971AEC33@netapp.com \
    --to=jorge.mora@netapp.com \
    --cc=aglo@umich.edu \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).