All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever III <chuck.lever@oracle.com>
To: Trond Myklebust <trondmy@hammerspace.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Anna Schumaker <anna.schumaker@netapp.com>,
	Olga Kornievskaia <olga.kornievskaia@gmail.com>
Subject: Re: [RFC 1/2] xprtrdma: xdr pad optimization revisted again
Date: Tue, 31 Aug 2021 14:33:03 +0000	[thread overview]
Message-ID: <AA0A4E98-482C-4276-B8C8-96AF08550320@oracle.com> (raw)
In-Reply-To: <B5C7A8A1-E810-4616-9E1A-265BADEC5432@oracle.com>


> On Aug 30, 2021, at 4:37 PM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
>> On Aug 30, 2021, at 2:18 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
>> 
>> On Mon, 2021-08-30 at 18:02 +0000, Chuck Lever III wrote:
>>> 
>>>> On Aug 30, 2021, at 1:34 PM, Trond Myklebust
>>>> <trondmy@hammerspace.com> wrote:
>>>> 
>>>> On Mon, 2021-08-30 at 13:24 -0400, Olga Kornievskaia wrote:
>>>>> On Mon, Aug 30, 2021 at 1:04 PM Chuck Lever III
>>>>> <chuck.lever@oracle.com> wrote:
>>>>>> 
>>>>>> Hi Olga-
>>>>>> 
>>>>>>> On Aug 30, 2021, at 12:53 PM, Olga Kornievskaia
>>>>>>> <olga.kornievskaia@gmail.com> wrote:
>>>>>>> 
>>>>>>> From: Olga Kornievskaia <kolga@netapp.com>
>>>>>>> 
>>>>>>> Given the patch "Always provide aligned buffers to the RPC
>>>>>>> read
>>>>>>> layers",
>>>>>>> RPC over RDMA doesn't need to look at the tail page and add
>>>>>>> that
>>>>>>> space
>>>>>>> to the write chunk.
>>>>>>> 
>>>>>>> For the RFC 8166 compliant server, it must not write an XDR
>>>>>>> padding
>>>>>>> into the write chunk (even if space was provided).
>>>>>>> Historically
>>>>>>> (before RFC 8166) Solaris RDMA server has been requiring the
>>>>>>> client
>>>>>>> to provide space for the XDR padding and thus this client
>>>>>>> code
>>>>>>> has
>>>>>>> existed.
>>>>>> 
>>>>>> I don't understand this change.
>>>>>> 
>>>>>> So, the upper layer doesn't provide XDR padding any more. That
>>>>>> doesn't
>>>>>> mean Solaris servers still aren't going to want to write into
>>>>>> it.
>>>>>> The
>>>>>> client still has to provide this padding from somewhere.
>>>>>> 
>>>>>> This suggests that "Always provide aligned buffers to the RPC
>>>>>> read
>>>>>> layers" breaks our interop with Solaris servers. Does it?
>>>>> 
>>>>> No, I don't believe "Always provide aligned buffers to the RPC
>>>>> read
>>>>> layers" breaks the interoperability. THIS patch would break the
>>>>> interop.
>>>>> 
>>>>> If we are not willing to break the interoperability and support
>>>>> only
>>>>> servers that comply with RFC 8166, this patch is not needed.
>>>> 
>>>> Why? The intention of the first patch is to ensure that we do not
>>>> have
>>>> buffers that are not word aligned. If Solaris wants to write
>>>> padding
>>>> after the end of the file, then there is space in the page buffer
>>>> for
>>>> it to do so. There should be no need for an extra tail in which to
>>>> write the padding.
>>> 
>>> The RPC/RDMA protocol is designed for hardware-offloaded direct data
>>> placement. That means the padding, which isn't data, must be directed
>>> to another buffer.
>>> 
>>> This is a problem with RPC/RDMA v1 implementations. RFC 5666 was
>>> ambiguous, so there are implementations that write XDR padding into
>>> Write chunks. This is why RFC 8166 says SHOULD NOT instead of MUST
>>> NOT.
>>> 
>>> I believe rpcrdma-version-two makes it a requirement not to use XDR
>>> padding in either Read or Write data payload chunks.
>>> 
>>> 
>> Correct, but in order to satisfy the needs of the Solaris server,
>> you've hijacked the tail for use as a data buffer. AFAICS it is not
>> being used as a SEND buffer target, but is instead being turned into a
>> write chunk target. That is not acceptable!
> 
> The buffer is being used as both. Proper function depends on the
> order of RDMA Writes and Receives on the client.

I've refreshed my memory. The above statement oversimplifies
a bit.

- The memory pointed to by rqst->rq_buffer is persistently
  registered to allow RDMA Send from it. It can also be
  registered on the fly for use in Read chunks.

- The memory pointed to by rqst->rq_rbuffer is NOT
  persistently registered and is never used for RDMA Send
  or Receive. It can be registered on the fly for use in a
  Write or Reply chunk. This is when the tail portion of
  rq_rcv_buf might be used to receive a pad.

- A third set of persistently registered buffers is used
  ONLY to catch RDMA Receive completions.

When a Receive completes, the reply handler decides how to
construct the received RPC message. If there was a Reply
chunk, it leaves rq_rcv_buf pointing to rq_rbuffer, as that's
where the server wrote the Reply chunk. If there wasn't a
Reply chunk, the handler points rq_rcv_buf to the RDMA
Receive buffer.

The only case where there might be overlap is when the client
has provisioned both a Write chunk and a Reply chunk. In that
case, yes, I can see that there might be an opportunity for
the server to RDMA Write the Write chunk's pad and the Reply
chunk into the same client memory buffer. However:

- Most servers don't write a pad. Modern Solaris servers
  behave "correctly" in this regard, I'm now told. Linux
  never writes the pad, and I'm told OnTAP also does not.

- Server implementations I'm aware of Write the Write chunk
  first and then the Reply chunk. The Reply chunk would
  overwrite the pad.

- The client hardly ever uses Write + Reply. It's most
  often one or the other.

So again, there is little to zero chance there is an existing
operational issue here. But see below.


> rpcrdma_encode_write_list() registers up to an extra 3 bytes in
> rq_rcv_buf.tail as part of the Write chunk. The R_keys for the
> segments in the Write chunk are then sent to the server as part
> of the RPC Call.
> 
> As part of Replying, the server RDMA Writes data into the chunk,
> and possibly also RDMA Writes padding. It then does an RDMA Send
> containing the RPC Reply.
> 
> The Send data always lands in the Receive buffer _after_ the Write
> data. The Receive completion guarantees that previous RDMA Writes
> are complete. Receive handling invalidates and unmaps the memory,
> and then it is made available to the RPC client.
> 
> 
>> It means that we now are limited to creating COMPOUNDs where there are
>> no more operations following the READ op because if we do so, we end up
>> with a situation where the RDMA behaviour breaks.
> 
> I haven't heard reports of a problem like this.
> 
> However, if there is a problem, it would be simple to create a
> persistently-registered memory region that is not part of any RPC
> buffer that can be used to catch unused Write chunk XDR padding.
> 
> 
>>>> This means that the RDMA and TCP cases should end up doing the same
>>>> thing for the case of the Solaris server: the padding is written
>>>> into
>>>> the page buffer. There is nothing written to the tail in either
>>>> case.
>>> 
>>> "Always provide" can guarantee that the NFS client makes aligned
>>> requests for buffered I/O, but what about NFS direct I/O from user
>>> space? The NIC will place the data payload in the application
>>> buffer, but there's no guarantee that the NFS READ request will be
>>> aligned or that the buffer will be able to sink the extra padding
>>> bytes.
>>> 
>>> We would definitely consider it an error if an unaligned RDMA Read
>>> leaked the link-layer's 4-byte padding into a sink buffer.
>>> 
>>> So, "Always provide" is nice for the in-kernel NFS client, but I
>>> don't believe it allows the way xprtrdma behaves to be changed.
>>> 
>> 
>> If you're doing an unaligned READ from user space then you are already
>> in a situation where you're doing something that is incompatible with
>> block device requirements.
>> If there really are any applications that contain O_DIRECT code
>> specifically for use with NFS, then we can artificially force the
>> buffers to be aligned by reducing the size of the buffer to align to a
>> 4 byte boundary. NFS supports returning short reads.
> 
> Or xprtrdma can use the scheme I describe above. I think that
> would be simpler and would avoid layering violations.
> 
> That would also possibly address the Nvidia problem, since a
> pre-registered MR that handles Write padding would always be a
> separate RDMA segment.
> 
> Again, I doubt this is necessary to fix any operational problem
> with _supported_ use cases, but let me know if you'd like me to
> make this change.

Since RFC 8666 says "SHOULD NOT", the spec mandates that the client
has to expect there might be a server that will RDMA Write the XDR
pad, so the Linux client really should always provide one. Having
a persistently registered spot to use for that means we have a
potential no-cost mechanism for providing that extra segment. The
whole "pad optimization" thing was because the pad segment is
currently registered on the fly, so it has a per-I/O cost.

So I'm leaning toward implementing this simple solution, at least
as proof-of-concept.


--
Chuck Lever




  reply	other threads:[~2021-08-31 14:33 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-30 16:53 [RFC 0/2] revisit RMDA XDR padding management Olga Kornievskaia
2021-08-30 16:53 ` [RFC 1/2] xprtrdma: xdr pad optimization revisted again Olga Kornievskaia
2021-08-30 17:04   ` Chuck Lever III
2021-08-30 17:24     ` Olga Kornievskaia
2021-08-30 17:34       ` Trond Myklebust
2021-08-30 18:02         ` Chuck Lever III
2021-08-30 18:18           ` Trond Myklebust
2021-08-30 20:37             ` Chuck Lever III
2021-08-31 14:33               ` Chuck Lever III [this message]
2021-08-31 15:58                 ` Olga Kornievskaia
2021-08-31 16:11                   ` Chuck Lever III
2021-08-31 15:54               ` Olga Kornievskaia
2021-08-31 16:02                 ` Chuck Lever III
2021-08-30 17:35       ` Chuck Lever III
2021-08-30 16:53 ` [RFC 2/2] xprtrdma: remove re_implicit_roundup xprt_rdma_pad_optimize Olga Kornievskaia
2021-08-30 16:57   ` Chuck Lever III
2021-08-30 16:55 ` [RFC 0/2] revisit RMDA XDR padding management Olga Kornievskaia

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AA0A4E98-482C-4276-B8C8-96AF08550320@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=anna.schumaker@netapp.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=olga.kornievskaia@gmail.com \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.