All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
To: Jason Gunthorpe
	<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
Cc: Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
	Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
	Linux NFS Mailing List
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Steve French <smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Subject: Re: [PATCH v1 00/16] NFS/RDMA patches proposed for 4.1
Date: Tue, 05 May 2015 20:16:01 -0400	[thread overview]
Message-ID: <55495D41.5090502@talpey.com> (raw)
In-Reply-To: <20150505223855.GA7696-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

I'm adding Steve French because I'm about to talk about SMB.

On 5/5/2015 6:38 PM, Jason Gunthorpe wrote:
> On Tue, May 05, 2015 at 05:32:21PM -0400, Tom Talpey wrote:
>
>>> Do you have any information on these attempts and why the failed?  Note
>>> that the only interesting ones would be for in-kernel consumers.
>>> Userspace verbs are another order of magnitude more problems, so they're
>>> not too interesting.
>>
>> Hmm, most of these are userspace API experiences, and I would not be
>> so quick as to dismiss their applicability, or their lessons.
>
> The specific use-case of a RDMA to/from a logical linear region broken
> up into HW pages is incredibly kernel specific, and very friendly to
> hardware support.
>
> Heck, on modern systems 100% of these requirements can be solved just
> by using the IOMMU. No need for the HCA at all. (HCA may be more
> performant, of course)

I don't agree on "100%", because IOMMUs don't have the same protection
attributes as RDMA adapters (local R, local W, remote R, remote W). Also
they don't support handles for page lists quite like STags/RMRs, so they
require additional (R)DMA scatter/gather. But, I agree with your point
that they translate addresses just great.

> This is a huge pain for everyone. ie The Lustre devs were talking
> about how Lustre is not performant on newer HCAs because their code
> doesn't support the new MR scheme.
>
> It makes sense to me to have a dedicated API for this work load:
>
> 'post outbound rdma send/write of page region'

A bunch of writes followed by a send is a common sequence, but not
very complex (I think).

> 'prepare inbound rdma write of page region'

This is memory registration, with remote writability. That's what
the rpcrdma_register_external() API in xprtrdma/verbs.c does. It
takes a private rpcrdma structure, but it supports multiple memreg
strategies and pretty much does what you expect. I'm sure someone
could abstract it upward.

> 'post rdma read, result into page region'

The svcrdma stuff in the NFS RDMA server has this, it's called from
the XDR decoding.

> 'complete X'

This is trickier - invalidation has many interesting error cases.
But, on a sunny day with the breeze at our backs, sure.

> I'd love to see someone propose some patches :)

I'd like to mention something else. Many upper layers basically want
a socket, but memory registration and explicit RDMA break that. There
have been some relatively awful solutions to make it all transparent,
let's not go there.

The RPC/RDMA protocol was designed to tuck underneath RPC and
XDR, so, while not socket-like, it allowed RPC to hide RDMA
from (for example) NFS. NFS therefore did not have to change.
I thought transparency was a good idea at the time.

SMB Direct, in Windows, presents a socket-like interface for messaging
(connection, send/receive, etc), but makes memory registration and
RDMA Read / Write explicit. It's the SMB3 protocol that drives RDMA,
which it does only for SMB_READ and SMB_WRITE. The SMB3 upper layer
knows it's on an RDMA-capable connection, and "sets up" the transfer
by explicitly deciding to do an RDMA, which it does by asking the
SMB Direct driver to register memory. It then gets back one or more
handles, which it sends to the server in the SMB3 layer message.
The server performs the RDMA, and the reply indicates the result.
After which, the SMB3 upper layer explicitly de-registers.

If Linux upper layers considered adopting a similar approach by
carefully inserting RDMA operations conditionally, it can make
the lower layer's job much more efficient. And, efficiency is speed.
And in the end, the API throughout the stack will be simpler.

MHO.

Tom.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)
From: Tom Talpey <tom@talpey.com>
To: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	Chuck Lever <chuck.lever@oracle.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	linux-rdma@vger.kernel.org, Steve French <smfrench@gmail.com>
Subject: Re: [PATCH v1 00/16] NFS/RDMA patches proposed for 4.1
Date: Tue, 05 May 2015 20:16:01 -0400	[thread overview]
Message-ID: <55495D41.5090502@talpey.com> (raw)
In-Reply-To: <20150505223855.GA7696@obsidianresearch.com>

I'm adding Steve French because I'm about to talk about SMB.

On 5/5/2015 6:38 PM, Jason Gunthorpe wrote:
> On Tue, May 05, 2015 at 05:32:21PM -0400, Tom Talpey wrote:
>
>>> Do you have any information on these attempts and why the failed?  Note
>>> that the only interesting ones would be for in-kernel consumers.
>>> Userspace verbs are another order of magnitude more problems, so they're
>>> not too interesting.
>>
>> Hmm, most of these are userspace API experiences, and I would not be
>> so quick as to dismiss their applicability, or their lessons.
>
> The specific use-case of a RDMA to/from a logical linear region broken
> up into HW pages is incredibly kernel specific, and very friendly to
> hardware support.
>
> Heck, on modern systems 100% of these requirements can be solved just
> by using the IOMMU. No need for the HCA at all. (HCA may be more
> performant, of course)

I don't agree on "100%", because IOMMUs don't have the same protection
attributes as RDMA adapters (local R, local W, remote R, remote W). Also
they don't support handles for page lists quite like STags/RMRs, so they
require additional (R)DMA scatter/gather. But, I agree with your point
that they translate addresses just great.

> This is a huge pain for everyone. ie The Lustre devs were talking
> about how Lustre is not performant on newer HCAs because their code
> doesn't support the new MR scheme.
>
> It makes sense to me to have a dedicated API for this work load:
>
> 'post outbound rdma send/write of page region'

A bunch of writes followed by a send is a common sequence, but not
very complex (I think).

> 'prepare inbound rdma write of page region'

This is memory registration, with remote writability. That's what
the rpcrdma_register_external() API in xprtrdma/verbs.c does. It
takes a private rpcrdma structure, but it supports multiple memreg
strategies and pretty much does what you expect. I'm sure someone
could abstract it upward.

> 'post rdma read, result into page region'

The svcrdma stuff in the NFS RDMA server has this, it's called from
the XDR decoding.

> 'complete X'

This is trickier - invalidation has many interesting error cases.
But, on a sunny day with the breeze at our backs, sure.

> I'd love to see someone propose some patches :)

I'd like to mention something else. Many upper layers basically want
a socket, but memory registration and explicit RDMA break that. There
have been some relatively awful solutions to make it all transparent,
let's not go there.

The RPC/RDMA protocol was designed to tuck underneath RPC and
XDR, so, while not socket-like, it allowed RPC to hide RDMA
from (for example) NFS. NFS therefore did not have to change.
I thought transparency was a good idea at the time.

SMB Direct, in Windows, presents a socket-like interface for messaging
(connection, send/receive, etc), but makes memory registration and
RDMA Read / Write explicit. It's the SMB3 protocol that drives RDMA,
which it does only for SMB_READ and SMB_WRITE. The SMB3 upper layer
knows it's on an RDMA-capable connection, and "sets up" the transfer
by explicitly deciding to do an RDMA, which it does by asking the
SMB Direct driver to register memory. It then gets back one or more
handles, which it sends to the server in the SMB3 layer message.
The server performs the RDMA, and the reply indicates the result.
After which, the SMB3 upper layer explicitly de-registers.

If Linux upper layers considered adopting a similar approach by
carefully inserting RDMA operations conditionally, it can make
the lower layer's job much more efficient. And, efficiency is speed.
And in the end, the API throughout the stack will be simpler.

MHO.

Tom.


  parent reply	other threads:[~2015-05-06  0:16 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-13 21:21 [PATCH v1 00/16] NFS/RDMA patches proposed for 4.1 Chuck Lever
2015-03-13 21:21 ` [PATCH v1 01/16] xprtrdma: Display IPv6 addresses and port numbers correctly Chuck Lever
2015-03-13 21:21 ` [PATCH v1 02/16] xprtrdma: Perform a full marshal on retransmit Chuck Lever
2015-03-13 21:21 ` [PATCH v1 03/16] xprtrdma: Add vector of ops for each memory registration strategy Chuck Lever
2015-03-13 21:21 ` [PATCH v1 04/16] xprtrdma: Add a "max_payload" op for each memreg mode Chuck Lever
2015-03-13 21:22 ` [PATCH v1 05/16] xprtrdma: Add a "register_external" " Chuck Lever
2015-03-13 21:22 ` [PATCH v1 06/16] xprtrdma: Add a "deregister_external" " Chuck Lever
2015-03-17 14:37   ` Anna Schumaker
2015-03-17 15:04     ` Chuck Lever
2015-03-13 21:22 ` [PATCH v1 07/16] xprtrdma: Add "init MRs" memreg op Chuck Lever
2015-03-13 21:22 ` [PATCH v1 08/16] xprtrdma: Add "reset " Chuck Lever
2015-03-13 21:22 ` [PATCH v1 09/16] xprtrdma: Add "destroy " Chuck Lever
2015-03-13 21:22 ` [PATCH v1 10/16] xprtrdma: Add "open" " Chuck Lever
2015-03-17 15:16   ` Anna Schumaker
2015-03-17 15:19     ` Chuck Lever
2015-03-13 21:23 ` [PATCH v1 11/16] xprtrdma: Handle non-SEND completions via a callout Chuck Lever
2015-03-13 21:23 ` [PATCH v1 12/16] xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external() Chuck Lever
2015-03-13 21:23 ` [PATCH v1 13/16] xprtrdma: Acquire MRs in rpcrdma_register_external() Chuck Lever
2015-03-13 21:23 ` [PATCH v1 14/16] xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy Chuck Lever
2015-03-13 21:23 ` [PATCH v1 15/16] xprtrdma: Make rpcrdma_{un}map_one() into inline functions Chuck Lever
2015-03-13 21:23 ` [PATCH v1 16/16] xprtrdma: Split rb_lock Chuck Lever
     [not found] ` <20150313211124.22471.14517.stgit-FYjufvaPoItvLzlybtyyYzGyq/o6K9yX@public.gmane.org>
2015-05-05 15:44   ` [PATCH v1 00/16] NFS/RDMA patches proposed for 4.1 Christoph Hellwig
2015-05-05 15:44     ` Christoph Hellwig
     [not found]     ` <20150505154411.GA16729-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-05-05 16:04       ` Chuck Lever
2015-05-05 16:04         ` Chuck Lever
     [not found]         ` <5E1B32EA-9803-49AA-856D-BF0E1A5DFFF4-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2015-05-05 17:25           ` Christoph Hellwig
2015-05-05 17:25             ` Christoph Hellwig
     [not found]             ` <20150505172540.GA19442-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-05-05 18:14               ` Tom Talpey
2015-05-05 18:14                 ` Tom Talpey
     [not found]                 ` <55490886.4070502-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2015-05-05 19:10                   ` Christoph Hellwig
2015-05-05 19:10                     ` Christoph Hellwig
     [not found]                     ` <20150505191012.GA21164-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-05-05 20:57                       ` Tom Talpey
2015-05-05 20:57                         ` Tom Talpey
     [not found]                         ` <55492ED3.7000507-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2015-05-05 21:06                           ` Christoph Hellwig
2015-05-05 21:06                             ` Christoph Hellwig
     [not found]                             ` <20150505210627.GA5941-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-05-05 21:32                               ` Tom Talpey
2015-05-05 21:32                                 ` Tom Talpey
     [not found]                                 ` <554936E5.80607-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2015-05-05 22:38                                   ` Jason Gunthorpe
2015-05-05 22:38                                     ` Jason Gunthorpe
     [not found]                                     ` <20150505223855.GA7696-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-05-06  0:16                                       ` Tom Talpey [this message]
2015-05-06  0:16                                         ` Tom Talpey
     [not found]                                         ` <55495D41.5090502-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2015-05-06 16:20                                           ` Jason Gunthorpe
2015-05-06 16:20                                             ` Jason Gunthorpe
2015-05-06  7:01                                       ` Bart Van Assche
2015-05-06  7:01                                         ` Bart Van Assche
     [not found]                                         ` <5549BC33.30905-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2015-05-06 16:38                                           ` Jason Gunthorpe
2015-05-06 16:38                                             ` Jason Gunthorpe
2015-05-06  7:33                                   ` Christoph Hellwig
2015-05-06  7:33                                     ` Christoph Hellwig
2015-05-06  7:09                               ` Bart Van Assche
2015-05-06  7:09                                 ` Bart Van Assche
     [not found]                                 ` <5549BE30.8020505-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2015-05-06  7:29                                   ` Christoph Hellwig
2015-05-06  7:29                                     ` Christoph Hellwig
2015-05-06 12:15               ` Sagi Grimberg
2015-05-06 12:15                 ` Sagi Grimberg
2015-03-13 21:26 Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55495D41.5090502@talpey.com \
    --to=tom-cls1zie5n5hqt0dzr+alfa@public.gmane.org \
    --cc=chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org \
    --cc=hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org \
    --cc=jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org \
    --cc=linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=smfrench-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.