On Thu, May 3, 2018 at 12:43 PM Walker, Benjamin <benjamin.walker@intel.com> wrote:

On Thu, 2018-05-03 at 19:11 +0000, Mikhail altman wrote:
> Hello Everyone,
>
> On SPDK v18.01, we noticed there's a TODO in nvme_rdma_build_sgl_request() in
> nvme_rdma.c.
>
> Some code for context:
>
> /* TODO: for now, we only support a single SGL entry */
> rc = req->payload.u.sgl.next_sge_fn(req->payload.u.sgl.cb_arg, &virt_addr,
> &length);
> if (rc) {
> return -1;
> }
>
> if (length < req->payload_size) {
> SPDK_ERRLOG("multi-element SGL currently not supported for
> RDMA\n");
> return -1;
> }
>
> Is there any ongoing discussion or work to implement support for multiple SGL
> entries? (I looked at the Trello board and GerritHub, but couldn't find
> anything related.) If not, we can look into making a patch for this on our
> end. Any thoughts about what this would entail are welcome!

Hi Mike,

John has been working in this area. It's great to see that he'll have patches to
take a look at shortly. I just wanted to clarify a few things.

This isn't much of a limitation for the use cases we support today. The
initiator buffers can be scattered already, it's just the target memory for a
single I/O that must be described by a single element. Since the RDMA NIC is
pulling the data over the network and placing it into the local target system's
memory, it is simple enough to have it simultaneously gather it into a single
contiguous memory region.

That said, I can see at least a few use cases for this. One would be to change
the way the memory pool is allocated in the NVMe-oF target. Today, it allocates
4 full queue depths worth of max I/O size buffers in a shared pool for all
connections to use. If we had full support for scatter gather lists, we could
change this pool to contain an equivalent amount of 4k buffers. Then each I/O
could pull a list of buffers instead of a single big one and we'd end up with
better memory utilization. We already have the required scatter-gather-aware
APIs through the rest of the stack to make this happen.

The other use case is one where we switch our model to use memory provided by
the backing bdev for the RDMA transfer instead of using a separate dedicated
pool allocated by the NVMe-oF target. That backing bdev may need to provide the
memory as a scatter gather list for various reasons (this is John's use case).
This is the long term direction for the NVMe-oF target.

In addition to enabling custom bdevs to provide scatter gather lists for
whatever reason, this would also enable things like zero-copy transfers directly
to persistent memory or to a local NVMe SSD's controller memory buffer. This
effectively eliminates the single bounce we do from RDMA NIC to host memory to
persistent storage device, and would probably shave an additional ~3
microseconds off of the round trip latency for these cases.

These are all cool projects that are worthy of time and effort. If you all are
willing to work in this area, please jump in!

>
> Thanks in advance,
> Mike
_______________________________________________
SPDK mailing list
SPDK@lists.01.org
https://lists.01.org/mailman/listinfo/spdk