Re: [SPDK] NBD with SPDK

From: Kadayam, Hari <hkadayam at ebay.com>
To: spdk@lists.01.org
Subject: Re: [SPDK] NBD with SPDK
Date: Wed, 14 Aug 2019 17:05:40 +0000	[thread overview]
Message-ID: <819E98D6-6355-41BC-9B6C-B92EB4B1475B@ebay.com> (raw)
In-Reply-To: 742b180c299ad6847e6efa33bad8c62e3dd2c7ac.camel@intel.com

[-- Attachment #1: Type: text/plain, Size: 4516 bytes --]

Hi Ben,

I agree we need to profile this and improve where ever we are seeing the bottlenecks. The possible improvements you suggested are certainly very useful to look into and good place to start. Having said that, on a large writes won't memcpy surely going to add into latency and cost more CPU? 

Regarding your comment:
> We can't safely share the page cache buffers with a user space process.

The thought process here is the driver in the SPDK thread context does a remap using something like phys_to_virt() or mmap them, which means page cache buffer(s) gets to be accessed from user space process. Of course, we have concerns too regarding the safety of user space process accessing page cache. 

Regards,
Hari

On 8/14/19, 9:19 AM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:

    On Wed, 2019-08-14 at 14:28 +0000, Luse, Paul E wrote:
    > So I think there's still a feeling amongst most involved in the discussion
    > that eliminating the memcpy is likely not worth it, especially without
    > profiling data to prove it.  Ben and I were talking about some other much
    > simpler things that might be worth experimenting with).  One example would be
    > in spdk_nbd_io_recv_internal(), look at how spdk_malloc(), is called for every
    > IO/  Creating a pre-allocated pool and pulling from there would be a quick
    > change and may yield some positive results. Again though, profiling will
    > actually tell you where the most time is being spent and where the best bang
    > for your buck is in terms of making changes.
    > 
    > Thx
    > Paul
    > 
    > From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
    > 
    > Back end device is malloc0 which is a memory device running in the “vhost”
    > application address space.  It is not over NVMe-oF.
    > 
    > I guess that bio pages are already pinned because same buffers are sent to
    > lower layers to do DMA.  Lets say we have written a lightweight ebay block
    > driver in kernel. This would be the flow
    > 
    > 1.  SPDK reserve the virtual space and pass it to ebay block driver to do
    > mmap. This step happens once during startup. 
    > 2.  For every IO, ebay block driver map buffers to virtual memory and pass a
    > IO information to SPDK through shared queues.
    > 3.  SPDK read it from the shared queue and pass the same virtual address to do
    > RDMA.

    When an I/O is performed in the process initiating the I/O to a file, the data
    goes into the OS page cache buffers at a layer far above the bio stack
    (somewhere up in VFS). If SPDK were to reserve some memory and hand it off to
    your kernel driver, your kernel driver would still need to copy it to that
    location out of the page cache buffers. We can't safely share the page cache
    buffers with a user space process.

    As Paul said, I'm skeptical that the memcpy is significant in the overall
    performance you're measuring. I encourage you to go look at some profiling data
    and confirm that the memcpy is really showing up. I suspect the overhead is
    instead primarily in these spots:

    1) Dynamic buffer allocation in the SPDK NBD backend.

    As Paul indicated, the NBD target is dynamically allocating memory for each I/O.
    The NBD backend wasn't designed to be fast - it was designed to be simple.
    Pooling would be a lot faster and is something fairly easy to implement.

    2) The way SPDK does the syscalls when it implements the NBD backend.

    Again, the code was designed to be simple, not high performance. It simply calls
    read() and write() on the socket for each command. There are much higher
    performance ways of doing this, they're just more complex to implement.

    3) The lack of multi-queue support in NBD

    Every I/O is funneled through a single sockpair up to user space. That means
    there is locking going on. I believe this is just a limitation of NBD today - it
    doesn't plug into the block-mq stuff in the kernel and expose multiple
    sockpairs. But someone more knowledgeable on the kernel stack would need to take
    a look.

    Thanks,
    Ben

    > 
    > Couple of things that I am not really sure in this flow is :- 1. How memory
    > registration is going to work with RDMA driver.
    > 2. What changes are required in spdk memory management
    > 
    > Thanks
    > Rishabh Mittal