Thanks Brian, great write-up and definitely no need for anything formalized :) I'm going to defer response to the various number of vhost experts in the community though...

Thx
Paul

-----Original Message-----
From: Szmyd, Brian [mailto:bszmyd(a)ebay.com] 
Sent: Thursday, September 5, 2019 12:48 PM
To: Mittal, Rishabh <rimittal(a)ebay.com>; Luse, Paul E <paul.e.luse(a)intel.com>; Storage Performance Development Kit <spdk(a)lists.01.org>; Walker, Benjamin <benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>
Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>
Subject: Re: [SPDK] NBD with SPDK

Hi Paul,

Rather than put the effort into a formalized document here is a brief description of the solution I have been investigating just to get an opinion of feasibility or even workability. 

Some background and a reiteration of the problem to set things up. I apologize to reiterate anything and to include details that some may already know.

We are looking for a solution that allows us to write a custom bdev for the SPDK bdev layer that distributes I/O between different NVMe-oF targets that we have attached and then present that to our application as either a raw block device or filesystem mountpoint.

This is normally (as I understand it) done to by exposing a device via QEMU to a VM using the vhost target. This SPDK target has implemented the virtio-scsi (among others) device according to this spec:

https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-8300021

The VM kernel then uses a virtio-scsi module to attach said device into its SCSI mid-layer and then have the device enumerated as a /dev/sd[a-z]+ device.

The problem is that QEMU virtualizes a PCIe bus for the guest kernel virtio-pci driver to discover the virtio devices and bind them to the virtio-scsi driver. There really is no other way (other than platform MMIO type devices) to attach a device to the virtio-scsi device.

SPDK exposes the virtio device to the VM via QEMU which has written a "user space" version of the vhost bus. This driver then translates the API into the virtio-pci specification:

https://github.com/qemu/qemu/blob/5d0e5694470d2952b4f257bc985cac8c89b4fd92/docs/interop/vhost-user.rst

This uses an eventfd descriptor for interrupting the non-polling side of the queue and a UNIX domain socket to setup (and control) the shared memory which contains the I/O buffers and virtio queues. This is documented in SPDKs own documentation and diagramed here:

https://github.com/spdk/spdk/blob/01103b2e4dfdcf23cc2125164aa116394c8185e8/doc/vhost_processing.md

If we could implement this vhost-user QEMU target as a virtio driver in the kernel as an alternative to the virtio-pci driver, it could bind a SPDK vhost into the host kernel as a virtio device and enumerated in the /dev/sd[a-z]+ tree for our containers to bind. Attached is draft block diagram.

Since we will not have a real bus to signal for the driver to probe for new devices we can use a sysfs interface for the application to notify the driver of a new socket and eventfd pair to setup a new virtio-scsi instance. Otherwise the design simply moves the vhost-user driver from the QEMU application into the Host kernel itself.

It's my understanding that this will avoid a lot more system calls and copies compared to what exposing an iSCSI device or NBD device as we're currently discussing. Does this seem feasible?

Thanks,
Brian

﻿On 9/5/19, 12:32 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:

    Hi Paul.
    
    Thanks for investigating it. 
    
    We have one more idea floating around. Brian is going to send you a proposal shortly. If other proposal seems feasible to you that we can evaluate the work required in both the proposals.
    
    Thanks
    Rishabh Mittal
    
    On 9/5/19, 11:09 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
    
        Hi,
        
        So I was able to perform the same steps here and I think one of the keys to really seeing what's going on is to start perftop like this:
        
         “perf top --sort comm,dso,symbol -C 0” to get a more focused view by sorting on command, shared object and symbol
        
        Attached are 2 snapshots, one with a NULL back end for nbd and one with libaio/nvme.  Some notes after chatting with Ben a bit, please read through and let us know what you think:
        
        * in both cases the vast majority of the highest overhead activities are kernel
        * the "copy_user_enhanced" symbol on the NULL case (it shows up on the other as well but you have to scroll way down to see it) and is the user/kernel space copy, nothing SPDK can do about that
        * the syscalls that dominate in both cases are likely something that can be improved on by changing how SPDK interacts with nbd. Ben had a couple of ideas inlcuidng (a) using libaio to interact with the nbd fd as opposed to interacting with the nbd socket, (b) "batching" wherever possible, for example on writes to nbd investigate not ack'ing them until some number have completed
        * the kernel slab* commands are likely nbd kernel driver allocations/frees in the IO path, one possibility would be to look at optimizing the nbd kernel driver for this one
        * the libc item on the NULL chart also shows up on the libaio profile however is again way down the scroll so it didn't make the screenshot :)  This could be a zeroing of something somewhere in the SPDK nbd driver
        
        It looks like this data supports what Ben had suspected a while back, much of the overhead we're looking at is kernel nbd.  Anyway, let us know what you think and if you want to explore any of the ideas above any further or see something else in the data that looks worthy to note.
        
        Thx
        Paul
        
        
        
        -----Original Message-----
        From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
        Sent: Wednesday, September 4, 2019 4:27 PM
        To: Mittal, Rishabh <rimittal(a)ebay.com>; Walker, Benjamin <benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org
        Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>
        Subject: Re: [SPDK] NBD with SPDK
        
        Cool, thanks for sending this.  I will try and repro tomorrow here and see what kind of results I get
        
        Thx
        Paul
        
        -----Original Message-----
        From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
        Sent: Wednesday, September 4, 2019 4:23 PM
        To: Luse, Paul E <paul.e.luse(a)intel.com>; Walker, Benjamin <benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org
        Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
        Subject: Re: [SPDK] NBD with SPDK
        
        Avg CPU utilization is very low when I am running this.
        
        09/04/2019 04:21:40 PM
        avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                   2.59    0.00    2.57    0.00    0.00   94.84
        
        Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
        sda              0.00    0.20      0.00      0.80     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     4.00   0.00   0.00
        sdb              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
        sdc              0.00 28846.80      0.00 191555.20     0.00 18211.00   0.00  38.70    0.00    1.03  29.64     0.00     6.64   0.03 100.00
        nb0              0.00 47297.00      0.00 191562.40     0.00   593.60   0.00   1.24    0.00    1.32  61.83     0.00     4.05   0
        
        
        
        On 9/4/19, 4:19 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
        
            I am using this command
            
            fio --name=randwrite --ioengine=libaio --iodepth=8 --rw=write --rwmixread=0  --bsrange=4k-4k --direct=1 -filename=/dev/nbd0 --numjobs=8 --runtime 120 --time_based --group_reporting
            
            I have created the device by using these commands
            	1.  ./root/spdk/app/vhost
            	2.  ./rpc.py bdev_aio_create /dev/sdc aio0
            	3. /rpc.py start_nbd_disk aio0 /dev/nbd0
            
            I am using  "perf top"  to get the performance 
            
            On 9/4/19, 4:03 PM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
            
                Hi Rishabh,
                
                Maybe it would help (me at least) if you described the complete & exact steps for your test - both setup of the env & test and command to profile.  Can you send that out?
                
                Thx
                Paul
                
                -----Original Message-----
                From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
                Sent: Wednesday, September 4, 2019 2:45 PM
                To: Walker, Benjamin <benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org; Luse, Paul E <paul.e.luse(a)intel.com>
                Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
                Subject: Re: [SPDK] NBD with SPDK
                
                Yes, I am using 64 q depth with one thread in fio. I am using AIO. This profiling is for the entire system. I don't know why spdk threads are idle.
                
                On 9/4/19, 11:08 AM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:
                
                    On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wrote:
                    > I got the run again. It is with 4k write.
                    > 
                    > 13.16%  vhost                       [.]
                    > spdk_ring_dequeue                                                             
                    >              
                    >    6.08%  vhost                       [.]
                    > rte_rdtsc                                                                     
                    >              
                    >    4.77%  vhost                       [.]
                    > spdk_thread_poll                                                              
                    >              
                    >    2.85%  vhost                       [.]
                    > _spdk_reactor_run                                                             
                    >  
                    
                    You're doing high queue depth for at least 30 seconds while the trace runs,
                    right? Using fio with the libaio engine on the NBD device is probably the way to
                    go. Are you limiting the profiling to just the core where the main SPDK process
                    is pinned? I'm asking because SPDK still appears to be mostly idle, and I
                    suspect the time is being spent in some other thread (in the kernel). Consider
                    capturing a profile for the entire system. It will have fio stuff in it, but the
                    expensive stuff still should generally bubble up to the top.
                    
                    Thanks,
                    Ben
                    
                    
                    > 
                    > On 8/29/19, 6:05 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
                    > 
                    >     I got the profile with first run. 
                    >     
                    >       27.91%  vhost                       [.]
                    > spdk_ring_dequeue                                                             
                    >              
                    >       12.94%  vhost                       [.]
                    > rte_rdtsc                                                                     
                    >              
                    >       11.00%  vhost                       [.]
                    > spdk_thread_poll                                                              
                    >              
                    >        6.15%  vhost                       [.]
                    > _spdk_reactor_run                                                             
                    >              
                    >        4.35%  [kernel]                    [k]
                    > syscall_return_via_sysret                                                     
                    >              
                    >        3.91%  vhost                       [.]
                    > _spdk_msg_queue_run_batch                                                     
                    >              
                    >        3.38%  vhost                       [.]
                    > _spdk_event_queue_run_batch                                                   
                    >              
                    >        2.83%  [unknown]                   [k]
                    > 0xfffffe000000601b                                                            
                    >              
                    >        1.45%  vhost                       [.]
                    > spdk_thread_get_from_ctx                                                      
                    >              
                    >        1.20%  [kernel]                    [k]
                    > __fget                                                                        
                    >              
                    >        1.14%  libpthread-2.27.so          [.]
                    > __libc_read                                                                   
                    >              
                    >        1.00%  libc-2.27.so                [.]
                    > 0x000000000018ef76                                                            
                    >              
                    >        0.99%  libc-2.27.so                [.] 0x000000000018ef79          
                    >     
                    >     Thanks
                    >     Rishabh Mittal                         
                    >     
                    >     On 8/19/19, 7:42 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
                    >     
                    >         That's great.  Keep any eye out for the items Ben mentions below - at
                    > least the first one should be quick to implement and compare both profile data
                    > and measured performance.
                    >         
                    >         Don’t' forget about the community meetings either, great place to chat
                    > about these kinds of things.  
                    > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Cbszmyd%40ebay.com%7C52847d18df514b39d8cf08d7322f74ea%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033051721021295&amp;sdata=heRt%2FhB5SPeqWNw44VoCIrt5W9N%2B0ExCXIVFNtzi2Zg%3D&amp;reserved=0
                    >   Next one is tomorrow morn US time.
                    >         
                    >         Thx
                    >         Paul
                    >         
                    >         -----Original Message-----
                    >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Mittal,
                    > Rishabh via SPDK
                    >         Sent: Thursday, August 15, 2019 6:50 PM
                    >         To: Harris, James R <james.r.harris(a)intel.com>; Walker, Benjamin <
                    > benjamin.walker(a)intel.com>; spdk(a)lists.01.org
                    >         Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen, Xiaoxi <
                    > xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>; Kadayam, Hari <
                    > hkadayam(a)ebay.com>
                    >         Subject: Re: [SPDK] NBD with SPDK
                    >         
                    >         Thanks. I will get the profiling by next week. 
                    >         
                    >         On 8/15/19, 6:26 PM, "Harris, James R" <james.r.harris(a)intel.com>
                    > wrote:
                    >         
                    >             
                    >             
                    >             On 8/15/19, 4:34 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
                    >             
                    >                 Hi Jim
                    >                 
                    >                 What tool you use to take profiling. 
                    >             
                    >             Hi Rishabh,
                    >             
                    >             Mostly I just use "perf top".
                    >             
                    >             -Jim
                    >             
                    >                 
                    >                 Thanks
                    >                 Rishabh Mittal
                    >                 
                    >                 On 8/14/19, 9:54 AM, "Harris, James R" <
                    > james.r.harris(a)intel.com> wrote:
                    >                 
                    >                     
                    >                     
                    >                     On 8/14/19, 9:18 AM, "Walker, Benjamin" <
                    > benjamin.walker(a)intel.com> wrote:
                    >                     
                    >                     <trim>
                    >                         
                    >                         When an I/O is performed in the process initiating the
                    > I/O to a file, the data
                    >                         goes into the OS page cache buffers at a layer far
                    > above the bio stack
                    >                         (somewhere up in VFS). If SPDK were to reserve some
                    > memory and hand it off to
                    >                         your kernel driver, your kernel driver would still
                    > need to copy it to that
                    >                         location out of the page cache buffers. We can't
                    > safely share the page cache
                    >                         buffers with a user space process.
                    >                        
                    >                     I think Rishabh was suggesting the SPDK reserve the
                    > virtual address space only.
                    >                     Then the kernel could map the page cache buffers into that
                    > virtual address space.
                    >                     That would not require a data copy, but would require the
                    > mapping operations.
                    >                     
                    >                     I think the profiling data would be really helpful - to
                    > quantify how much of the 50us
                    >                     Is due to copying the 4KB of data.  That can help drive
                    > next steps on how to optimize
                    >                     the SPDK NBD module.
                    >                     
                    >                     Thanks,
                    >                     
                    >                     -Jim
                    >                     
                    >                     
                    >                         As Paul said, I'm skeptical that the memcpy is
                    > significant in the overall
                    >                         performance you're measuring. I encourage you to go
                    > look at some profiling data
                    >                         and confirm that the memcpy is really showing up. I
                    > suspect the overhead is
                    >                         instead primarily in these spots:
                    >                         
                    >                         1) Dynamic buffer allocation in the SPDK NBD backend.
                    >                         
                    >                         As Paul indicated, the NBD target is dynamically
                    > allocating memory for each I/O.
                    >                         The NBD backend wasn't designed to be fast - it was
                    > designed to be simple.
                    >                         Pooling would be a lot faster and is something fairly
                    > easy to implement.
                    >                         
                    >                         2) The way SPDK does the syscalls when it implements
                    > the NBD backend.
                    >                         
                    >                         Again, the code was designed to be simple, not high
                    > performance. It simply calls
                    >                         read() and write() on the socket for each command.
                    > There are much higher
                    >                         performance ways of doing this, they're just more
                    > complex to implement.
                    >                         
                    >                         3) The lack of multi-queue support in NBD
                    >                         
                    >                         Every I/O is funneled through a single sockpair up to
                    > user space. That means
                    >                         there is locking going on. I believe this is just a
                    > limitation of NBD today - it
                    >                         doesn't plug into the block-mq stuff in the kernel and
                    > expose multiple
                    >                         sockpairs. But someone more knowledgeable on the
                    > kernel stack would need to take
                    >                         a look.
                    >                         
                    >                         Thanks,
                    >                         Ben
                    >                         
                    >                         > 
                    >                         > Couple of things that I am not really sure in this
                    > flow is :- 1. How memory
                    >                         > registration is going to work with RDMA driver.
                    >                         > 2. What changes are required in spdk memory
                    > management
                    >                         > 
                    >                         > Thanks
                    >                         > Rishabh Mittal
                    >                         
                    >                     
                    >                     
                    >                 
                    >                 
                    >             
                    >             
                    >         
                    >         _______________________________________________
                    >         SPDK mailing list
                    >         SPDK(a)lists.01.org
                    >         
                    > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cbszmyd%40ebay.com%7C52847d18df514b39d8cf08d7322f74ea%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033051721021295&amp;sdata=JS0ej%2B6QBpawCofiD%2FktD%2Bmzpu3pc1YpsKw5CKVzVBw%3D&amp;reserved=0
                    >         
                    >     
                    >     
                    > 
                    
                    
                
                
            
            
        
        _______________________________________________
        SPDK mailing list
        SPDK(a)lists.01.org
        https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cbszmyd%40ebay.com%7C52847d18df514b39d8cf08d7322f74ea%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033051721021295&amp;sdata=JS0ej%2B6QBpawCofiD%2FktD%2Bmzpu3pc1YpsKw5CKVzVBw%3D&amp;reserved=0