All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [SPDK] NBD with SPDK
@ 2019-08-30 22:28 Mittal, Rishabh
  0 siblings, 0 replies; 32+ messages in thread
From: Mittal, Rishabh @ 2019-08-30 22:28 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 11551 bytes --]

I got the run again. It is with 4k write.

13.16%  vhost                       [.] spdk_ring_dequeue                                                                          
   6.08%  vhost                       [.] rte_rdtsc                                                                                  
   4.77%  vhost                       [.] spdk_thread_poll                                                                           
   2.85%  vhost                       [.] _spdk_reactor_run                                                                          
   2.43%  [kernel]                    [k] syscall_return_via_sysret                                                                  
   2.17%  [kernel]                    [k] copy_user_enhanced_fast_string                                                             
   2.05%  [kernel]                    [k] _raw_spin_lock                                                                             
   1.83%  vhost                       [.] _spdk_msg_queue_run_batch                                                                  
   1.56%  vhost                       [.] _spdk_event_queue_run_batch                                                                
   1.56%  [kernel]                    [k] memcpy_erms                                                                                
   1.39%  [kernel]                    [k] switch_mm_irqs_off                                                                         
   1.33%  [kernel]                    [k] radix_tree_next_chunk                                                                      
   1.17%  [kernel]                    [k] native_queued_spin_lock_slowpath                                                           
   1.13%  [unknown]                   [k] 0xfffffe000000601b                                                                         
   1.02%  [kernel]                    [k] _raw_spin_lock_irqsave                                                                     
   0.94%  [kernel]                    [k] unix_stream_read_generic                                                                   
   0.92%  [kernel]                    [k] load_new_mm_cr3                                                                            
   0.87%  [kernel]                    [k] _raw_spin_lock_irq                                                                         
   0.83%  [kernel]                    [k] cmpxchg_double_slab.isra.61                                                                
   0.78%  [kernel]                    [k] mutex_lock                                                                                 
   0.78%  [kernel]                    [k] unix_stream_sendmsg                                                                        
   0.77%  [kernel]                    [k] sock_wfree                                                                                 
   0.74%  [kernel]                    [k] __schedule                                                                                 

On 8/29/19, 6:05 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:

    I got the profile with first run. 
    
      27.91%  vhost                       [.] spdk_ring_dequeue                                                                          
      12.94%  vhost                       [.] rte_rdtsc                                                                                  
      11.00%  vhost                       [.] spdk_thread_poll                                                                           
       6.15%  vhost                       [.] _spdk_reactor_run                                                                          
       4.35%  [kernel]                    [k] syscall_return_via_sysret                                                                  
       3.91%  vhost                       [.] _spdk_msg_queue_run_batch                                                                  
       3.38%  vhost                       [.] _spdk_event_queue_run_batch                                                                
       2.83%  [unknown]                   [k] 0xfffffe000000601b                                                                         
       1.45%  vhost                       [.] spdk_thread_get_from_ctx                                                                   
       1.20%  [kernel]                    [k] __fget                                                                                     
       1.14%  libpthread-2.27.so          [.] __libc_read                                                                                
       1.00%  libc-2.27.so                [.] 0x000000000018ef76                                                                         
       0.99%  libc-2.27.so                [.] 0x000000000018ef79          
    
    Thanks
    Rishabh Mittal                         
    
    On 8/19/19, 7:42 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
    
        That's great.  Keep any eye out for the items Ben mentions below - at least the first one should be quick to implement and compare both profile data and measured performance.
        
        Don’t' forget about the community meetings either, great place to chat about these kinds of things.  https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Crimittal%40ebay.com%7Cd5c75891ea414963501c08d724b36248%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637018225183900855&amp;sdata=wEMi40AMPeGVt3XX3bHfneHqM0LFEB8Jt%2F9dQl6cIBE%3D&amp;reserved=0  Next one is tomorrow morn US time.
        
        Thx
        Paul
        
        -----Original Message-----
        From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Mittal, Rishabh via SPDK
        Sent: Thursday, August 15, 2019 6:50 PM
        To: Harris, James R <james.r.harris(a)intel.com>; Walker, Benjamin <benjamin.walker(a)intel.com>; spdk(a)lists.01.org
        Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>
        Subject: Re: [SPDK] NBD with SPDK
        
        Thanks. I will get the profiling by next week. 
        
        On 8/15/19, 6:26 PM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
        
            
            
            On 8/15/19, 4:34 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
            
                Hi Jim
                
                What tool you use to take profiling. 
            
            Hi Rishabh,
            
            Mostly I just use "perf top".
            
            -Jim
            
                
                Thanks
                Rishabh Mittal
                
                On 8/14/19, 9:54 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
                
                    
                    
                    On 8/14/19, 9:18 AM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:
                    
                    <trim>
                        
                        When an I/O is performed in the process initiating the I/O to a file, the data
                        goes into the OS page cache buffers at a layer far above the bio stack
                        (somewhere up in VFS). If SPDK were to reserve some memory and hand it off to
                        your kernel driver, your kernel driver would still need to copy it to that
                        location out of the page cache buffers. We can't safely share the page cache
                        buffers with a user space process.
                       
                    I think Rishabh was suggesting the SPDK reserve the virtual address space only.
                    Then the kernel could map the page cache buffers into that virtual address space.
                    That would not require a data copy, but would require the mapping operations.
                    
                    I think the profiling data would be really helpful - to quantify how much of the 50us
                    Is due to copying the 4KB of data.  That can help drive next steps on how to optimize
                    the SPDK NBD module.
                    
                    Thanks,
                    
                    -Jim
                    
                    
                        As Paul said, I'm skeptical that the memcpy is significant in the overall
                        performance you're measuring. I encourage you to go look at some profiling data
                        and confirm that the memcpy is really showing up. I suspect the overhead is
                        instead primarily in these spots:
                        
                        1) Dynamic buffer allocation in the SPDK NBD backend.
                        
                        As Paul indicated, the NBD target is dynamically allocating memory for each I/O.
                        The NBD backend wasn't designed to be fast - it was designed to be simple.
                        Pooling would be a lot faster and is something fairly easy to implement.
                        
                        2) The way SPDK does the syscalls when it implements the NBD backend.
                        
                        Again, the code was designed to be simple, not high performance. It simply calls
                        read() and write() on the socket for each command. There are much higher
                        performance ways of doing this, they're just more complex to implement.
                        
                        3) The lack of multi-queue support in NBD
                        
                        Every I/O is funneled through a single sockpair up to user space. That means
                        there is locking going on. I believe this is just a limitation of NBD today - it
                        doesn't plug into the block-mq stuff in the kernel and expose multiple
                        sockpairs. But someone more knowledgeable on the kernel stack would need to take
                        a look.
                        
                        Thanks,
                        Ben
                        
                        > 
                        > Couple of things that I am not really sure in this flow is :- 1. How memory
                        > registration is going to work with RDMA driver.
                        > 2. What changes are required in spdk memory management
                        > 
                        > Thanks
                        > Rishabh Mittal
                        
                    
                    
                
                
            
            
        
        _______________________________________________
        SPDK mailing list
        SPDK(a)lists.01.org
        https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7Cd5c75891ea414963501c08d724b36248%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637018225183900855&amp;sdata=9QDXP2O4MWvrQmKitBJONSkZZHXrRqfFXPrDqltPYjM%3D&amp;reserved=0
        
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-09-23  1:03 Huang Zhiteng
  0 siblings, 0 replies; 32+ messages in thread
From: Huang Zhiteng @ 2019-09-23  1:03 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 37773 bytes --]

BTW, our attempt to run SPDK with Kata currently is blocked by this
issue: https://github.com/spdk/spdk/issues/946. It'll be nice if
community can take a look at it and see how we can fix this.  Thank
you.

On Sat, Sep 7, 2019 at 4:31 AM Kadayam, Hari <hkadayam(a)ebay.com> wrote:
>
> Kata containers has additional indirection compared to docker, which potentially affects the performance, right?
>
> Also memory protection concern in virtio is valid, but we could possibly look for containing the accessible memory. In any case, I think an SPDK application wouldn't be accessing a buffer other than IO buffer from that space.
>
> On 9/6/19, 10:13 AM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
>
>     I am summarizing all the options. We have three options
>
>     1.  SPDK with NBD. :- It need few optimizations in spdk - nbd to reduce the system calls overhead. We can also explore using multi socket because it is supported in NBD.  One disadvantage is the there will be bcopy for every IO from kernel to SPDK or vice versa for reads. Overhead of bcopy compare to end to end latency is very low for 4k workload but we need to see its impact for higher read/write size.
>
>     2. SPDK with virtio :- It doesn't require any changes in spdk (assuming that spdk virtio is written for performance) but we need to have a customized kernel module which can work with spdk virtio target. It's obvious advantage is the kernel buffer cache will shared with spdk so there will be no copy from kernel to spdk. Other advantage is there will be minimal system calls to ring the doorbell as it will be using shared ring queue. Here my only concern is that memory protection will be lost as entire kernel buffers will be shared with spdk.
>
>     3. SPDK is used with KATA containers :- It doesn't require much changes (Xiaoxi can comment more on this). But our concern is that apps will not be moved to kata containers which will slow down its adoption rate.
>
>     Please feel free to add pros/cons of any approach if I miss anything. It will help us to decide.
>
>
>     Thanks
>     Rishabh Mittal
>
>     On 9/5/19, 7:14 PM, "Szmyd, Brian" <bszmyd(a)ebay.com> wrote:
>
>         I believe this option has the same number of copies since your still sharing the memory
>         with KATA VM kernel not the application itself. This is an option that the development
>         of a virtio-vhost-user driver does not prevent, its merely an option to allow non-KATA
>         containers to also use the same device.
>
>         I will note that doing a virtio-host-user driver also allows one to project other device types
>         than just block devices into the kernel device stack. One could also write a user application
>         that exposed an input, network, console, gpu or socket device as well.
>
>         Not that I have any interest in these... __
>
>         On 9/5/19, 8:08 PM, "Huang Zhiteng" <winston.d(a)gmail.com> wrote:
>
>             Since this SPDK bdev is intended to be consumed by a user application
>             running inside a container, we do have the possibility to run user
>             application inside a Kata container instead.  Kata container does
>             introduce the layer of IO virtualization, therefore we convert a user
>             space block device on host to a kernel block device inside the VM but
>             with less memory copies than NBD thanks to SPDK vhost.  Kata container
>             might impose higher overhead than plain container but hopefully it's
>             lightweight enough that the overhead is negligible.
>
>             On Fri, Sep 6, 2019 at 5:22 AM Walker, Benjamin
>             <benjamin.walker(a)intel.com> wrote:
>             >
>             > On Thu, 2019-09-05 at 19:47 +0000, Szmyd, Brian wrote:
>             > > Hi Paul,
>             > >
>             > > Rather than put the effort into a formalized document here is a brief
>             > > description of the solution I have been investigating just to get an opinion
>             > > of feasibility or even workability.
>             > >
>             > > Some background and a reiteration of the problem to set things up. I apologize
>             > > to reiterate anything and to include details that some may already know.
>             > >
>             > > We are looking for a solution that allows us to write a custom bdev for the
>             > > SPDK bdev layer that distributes I/O between different NVMe-oF targets that we
>             > > have attached and then present that to our application as either a raw block
>             > > device or filesystem mountpoint.
>             > >
>             > > This is normally (as I understand it) done to by exposing a device via QEMU to
>             > > a VM using the vhost target. This SPDK target has implemented the virtio-scsi
>             > > (among others) device according to this spec:
>             > >
>             > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oasis-open.org%2Fvirtio%2Fvirtio%2Fv1.1%2Fcsprd01%2Fvirtio-v1.1-csprd01.html%23x1-8300021&amp;data=02%7C01%7Chkadayam%40ebay.com%7Cb65b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033868045644168&amp;sdata=fbXQwgNce4XFtTAAFHl%2F5SSFo8i%2BZ1vqQdjTEv56Lw4%3D&amp;reserved=0
>             > >
>             > > The VM kernel then uses a virtio-scsi module to attach said device into its
>             > > SCSI mid-layer and then have the device enumerated as a /dev/sd[a-z]+ device.
>             > >
>             > > The problem is that QEMU virtualizes a PCIe bus for the guest kernel virtio-
>             > > pci driver to discover the virtio devices and bind them to the virtio-scsi
>             > > driver. There really is no other way (other than platform MMIO type devices)
>             > > to attach a device to the virtio-scsi device.
>             > >
>             > > SPDK exposes the virtio device to the VM via QEMU which has written a "user
>             > > space" version of the vhost bus. This driver then translates the API into the
>             > > virtio-pci specification:
>             > >
>             > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fqemu%2Fqemu%2Fblob%2F5d0e5694470d2952b4f257bc985cac8c89b4fd92%2Fdocs%2Finterop%2Fvhost-user.rst&amp;data=02%7C01%7Chkadayam%40ebay.com%7Cb65b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033868045644168&amp;sdata=p%2BeTDdTmvNn8hxPFc%2BGQnKEleFeRP9aJ3Sc8prRKJRk%3D&amp;reserved=0
>             > >
>             > > This uses an eventfd descriptor for interrupting the non-polling side of the
>             > > queue and a UNIX domain socket to setup (and control) the shared memory which
>             > > contains the I/O buffers and virtio queues. This is documented in SPDKs own
>             > > documentation and diagramed here:
>             > >
>             > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fspdk%2Fspdk%2Fblob%2F01103b2e4dfdcf23cc2125164aa116394c8185e8%2Fdoc%2Fvhost_processing.md&amp;data=02%7C01%7Chkadayam%40ebay.com%7Cb65b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033868045644168&amp;sdata=iThOkZ70btaiBFAQKn5EUR%2BpCw%2BrcIzfIkWVPvf9LZs%3D&amp;reserved=0
>             > >
>             > > If we could implement this vhost-user QEMU target as a virtio driver in the
>             > > kernel as an alternative to the virtio-pci driver, it could bind a SPDK vhost
>             > > into the host kernel as a virtio device and enumerated in the /dev/sd[a-z]+
>             > > tree for our containers to bind. Attached is draft block diagram.
>             >
>             > If you think of QEMU as just another user-space process, and the SPDK vhost
>             > target as a user-space process, then it's clear that vhost-user is simply a
>             > cross-process IPC mechanism based on shared memory. The "shared memory" part is
>             > the critical part of that description - QEMU pre-registers all of the memory
>             > that will be used for I/O buffers (in fact, all of the memory that is mapped
>             > into the guest) with the SPDK process by sending fds across a Unix domain
>             > socket.
>             >
>             > If you move this code into the kernel, you have to solve two issues:
>             >
>             > 1) What memory is it registering with the SPDK process? The kernel driver has no
>             > idea which application process may route I/O to it - in fact the application
>             > process may not even exist yet - so it isn't memory allocated to the application
>             > process. Maybe you have a pool of kernel buffers that get mapped into the SPDK
>             > process, and when the application process performs I/O the kernel copies into
>             > those buffers prior to telling SPDK about them? That would work, but now you're
>             > back to doing a data copy. I do think you can get it down to 1 data copy instead
>             > of 2 with a scheme like this.
>             >
>             > 2) One of the big performance problems you're seeing is syscall overhead in NBD.
>             > If you still have a kernel block device that routes messages up to the SPDK
>             > process, the application process is making the same syscalls because it's still
>             > interacting with a block device in the kernel, but you're right that the backend
>             > SPDK implementation could be polling on shared memory rings and potentially run
>             > more efficiently.
>             >
>             > >
>             > > Since we will not have a real bus to signal for the driver to probe for new
>             > > devices we can use a sysfs interface for the application to notify the driver
>             > > of a new socket and eventfd pair to setup a new virtio-scsi instance.
>             > > Otherwise the design simply moves the vhost-user driver from the QEMU
>             > > application into the Host kernel itself.
>             > >
>             > > It's my understanding that this will avoid a lot more system calls and copies
>             > > compared to what exposing an iSCSI device or NBD device as we're currently
>             > > discussing. Does this seem feasible?
>             >
>             > What you really want is a "block device in user space" solution that's higher
>             > performance than NBD, and while that's been tried many, many times in the past I
>             > do think there is a great opportunity here for someone. I'm not sure that the
>             > interface between the block device process and the kernel is best done as a
>             > modification of NBD or a wholesale replacement by vhost-user-scsi, but I'd like
>             > to throw in a third option to consider - use NVMe queues in shared memory as the
>             > interface instead. The NVMe queues are going to be much more efficient than
>             > virtqueues for storage commands.
>             >
>             > >
>             > > Thanks,
>             > > Brian
>             > >
>             > > On 9/5/19, 12:32 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
>             > >
>             > >     Hi Paul.
>             > >
>             > >     Thanks for investigating it.
>             > >
>             > >     We have one more idea floating around. Brian is going to send you a
>             > > proposal shortly. If other proposal seems feasible to you that we can evaluate
>             > > the work required in both the proposals.
>             > >
>             > >     Thanks
>             > >     Rishabh Mittal
>             > >
>             > >     On 9/5/19, 11:09 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
>             > >
>             > >         Hi,
>             > >
>             > >         So I was able to perform the same steps here and I think one of the
>             > > keys to really seeing what's going on is to start perftop like this:
>             > >
>             > >          “perf top --sort comm,dso,symbol -C 0” to get a more focused view by
>             > > sorting on command, shared object and symbol
>             > >
>             > >         Attached are 2 snapshots, one with a NULL back end for nbd and one
>             > > with libaio/nvme.  Some notes after chatting with Ben a bit, please read
>             > > through and let us know what you think:
>             > >
>             > >         * in both cases the vast majority of the highest overhead activities
>             > > are kernel
>             > >         * the "copy_user_enhanced" symbol on the NULL case (it shows up on the
>             > > other as well but you have to scroll way down to see it) and is the
>             > > user/kernel space copy, nothing SPDK can do about that
>             > >         * the syscalls that dominate in both cases are likely something that
>             > > can be improved on by changing how SPDK interacts with nbd. Ben had a couple
>             > > of ideas inlcuidng (a) using libaio to interact with the nbd fd as opposed to
>             > > interacting with the nbd socket, (b) "batching" wherever possible, for example
>             > > on writes to nbd investigate not ack'ing them until some number have completed
>             > >         * the kernel slab* commands are likely nbd kernel driver
>             > > allocations/frees in the IO path, one possibility would be to look at
>             > > optimizing the nbd kernel driver for this one
>             > >         * the libc item on the NULL chart also shows up on the libaio profile
>             > > however is again way down the scroll so it didn't make the screenshot :)  This
>             > > could be a zeroing of something somewhere in the SPDK nbd driver
>             > >
>             > >         It looks like this data supports what Ben had suspected a while back,
>             > > much of the overhead we're looking at is kernel nbd.  Anyway, let us know what
>             > > you think and if you want to explore any of the ideas above any further or see
>             > > something else in the data that looks worthy to note.
>             > >
>             > >         Thx
>             > >         Paul
>             > >
>             > >
>             > >
>             > >         -----Original Message-----
>             > >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul
>             > > E
>             > >         Sent: Wednesday, September 4, 2019 4:27 PM
>             > >         To: Mittal, Rishabh <rimittal(a)ebay.com>; Walker, Benjamin <
>             > > benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>;
>             > > spdk(a)lists.01.org
>             > >         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>;
>             > > Kadayam, Hari <hkadayam(a)ebay.com>
>             > >         Subject: Re: [SPDK] NBD with SPDK
>             > >
>             > >         Cool, thanks for sending this.  I will try and repro tomorrow here and
>             > > see what kind of results I get
>             > >
>             > >         Thx
>             > >         Paul
>             > >
>             > >         -----Original Message-----
>             > >         From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
>             > >         Sent: Wednesday, September 4, 2019 4:23 PM
>             > >         To: Luse, Paul E <paul.e.luse(a)intel.com>; Walker, Benjamin <
>             > > benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>;
>             > > spdk(a)lists.01.org
>             > >         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
>             > > hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
>             > >         Subject: Re: [SPDK] NBD with SPDK
>             > >
>             > >         Avg CPU utilization is very low when I am running this.
>             > >
>             > >         09/04/2019 04:21:40 PM
>             > >         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             > >                    2.59    0.00    2.57    0.00    0.00   94.84
>             > >
>             > >         Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %
>             > > rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
>             > >         sda              0.00    0.20      0.00      0.80     0.00     0.00
>             > > 0.00   0.00    0.00    0.00   0.00     0.00     4.00   0.00   0.00
>             > >         sdb              0.00    0.00      0.00      0.00     0.00     0.00
>             > > 0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
>             > >         sdc              0.00 28846.80      0.00 191555.20     0.00
>             > > 18211.00   0.00  38.70    0.00    1.03  29.64     0.00     6.64   0.03 100.00
>             > >         nb0              0.00 47297.00      0.00
>             > > 191562.40     0.00   593.60   0.00   1.24    0.00    1.32  61.83     0.00
>             > > 4.05   0
>             > >
>             > >
>             > >
>             > >         On 9/4/19, 4:19 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
>             > >
>             > >             I am using this command
>             > >
>             > >             fio --name=randwrite --ioengine=libaio --iodepth=8 --rw=write --
>             > > rwmixread=0  --bsrange=4k-4k --direct=1 -filename=/dev/nbd0 --numjobs=8 --
>             > > runtime 120 --time_based --group_reporting
>             > >
>             > >             I have created the device by using these commands
>             > >               1.  ./root/spdk/app/vhost
>             > >               2.  ./rpc.py bdev_aio_create /dev/sdc aio0
>             > >               3. /rpc.py start_nbd_disk aio0 /dev/nbd0
>             > >
>             > >             I am using  "perf top"  to get the performance
>             > >
>             > >             On 9/4/19, 4:03 PM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
>             > >
>             > >                 Hi Rishabh,
>             > >
>             > >                 Maybe it would help (me at least) if you described the
>             > > complete & exact steps for your test - both setup of the env & test and
>             > > command to profile.  Can you send that out?
>             > >
>             > >                 Thx
>             > >                 Paul
>             > >
>             > >                 -----Original Message-----
>             > >                 From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
>             > >                 Sent: Wednesday, September 4, 2019 2:45 PM
>             > >                 To: Walker, Benjamin <benjamin.walker(a)intel.com>; Harris,
>             > > James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org; Luse, Paul E <
>             > > paul.e.luse(a)intel.com>
>             > >                 Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
>             > > hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
>             > >                 Subject: Re: [SPDK] NBD with SPDK
>             > >
>             > >                 Yes, I am using 64 q depth with one thread in fio. I am using
>             > > AIO. This profiling is for the entire system. I don't know why spdk threads
>             > > are idle.
>             > >
>             > >                 On 9/4/19, 11:08 AM, "Walker, Benjamin" <
>             > > benjamin.walker(a)intel.com> wrote:
>             > >
>             > >                     On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wrote:
>             > >                     > I got the run again. It is with 4k write.
>             > >                     >
>             > >                     > 13.16%  vhost                       [.]
>             > >                     >
>             > > spdk_ring_dequeue
>             > >                     >
>             > >                     >    6.08%  vhost                       [.]
>             > >                     >
>             > > rte_rdtsc
>             > >                     >
>             > >                     >    4.77%  vhost                       [.]
>             > >                     >
>             > > spdk_thread_poll
>             > >                     >
>             > >                     >    2.85%  vhost                       [.]
>             > >                     >
>             > > _spdk_reactor_run
>             > >                     >
>             > >
>             > >                     You're doing high queue depth for at least 30 seconds
>             > > while the trace runs,
>             > >                     right? Using fio with the libaio engine on the NBD device
>             > > is probably the way to
>             > >                     go. Are you limiting the profiling to just the core where
>             > > the main SPDK process
>             > >                     is pinned? I'm asking because SPDK still appears to be
>             > > mostly idle, and I
>             > >                     suspect the time is being spent in some other thread (in
>             > > the kernel). Consider
>             > >                     capturing a profile for the entire system. It will have
>             > > fio stuff in it, but the
>             > >                     expensive stuff still should generally bubble up to the
>             > > top.
>             > >
>             > >                     Thanks,
>             > >                     Ben
>             > >
>             > >
>             > >                     >
>             > >                     > On 8/29/19, 6:05 PM, "Mittal, Rishabh" <
>             > > rimittal(a)ebay.com> wrote:
>             > >                     >
>             > >                     >     I got the profile with first run.
>             > >                     >
>             > >                     >       27.91%  vhost                       [.]
>             > >                     >
>             > > spdk_ring_dequeue
>             > >                     >
>             > >                     >       12.94%  vhost                       [.]
>             > >                     >
>             > > rte_rdtsc
>             > >                     >
>             > >                     >       11.00%  vhost                       [.]
>             > >                     >
>             > > spdk_thread_poll
>             > >                     >
>             > >                     >        6.15%  vhost                       [.]
>             > >                     >
>             > > _spdk_reactor_run
>             > >                     >
>             > >                     >        4.35%  [kernel]                    [k]
>             > >                     >
>             > > syscall_return_via_sysret
>             > >                     >
>             > >                     >        3.91%  vhost                       [.]
>             > >                     >
>             > > _spdk_msg_queue_run_batch
>             > >                     >
>             > >                     >        3.38%  vhost                       [.]
>             > >                     >
>             > > _spdk_event_queue_run_batch
>             > >                     >
>             > >                     >        2.83%  [unknown]                   [k]
>             > >                     >
>             > > 0xfffffe000000601b
>             > >                     >
>             > >                     >        1.45%  vhost                       [.]
>             > >                     >
>             > > spdk_thread_get_from_ctx
>             > >                     >
>             > >                     >        1.20%  [kernel]                    [k]
>             > >                     >
>             > > __fget
>             > >                     >
>             > >                     >        1.14%  libpthread-2.27.so          [.]
>             > >                     >
>             > > __libc_read
>             > >                     >
>             > >                     >        1.00%  libc-2.27.so                [.]
>             > >                     >
>             > > 0x000000000018ef76
>             > >                     >
>             > >                     >        0.99%  libc-2.27.so                [.]
>             > > 0x000000000018ef79
>             > >                     >
>             > >                     >     Thanks
>             > >                     >     Rishabh Mittal
>             > >                     >
>             > >                     >     On 8/19/19, 7:42 AM, "Luse, Paul E" <
>             > > paul.e.luse(a)intel.com> wrote:
>             > >                     >
>             > >                     >         That's great.  Keep any eye out for the items
>             > > Ben mentions below - at
>             > >                     > least the first one should be quick to implement and
>             > > compare both profile data
>             > >                     > and measured performance.
>             > >                     >
>             > >                     >         Don’t' forget about the community meetings
>             > > either, great place to chat
>             > >                     > about these kinds of things.
>             > >                     >
>             > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Chkadayam%40ebay.com%7Cb65b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033868045644168&amp;sdata=EMsizfalT%2FNT885h48%2FRgiefp0AN%2BKyYKBsQnhzn5IA%3D&amp;reserved=0
>             > >                     >   Next one is tomorrow morn US time.
>             > >                     >
>             > >                     >         Thx
>             > >                     >         Paul
>             > >                     >
>             > >                     >         -----Original Message-----
>             > >                     >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On
>             > > Behalf Of Mittal,
>             > >                     > Rishabh via SPDK
>             > >                     >         Sent: Thursday, August 15, 2019 6:50 PM
>             > >                     >         To: Harris, James R <james.r.harris(a)intel.com>;
>             > > Walker, Benjamin <
>             > >                     > benjamin.walker(a)intel.com>; spdk(a)lists.01.org
>             > >                     >         Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen,
>             > > Xiaoxi <
>             > >                     > xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>;
>             > > Kadayam, Hari <
>             > >                     > hkadayam(a)ebay.com>
>             > >                     >         Subject: Re: [SPDK] NBD with SPDK
>             > >                     >
>             > >                     >         Thanks. I will get the profiling by next week.
>             > >                     >
>             > >                     >         On 8/15/19, 6:26 PM, "Harris, James R" <
>             > > james.r.harris(a)intel.com>
>             > >                     > wrote:
>             > >                     >
>             > >                     >
>             > >                     >
>             > >                     >             On 8/15/19, 4:34 PM, "Mittal, Rishabh" <
>             > > rimittal(a)ebay.com> wrote:
>             > >                     >
>             > >                     >                 Hi Jim
>             > >                     >
>             > >                     >                 What tool you use to take profiling.
>             > >                     >
>             > >                     >             Hi Rishabh,
>             > >                     >
>             > >                     >             Mostly I just use "perf top".
>             > >                     >
>             > >                     >             -Jim
>             > >                     >
>             > >                     >
>             > >                     >                 Thanks
>             > >                     >                 Rishabh Mittal
>             > >                     >
>             > >                     >                 On 8/14/19, 9:54 AM, "Harris, James R" <
>             > >                     > james.r.harris(a)intel.com> wrote:
>             > >                     >
>             > >                     >
>             > >                     >
>             > >                     >                     On 8/14/19, 9:18 AM, "Walker,
>             > > Benjamin" <
>             > >                     > benjamin.walker(a)intel.com> wrote:
>             > >                     >
>             > >                     >                     <trim>
>             > >                     >
>             > >                     >                         When an I/O is performed in the
>             > > process initiating the
>             > >                     > I/O to a file, the data
>             > >                     >                         goes into the OS page cache
>             > > buffers at a layer far
>             > >                     > above the bio stack
>             > >                     >                         (somewhere up in VFS). If SPDK
>             > > were to reserve some
>             > >                     > memory and hand it off to
>             > >                     >                         your kernel driver, your kernel
>             > > driver would still
>             > >                     > need to copy it to that
>             > >                     >                         location out of the page cache
>             > > buffers. We can't
>             > >                     > safely share the page cache
>             > >                     >                         buffers with a user space
>             > > process.
>             > >                     >
>             > >                     >                     I think Rishabh was suggesting the
>             > > SPDK reserve the
>             > >                     > virtual address space only.
>             > >                     >                     Then the kernel could map the page
>             > > cache buffers into that
>             > >                     > virtual address space.
>             > >                     >                     That would not require a data copy,
>             > > but would require the
>             > >                     > mapping operations.
>             > >                     >
>             > >                     >                     I think the profiling data would be
>             > > really helpful - to
>             > >                     > quantify how much of the 50us
>             > >                     >                     Is due to copying the 4KB of
>             > > data.  That can help drive
>             > >                     > next steps on how to optimize
>             > >                     >                     the SPDK NBD module.
>             > >                     >
>             > >                     >                     Thanks,
>             > >                     >
>             > >                     >                     -Jim
>             > >                     >
>             > >                     >
>             > >                     >                         As Paul said, I'm skeptical that
>             > > the memcpy is
>             > >                     > significant in the overall
>             > >                     >                         performance you're measuring. I
>             > > encourage you to go
>             > >                     > look at some profiling data
>             > >                     >                         and confirm that the memcpy is
>             > > really showing up. I
>             > >                     > suspect the overhead is
>             > >                     >                         instead primarily in these
>             > > spots:
>             > >                     >
>             > >                     >                         1) Dynamic buffer allocation in
>             > > the SPDK NBD backend.
>             > >                     >
>             > >                     >                         As Paul indicated, the NBD
>             > > target is dynamically
>             > >                     > allocating memory for each I/O.
>             > >                     >                         The NBD backend wasn't designed
>             > > to be fast - it was
>             > >                     > designed to be simple.
>             > >                     >                         Pooling would be a lot faster
>             > > and is something fairly
>             > >                     > easy to implement.
>             > >                     >
>             > >                     >                         2) The way SPDK does the
>             > > syscalls when it implements
>             > >                     > the NBD backend.
>             > >                     >
>             > >                     >                         Again, the code was designed to
>             > > be simple, not high
>             > >                     > performance. It simply calls
>             > >                     >                         read() and write() on the socket
>             > > for each command.
>             > >                     > There are much higher
>             > >                     >                         performance ways of doing this,
>             > > they're just more
>             > >                     > complex to implement.
>             > >                     >
>             > >                     >                         3) The lack of multi-queue
>             > > support in NBD
>             > >                     >
>             > >                     >                         Every I/O is funneled through a
>             > > single sockpair up to
>             > >                     > user space. That means
>             > >                     >                         there is locking going on. I
>             > > believe this is just a
>             > >                     > limitation of NBD today - it
>             > >                     >                         doesn't plug into the block-mq
>             > > stuff in the kernel and
>             > >                     > expose multiple
>             > >                     >                         sockpairs. But someone more
>             > > knowledgeable on the
>             > >                     > kernel stack would need to take
>             > >                     >                         a look.
>             > >                     >
>             > >                     >                         Thanks,
>             > >                     >                         Ben
>             > >                     >
>             > >                     >                         >
>             > >                     >                         > Couple of things that I am not
>             > > really sure in this
>             > >                     > flow is :- 1. How memory
>             > >                     >                         > registration is going to work
>             > > with RDMA driver.
>             > >                     >                         > 2. What changes are required
>             > > in spdk memory
>             > >                     > management
>             > >                     >                         >
>             > >                     >                         > Thanks
>             > >                     >                         > Rishabh Mittal
>             > >                     >
>             >
>             > >
>             > >
>             > >
>             >
>             > _______________________________________________
>             > SPDK mailing list
>             > SPDK(a)lists.01.org
>             > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Chkadayam%40ebay.com%7Cb65b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033868045644168&amp;sdata=or%2FOWQA3mTPiixZcHJqPizjOMNreQoIcDK8ZZ5A4Goo%3D&amp;reserved=0
>
>
>
>             --
>             Regards
>             Huang Zhiteng
>
>
>
>
>
>


-- 
Regards
Huang Zhiteng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-09-06 20:31 Kadayam, Hari
  0 siblings, 0 replies; 32+ messages in thread
From: Kadayam, Hari @ 2019-09-06 20:31 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 36444 bytes --]

Kata containers has additional indirection compared to docker, which potentially affects the performance, right? 

Also memory protection concern in virtio is valid, but we could possibly look for containing the accessible memory. In any case, I think an SPDK application wouldn't be accessing a buffer other than IO buffer from that space.

On 9/6/19, 10:13 AM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:

    I am summarizing all the options. We have three options
    
    1.  SPDK with NBD. :- It need few optimizations in spdk - nbd to reduce the system calls overhead. We can also explore using multi socket because it is supported in NBD.  One disadvantage is the there will be bcopy for every IO from kernel to SPDK or vice versa for reads. Overhead of bcopy compare to end to end latency is very low for 4k workload but we need to see its impact for higher read/write size.
    
    2. SPDK with virtio :- It doesn't require any changes in spdk (assuming that spdk virtio is written for performance) but we need to have a customized kernel module which can work with spdk virtio target. It's obvious advantage is the kernel buffer cache will shared with spdk so there will be no copy from kernel to spdk. Other advantage is there will be minimal system calls to ring the doorbell as it will be using shared ring queue. Here my only concern is that memory protection will be lost as entire kernel buffers will be shared with spdk.
    
    3. SPDK is used with KATA containers :- It doesn't require much changes (Xiaoxi can comment more on this). But our concern is that apps will not be moved to kata containers which will slow down its adoption rate. 
    
    Please feel free to add pros/cons of any approach if I miss anything. It will help us to decide.
    
    
    Thanks
    Rishabh Mittal
    
    On 9/5/19, 7:14 PM, "Szmyd, Brian" <bszmyd(a)ebay.com> wrote:
    
        I believe this option has the same number of copies since your still sharing the memory
        with KATA VM kernel not the application itself. This is an option that the development
        of a virtio-vhost-user driver does not prevent, its merely an option to allow non-KATA
        containers to also use the same device.
        
        I will note that doing a virtio-host-user driver also allows one to project other device types
        than just block devices into the kernel device stack. One could also write a user application
        that exposed an input, network, console, gpu or socket device as well.
        
        Not that I have any interest in these... __
        
        On 9/5/19, 8:08 PM, "Huang Zhiteng" <winston.d(a)gmail.com> wrote:
        
            Since this SPDK bdev is intended to be consumed by a user application
            running inside a container, we do have the possibility to run user
            application inside a Kata container instead.  Kata container does
            introduce the layer of IO virtualization, therefore we convert a user
            space block device on host to a kernel block device inside the VM but
            with less memory copies than NBD thanks to SPDK vhost.  Kata container
            might impose higher overhead than plain container but hopefully it's
            lightweight enough that the overhead is negligible.
            
            On Fri, Sep 6, 2019 at 5:22 AM Walker, Benjamin
            <benjamin.walker(a)intel.com> wrote:
            >
            > On Thu, 2019-09-05 at 19:47 +0000, Szmyd, Brian wrote:
            > > Hi Paul,
            > >
            > > Rather than put the effort into a formalized document here is a brief
            > > description of the solution I have been investigating just to get an opinion
            > > of feasibility or even workability.
            > >
            > > Some background and a reiteration of the problem to set things up. I apologize
            > > to reiterate anything and to include details that some may already know.
            > >
            > > We are looking for a solution that allows us to write a custom bdev for the
            > > SPDK bdev layer that distributes I/O between different NVMe-oF targets that we
            > > have attached and then present that to our application as either a raw block
            > > device or filesystem mountpoint.
            > >
            > > This is normally (as I understand it) done to by exposing a device via QEMU to
            > > a VM using the vhost target. This SPDK target has implemented the virtio-scsi
            > > (among others) device according to this spec:
            > >
            > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oasis-open.org%2Fvirtio%2Fvirtio%2Fv1.1%2Fcsprd01%2Fvirtio-v1.1-csprd01.html%23x1-8300021&amp;data=02%7C01%7Chkadayam%40ebay.com%7Cb65b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033868045644168&amp;sdata=fbXQwgNce4XFtTAAFHl%2F5SSFo8i%2BZ1vqQdjTEv56Lw4%3D&amp;reserved=0
            > >
            > > The VM kernel then uses a virtio-scsi module to attach said device into its
            > > SCSI mid-layer and then have the device enumerated as a /dev/sd[a-z]+ device.
            > >
            > > The problem is that QEMU virtualizes a PCIe bus for the guest kernel virtio-
            > > pci driver to discover the virtio devices and bind them to the virtio-scsi
            > > driver. There really is no other way (other than platform MMIO type devices)
            > > to attach a device to the virtio-scsi device.
            > >
            > > SPDK exposes the virtio device to the VM via QEMU which has written a "user
            > > space" version of the vhost bus. This driver then translates the API into the
            > > virtio-pci specification:
            > >
            > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fqemu%2Fqemu%2Fblob%2F5d0e5694470d2952b4f257bc985cac8c89b4fd92%2Fdocs%2Finterop%2Fvhost-user.rst&amp;data=02%7C01%7Chkadayam%40ebay.com%7Cb65b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033868045644168&amp;sdata=p%2BeTDdTmvNn8hxPFc%2BGQnKEleFeRP9aJ3Sc8prRKJRk%3D&amp;reserved=0
            > >
            > > This uses an eventfd descriptor for interrupting the non-polling side of the
            > > queue and a UNIX domain socket to setup (and control) the shared memory which
            > > contains the I/O buffers and virtio queues. This is documented in SPDKs own
            > > documentation and diagramed here:
            > >
            > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fspdk%2Fspdk%2Fblob%2F01103b2e4dfdcf23cc2125164aa116394c8185e8%2Fdoc%2Fvhost_processing.md&amp;data=02%7C01%7Chkadayam%40ebay.com%7Cb65b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033868045644168&amp;sdata=iThOkZ70btaiBFAQKn5EUR%2BpCw%2BrcIzfIkWVPvf9LZs%3D&amp;reserved=0
            > >
            > > If we could implement this vhost-user QEMU target as a virtio driver in the
            > > kernel as an alternative to the virtio-pci driver, it could bind a SPDK vhost
            > > into the host kernel as a virtio device and enumerated in the /dev/sd[a-z]+
            > > tree for our containers to bind. Attached is draft block diagram.
            >
            > If you think of QEMU as just another user-space process, and the SPDK vhost
            > target as a user-space process, then it's clear that vhost-user is simply a
            > cross-process IPC mechanism based on shared memory. The "shared memory" part is
            > the critical part of that description - QEMU pre-registers all of the memory
            > that will be used for I/O buffers (in fact, all of the memory that is mapped
            > into the guest) with the SPDK process by sending fds across a Unix domain
            > socket.
            >
            > If you move this code into the kernel, you have to solve two issues:
            >
            > 1) What memory is it registering with the SPDK process? The kernel driver has no
            > idea which application process may route I/O to it - in fact the application
            > process may not even exist yet - so it isn't memory allocated to the application
            > process. Maybe you have a pool of kernel buffers that get mapped into the SPDK
            > process, and when the application process performs I/O the kernel copies into
            > those buffers prior to telling SPDK about them? That would work, but now you're
            > back to doing a data copy. I do think you can get it down to 1 data copy instead
            > of 2 with a scheme like this.
            >
            > 2) One of the big performance problems you're seeing is syscall overhead in NBD.
            > If you still have a kernel block device that routes messages up to the SPDK
            > process, the application process is making the same syscalls because it's still
            > interacting with a block device in the kernel, but you're right that the backend
            > SPDK implementation could be polling on shared memory rings and potentially run
            > more efficiently.
            >
            > >
            > > Since we will not have a real bus to signal for the driver to probe for new
            > > devices we can use a sysfs interface for the application to notify the driver
            > > of a new socket and eventfd pair to setup a new virtio-scsi instance.
            > > Otherwise the design simply moves the vhost-user driver from the QEMU
            > > application into the Host kernel itself.
            > >
            > > It's my understanding that this will avoid a lot more system calls and copies
            > > compared to what exposing an iSCSI device or NBD device as we're currently
            > > discussing. Does this seem feasible?
            >
            > What you really want is a "block device in user space" solution that's higher
            > performance than NBD, and while that's been tried many, many times in the past I
            > do think there is a great opportunity here for someone. I'm not sure that the
            > interface between the block device process and the kernel is best done as a
            > modification of NBD or a wholesale replacement by vhost-user-scsi, but I'd like
            > to throw in a third option to consider - use NVMe queues in shared memory as the
            > interface instead. The NVMe queues are going to be much more efficient than
            > virtqueues for storage commands.
            >
            > >
            > > Thanks,
            > > Brian
            > >
            > > On 9/5/19, 12:32 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
            > >
            > >     Hi Paul.
            > >
            > >     Thanks for investigating it.
            > >
            > >     We have one more idea floating around. Brian is going to send you a
            > > proposal shortly. If other proposal seems feasible to you that we can evaluate
            > > the work required in both the proposals.
            > >
            > >     Thanks
            > >     Rishabh Mittal
            > >
            > >     On 9/5/19, 11:09 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
            > >
            > >         Hi,
            > >
            > >         So I was able to perform the same steps here and I think one of the
            > > keys to really seeing what's going on is to start perftop like this:
            > >
            > >          “perf top --sort comm,dso,symbol -C 0” to get a more focused view by
            > > sorting on command, shared object and symbol
            > >
            > >         Attached are 2 snapshots, one with a NULL back end for nbd and one
            > > with libaio/nvme.  Some notes after chatting with Ben a bit, please read
            > > through and let us know what you think:
            > >
            > >         * in both cases the vast majority of the highest overhead activities
            > > are kernel
            > >         * the "copy_user_enhanced" symbol on the NULL case (it shows up on the
            > > other as well but you have to scroll way down to see it) and is the
            > > user/kernel space copy, nothing SPDK can do about that
            > >         * the syscalls that dominate in both cases are likely something that
            > > can be improved on by changing how SPDK interacts with nbd. Ben had a couple
            > > of ideas inlcuidng (a) using libaio to interact with the nbd fd as opposed to
            > > interacting with the nbd socket, (b) "batching" wherever possible, for example
            > > on writes to nbd investigate not ack'ing them until some number have completed
            > >         * the kernel slab* commands are likely nbd kernel driver
            > > allocations/frees in the IO path, one possibility would be to look at
            > > optimizing the nbd kernel driver for this one
            > >         * the libc item on the NULL chart also shows up on the libaio profile
            > > however is again way down the scroll so it didn't make the screenshot :)  This
            > > could be a zeroing of something somewhere in the SPDK nbd driver
            > >
            > >         It looks like this data supports what Ben had suspected a while back,
            > > much of the overhead we're looking at is kernel nbd.  Anyway, let us know what
            > > you think and if you want to explore any of the ideas above any further or see
            > > something else in the data that looks worthy to note.
            > >
            > >         Thx
            > >         Paul
            > >
            > >
            > >
            > >         -----Original Message-----
            > >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul
            > > E
            > >         Sent: Wednesday, September 4, 2019 4:27 PM
            > >         To: Mittal, Rishabh <rimittal(a)ebay.com>; Walker, Benjamin <
            > > benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>;
            > > spdk(a)lists.01.org
            > >         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>;
            > > Kadayam, Hari <hkadayam(a)ebay.com>
            > >         Subject: Re: [SPDK] NBD with SPDK
            > >
            > >         Cool, thanks for sending this.  I will try and repro tomorrow here and
            > > see what kind of results I get
            > >
            > >         Thx
            > >         Paul
            > >
            > >         -----Original Message-----
            > >         From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
            > >         Sent: Wednesday, September 4, 2019 4:23 PM
            > >         To: Luse, Paul E <paul.e.luse(a)intel.com>; Walker, Benjamin <
            > > benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>;
            > > spdk(a)lists.01.org
            > >         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
            > > hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
            > >         Subject: Re: [SPDK] NBD with SPDK
            > >
            > >         Avg CPU utilization is very low when I am running this.
            > >
            > >         09/04/2019 04:21:40 PM
            > >         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            > >                    2.59    0.00    2.57    0.00    0.00   94.84
            > >
            > >         Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %
            > > rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
            > >         sda              0.00    0.20      0.00      0.80     0.00     0.00
            > > 0.00   0.00    0.00    0.00   0.00     0.00     4.00   0.00   0.00
            > >         sdb              0.00    0.00      0.00      0.00     0.00     0.00
            > > 0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
            > >         sdc              0.00 28846.80      0.00 191555.20     0.00
            > > 18211.00   0.00  38.70    0.00    1.03  29.64     0.00     6.64   0.03 100.00
            > >         nb0              0.00 47297.00      0.00
            > > 191562.40     0.00   593.60   0.00   1.24    0.00    1.32  61.83     0.00
            > > 4.05   0
            > >
            > >
            > >
            > >         On 9/4/19, 4:19 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
            > >
            > >             I am using this command
            > >
            > >             fio --name=randwrite --ioengine=libaio --iodepth=8 --rw=write --
            > > rwmixread=0  --bsrange=4k-4k --direct=1 -filename=/dev/nbd0 --numjobs=8 --
            > > runtime 120 --time_based --group_reporting
            > >
            > >             I have created the device by using these commands
            > >               1.  ./root/spdk/app/vhost
            > >               2.  ./rpc.py bdev_aio_create /dev/sdc aio0
            > >               3. /rpc.py start_nbd_disk aio0 /dev/nbd0
            > >
            > >             I am using  "perf top"  to get the performance
            > >
            > >             On 9/4/19, 4:03 PM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
            > >
            > >                 Hi Rishabh,
            > >
            > >                 Maybe it would help (me at least) if you described the
            > > complete & exact steps for your test - both setup of the env & test and
            > > command to profile.  Can you send that out?
            > >
            > >                 Thx
            > >                 Paul
            > >
            > >                 -----Original Message-----
            > >                 From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
            > >                 Sent: Wednesday, September 4, 2019 2:45 PM
            > >                 To: Walker, Benjamin <benjamin.walker(a)intel.com>; Harris,
            > > James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org; Luse, Paul E <
            > > paul.e.luse(a)intel.com>
            > >                 Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
            > > hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
            > >                 Subject: Re: [SPDK] NBD with SPDK
            > >
            > >                 Yes, I am using 64 q depth with one thread in fio. I am using
            > > AIO. This profiling is for the entire system. I don't know why spdk threads
            > > are idle.
            > >
            > >                 On 9/4/19, 11:08 AM, "Walker, Benjamin" <
            > > benjamin.walker(a)intel.com> wrote:
            > >
            > >                     On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wrote:
            > >                     > I got the run again. It is with 4k write.
            > >                     >
            > >                     > 13.16%  vhost                       [.]
            > >                     >
            > > spdk_ring_dequeue
            > >                     >
            > >                     >    6.08%  vhost                       [.]
            > >                     >
            > > rte_rdtsc
            > >                     >
            > >                     >    4.77%  vhost                       [.]
            > >                     >
            > > spdk_thread_poll
            > >                     >
            > >                     >    2.85%  vhost                       [.]
            > >                     >
            > > _spdk_reactor_run
            > >                     >
            > >
            > >                     You're doing high queue depth for at least 30 seconds
            > > while the trace runs,
            > >                     right? Using fio with the libaio engine on the NBD device
            > > is probably the way to
            > >                     go. Are you limiting the profiling to just the core where
            > > the main SPDK process
            > >                     is pinned? I'm asking because SPDK still appears to be
            > > mostly idle, and I
            > >                     suspect the time is being spent in some other thread (in
            > > the kernel). Consider
            > >                     capturing a profile for the entire system. It will have
            > > fio stuff in it, but the
            > >                     expensive stuff still should generally bubble up to the
            > > top.
            > >
            > >                     Thanks,
            > >                     Ben
            > >
            > >
            > >                     >
            > >                     > On 8/29/19, 6:05 PM, "Mittal, Rishabh" <
            > > rimittal(a)ebay.com> wrote:
            > >                     >
            > >                     >     I got the profile with first run.
            > >                     >
            > >                     >       27.91%  vhost                       [.]
            > >                     >
            > > spdk_ring_dequeue
            > >                     >
            > >                     >       12.94%  vhost                       [.]
            > >                     >
            > > rte_rdtsc
            > >                     >
            > >                     >       11.00%  vhost                       [.]
            > >                     >
            > > spdk_thread_poll
            > >                     >
            > >                     >        6.15%  vhost                       [.]
            > >                     >
            > > _spdk_reactor_run
            > >                     >
            > >                     >        4.35%  [kernel]                    [k]
            > >                     >
            > > syscall_return_via_sysret
            > >                     >
            > >                     >        3.91%  vhost                       [.]
            > >                     >
            > > _spdk_msg_queue_run_batch
            > >                     >
            > >                     >        3.38%  vhost                       [.]
            > >                     >
            > > _spdk_event_queue_run_batch
            > >                     >
            > >                     >        2.83%  [unknown]                   [k]
            > >                     >
            > > 0xfffffe000000601b
            > >                     >
            > >                     >        1.45%  vhost                       [.]
            > >                     >
            > > spdk_thread_get_from_ctx
            > >                     >
            > >                     >        1.20%  [kernel]                    [k]
            > >                     >
            > > __fget
            > >                     >
            > >                     >        1.14%  libpthread-2.27.so          [.]
            > >                     >
            > > __libc_read
            > >                     >
            > >                     >        1.00%  libc-2.27.so                [.]
            > >                     >
            > > 0x000000000018ef76
            > >                     >
            > >                     >        0.99%  libc-2.27.so                [.]
            > > 0x000000000018ef79
            > >                     >
            > >                     >     Thanks
            > >                     >     Rishabh Mittal
            > >                     >
            > >                     >     On 8/19/19, 7:42 AM, "Luse, Paul E" <
            > > paul.e.luse(a)intel.com> wrote:
            > >                     >
            > >                     >         That's great.  Keep any eye out for the items
            > > Ben mentions below - at
            > >                     > least the first one should be quick to implement and
            > > compare both profile data
            > >                     > and measured performance.
            > >                     >
            > >                     >         Don’t' forget about the community meetings
            > > either, great place to chat
            > >                     > about these kinds of things.
            > >                     >
            > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Chkadayam%40ebay.com%7Cb65b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033868045644168&amp;sdata=EMsizfalT%2FNT885h48%2FRgiefp0AN%2BKyYKBsQnhzn5IA%3D&amp;reserved=0
            > >                     >   Next one is tomorrow morn US time.
            > >                     >
            > >                     >         Thx
            > >                     >         Paul
            > >                     >
            > >                     >         -----Original Message-----
            > >                     >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On
            > > Behalf Of Mittal,
            > >                     > Rishabh via SPDK
            > >                     >         Sent: Thursday, August 15, 2019 6:50 PM
            > >                     >         To: Harris, James R <james.r.harris(a)intel.com>;
            > > Walker, Benjamin <
            > >                     > benjamin.walker(a)intel.com>; spdk(a)lists.01.org
            > >                     >         Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen,
            > > Xiaoxi <
            > >                     > xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>;
            > > Kadayam, Hari <
            > >                     > hkadayam(a)ebay.com>
            > >                     >         Subject: Re: [SPDK] NBD with SPDK
            > >                     >
            > >                     >         Thanks. I will get the profiling by next week.
            > >                     >
            > >                     >         On 8/15/19, 6:26 PM, "Harris, James R" <
            > > james.r.harris(a)intel.com>
            > >                     > wrote:
            > >                     >
            > >                     >
            > >                     >
            > >                     >             On 8/15/19, 4:34 PM, "Mittal, Rishabh" <
            > > rimittal(a)ebay.com> wrote:
            > >                     >
            > >                     >                 Hi Jim
            > >                     >
            > >                     >                 What tool you use to take profiling.
            > >                     >
            > >                     >             Hi Rishabh,
            > >                     >
            > >                     >             Mostly I just use "perf top".
            > >                     >
            > >                     >             -Jim
            > >                     >
            > >                     >
            > >                     >                 Thanks
            > >                     >                 Rishabh Mittal
            > >                     >
            > >                     >                 On 8/14/19, 9:54 AM, "Harris, James R" <
            > >                     > james.r.harris(a)intel.com> wrote:
            > >                     >
            > >                     >
            > >                     >
            > >                     >                     On 8/14/19, 9:18 AM, "Walker,
            > > Benjamin" <
            > >                     > benjamin.walker(a)intel.com> wrote:
            > >                     >
            > >                     >                     <trim>
            > >                     >
            > >                     >                         When an I/O is performed in the
            > > process initiating the
            > >                     > I/O to a file, the data
            > >                     >                         goes into the OS page cache
            > > buffers at a layer far
            > >                     > above the bio stack
            > >                     >                         (somewhere up in VFS). If SPDK
            > > were to reserve some
            > >                     > memory and hand it off to
            > >                     >                         your kernel driver, your kernel
            > > driver would still
            > >                     > need to copy it to that
            > >                     >                         location out of the page cache
            > > buffers. We can't
            > >                     > safely share the page cache
            > >                     >                         buffers with a user space
            > > process.
            > >                     >
            > >                     >                     I think Rishabh was suggesting the
            > > SPDK reserve the
            > >                     > virtual address space only.
            > >                     >                     Then the kernel could map the page
            > > cache buffers into that
            > >                     > virtual address space.
            > >                     >                     That would not require a data copy,
            > > but would require the
            > >                     > mapping operations.
            > >                     >
            > >                     >                     I think the profiling data would be
            > > really helpful - to
            > >                     > quantify how much of the 50us
            > >                     >                     Is due to copying the 4KB of
            > > data.  That can help drive
            > >                     > next steps on how to optimize
            > >                     >                     the SPDK NBD module.
            > >                     >
            > >                     >                     Thanks,
            > >                     >
            > >                     >                     -Jim
            > >                     >
            > >                     >
            > >                     >                         As Paul said, I'm skeptical that
            > > the memcpy is
            > >                     > significant in the overall
            > >                     >                         performance you're measuring. I
            > > encourage you to go
            > >                     > look at some profiling data
            > >                     >                         and confirm that the memcpy is
            > > really showing up. I
            > >                     > suspect the overhead is
            > >                     >                         instead primarily in these
            > > spots:
            > >                     >
            > >                     >                         1) Dynamic buffer allocation in
            > > the SPDK NBD backend.
            > >                     >
            > >                     >                         As Paul indicated, the NBD
            > > target is dynamically
            > >                     > allocating memory for each I/O.
            > >                     >                         The NBD backend wasn't designed
            > > to be fast - it was
            > >                     > designed to be simple.
            > >                     >                         Pooling would be a lot faster
            > > and is something fairly
            > >                     > easy to implement.
            > >                     >
            > >                     >                         2) The way SPDK does the
            > > syscalls when it implements
            > >                     > the NBD backend.
            > >                     >
            > >                     >                         Again, the code was designed to
            > > be simple, not high
            > >                     > performance. It simply calls
            > >                     >                         read() and write() on the socket
            > > for each command.
            > >                     > There are much higher
            > >                     >                         performance ways of doing this,
            > > they're just more
            > >                     > complex to implement.
            > >                     >
            > >                     >                         3) The lack of multi-queue
            > > support in NBD
            > >                     >
            > >                     >                         Every I/O is funneled through a
            > > single sockpair up to
            > >                     > user space. That means
            > >                     >                         there is locking going on. I
            > > believe this is just a
            > >                     > limitation of NBD today - it
            > >                     >                         doesn't plug into the block-mq
            > > stuff in the kernel and
            > >                     > expose multiple
            > >                     >                         sockpairs. But someone more
            > > knowledgeable on the
            > >                     > kernel stack would need to take
            > >                     >                         a look.
            > >                     >
            > >                     >                         Thanks,
            > >                     >                         Ben
            > >                     >
            > >                     >                         >
            > >                     >                         > Couple of things that I am not
            > > really sure in this
            > >                     > flow is :- 1. How memory
            > >                     >                         > registration is going to work
            > > with RDMA driver.
            > >                     >                         > 2. What changes are required
            > > in spdk memory
            > >                     > management
            > >                     >                         >
            > >                     >                         > Thanks
            > >                     >                         > Rishabh Mittal
            > >                     >
            >
            > >
            > >
            > >
            >
            > _______________________________________________
            > SPDK mailing list
            > SPDK(a)lists.01.org
            > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Chkadayam%40ebay.com%7Cb65b1ff6f6cc470400d608d732ed8087%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033868045644168&amp;sdata=or%2FOWQA3mTPiixZcHJqPizjOMNreQoIcDK8ZZ5A4Goo%3D&amp;reserved=0
            
            
            
            -- 
            Regards
            Huang Zhiteng
            
        
        
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-09-06 17:13 Mittal, Rishabh
  0 siblings, 0 replies; 32+ messages in thread
From: Mittal, Rishabh @ 2019-09-06 17:13 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 33677 bytes --]

I am summarizing all the options. We have three options

1.  SPDK with NBD. :- It need few optimizations in spdk - nbd to reduce the system calls overhead. We can also explore using multi socket because it is supported in NBD.  One disadvantage is the there will be bcopy for every IO from kernel to SPDK or vice versa for reads. Overhead of bcopy compare to end to end latency is very low for 4k workload but we need to see its impact for higher read/write size.

2. SPDK with virtio :- It doesn't require any changes in spdk (assuming that spdk virtio is written for performance) but we need to have a customized kernel module which can work with spdk virtio target. It's obvious advantage is the kernel buffer cache will shared with spdk so there will be no copy from kernel to spdk. Other advantage is there will be minimal system calls to ring the doorbell as it will be using shared ring queue. Here my only concern is that memory protection will be lost as entire kernel buffers will be shared with spdk.

3. SPDK is used with KATA containers :- It doesn't require much changes (Xiaoxi can comment more on this). But our concern is that apps will not be moved to kata containers which will slow down its adoption rate. 

Please feel free to add pros/cons of any approach if I miss anything. It will help us to decide.


Thanks
Rishabh Mittal

On 9/5/19, 7:14 PM, "Szmyd, Brian" <bszmyd(a)ebay.com> wrote:

    I believe this option has the same number of copies since your still sharing the memory
    with KATA VM kernel not the application itself. This is an option that the development
    of a virtio-vhost-user driver does not prevent, its merely an option to allow non-KATA
    containers to also use the same device.
    
    I will note that doing a virtio-host-user driver also allows one to project other device types
    than just block devices into the kernel device stack. One could also write a user application
    that exposed an input, network, console, gpu or socket device as well.
    
    Not that I have any interest in these... __
    
    On 9/5/19, 8:08 PM, "Huang Zhiteng" <winston.d(a)gmail.com> wrote:
    
        Since this SPDK bdev is intended to be consumed by a user application
        running inside a container, we do have the possibility to run user
        application inside a Kata container instead.  Kata container does
        introduce the layer of IO virtualization, therefore we convert a user
        space block device on host to a kernel block device inside the VM but
        with less memory copies than NBD thanks to SPDK vhost.  Kata container
        might impose higher overhead than plain container but hopefully it's
        lightweight enough that the overhead is negligible.
        
        On Fri, Sep 6, 2019 at 5:22 AM Walker, Benjamin
        <benjamin.walker(a)intel.com> wrote:
        >
        > On Thu, 2019-09-05 at 19:47 +0000, Szmyd, Brian wrote:
        > > Hi Paul,
        > >
        > > Rather than put the effort into a formalized document here is a brief
        > > description of the solution I have been investigating just to get an opinion
        > > of feasibility or even workability.
        > >
        > > Some background and a reiteration of the problem to set things up. I apologize
        > > to reiterate anything and to include details that some may already know.
        > >
        > > We are looking for a solution that allows us to write a custom bdev for the
        > > SPDK bdev layer that distributes I/O between different NVMe-oF targets that we
        > > have attached and then present that to our application as either a raw block
        > > device or filesystem mountpoint.
        > >
        > > This is normally (as I understand it) done to by exposing a device via QEMU to
        > > a VM using the vhost target. This SPDK target has implemented the virtio-scsi
        > > (among others) device according to this spec:
        > >
        > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oasis-open.org%2Fvirtio%2Fvirtio%2Fv1.1%2Fcsprd01%2Fvirtio-v1.1-csprd01.html%23x1-8300021&amp;data=02%7C01%7Crimittal%40ebay.com%7Ceebabb3a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033328572563931&amp;sdata=Rb6jc8GEqDasm%2FNpWPpPozlFSfwHumutQQ0P9r28ysw%3D&amp;reserved=0
        > >
        > > The VM kernel then uses a virtio-scsi module to attach said device into its
        > > SCSI mid-layer and then have the device enumerated as a /dev/sd[a-z]+ device.
        > >
        > > The problem is that QEMU virtualizes a PCIe bus for the guest kernel virtio-
        > > pci driver to discover the virtio devices and bind them to the virtio-scsi
        > > driver. There really is no other way (other than platform MMIO type devices)
        > > to attach a device to the virtio-scsi device.
        > >
        > > SPDK exposes the virtio device to the VM via QEMU which has written a "user
        > > space" version of the vhost bus. This driver then translates the API into the
        > > virtio-pci specification:
        > >
        > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fqemu%2Fqemu%2Fblob%2F5d0e5694470d2952b4f257bc985cac8c89b4fd92%2Fdocs%2Finterop%2Fvhost-user.rst&amp;data=02%7C01%7Crimittal%40ebay.com%7Ceebabb3a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033328572563931&amp;sdata=cfFYWklYCtQog7oi6cpw93490%2F1UwTM1qwZghWnuu%2FU%3D&amp;reserved=0
        > >
        > > This uses an eventfd descriptor for interrupting the non-polling side of the
        > > queue and a UNIX domain socket to setup (and control) the shared memory which
        > > contains the I/O buffers and virtio queues. This is documented in SPDKs own
        > > documentation and diagramed here:
        > >
        > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fspdk%2Fspdk%2Fblob%2F01103b2e4dfdcf23cc2125164aa116394c8185e8%2Fdoc%2Fvhost_processing.md&amp;data=02%7C01%7Crimittal%40ebay.com%7Ceebabb3a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033328572563931&amp;sdata=hVNgYqwWUl6y61MibZ4K0tJr%2FEIMgVldx8FIb0WgyXE%3D&amp;reserved=0
        > >
        > > If we could implement this vhost-user QEMU target as a virtio driver in the
        > > kernel as an alternative to the virtio-pci driver, it could bind a SPDK vhost
        > > into the host kernel as a virtio device and enumerated in the /dev/sd[a-z]+
        > > tree for our containers to bind. Attached is draft block diagram.
        >
        > If you think of QEMU as just another user-space process, and the SPDK vhost
        > target as a user-space process, then it's clear that vhost-user is simply a
        > cross-process IPC mechanism based on shared memory. The "shared memory" part is
        > the critical part of that description - QEMU pre-registers all of the memory
        > that will be used for I/O buffers (in fact, all of the memory that is mapped
        > into the guest) with the SPDK process by sending fds across a Unix domain
        > socket.
        >
        > If you move this code into the kernel, you have to solve two issues:
        >
        > 1) What memory is it registering with the SPDK process? The kernel driver has no
        > idea which application process may route I/O to it - in fact the application
        > process may not even exist yet - so it isn't memory allocated to the application
        > process. Maybe you have a pool of kernel buffers that get mapped into the SPDK
        > process, and when the application process performs I/O the kernel copies into
        > those buffers prior to telling SPDK about them? That would work, but now you're
        > back to doing a data copy. I do think you can get it down to 1 data copy instead
        > of 2 with a scheme like this.
        >
        > 2) One of the big performance problems you're seeing is syscall overhead in NBD.
        > If you still have a kernel block device that routes messages up to the SPDK
        > process, the application process is making the same syscalls because it's still
        > interacting with a block device in the kernel, but you're right that the backend
        > SPDK implementation could be polling on shared memory rings and potentially run
        > more efficiently.
        >
        > >
        > > Since we will not have a real bus to signal for the driver to probe for new
        > > devices we can use a sysfs interface for the application to notify the driver
        > > of a new socket and eventfd pair to setup a new virtio-scsi instance.
        > > Otherwise the design simply moves the vhost-user driver from the QEMU
        > > application into the Host kernel itself.
        > >
        > > It's my understanding that this will avoid a lot more system calls and copies
        > > compared to what exposing an iSCSI device or NBD device as we're currently
        > > discussing. Does this seem feasible?
        >
        > What you really want is a "block device in user space" solution that's higher
        > performance than NBD, and while that's been tried many, many times in the past I
        > do think there is a great opportunity here for someone. I'm not sure that the
        > interface between the block device process and the kernel is best done as a
        > modification of NBD or a wholesale replacement by vhost-user-scsi, but I'd like
        > to throw in a third option to consider - use NVMe queues in shared memory as the
        > interface instead. The NVMe queues are going to be much more efficient than
        > virtqueues for storage commands.
        >
        > >
        > > Thanks,
        > > Brian
        > >
        > > On 9/5/19, 12:32 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
        > >
        > >     Hi Paul.
        > >
        > >     Thanks for investigating it.
        > >
        > >     We have one more idea floating around. Brian is going to send you a
        > > proposal shortly. If other proposal seems feasible to you that we can evaluate
        > > the work required in both the proposals.
        > >
        > >     Thanks
        > >     Rishabh Mittal
        > >
        > >     On 9/5/19, 11:09 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
        > >
        > >         Hi,
        > >
        > >         So I was able to perform the same steps here and I think one of the
        > > keys to really seeing what's going on is to start perftop like this:
        > >
        > >          “perf top --sort comm,dso,symbol -C 0” to get a more focused view by
        > > sorting on command, shared object and symbol
        > >
        > >         Attached are 2 snapshots, one with a NULL back end for nbd and one
        > > with libaio/nvme.  Some notes after chatting with Ben a bit, please read
        > > through and let us know what you think:
        > >
        > >         * in both cases the vast majority of the highest overhead activities
        > > are kernel
        > >         * the "copy_user_enhanced" symbol on the NULL case (it shows up on the
        > > other as well but you have to scroll way down to see it) and is the
        > > user/kernel space copy, nothing SPDK can do about that
        > >         * the syscalls that dominate in both cases are likely something that
        > > can be improved on by changing how SPDK interacts with nbd. Ben had a couple
        > > of ideas inlcuidng (a) using libaio to interact with the nbd fd as opposed to
        > > interacting with the nbd socket, (b) "batching" wherever possible, for example
        > > on writes to nbd investigate not ack'ing them until some number have completed
        > >         * the kernel slab* commands are likely nbd kernel driver
        > > allocations/frees in the IO path, one possibility would be to look at
        > > optimizing the nbd kernel driver for this one
        > >         * the libc item on the NULL chart also shows up on the libaio profile
        > > however is again way down the scroll so it didn't make the screenshot :)  This
        > > could be a zeroing of something somewhere in the SPDK nbd driver
        > >
        > >         It looks like this data supports what Ben had suspected a while back,
        > > much of the overhead we're looking at is kernel nbd.  Anyway, let us know what
        > > you think and if you want to explore any of the ideas above any further or see
        > > something else in the data that looks worthy to note.
        > >
        > >         Thx
        > >         Paul
        > >
        > >
        > >
        > >         -----Original Message-----
        > >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul
        > > E
        > >         Sent: Wednesday, September 4, 2019 4:27 PM
        > >         To: Mittal, Rishabh <rimittal(a)ebay.com>; Walker, Benjamin <
        > > benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>;
        > > spdk(a)lists.01.org
        > >         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>;
        > > Kadayam, Hari <hkadayam(a)ebay.com>
        > >         Subject: Re: [SPDK] NBD with SPDK
        > >
        > >         Cool, thanks for sending this.  I will try and repro tomorrow here and
        > > see what kind of results I get
        > >
        > >         Thx
        > >         Paul
        > >
        > >         -----Original Message-----
        > >         From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
        > >         Sent: Wednesday, September 4, 2019 4:23 PM
        > >         To: Luse, Paul E <paul.e.luse(a)intel.com>; Walker, Benjamin <
        > > benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>;
        > > spdk(a)lists.01.org
        > >         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
        > > hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
        > >         Subject: Re: [SPDK] NBD with SPDK
        > >
        > >         Avg CPU utilization is very low when I am running this.
        > >
        > >         09/04/2019 04:21:40 PM
        > >         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
        > >                    2.59    0.00    2.57    0.00    0.00   94.84
        > >
        > >         Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %
        > > rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
        > >         sda              0.00    0.20      0.00      0.80     0.00     0.00
        > > 0.00   0.00    0.00    0.00   0.00     0.00     4.00   0.00   0.00
        > >         sdb              0.00    0.00      0.00      0.00     0.00     0.00
        > > 0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
        > >         sdc              0.00 28846.80      0.00 191555.20     0.00
        > > 18211.00   0.00  38.70    0.00    1.03  29.64     0.00     6.64   0.03 100.00
        > >         nb0              0.00 47297.00      0.00
        > > 191562.40     0.00   593.60   0.00   1.24    0.00    1.32  61.83     0.00
        > > 4.05   0
        > >
        > >
        > >
        > >         On 9/4/19, 4:19 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
        > >
        > >             I am using this command
        > >
        > >             fio --name=randwrite --ioengine=libaio --iodepth=8 --rw=write --
        > > rwmixread=0  --bsrange=4k-4k --direct=1 -filename=/dev/nbd0 --numjobs=8 --
        > > runtime 120 --time_based --group_reporting
        > >
        > >             I have created the device by using these commands
        > >               1.  ./root/spdk/app/vhost
        > >               2.  ./rpc.py bdev_aio_create /dev/sdc aio0
        > >               3. /rpc.py start_nbd_disk aio0 /dev/nbd0
        > >
        > >             I am using  "perf top"  to get the performance
        > >
        > >             On 9/4/19, 4:03 PM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
        > >
        > >                 Hi Rishabh,
        > >
        > >                 Maybe it would help (me at least) if you described the
        > > complete & exact steps for your test - both setup of the env & test and
        > > command to profile.  Can you send that out?
        > >
        > >                 Thx
        > >                 Paul
        > >
        > >                 -----Original Message-----
        > >                 From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
        > >                 Sent: Wednesday, September 4, 2019 2:45 PM
        > >                 To: Walker, Benjamin <benjamin.walker(a)intel.com>; Harris,
        > > James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org; Luse, Paul E <
        > > paul.e.luse(a)intel.com>
        > >                 Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
        > > hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
        > >                 Subject: Re: [SPDK] NBD with SPDK
        > >
        > >                 Yes, I am using 64 q depth with one thread in fio. I am using
        > > AIO. This profiling is for the entire system. I don't know why spdk threads
        > > are idle.
        > >
        > >                 On 9/4/19, 11:08 AM, "Walker, Benjamin" <
        > > benjamin.walker(a)intel.com> wrote:
        > >
        > >                     On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wrote:
        > >                     > I got the run again. It is with 4k write.
        > >                     >
        > >                     > 13.16%  vhost                       [.]
        > >                     >
        > > spdk_ring_dequeue
        > >                     >
        > >                     >    6.08%  vhost                       [.]
        > >                     >
        > > rte_rdtsc
        > >                     >
        > >                     >    4.77%  vhost                       [.]
        > >                     >
        > > spdk_thread_poll
        > >                     >
        > >                     >    2.85%  vhost                       [.]
        > >                     >
        > > _spdk_reactor_run
        > >                     >
        > >
        > >                     You're doing high queue depth for at least 30 seconds
        > > while the trace runs,
        > >                     right? Using fio with the libaio engine on the NBD device
        > > is probably the way to
        > >                     go. Are you limiting the profiling to just the core where
        > > the main SPDK process
        > >                     is pinned? I'm asking because SPDK still appears to be
        > > mostly idle, and I
        > >                     suspect the time is being spent in some other thread (in
        > > the kernel). Consider
        > >                     capturing a profile for the entire system. It will have
        > > fio stuff in it, but the
        > >                     expensive stuff still should generally bubble up to the
        > > top.
        > >
        > >                     Thanks,
        > >                     Ben
        > >
        > >
        > >                     >
        > >                     > On 8/29/19, 6:05 PM, "Mittal, Rishabh" <
        > > rimittal(a)ebay.com> wrote:
        > >                     >
        > >                     >     I got the profile with first run.
        > >                     >
        > >                     >       27.91%  vhost                       [.]
        > >                     >
        > > spdk_ring_dequeue
        > >                     >
        > >                     >       12.94%  vhost                       [.]
        > >                     >
        > > rte_rdtsc
        > >                     >
        > >                     >       11.00%  vhost                       [.]
        > >                     >
        > > spdk_thread_poll
        > >                     >
        > >                     >        6.15%  vhost                       [.]
        > >                     >
        > > _spdk_reactor_run
        > >                     >
        > >                     >        4.35%  [kernel]                    [k]
        > >                     >
        > > syscall_return_via_sysret
        > >                     >
        > >                     >        3.91%  vhost                       [.]
        > >                     >
        > > _spdk_msg_queue_run_batch
        > >                     >
        > >                     >        3.38%  vhost                       [.]
        > >                     >
        > > _spdk_event_queue_run_batch
        > >                     >
        > >                     >        2.83%  [unknown]                   [k]
        > >                     >
        > > 0xfffffe000000601b
        > >                     >
        > >                     >        1.45%  vhost                       [.]
        > >                     >
        > > spdk_thread_get_from_ctx
        > >                     >
        > >                     >        1.20%  [kernel]                    [k]
        > >                     >
        > > __fget
        > >                     >
        > >                     >        1.14%  libpthread-2.27.so          [.]
        > >                     >
        > > __libc_read
        > >                     >
        > >                     >        1.00%  libc-2.27.so                [.]
        > >                     >
        > > 0x000000000018ef76
        > >                     >
        > >                     >        0.99%  libc-2.27.so                [.]
        > > 0x000000000018ef79
        > >                     >
        > >                     >     Thanks
        > >                     >     Rishabh Mittal
        > >                     >
        > >                     >     On 8/19/19, 7:42 AM, "Luse, Paul E" <
        > > paul.e.luse(a)intel.com> wrote:
        > >                     >
        > >                     >         That's great.  Keep any eye out for the items
        > > Ben mentions below - at
        > >                     > least the first one should be quick to implement and
        > > compare both profile data
        > >                     > and measured performance.
        > >                     >
        > >                     >         Don’t' forget about the community meetings
        > > either, great place to chat
        > >                     > about these kinds of things.
        > >                     >
        > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Crimittal%40ebay.com%7Ceebabb3a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033328572563931&amp;sdata=7tan5pPttSBLypikDgsH1lQZGY0HBQQr3rQQGJwIy3s%3D&amp;reserved=0
        > >                     >   Next one is tomorrow morn US time.
        > >                     >
        > >                     >         Thx
        > >                     >         Paul
        > >                     >
        > >                     >         -----Original Message-----
        > >                     >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On
        > > Behalf Of Mittal,
        > >                     > Rishabh via SPDK
        > >                     >         Sent: Thursday, August 15, 2019 6:50 PM
        > >                     >         To: Harris, James R <james.r.harris(a)intel.com>;
        > > Walker, Benjamin <
        > >                     > benjamin.walker(a)intel.com>; spdk(a)lists.01.org
        > >                     >         Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen,
        > > Xiaoxi <
        > >                     > xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>;
        > > Kadayam, Hari <
        > >                     > hkadayam(a)ebay.com>
        > >                     >         Subject: Re: [SPDK] NBD with SPDK
        > >                     >
        > >                     >         Thanks. I will get the profiling by next week.
        > >                     >
        > >                     >         On 8/15/19, 6:26 PM, "Harris, James R" <
        > > james.r.harris(a)intel.com>
        > >                     > wrote:
        > >                     >
        > >                     >
        > >                     >
        > >                     >             On 8/15/19, 4:34 PM, "Mittal, Rishabh" <
        > > rimittal(a)ebay.com> wrote:
        > >                     >
        > >                     >                 Hi Jim
        > >                     >
        > >                     >                 What tool you use to take profiling.
        > >                     >
        > >                     >             Hi Rishabh,
        > >                     >
        > >                     >             Mostly I just use "perf top".
        > >                     >
        > >                     >             -Jim
        > >                     >
        > >                     >
        > >                     >                 Thanks
        > >                     >                 Rishabh Mittal
        > >                     >
        > >                     >                 On 8/14/19, 9:54 AM, "Harris, James R" <
        > >                     > james.r.harris(a)intel.com> wrote:
        > >                     >
        > >                     >
        > >                     >
        > >                     >                     On 8/14/19, 9:18 AM, "Walker,
        > > Benjamin" <
        > >                     > benjamin.walker(a)intel.com> wrote:
        > >                     >
        > >                     >                     <trim>
        > >                     >
        > >                     >                         When an I/O is performed in the
        > > process initiating the
        > >                     > I/O to a file, the data
        > >                     >                         goes into the OS page cache
        > > buffers at a layer far
        > >                     > above the bio stack
        > >                     >                         (somewhere up in VFS). If SPDK
        > > were to reserve some
        > >                     > memory and hand it off to
        > >                     >                         your kernel driver, your kernel
        > > driver would still
        > >                     > need to copy it to that
        > >                     >                         location out of the page cache
        > > buffers. We can't
        > >                     > safely share the page cache
        > >                     >                         buffers with a user space
        > > process.
        > >                     >
        > >                     >                     I think Rishabh was suggesting the
        > > SPDK reserve the
        > >                     > virtual address space only.
        > >                     >                     Then the kernel could map the page
        > > cache buffers into that
        > >                     > virtual address space.
        > >                     >                     That would not require a data copy,
        > > but would require the
        > >                     > mapping operations.
        > >                     >
        > >                     >                     I think the profiling data would be
        > > really helpful - to
        > >                     > quantify how much of the 50us
        > >                     >                     Is due to copying the 4KB of
        > > data.  That can help drive
        > >                     > next steps on how to optimize
        > >                     >                     the SPDK NBD module.
        > >                     >
        > >                     >                     Thanks,
        > >                     >
        > >                     >                     -Jim
        > >                     >
        > >                     >
        > >                     >                         As Paul said, I'm skeptical that
        > > the memcpy is
        > >                     > significant in the overall
        > >                     >                         performance you're measuring. I
        > > encourage you to go
        > >                     > look at some profiling data
        > >                     >                         and confirm that the memcpy is
        > > really showing up. I
        > >                     > suspect the overhead is
        > >                     >                         instead primarily in these
        > > spots:
        > >                     >
        > >                     >                         1) Dynamic buffer allocation in
        > > the SPDK NBD backend.
        > >                     >
        > >                     >                         As Paul indicated, the NBD
        > > target is dynamically
        > >                     > allocating memory for each I/O.
        > >                     >                         The NBD backend wasn't designed
        > > to be fast - it was
        > >                     > designed to be simple.
        > >                     >                         Pooling would be a lot faster
        > > and is something fairly
        > >                     > easy to implement.
        > >                     >
        > >                     >                         2) The way SPDK does the
        > > syscalls when it implements
        > >                     > the NBD backend.
        > >                     >
        > >                     >                         Again, the code was designed to
        > > be simple, not high
        > >                     > performance. It simply calls
        > >                     >                         read() and write() on the socket
        > > for each command.
        > >                     > There are much higher
        > >                     >                         performance ways of doing this,
        > > they're just more
        > >                     > complex to implement.
        > >                     >
        > >                     >                         3) The lack of multi-queue
        > > support in NBD
        > >                     >
        > >                     >                         Every I/O is funneled through a
        > > single sockpair up to
        > >                     > user space. That means
        > >                     >                         there is locking going on. I
        > > believe this is just a
        > >                     > limitation of NBD today - it
        > >                     >                         doesn't plug into the block-mq
        > > stuff in the kernel and
        > >                     > expose multiple
        > >                     >                         sockpairs. But someone more
        > > knowledgeable on the
        > >                     > kernel stack would need to take
        > >                     >                         a look.
        > >                     >
        > >                     >                         Thanks,
        > >                     >                         Ben
        > >                     >
        > >                     >                         >
        > >                     >                         > Couple of things that I am not
        > > really sure in this
        > >                     > flow is :- 1. How memory
        > >                     >                         > registration is going to work
        > > with RDMA driver.
        > >                     >                         > 2. What changes are required
        > > in spdk memory
        > >                     > management
        > >                     >                         >
        > >                     >                         > Thanks
        > >                     >                         > Rishabh Mittal
        > >                     >
        >
        > >
        > >
        > >
        >
        > _______________________________________________
        > SPDK mailing list
        > SPDK(a)lists.01.org
        > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7Ceebabb3a2eff4e1dc47108d7326feaca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033328572563931&amp;sdata=rGCQA4lAfwN8GdwvnZ2ozjAWhApxdu%2BioKMqw3gOmr0%3D&amp;reserved=0
        
        
        
        -- 
        Regards
        Huang Zhiteng
        
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-09-06  2:14 Szmyd, Brian
  0 siblings, 0 replies; 32+ messages in thread
From: Szmyd, Brian @ 2019-09-06  2:14 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 29970 bytes --]

I believe this option has the same number of copies since your still sharing the memory
with KATA VM kernel not the application itself. This is an option that the development
of a virtio-vhost-user driver does not prevent, its merely an option to allow non-KATA
containers to also use the same device.

I will note that doing a virtio-host-user driver also allows one to project other device types
than just block devices into the kernel device stack. One could also write a user application
that exposed an input, network, console, gpu or socket device as well.

Not that I have any interest in these... __

On 9/5/19, 8:08 PM, "Huang Zhiteng" <winston.d(a)gmail.com> wrote:

    Since this SPDK bdev is intended to be consumed by a user application
    running inside a container, we do have the possibility to run user
    application inside a Kata container instead.  Kata container does
    introduce the layer of IO virtualization, therefore we convert a user
    space block device on host to a kernel block device inside the VM but
    with less memory copies than NBD thanks to SPDK vhost.  Kata container
    might impose higher overhead than plain container but hopefully it's
    lightweight enough that the overhead is negligible.
    
    On Fri, Sep 6, 2019 at 5:22 AM Walker, Benjamin
    <benjamin.walker(a)intel.com> wrote:
    >
    > On Thu, 2019-09-05 at 19:47 +0000, Szmyd, Brian wrote:
    > > Hi Paul,
    > >
    > > Rather than put the effort into a formalized document here is a brief
    > > description of the solution I have been investigating just to get an opinion
    > > of feasibility or even workability.
    > >
    > > Some background and a reiteration of the problem to set things up. I apologize
    > > to reiterate anything and to include details that some may already know.
    > >
    > > We are looking for a solution that allows us to write a custom bdev for the
    > > SPDK bdev layer that distributes I/O between different NVMe-oF targets that we
    > > have attached and then present that to our application as either a raw block
    > > device or filesystem mountpoint.
    > >
    > > This is normally (as I understand it) done to by exposing a device via QEMU to
    > > a VM using the vhost target. This SPDK target has implemented the virtio-scsi
    > > (among others) device according to this spec:
    > >
    > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oasis-open.org%2Fvirtio%2Fvirtio%2Fv1.1%2Fcsprd01%2Fvirtio-v1.1-csprd01.html%23x1-8300021&amp;data=02%7C01%7Cbszmyd%40ebay.com%7Cc69c9bed2743416e654208d7326f15b3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033325011676085&amp;sdata=gSSZfohiZCFv85ZBbGTfiMzttbHRwgQ0eOB0rFSTlpo%3D&amp;reserved=0
    > >
    > > The VM kernel then uses a virtio-scsi module to attach said device into its
    > > SCSI mid-layer and then have the device enumerated as a /dev/sd[a-z]+ device.
    > >
    > > The problem is that QEMU virtualizes a PCIe bus for the guest kernel virtio-
    > > pci driver to discover the virtio devices and bind them to the virtio-scsi
    > > driver. There really is no other way (other than platform MMIO type devices)
    > > to attach a device to the virtio-scsi device.
    > >
    > > SPDK exposes the virtio device to the VM via QEMU which has written a "user
    > > space" version of the vhost bus. This driver then translates the API into the
    > > virtio-pci specification:
    > >
    > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fqemu%2Fqemu%2Fblob%2F5d0e5694470d2952b4f257bc985cac8c89b4fd92%2Fdocs%2Finterop%2Fvhost-user.rst&amp;data=02%7C01%7Cbszmyd%40ebay.com%7Cc69c9bed2743416e654208d7326f15b3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033325011676085&amp;sdata=18MeAcahkaTPT9pRiPGcx4GV%2BPFNb%2F12JXYgj1h5hSk%3D&amp;reserved=0
    > >
    > > This uses an eventfd descriptor for interrupting the non-polling side of the
    > > queue and a UNIX domain socket to setup (and control) the shared memory which
    > > contains the I/O buffers and virtio queues. This is documented in SPDKs own
    > > documentation and diagramed here:
    > >
    > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fspdk%2Fspdk%2Fblob%2F01103b2e4dfdcf23cc2125164aa116394c8185e8%2Fdoc%2Fvhost_processing.md&amp;data=02%7C01%7Cbszmyd%40ebay.com%7Cc69c9bed2743416e654208d7326f15b3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033325011676085&amp;sdata=Bf3Ni5lmXxlwgkxYknWg3yW7bprXp2H%2F525JZ%2BHcjgE%3D&amp;reserved=0
    > >
    > > If we could implement this vhost-user QEMU target as a virtio driver in the
    > > kernel as an alternative to the virtio-pci driver, it could bind a SPDK vhost
    > > into the host kernel as a virtio device and enumerated in the /dev/sd[a-z]+
    > > tree for our containers to bind. Attached is draft block diagram.
    >
    > If you think of QEMU as just another user-space process, and the SPDK vhost
    > target as a user-space process, then it's clear that vhost-user is simply a
    > cross-process IPC mechanism based on shared memory. The "shared memory" part is
    > the critical part of that description - QEMU pre-registers all of the memory
    > that will be used for I/O buffers (in fact, all of the memory that is mapped
    > into the guest) with the SPDK process by sending fds across a Unix domain
    > socket.
    >
    > If you move this code into the kernel, you have to solve two issues:
    >
    > 1) What memory is it registering with the SPDK process? The kernel driver has no
    > idea which application process may route I/O to it - in fact the application
    > process may not even exist yet - so it isn't memory allocated to the application
    > process. Maybe you have a pool of kernel buffers that get mapped into the SPDK
    > process, and when the application process performs I/O the kernel copies into
    > those buffers prior to telling SPDK about them? That would work, but now you're
    > back to doing a data copy. I do think you can get it down to 1 data copy instead
    > of 2 with a scheme like this.
    >
    > 2) One of the big performance problems you're seeing is syscall overhead in NBD.
    > If you still have a kernel block device that routes messages up to the SPDK
    > process, the application process is making the same syscalls because it's still
    > interacting with a block device in the kernel, but you're right that the backend
    > SPDK implementation could be polling on shared memory rings and potentially run
    > more efficiently.
    >
    > >
    > > Since we will not have a real bus to signal for the driver to probe for new
    > > devices we can use a sysfs interface for the application to notify the driver
    > > of a new socket and eventfd pair to setup a new virtio-scsi instance.
    > > Otherwise the design simply moves the vhost-user driver from the QEMU
    > > application into the Host kernel itself.
    > >
    > > It's my understanding that this will avoid a lot more system calls and copies
    > > compared to what exposing an iSCSI device or NBD device as we're currently
    > > discussing. Does this seem feasible?
    >
    > What you really want is a "block device in user space" solution that's higher
    > performance than NBD, and while that's been tried many, many times in the past I
    > do think there is a great opportunity here for someone. I'm not sure that the
    > interface between the block device process and the kernel is best done as a
    > modification of NBD or a wholesale replacement by vhost-user-scsi, but I'd like
    > to throw in a third option to consider - use NVMe queues in shared memory as the
    > interface instead. The NVMe queues are going to be much more efficient than
    > virtqueues for storage commands.
    >
    > >
    > > Thanks,
    > > Brian
    > >
    > > On 9/5/19, 12:32 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
    > >
    > >     Hi Paul.
    > >
    > >     Thanks for investigating it.
    > >
    > >     We have one more idea floating around. Brian is going to send you a
    > > proposal shortly. If other proposal seems feasible to you that we can evaluate
    > > the work required in both the proposals.
    > >
    > >     Thanks
    > >     Rishabh Mittal
    > >
    > >     On 9/5/19, 11:09 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
    > >
    > >         Hi,
    > >
    > >         So I was able to perform the same steps here and I think one of the
    > > keys to really seeing what's going on is to start perftop like this:
    > >
    > >          “perf top --sort comm,dso,symbol -C 0” to get a more focused view by
    > > sorting on command, shared object and symbol
    > >
    > >         Attached are 2 snapshots, one with a NULL back end for nbd and one
    > > with libaio/nvme.  Some notes after chatting with Ben a bit, please read
    > > through and let us know what you think:
    > >
    > >         * in both cases the vast majority of the highest overhead activities
    > > are kernel
    > >         * the "copy_user_enhanced" symbol on the NULL case (it shows up on the
    > > other as well but you have to scroll way down to see it) and is the
    > > user/kernel space copy, nothing SPDK can do about that
    > >         * the syscalls that dominate in both cases are likely something that
    > > can be improved on by changing how SPDK interacts with nbd. Ben had a couple
    > > of ideas inlcuidng (a) using libaio to interact with the nbd fd as opposed to
    > > interacting with the nbd socket, (b) "batching" wherever possible, for example
    > > on writes to nbd investigate not ack'ing them until some number have completed
    > >         * the kernel slab* commands are likely nbd kernel driver
    > > allocations/frees in the IO path, one possibility would be to look at
    > > optimizing the nbd kernel driver for this one
    > >         * the libc item on the NULL chart also shows up on the libaio profile
    > > however is again way down the scroll so it didn't make the screenshot :)  This
    > > could be a zeroing of something somewhere in the SPDK nbd driver
    > >
    > >         It looks like this data supports what Ben had suspected a while back,
    > > much of the overhead we're looking at is kernel nbd.  Anyway, let us know what
    > > you think and if you want to explore any of the ideas above any further or see
    > > something else in the data that looks worthy to note.
    > >
    > >         Thx
    > >         Paul
    > >
    > >
    > >
    > >         -----Original Message-----
    > >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul
    > > E
    > >         Sent: Wednesday, September 4, 2019 4:27 PM
    > >         To: Mittal, Rishabh <rimittal(a)ebay.com>; Walker, Benjamin <
    > > benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>;
    > > spdk(a)lists.01.org
    > >         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>;
    > > Kadayam, Hari <hkadayam(a)ebay.com>
    > >         Subject: Re: [SPDK] NBD with SPDK
    > >
    > >         Cool, thanks for sending this.  I will try and repro tomorrow here and
    > > see what kind of results I get
    > >
    > >         Thx
    > >         Paul
    > >
    > >         -----Original Message-----
    > >         From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
    > >         Sent: Wednesday, September 4, 2019 4:23 PM
    > >         To: Luse, Paul E <paul.e.luse(a)intel.com>; Walker, Benjamin <
    > > benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>;
    > > spdk(a)lists.01.org
    > >         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
    > > hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
    > >         Subject: Re: [SPDK] NBD with SPDK
    > >
    > >         Avg CPU utilization is very low when I am running this.
    > >
    > >         09/04/2019 04:21:40 PM
    > >         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
    > >                    2.59    0.00    2.57    0.00    0.00   94.84
    > >
    > >         Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %
    > > rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
    > >         sda              0.00    0.20      0.00      0.80     0.00     0.00
    > > 0.00   0.00    0.00    0.00   0.00     0.00     4.00   0.00   0.00
    > >         sdb              0.00    0.00      0.00      0.00     0.00     0.00
    > > 0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
    > >         sdc              0.00 28846.80      0.00 191555.20     0.00
    > > 18211.00   0.00  38.70    0.00    1.03  29.64     0.00     6.64   0.03 100.00
    > >         nb0              0.00 47297.00      0.00
    > > 191562.40     0.00   593.60   0.00   1.24    0.00    1.32  61.83     0.00
    > > 4.05   0
    > >
    > >
    > >
    > >         On 9/4/19, 4:19 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
    > >
    > >             I am using this command
    > >
    > >             fio --name=randwrite --ioengine=libaio --iodepth=8 --rw=write --
    > > rwmixread=0  --bsrange=4k-4k --direct=1 -filename=/dev/nbd0 --numjobs=8 --
    > > runtime 120 --time_based --group_reporting
    > >
    > >             I have created the device by using these commands
    > >               1.  ./root/spdk/app/vhost
    > >               2.  ./rpc.py bdev_aio_create /dev/sdc aio0
    > >               3. /rpc.py start_nbd_disk aio0 /dev/nbd0
    > >
    > >             I am using  "perf top"  to get the performance
    > >
    > >             On 9/4/19, 4:03 PM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
    > >
    > >                 Hi Rishabh,
    > >
    > >                 Maybe it would help (me at least) if you described the
    > > complete & exact steps for your test - both setup of the env & test and
    > > command to profile.  Can you send that out?
    > >
    > >                 Thx
    > >                 Paul
    > >
    > >                 -----Original Message-----
    > >                 From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
    > >                 Sent: Wednesday, September 4, 2019 2:45 PM
    > >                 To: Walker, Benjamin <benjamin.walker(a)intel.com>; Harris,
    > > James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org; Luse, Paul E <
    > > paul.e.luse(a)intel.com>
    > >                 Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
    > > hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
    > >                 Subject: Re: [SPDK] NBD with SPDK
    > >
    > >                 Yes, I am using 64 q depth with one thread in fio. I am using
    > > AIO. This profiling is for the entire system. I don't know why spdk threads
    > > are idle.
    > >
    > >                 On 9/4/19, 11:08 AM, "Walker, Benjamin" <
    > > benjamin.walker(a)intel.com> wrote:
    > >
    > >                     On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wrote:
    > >                     > I got the run again. It is with 4k write.
    > >                     >
    > >                     > 13.16%  vhost                       [.]
    > >                     >
    > > spdk_ring_dequeue
    > >                     >
    > >                     >    6.08%  vhost                       [.]
    > >                     >
    > > rte_rdtsc
    > >                     >
    > >                     >    4.77%  vhost                       [.]
    > >                     >
    > > spdk_thread_poll
    > >                     >
    > >                     >    2.85%  vhost                       [.]
    > >                     >
    > > _spdk_reactor_run
    > >                     >
    > >
    > >                     You're doing high queue depth for at least 30 seconds
    > > while the trace runs,
    > >                     right? Using fio with the libaio engine on the NBD device
    > > is probably the way to
    > >                     go. Are you limiting the profiling to just the core where
    > > the main SPDK process
    > >                     is pinned? I'm asking because SPDK still appears to be
    > > mostly idle, and I
    > >                     suspect the time is being spent in some other thread (in
    > > the kernel). Consider
    > >                     capturing a profile for the entire system. It will have
    > > fio stuff in it, but the
    > >                     expensive stuff still should generally bubble up to the
    > > top.
    > >
    > >                     Thanks,
    > >                     Ben
    > >
    > >
    > >                     >
    > >                     > On 8/29/19, 6:05 PM, "Mittal, Rishabh" <
    > > rimittal(a)ebay.com> wrote:
    > >                     >
    > >                     >     I got the profile with first run.
    > >                     >
    > >                     >       27.91%  vhost                       [.]
    > >                     >
    > > spdk_ring_dequeue
    > >                     >
    > >                     >       12.94%  vhost                       [.]
    > >                     >
    > > rte_rdtsc
    > >                     >
    > >                     >       11.00%  vhost                       [.]
    > >                     >
    > > spdk_thread_poll
    > >                     >
    > >                     >        6.15%  vhost                       [.]
    > >                     >
    > > _spdk_reactor_run
    > >                     >
    > >                     >        4.35%  [kernel]                    [k]
    > >                     >
    > > syscall_return_via_sysret
    > >                     >
    > >                     >        3.91%  vhost                       [.]
    > >                     >
    > > _spdk_msg_queue_run_batch
    > >                     >
    > >                     >        3.38%  vhost                       [.]
    > >                     >
    > > _spdk_event_queue_run_batch
    > >                     >
    > >                     >        2.83%  [unknown]                   [k]
    > >                     >
    > > 0xfffffe000000601b
    > >                     >
    > >                     >        1.45%  vhost                       [.]
    > >                     >
    > > spdk_thread_get_from_ctx
    > >                     >
    > >                     >        1.20%  [kernel]                    [k]
    > >                     >
    > > __fget
    > >                     >
    > >                     >        1.14%  libpthread-2.27.so          [.]
    > >                     >
    > > __libc_read
    > >                     >
    > >                     >        1.00%  libc-2.27.so                [.]
    > >                     >
    > > 0x000000000018ef76
    > >                     >
    > >                     >        0.99%  libc-2.27.so                [.]
    > > 0x000000000018ef79
    > >                     >
    > >                     >     Thanks
    > >                     >     Rishabh Mittal
    > >                     >
    > >                     >     On 8/19/19, 7:42 AM, "Luse, Paul E" <
    > > paul.e.luse(a)intel.com> wrote:
    > >                     >
    > >                     >         That's great.  Keep any eye out for the items
    > > Ben mentions below - at
    > >                     > least the first one should be quick to implement and
    > > compare both profile data
    > >                     > and measured performance.
    > >                     >
    > >                     >         Don’t' forget about the community meetings
    > > either, great place to chat
    > >                     > about these kinds of things.
    > >                     >
    > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Cbszmyd%40ebay.com%7Cc69c9bed2743416e654208d7326f15b3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033325011676085&amp;sdata=CQV3XmFFbzlI%2FLqQgB9IdeB4E6Imtyvvegk4c5bFhBo%3D&amp;reserved=0
    > >                     >   Next one is tomorrow morn US time.
    > >                     >
    > >                     >         Thx
    > >                     >         Paul
    > >                     >
    > >                     >         -----Original Message-----
    > >                     >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On
    > > Behalf Of Mittal,
    > >                     > Rishabh via SPDK
    > >                     >         Sent: Thursday, August 15, 2019 6:50 PM
    > >                     >         To: Harris, James R <james.r.harris(a)intel.com>;
    > > Walker, Benjamin <
    > >                     > benjamin.walker(a)intel.com>; spdk(a)lists.01.org
    > >                     >         Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen,
    > > Xiaoxi <
    > >                     > xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>;
    > > Kadayam, Hari <
    > >                     > hkadayam(a)ebay.com>
    > >                     >         Subject: Re: [SPDK] NBD with SPDK
    > >                     >
    > >                     >         Thanks. I will get the profiling by next week.
    > >                     >
    > >                     >         On 8/15/19, 6:26 PM, "Harris, James R" <
    > > james.r.harris(a)intel.com>
    > >                     > wrote:
    > >                     >
    > >                     >
    > >                     >
    > >                     >             On 8/15/19, 4:34 PM, "Mittal, Rishabh" <
    > > rimittal(a)ebay.com> wrote:
    > >                     >
    > >                     >                 Hi Jim
    > >                     >
    > >                     >                 What tool you use to take profiling.
    > >                     >
    > >                     >             Hi Rishabh,
    > >                     >
    > >                     >             Mostly I just use "perf top".
    > >                     >
    > >                     >             -Jim
    > >                     >
    > >                     >
    > >                     >                 Thanks
    > >                     >                 Rishabh Mittal
    > >                     >
    > >                     >                 On 8/14/19, 9:54 AM, "Harris, James R" <
    > >                     > james.r.harris(a)intel.com> wrote:
    > >                     >
    > >                     >
    > >                     >
    > >                     >                     On 8/14/19, 9:18 AM, "Walker,
    > > Benjamin" <
    > >                     > benjamin.walker(a)intel.com> wrote:
    > >                     >
    > >                     >                     <trim>
    > >                     >
    > >                     >                         When an I/O is performed in the
    > > process initiating the
    > >                     > I/O to a file, the data
    > >                     >                         goes into the OS page cache
    > > buffers at a layer far
    > >                     > above the bio stack
    > >                     >                         (somewhere up in VFS). If SPDK
    > > were to reserve some
    > >                     > memory and hand it off to
    > >                     >                         your kernel driver, your kernel
    > > driver would still
    > >                     > need to copy it to that
    > >                     >                         location out of the page cache
    > > buffers. We can't
    > >                     > safely share the page cache
    > >                     >                         buffers with a user space
    > > process.
    > >                     >
    > >                     >                     I think Rishabh was suggesting the
    > > SPDK reserve the
    > >                     > virtual address space only.
    > >                     >                     Then the kernel could map the page
    > > cache buffers into that
    > >                     > virtual address space.
    > >                     >                     That would not require a data copy,
    > > but would require the
    > >                     > mapping operations.
    > >                     >
    > >                     >                     I think the profiling data would be
    > > really helpful - to
    > >                     > quantify how much of the 50us
    > >                     >                     Is due to copying the 4KB of
    > > data.  That can help drive
    > >                     > next steps on how to optimize
    > >                     >                     the SPDK NBD module.
    > >                     >
    > >                     >                     Thanks,
    > >                     >
    > >                     >                     -Jim
    > >                     >
    > >                     >
    > >                     >                         As Paul said, I'm skeptical that
    > > the memcpy is
    > >                     > significant in the overall
    > >                     >                         performance you're measuring. I
    > > encourage you to go
    > >                     > look at some profiling data
    > >                     >                         and confirm that the memcpy is
    > > really showing up. I
    > >                     > suspect the overhead is
    > >                     >                         instead primarily in these
    > > spots:
    > >                     >
    > >                     >                         1) Dynamic buffer allocation in
    > > the SPDK NBD backend.
    > >                     >
    > >                     >                         As Paul indicated, the NBD
    > > target is dynamically
    > >                     > allocating memory for each I/O.
    > >                     >                         The NBD backend wasn't designed
    > > to be fast - it was
    > >                     > designed to be simple.
    > >                     >                         Pooling would be a lot faster
    > > and is something fairly
    > >                     > easy to implement.
    > >                     >
    > >                     >                         2) The way SPDK does the
    > > syscalls when it implements
    > >                     > the NBD backend.
    > >                     >
    > >                     >                         Again, the code was designed to
    > > be simple, not high
    > >                     > performance. It simply calls
    > >                     >                         read() and write() on the socket
    > > for each command.
    > >                     > There are much higher
    > >                     >                         performance ways of doing this,
    > > they're just more
    > >                     > complex to implement.
    > >                     >
    > >                     >                         3) The lack of multi-queue
    > > support in NBD
    > >                     >
    > >                     >                         Every I/O is funneled through a
    > > single sockpair up to
    > >                     > user space. That means
    > >                     >                         there is locking going on. I
    > > believe this is just a
    > >                     > limitation of NBD today - it
    > >                     >                         doesn't plug into the block-mq
    > > stuff in the kernel and
    > >                     > expose multiple
    > >                     >                         sockpairs. But someone more
    > > knowledgeable on the
    > >                     > kernel stack would need to take
    > >                     >                         a look.
    > >                     >
    > >                     >                         Thanks,
    > >                     >                         Ben
    > >                     >
    > >                     >                         >
    > >                     >                         > Couple of things that I am not
    > > really sure in this
    > >                     > flow is :- 1. How memory
    > >                     >                         > registration is going to work
    > > with RDMA driver.
    > >                     >                         > 2. What changes are required
    > > in spdk memory
    > >                     > management
    > >                     >                         >
    > >                     >                         > Thanks
    > >                     >                         > Rishabh Mittal
    > >                     >
    >
    > >
    > >
    > >
    >
    > _______________________________________________
    > SPDK mailing list
    > SPDK(a)lists.01.org
    > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cbszmyd%40ebay.com%7Cc69c9bed2743416e654208d7326f15b3%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033325011676085&amp;sdata=70VlFioBcD63PlGV3IUCd8qdCJvA2DDyfdSixgQLAKE%3D&amp;reserved=0
    
    
    
    -- 
    Regards
    Huang Zhiteng
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-09-06  2:08 Huang Zhiteng
  0 siblings, 0 replies; 32+ messages in thread
From: Huang Zhiteng @ 2019-09-06  2:08 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 25958 bytes --]

Since this SPDK bdev is intended to be consumed by a user application
running inside a container, we do have the possibility to run user
application inside a Kata container instead.  Kata container does
introduce the layer of IO virtualization, therefore we convert a user
space block device on host to a kernel block device inside the VM but
with less memory copies than NBD thanks to SPDK vhost.  Kata container
might impose higher overhead than plain container but hopefully it's
lightweight enough that the overhead is negligible.

On Fri, Sep 6, 2019 at 5:22 AM Walker, Benjamin
<benjamin.walker(a)intel.com> wrote:
>
> On Thu, 2019-09-05 at 19:47 +0000, Szmyd, Brian wrote:
> > Hi Paul,
> >
> > Rather than put the effort into a formalized document here is a brief
> > description of the solution I have been investigating just to get an opinion
> > of feasibility or even workability.
> >
> > Some background and a reiteration of the problem to set things up. I apologize
> > to reiterate anything and to include details that some may already know.
> >
> > We are looking for a solution that allows us to write a custom bdev for the
> > SPDK bdev layer that distributes I/O between different NVMe-oF targets that we
> > have attached and then present that to our application as either a raw block
> > device or filesystem mountpoint.
> >
> > This is normally (as I understand it) done to by exposing a device via QEMU to
> > a VM using the vhost target. This SPDK target has implemented the virtio-scsi
> > (among others) device according to this spec:
> >
> > https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-8300021
> >
> > The VM kernel then uses a virtio-scsi module to attach said device into its
> > SCSI mid-layer and then have the device enumerated as a /dev/sd[a-z]+ device.
> >
> > The problem is that QEMU virtualizes a PCIe bus for the guest kernel virtio-
> > pci driver to discover the virtio devices and bind them to the virtio-scsi
> > driver. There really is no other way (other than platform MMIO type devices)
> > to attach a device to the virtio-scsi device.
> >
> > SPDK exposes the virtio device to the VM via QEMU which has written a "user
> > space" version of the vhost bus. This driver then translates the API into the
> > virtio-pci specification:
> >
> > https://github.com/qemu/qemu/blob/5d0e5694470d2952b4f257bc985cac8c89b4fd92/docs/interop/vhost-user.rst
> >
> > This uses an eventfd descriptor for interrupting the non-polling side of the
> > queue and a UNIX domain socket to setup (and control) the shared memory which
> > contains the I/O buffers and virtio queues. This is documented in SPDKs own
> > documentation and diagramed here:
> >
> > https://github.com/spdk/spdk/blob/01103b2e4dfdcf23cc2125164aa116394c8185e8/doc/vhost_processing.md
> >
> > If we could implement this vhost-user QEMU target as a virtio driver in the
> > kernel as an alternative to the virtio-pci driver, it could bind a SPDK vhost
> > into the host kernel as a virtio device and enumerated in the /dev/sd[a-z]+
> > tree for our containers to bind. Attached is draft block diagram.
>
> If you think of QEMU as just another user-space process, and the SPDK vhost
> target as a user-space process, then it's clear that vhost-user is simply a
> cross-process IPC mechanism based on shared memory. The "shared memory" part is
> the critical part of that description - QEMU pre-registers all of the memory
> that will be used for I/O buffers (in fact, all of the memory that is mapped
> into the guest) with the SPDK process by sending fds across a Unix domain
> socket.
>
> If you move this code into the kernel, you have to solve two issues:
>
> 1) What memory is it registering with the SPDK process? The kernel driver has no
> idea which application process may route I/O to it - in fact the application
> process may not even exist yet - so it isn't memory allocated to the application
> process. Maybe you have a pool of kernel buffers that get mapped into the SPDK
> process, and when the application process performs I/O the kernel copies into
> those buffers prior to telling SPDK about them? That would work, but now you're
> back to doing a data copy. I do think you can get it down to 1 data copy instead
> of 2 with a scheme like this.
>
> 2) One of the big performance problems you're seeing is syscall overhead in NBD.
> If you still have a kernel block device that routes messages up to the SPDK
> process, the application process is making the same syscalls because it's still
> interacting with a block device in the kernel, but you're right that the backend
> SPDK implementation could be polling on shared memory rings and potentially run
> more efficiently.
>
> >
> > Since we will not have a real bus to signal for the driver to probe for new
> > devices we can use a sysfs interface for the application to notify the driver
> > of a new socket and eventfd pair to setup a new virtio-scsi instance.
> > Otherwise the design simply moves the vhost-user driver from the QEMU
> > application into the Host kernel itself.
> >
> > It's my understanding that this will avoid a lot more system calls and copies
> > compared to what exposing an iSCSI device or NBD device as we're currently
> > discussing. Does this seem feasible?
>
> What you really want is a "block device in user space" solution that's higher
> performance than NBD, and while that's been tried many, many times in the past I
> do think there is a great opportunity here for someone. I'm not sure that the
> interface between the block device process and the kernel is best done as a
> modification of NBD or a wholesale replacement by vhost-user-scsi, but I'd like
> to throw in a third option to consider - use NVMe queues in shared memory as the
> interface instead. The NVMe queues are going to be much more efficient than
> virtqueues for storage commands.
>
> >
> > Thanks,
> > Brian
> >
> > On 9/5/19, 12:32 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
> >
> >     Hi Paul.
> >
> >     Thanks for investigating it.
> >
> >     We have one more idea floating around. Brian is going to send you a
> > proposal shortly. If other proposal seems feasible to you that we can evaluate
> > the work required in both the proposals.
> >
> >     Thanks
> >     Rishabh Mittal
> >
> >     On 9/5/19, 11:09 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
> >
> >         Hi,
> >
> >         So I was able to perform the same steps here and I think one of the
> > keys to really seeing what's going on is to start perftop like this:
> >
> >          “perf top --sort comm,dso,symbol -C 0” to get a more focused view by
> > sorting on command, shared object and symbol
> >
> >         Attached are 2 snapshots, one with a NULL back end for nbd and one
> > with libaio/nvme.  Some notes after chatting with Ben a bit, please read
> > through and let us know what you think:
> >
> >         * in both cases the vast majority of the highest overhead activities
> > are kernel
> >         * the "copy_user_enhanced" symbol on the NULL case (it shows up on the
> > other as well but you have to scroll way down to see it) and is the
> > user/kernel space copy, nothing SPDK can do about that
> >         * the syscalls that dominate in both cases are likely something that
> > can be improved on by changing how SPDK interacts with nbd. Ben had a couple
> > of ideas inlcuidng (a) using libaio to interact with the nbd fd as opposed to
> > interacting with the nbd socket, (b) "batching" wherever possible, for example
> > on writes to nbd investigate not ack'ing them until some number have completed
> >         * the kernel slab* commands are likely nbd kernel driver
> > allocations/frees in the IO path, one possibility would be to look at
> > optimizing the nbd kernel driver for this one
> >         * the libc item on the NULL chart also shows up on the libaio profile
> > however is again way down the scroll so it didn't make the screenshot :)  This
> > could be a zeroing of something somewhere in the SPDK nbd driver
> >
> >         It looks like this data supports what Ben had suspected a while back,
> > much of the overhead we're looking at is kernel nbd.  Anyway, let us know what
> > you think and if you want to explore any of the ideas above any further or see
> > something else in the data that looks worthy to note.
> >
> >         Thx
> >         Paul
> >
> >
> >
> >         -----Original Message-----
> >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul
> > E
> >         Sent: Wednesday, September 4, 2019 4:27 PM
> >         To: Mittal, Rishabh <rimittal(a)ebay.com>; Walker, Benjamin <
> > benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>;
> > spdk(a)lists.01.org
> >         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>;
> > Kadayam, Hari <hkadayam(a)ebay.com>
> >         Subject: Re: [SPDK] NBD with SPDK
> >
> >         Cool, thanks for sending this.  I will try and repro tomorrow here and
> > see what kind of results I get
> >
> >         Thx
> >         Paul
> >
> >         -----Original Message-----
> >         From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
> >         Sent: Wednesday, September 4, 2019 4:23 PM
> >         To: Luse, Paul E <paul.e.luse(a)intel.com>; Walker, Benjamin <
> > benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>;
> > spdk(a)lists.01.org
> >         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
> > hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
> >         Subject: Re: [SPDK] NBD with SPDK
> >
> >         Avg CPU utilization is very low when I am running this.
> >
> >         09/04/2019 04:21:40 PM
> >         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >                    2.59    0.00    2.57    0.00    0.00   94.84
> >
> >         Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %
> > rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> >         sda              0.00    0.20      0.00      0.80     0.00     0.00
> > 0.00   0.00    0.00    0.00   0.00     0.00     4.00   0.00   0.00
> >         sdb              0.00    0.00      0.00      0.00     0.00     0.00
> > 0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
> >         sdc              0.00 28846.80      0.00 191555.20     0.00
> > 18211.00   0.00  38.70    0.00    1.03  29.64     0.00     6.64   0.03 100.00
> >         nb0              0.00 47297.00      0.00
> > 191562.40     0.00   593.60   0.00   1.24    0.00    1.32  61.83     0.00
> > 4.05   0
> >
> >
> >
> >         On 9/4/19, 4:19 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
> >
> >             I am using this command
> >
> >             fio --name=randwrite --ioengine=libaio --iodepth=8 --rw=write --
> > rwmixread=0  --bsrange=4k-4k --direct=1 -filename=/dev/nbd0 --numjobs=8 --
> > runtime 120 --time_based --group_reporting
> >
> >             I have created the device by using these commands
> >               1.  ./root/spdk/app/vhost
> >               2.  ./rpc.py bdev_aio_create /dev/sdc aio0
> >               3. /rpc.py start_nbd_disk aio0 /dev/nbd0
> >
> >             I am using  "perf top"  to get the performance
> >
> >             On 9/4/19, 4:03 PM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
> >
> >                 Hi Rishabh,
> >
> >                 Maybe it would help (me at least) if you described the
> > complete & exact steps for your test - both setup of the env & test and
> > command to profile.  Can you send that out?
> >
> >                 Thx
> >                 Paul
> >
> >                 -----Original Message-----
> >                 From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
> >                 Sent: Wednesday, September 4, 2019 2:45 PM
> >                 To: Walker, Benjamin <benjamin.walker(a)intel.com>; Harris,
> > James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org; Luse, Paul E <
> > paul.e.luse(a)intel.com>
> >                 Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
> > hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
> >                 Subject: Re: [SPDK] NBD with SPDK
> >
> >                 Yes, I am using 64 q depth with one thread in fio. I am using
> > AIO. This profiling is for the entire system. I don't know why spdk threads
> > are idle.
> >
> >                 On 9/4/19, 11:08 AM, "Walker, Benjamin" <
> > benjamin.walker(a)intel.com> wrote:
> >
> >                     On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wrote:
> >                     > I got the run again. It is with 4k write.
> >                     >
> >                     > 13.16%  vhost                       [.]
> >                     >
> > spdk_ring_dequeue
> >                     >
> >                     >    6.08%  vhost                       [.]
> >                     >
> > rte_rdtsc
> >                     >
> >                     >    4.77%  vhost                       [.]
> >                     >
> > spdk_thread_poll
> >                     >
> >                     >    2.85%  vhost                       [.]
> >                     >
> > _spdk_reactor_run
> >                     >
> >
> >                     You're doing high queue depth for at least 30 seconds
> > while the trace runs,
> >                     right? Using fio with the libaio engine on the NBD device
> > is probably the way to
> >                     go. Are you limiting the profiling to just the core where
> > the main SPDK process
> >                     is pinned? I'm asking because SPDK still appears to be
> > mostly idle, and I
> >                     suspect the time is being spent in some other thread (in
> > the kernel). Consider
> >                     capturing a profile for the entire system. It will have
> > fio stuff in it, but the
> >                     expensive stuff still should generally bubble up to the
> > top.
> >
> >                     Thanks,
> >                     Ben
> >
> >
> >                     >
> >                     > On 8/29/19, 6:05 PM, "Mittal, Rishabh" <
> > rimittal(a)ebay.com> wrote:
> >                     >
> >                     >     I got the profile with first run.
> >                     >
> >                     >       27.91%  vhost                       [.]
> >                     >
> > spdk_ring_dequeue
> >                     >
> >                     >       12.94%  vhost                       [.]
> >                     >
> > rte_rdtsc
> >                     >
> >                     >       11.00%  vhost                       [.]
> >                     >
> > spdk_thread_poll
> >                     >
> >                     >        6.15%  vhost                       [.]
> >                     >
> > _spdk_reactor_run
> >                     >
> >                     >        4.35%  [kernel]                    [k]
> >                     >
> > syscall_return_via_sysret
> >                     >
> >                     >        3.91%  vhost                       [.]
> >                     >
> > _spdk_msg_queue_run_batch
> >                     >
> >                     >        3.38%  vhost                       [.]
> >                     >
> > _spdk_event_queue_run_batch
> >                     >
> >                     >        2.83%  [unknown]                   [k]
> >                     >
> > 0xfffffe000000601b
> >                     >
> >                     >        1.45%  vhost                       [.]
> >                     >
> > spdk_thread_get_from_ctx
> >                     >
> >                     >        1.20%  [kernel]                    [k]
> >                     >
> > __fget
> >                     >
> >                     >        1.14%  libpthread-2.27.so          [.]
> >                     >
> > __libc_read
> >                     >
> >                     >        1.00%  libc-2.27.so                [.]
> >                     >
> > 0x000000000018ef76
> >                     >
> >                     >        0.99%  libc-2.27.so                [.]
> > 0x000000000018ef79
> >                     >
> >                     >     Thanks
> >                     >     Rishabh Mittal
> >                     >
> >                     >     On 8/19/19, 7:42 AM, "Luse, Paul E" <
> > paul.e.luse(a)intel.com> wrote:
> >                     >
> >                     >         That's great.  Keep any eye out for the items
> > Ben mentions below - at
> >                     > least the first one should be quick to implement and
> > compare both profile data
> >                     > and measured performance.
> >                     >
> >                     >         Don’t' forget about the community meetings
> > either, great place to chat
> >                     > about these kinds of things.
> >                     >
> > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Cbszmyd%40ebay.com%7C52847d18df514b39d8cf08d7322f74ea%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033051721021295&amp;sdata=heRt%2FhB5SPeqWNw44VoCIrt5W9N%2B0ExCXIVFNtzi2Zg%3D&amp;reserved=0
> >                     >   Next one is tomorrow morn US time.
> >                     >
> >                     >         Thx
> >                     >         Paul
> >                     >
> >                     >         -----Original Message-----
> >                     >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On
> > Behalf Of Mittal,
> >                     > Rishabh via SPDK
> >                     >         Sent: Thursday, August 15, 2019 6:50 PM
> >                     >         To: Harris, James R <james.r.harris(a)intel.com>;
> > Walker, Benjamin <
> >                     > benjamin.walker(a)intel.com>; spdk(a)lists.01.org
> >                     >         Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen,
> > Xiaoxi <
> >                     > xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>;
> > Kadayam, Hari <
> >                     > hkadayam(a)ebay.com>
> >                     >         Subject: Re: [SPDK] NBD with SPDK
> >                     >
> >                     >         Thanks. I will get the profiling by next week.
> >                     >
> >                     >         On 8/15/19, 6:26 PM, "Harris, James R" <
> > james.r.harris(a)intel.com>
> >                     > wrote:
> >                     >
> >                     >
> >                     >
> >                     >             On 8/15/19, 4:34 PM, "Mittal, Rishabh" <
> > rimittal(a)ebay.com> wrote:
> >                     >
> >                     >                 Hi Jim
> >                     >
> >                     >                 What tool you use to take profiling.
> >                     >
> >                     >             Hi Rishabh,
> >                     >
> >                     >             Mostly I just use "perf top".
> >                     >
> >                     >             -Jim
> >                     >
> >                     >
> >                     >                 Thanks
> >                     >                 Rishabh Mittal
> >                     >
> >                     >                 On 8/14/19, 9:54 AM, "Harris, James R" <
> >                     > james.r.harris(a)intel.com> wrote:
> >                     >
> >                     >
> >                     >
> >                     >                     On 8/14/19, 9:18 AM, "Walker,
> > Benjamin" <
> >                     > benjamin.walker(a)intel.com> wrote:
> >                     >
> >                     >                     <trim>
> >                     >
> >                     >                         When an I/O is performed in the
> > process initiating the
> >                     > I/O to a file, the data
> >                     >                         goes into the OS page cache
> > buffers at a layer far
> >                     > above the bio stack
> >                     >                         (somewhere up in VFS). If SPDK
> > were to reserve some
> >                     > memory and hand it off to
> >                     >                         your kernel driver, your kernel
> > driver would still
> >                     > need to copy it to that
> >                     >                         location out of the page cache
> > buffers. We can't
> >                     > safely share the page cache
> >                     >                         buffers with a user space
> > process.
> >                     >
> >                     >                     I think Rishabh was suggesting the
> > SPDK reserve the
> >                     > virtual address space only.
> >                     >                     Then the kernel could map the page
> > cache buffers into that
> >                     > virtual address space.
> >                     >                     That would not require a data copy,
> > but would require the
> >                     > mapping operations.
> >                     >
> >                     >                     I think the profiling data would be
> > really helpful - to
> >                     > quantify how much of the 50us
> >                     >                     Is due to copying the 4KB of
> > data.  That can help drive
> >                     > next steps on how to optimize
> >                     >                     the SPDK NBD module.
> >                     >
> >                     >                     Thanks,
> >                     >
> >                     >                     -Jim
> >                     >
> >                     >
> >                     >                         As Paul said, I'm skeptical that
> > the memcpy is
> >                     > significant in the overall
> >                     >                         performance you're measuring. I
> > encourage you to go
> >                     > look at some profiling data
> >                     >                         and confirm that the memcpy is
> > really showing up. I
> >                     > suspect the overhead is
> >                     >                         instead primarily in these
> > spots:
> >                     >
> >                     >                         1) Dynamic buffer allocation in
> > the SPDK NBD backend.
> >                     >
> >                     >                         As Paul indicated, the NBD
> > target is dynamically
> >                     > allocating memory for each I/O.
> >                     >                         The NBD backend wasn't designed
> > to be fast - it was
> >                     > designed to be simple.
> >                     >                         Pooling would be a lot faster
> > and is something fairly
> >                     > easy to implement.
> >                     >
> >                     >                         2) The way SPDK does the
> > syscalls when it implements
> >                     > the NBD backend.
> >                     >
> >                     >                         Again, the code was designed to
> > be simple, not high
> >                     > performance. It simply calls
> >                     >                         read() and write() on the socket
> > for each command.
> >                     > There are much higher
> >                     >                         performance ways of doing this,
> > they're just more
> >                     > complex to implement.
> >                     >
> >                     >                         3) The lack of multi-queue
> > support in NBD
> >                     >
> >                     >                         Every I/O is funneled through a
> > single sockpair up to
> >                     > user space. That means
> >                     >                         there is locking going on. I
> > believe this is just a
> >                     > limitation of NBD today - it
> >                     >                         doesn't plug into the block-mq
> > stuff in the kernel and
> >                     > expose multiple
> >                     >                         sockpairs. But someone more
> > knowledgeable on the
> >                     > kernel stack would need to take
> >                     >                         a look.
> >                     >
> >                     >                         Thanks,
> >                     >                         Ben
> >                     >
> >                     >                         >
> >                     >                         > Couple of things that I am not
> > really sure in this
> >                     > flow is :- 1. How memory
> >                     >                         > registration is going to work
> > with RDMA driver.
> >                     >                         > 2. What changes are required
> > in spdk memory
> >                     > management
> >                     >                         >
> >                     >                         > Thanks
> >                     >                         > Rishabh Mittal
> >                     >
>
> >
> >
> >
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk



-- 
Regards
Huang Zhiteng

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-09-05 22:00 Szmyd, Brian
  0 siblings, 0 replies; 32+ messages in thread
From: Szmyd, Brian @ 2019-09-05 22:00 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 30989 bytes --]

> What memory is it registering with the SPDK process?

Only the kernel and the SPDK application would share memory. Writes from the application will be going through the VFS most likely so I don't think it's feasible to shared the buffers directly to the SPDK app. Yes, there would be a copy from the applications write buffers into the shared memory region by the kernel. This is how it works already with the virtio-pci under QEMU right? I'm not trying to optimize that path, it's as you say a removal of a copy on the other side.

> If you still have a kernel block device that routes messages up to the SPDK process, the application process is making the same syscalls because it's still
 interacting with a block device in the kernel.

Correct, there is no intention to remove this syscall from the application into the kernel as again the write will also be accompanied by VFS operations on the block device that only the kernel can provide. It was my impression most of the added latency we were concerned with is coming from the transformation of said write/read to and from NBD messages forwarded to the SPDK application over a normal socket.

> use NVMe queues in shared memory as the interface instead

You could be correct that this is more efficient. It would involve implementing something I assumed would end up quite similar to the virtio spec since we don't want to use TCP messages (NVMf) or act as a PCIe device.  

On 9/5/19, 3:22 PM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:

    On Thu, 2019-09-05 at 19:47 +0000, Szmyd, Brian wrote:
    > Hi Paul,
    > 
    > Rather than put the effort into a formalized document here is a brief
    > description of the solution I have been investigating just to get an opinion
    > of feasibility or even workability. 
    > 
    > Some background and a reiteration of the problem to set things up. I apologize
    > to reiterate anything and to include details that some may already know.
    > 
    > We are looking for a solution that allows us to write a custom bdev for the
    > SPDK bdev layer that distributes I/O between different NVMe-oF targets that we
    > have attached and then present that to our application as either a raw block
    > device or filesystem mountpoint.
    > 
    > This is normally (as I understand it) done to by exposing a device via QEMU to
    > a VM using the vhost target. This SPDK target has implemented the virtio-scsi
    > (among others) device according to this spec:
    > 
    > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oasis-open.org%2Fvirtio%2Fvirtio%2Fv1.1%2Fcsprd01%2Fvirtio-v1.1-csprd01.html%23x1-8300021&amp;data=02%7C01%7Cbszmyd%40ebay.com%7C7ac7f0056f0a47969f1808d732472eab%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033153633551936&amp;sdata=qAR4NugaG%2FgzXae8eJuvlwGyWUHihidrKZv7ZZRt%2BY8%3D&amp;reserved=0
    > 
    > The VM kernel then uses a virtio-scsi module to attach said device into its
    > SCSI mid-layer and then have the device enumerated as a /dev/sd[a-z]+ device.
    > 
    > The problem is that QEMU virtualizes a PCIe bus for the guest kernel virtio-
    > pci driver to discover the virtio devices and bind them to the virtio-scsi
    > driver. There really is no other way (other than platform MMIO type devices)
    > to attach a device to the virtio-scsi device.
    > 
    > SPDK exposes the virtio device to the VM via QEMU which has written a "user
    > space" version of the vhost bus. This driver then translates the API into the
    > virtio-pci specification:
    > 
    > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fqemu%2Fqemu%2Fblob%2F5d0e5694470d2952b4f257bc985cac8c89b4fd92%2Fdocs%2Finterop%2Fvhost-user.rst&amp;data=02%7C01%7Cbszmyd%40ebay.com%7C7ac7f0056f0a47969f1808d732472eab%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033153633551936&amp;sdata=kBKlf5GxG728mmFyUjXzTHguS0pjlx0W%2FtMOt8dKVUg%3D&amp;reserved=0
    > 
    > This uses an eventfd descriptor for interrupting the non-polling side of the
    > queue and a UNIX domain socket to setup (and control) the shared memory which
    > contains the I/O buffers and virtio queues. This is documented in SPDKs own
    > documentation and diagramed here:
    > 
    > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fspdk%2Fspdk%2Fblob%2F01103b2e4dfdcf23cc2125164aa116394c8185e8%2Fdoc%2Fvhost_processing.md&amp;data=02%7C01%7Cbszmyd%40ebay.com%7C7ac7f0056f0a47969f1808d732472eab%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033153633551936&amp;sdata=G%2Fq9CSNi0FpGspqDsQAYKKdsZBLlQ2V6rm1UPCy8rC4%3D&amp;reserved=0
    > 
    > If we could implement this vhost-user QEMU target as a virtio driver in the
    > kernel as an alternative to the virtio-pci driver, it could bind a SPDK vhost
    > into the host kernel as a virtio device and enumerated in the /dev/sd[a-z]+
    > tree for our containers to bind. Attached is draft block diagram.
    
    If you think of QEMU as just another user-space process, and the SPDK vhost
    target as a user-space process, then it's clear that vhost-user is simply a
    cross-process IPC mechanism based on shared memory. The "shared memory" part is
    the critical part of that description - QEMU pre-registers all of the memory
    that will be used for I/O buffers (in fact, all of the memory that is mapped
    into the guest) with the SPDK process by sending fds across a Unix domain
    socket.
    
    If you move this code into the kernel, you have to solve two issues:
    
    1) What memory is it registering with the SPDK process? The kernel driver has no
    idea which application process may route I/O to it - in fact the application
    process may not even exist yet - so it isn't memory allocated to the application
    process. Maybe you have a pool of kernel buffers that get mapped into the SPDK
    process, and when the application process performs I/O the kernel copies into
    those buffers prior to telling SPDK about them? That would work, but now you're
    back to doing a data copy. I do think you can get it down to 1 data copy instead
    of 2 with a scheme like this.
    
    2) One of the big performance problems you're seeing is syscall overhead in NBD.
    If you still have a kernel block device that routes messages up to the SPDK
    process, the application process is making the same syscalls because it's still
    interacting with a block device in the kernel, but you're right that the backend
    SPDK implementation could be polling on shared memory rings and potentially run
    more efficiently.
    
    > 
    > Since we will not have a real bus to signal for the driver to probe for new
    > devices we can use a sysfs interface for the application to notify the driver
    > of a new socket and eventfd pair to setup a new virtio-scsi instance.
    > Otherwise the design simply moves the vhost-user driver from the QEMU
    > application into the Host kernel itself.
    > 
    > It's my understanding that this will avoid a lot more system calls and copies
    > compared to what exposing an iSCSI device or NBD device as we're currently
    > discussing. Does this seem feasible?
    
    What you really want is a "block device in user space" solution that's higher
    performance than NBD, and while that's been tried many, many times in the past I
    do think there is a great opportunity here for someone. I'm not sure that the
    interface between the block device process and the kernel is best done as a
    modification of NBD or a wholesale replacement by vhost-user-scsi, but I'd like
    to throw in a third option to consider - use NVMe queues in shared memory as the
    interface instead. The NVMe queues are going to be much more efficient than
    virtqueues for storage commands.
    
    > 
    > Thanks,
    > Brian
    > 
    > On 9/5/19, 12:32 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
    > 
    >     Hi Paul.
    >     
    >     Thanks for investigating it. 
    >     
    >     We have one more idea floating around. Brian is going to send you a
    > proposal shortly. If other proposal seems feasible to you that we can evaluate
    > the work required in both the proposals.
    >     
    >     Thanks
    >     Rishabh Mittal
    >     
    >     On 9/5/19, 11:09 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
    >     
    >         Hi,
    >         
    >         So I was able to perform the same steps here and I think one of the
    > keys to really seeing what's going on is to start perftop like this:
    >         
    >          “perf top --sort comm,dso,symbol -C 0” to get a more focused view by
    > sorting on command, shared object and symbol
    >         
    >         Attached are 2 snapshots, one with a NULL back end for nbd and one
    > with libaio/nvme.  Some notes after chatting with Ben a bit, please read
    > through and let us know what you think:
    >         
    >         * in both cases the vast majority of the highest overhead activities
    > are kernel
    >         * the "copy_user_enhanced" symbol on the NULL case (it shows up on the
    > other as well but you have to scroll way down to see it) and is the
    > user/kernel space copy, nothing SPDK can do about that
    >         * the syscalls that dominate in both cases are likely something that
    > can be improved on by changing how SPDK interacts with nbd. Ben had a couple
    > of ideas inlcuidng (a) using libaio to interact with the nbd fd as opposed to
    > interacting with the nbd socket, (b) "batching" wherever possible, for example
    > on writes to nbd investigate not ack'ing them until some number have completed
    >         * the kernel slab* commands are likely nbd kernel driver
    > allocations/frees in the IO path, one possibility would be to look at
    > optimizing the nbd kernel driver for this one
    >         * the libc item on the NULL chart also shows up on the libaio profile
    > however is again way down the scroll so it didn't make the screenshot :)  This
    > could be a zeroing of something somewhere in the SPDK nbd driver
    >         
    >         It looks like this data supports what Ben had suspected a while back,
    > much of the overhead we're looking at is kernel nbd.  Anyway, let us know what
    > you think and if you want to explore any of the ideas above any further or see
    > something else in the data that looks worthy to note.
    >         
    >         Thx
    >         Paul
    >         
    >         
    >         
    >         -----Original Message-----
    >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul
    > E
    >         Sent: Wednesday, September 4, 2019 4:27 PM
    >         To: Mittal, Rishabh <rimittal(a)ebay.com>; Walker, Benjamin <
    > benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; 
    > spdk(a)lists.01.org
    >         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>;
    > Kadayam, Hari <hkadayam(a)ebay.com>
    >         Subject: Re: [SPDK] NBD with SPDK
    >         
    >         Cool, thanks for sending this.  I will try and repro tomorrow here and
    > see what kind of results I get
    >         
    >         Thx
    >         Paul
    >         
    >         -----Original Message-----
    >         From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
    >         Sent: Wednesday, September 4, 2019 4:23 PM
    >         To: Luse, Paul E <paul.e.luse(a)intel.com>; Walker, Benjamin <
    > benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; 
    > spdk(a)lists.01.org
    >         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
    > hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
    >         Subject: Re: [SPDK] NBD with SPDK
    >         
    >         Avg CPU utilization is very low when I am running this.
    >         
    >         09/04/2019 04:21:40 PM
    >         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
    >                    2.59    0.00    2.57    0.00    0.00   94.84
    >         
    >         Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %
    > rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
    >         sda              0.00    0.20      0.00      0.80     0.00     0.00   
    > 0.00   0.00    0.00    0.00   0.00     0.00     4.00   0.00   0.00
    >         sdb              0.00    0.00      0.00      0.00     0.00     0.00   
    > 0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
    >         sdc              0.00 28846.80      0.00 191555.20     0.00
    > 18211.00   0.00  38.70    0.00    1.03  29.64     0.00     6.64   0.03 100.00
    >         nb0              0.00 47297.00      0.00
    > 191562.40     0.00   593.60   0.00   1.24    0.00    1.32  61.83     0.00     
    > 4.05   0
    >         
    >         
    >         
    >         On 9/4/19, 4:19 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
    >         
    >             I am using this command
    >             
    >             fio --name=randwrite --ioengine=libaio --iodepth=8 --rw=write --
    > rwmixread=0  --bsrange=4k-4k --direct=1 -filename=/dev/nbd0 --numjobs=8 --
    > runtime 120 --time_based --group_reporting
    >             
    >             I have created the device by using these commands
    >             	1.  ./root/spdk/app/vhost
    >             	2.  ./rpc.py bdev_aio_create /dev/sdc aio0
    >             	3. /rpc.py start_nbd_disk aio0 /dev/nbd0
    >             
    >             I am using  "perf top"  to get the performance 
    >             
    >             On 9/4/19, 4:03 PM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
    >             
    >                 Hi Rishabh,
    >                 
    >                 Maybe it would help (me at least) if you described the
    > complete & exact steps for your test - both setup of the env & test and
    > command to profile.  Can you send that out?
    >                 
    >                 Thx
    >                 Paul
    >                 
    >                 -----Original Message-----
    >                 From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
    >                 Sent: Wednesday, September 4, 2019 2:45 PM
    >                 To: Walker, Benjamin <benjamin.walker(a)intel.com>; Harris,
    > James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org; Luse, Paul E <
    > paul.e.luse(a)intel.com>
    >                 Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
    > hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
    >                 Subject: Re: [SPDK] NBD with SPDK
    >                 
    >                 Yes, I am using 64 q depth with one thread in fio. I am using
    > AIO. This profiling is for the entire system. I don't know why spdk threads
    > are idle.
    >                 
    >                 On 9/4/19, 11:08 AM, "Walker, Benjamin" <
    > benjamin.walker(a)intel.com> wrote:
    >                 
    >                     On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wrote:
    >                     > I got the run again. It is with 4k write.
    >                     > 
    >                     > 13.16%  vhost                       [.]
    >                     >
    > spdk_ring_dequeue                                                             
    >                     >              
    >                     >    6.08%  vhost                       [.]
    >                     >
    > rte_rdtsc                                                                     
    >                     >              
    >                     >    4.77%  vhost                       [.]
    >                     >
    > spdk_thread_poll                                                              
    >                     >              
    >                     >    2.85%  vhost                       [.]
    >                     >
    > _spdk_reactor_run                                                             
    >                     >  
    >                     
    >                     You're doing high queue depth for at least 30 seconds
    > while the trace runs,
    >                     right? Using fio with the libaio engine on the NBD device
    > is probably the way to
    >                     go. Are you limiting the profiling to just the core where
    > the main SPDK process
    >                     is pinned? I'm asking because SPDK still appears to be
    > mostly idle, and I
    >                     suspect the time is being spent in some other thread (in
    > the kernel). Consider
    >                     capturing a profile for the entire system. It will have
    > fio stuff in it, but the
    >                     expensive stuff still should generally bubble up to the
    > top.
    >                     
    >                     Thanks,
    >                     Ben
    >                     
    >                     
    >                     > 
    >                     > On 8/29/19, 6:05 PM, "Mittal, Rishabh" <
    > rimittal(a)ebay.com> wrote:
    >                     > 
    >                     >     I got the profile with first run. 
    >                     >     
    >                     >       27.91%  vhost                       [.]
    >                     >
    > spdk_ring_dequeue                                                             
    >                     >              
    >                     >       12.94%  vhost                       [.]
    >                     >
    > rte_rdtsc                                                                     
    >                     >              
    >                     >       11.00%  vhost                       [.]
    >                     >
    > spdk_thread_poll                                                              
    >                     >              
    >                     >        6.15%  vhost                       [.]
    >                     >
    > _spdk_reactor_run                                                             
    >                     >              
    >                     >        4.35%  [kernel]                    [k]
    >                     >
    > syscall_return_via_sysret                                                     
    >                     >              
    >                     >        3.91%  vhost                       [.]
    >                     >
    > _spdk_msg_queue_run_batch                                                     
    >                     >              
    >                     >        3.38%  vhost                       [.]
    >                     >
    > _spdk_event_queue_run_batch                                                   
    >                     >              
    >                     >        2.83%  [unknown]                   [k]
    >                     >
    > 0xfffffe000000601b                                                            
    >                     >              
    >                     >        1.45%  vhost                       [.]
    >                     >
    > spdk_thread_get_from_ctx                                                      
    >                     >              
    >                     >        1.20%  [kernel]                    [k]
    >                     >
    > __fget                                                                        
    >                     >              
    >                     >        1.14%  libpthread-2.27.so          [.]
    >                     >
    > __libc_read                                                                   
    >                     >              
    >                     >        1.00%  libc-2.27.so                [.]
    >                     >
    > 0x000000000018ef76                                                            
    >                     >              
    >                     >        0.99%  libc-2.27.so                [.]
    > 0x000000000018ef79          
    >                     >     
    >                     >     Thanks
    >                     >     Rishabh Mittal                         
    >                     >     
    >                     >     On 8/19/19, 7:42 AM, "Luse, Paul E" <
    > paul.e.luse(a)intel.com> wrote:
    >                     >     
    >                     >         That's great.  Keep any eye out for the items
    > Ben mentions below - at
    >                     > least the first one should be quick to implement and
    > compare both profile data
    >                     > and measured performance.
    >                     >         
    >                     >         Don’t' forget about the community meetings
    > either, great place to chat
    >                     > about these kinds of things.  
    >                     > 
    > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Cbszmyd%40ebay.com%7C7ac7f0056f0a47969f1808d732472eab%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033153633551936&amp;sdata=AXH71h8R4%2FlZMiCQ4sugJ6cxdMOeqO14e72gEpbt21w%3D&amp;reserved=0
    >                     >   Next one is tomorrow morn US time.
    >                     >         
    >                     >         Thx
    >                     >         Paul
    >                     >         
    >                     >         -----Original Message-----
    >                     >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On
    > Behalf Of Mittal,
    >                     > Rishabh via SPDK
    >                     >         Sent: Thursday, August 15, 2019 6:50 PM
    >                     >         To: Harris, James R <james.r.harris(a)intel.com>;
    > Walker, Benjamin <
    >                     > benjamin.walker(a)intel.com>; spdk(a)lists.01.org
    >                     >         Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen,
    > Xiaoxi <
    >                     > xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>;
    > Kadayam, Hari <
    >                     > hkadayam(a)ebay.com>
    >                     >         Subject: Re: [SPDK] NBD with SPDK
    >                     >         
    >                     >         Thanks. I will get the profiling by next week. 
    >                     >         
    >                     >         On 8/15/19, 6:26 PM, "Harris, James R" <
    > james.r.harris(a)intel.com>
    >                     > wrote:
    >                     >         
    >                     >             
    >                     >             
    >                     >             On 8/15/19, 4:34 PM, "Mittal, Rishabh" <
    > rimittal(a)ebay.com> wrote:
    >                     >             
    >                     >                 Hi Jim
    >                     >                 
    >                     >                 What tool you use to take profiling. 
    >                     >             
    >                     >             Hi Rishabh,
    >                     >             
    >                     >             Mostly I just use "perf top".
    >                     >             
    >                     >             -Jim
    >                     >             
    >                     >                 
    >                     >                 Thanks
    >                     >                 Rishabh Mittal
    >                     >                 
    >                     >                 On 8/14/19, 9:54 AM, "Harris, James R" <
    >                     > james.r.harris(a)intel.com> wrote:
    >                     >                 
    >                     >                     
    >                     >                     
    >                     >                     On 8/14/19, 9:18 AM, "Walker,
    > Benjamin" <
    >                     > benjamin.walker(a)intel.com> wrote:
    >                     >                     
    >                     >                     <trim>
    >                     >                         
    >                     >                         When an I/O is performed in the
    > process initiating the
    >                     > I/O to a file, the data
    >                     >                         goes into the OS page cache
    > buffers at a layer far
    >                     > above the bio stack
    >                     >                         (somewhere up in VFS). If SPDK
    > were to reserve some
    >                     > memory and hand it off to
    >                     >                         your kernel driver, your kernel
    > driver would still
    >                     > need to copy it to that
    >                     >                         location out of the page cache
    > buffers. We can't
    >                     > safely share the page cache
    >                     >                         buffers with a user space
    > process.
    >                     >                        
    >                     >                     I think Rishabh was suggesting the
    > SPDK reserve the
    >                     > virtual address space only.
    >                     >                     Then the kernel could map the page
    > cache buffers into that
    >                     > virtual address space.
    >                     >                     That would not require a data copy,
    > but would require the
    >                     > mapping operations.
    >                     >                     
    >                     >                     I think the profiling data would be
    > really helpful - to
    >                     > quantify how much of the 50us
    >                     >                     Is due to copying the 4KB of
    > data.  That can help drive
    >                     > next steps on how to optimize
    >                     >                     the SPDK NBD module.
    >                     >                     
    >                     >                     Thanks,
    >                     >                     
    >                     >                     -Jim
    >                     >                     
    >                     >                     
    >                     >                         As Paul said, I'm skeptical that
    > the memcpy is
    >                     > significant in the overall
    >                     >                         performance you're measuring. I
    > encourage you to go
    >                     > look at some profiling data
    >                     >                         and confirm that the memcpy is
    > really showing up. I
    >                     > suspect the overhead is
    >                     >                         instead primarily in these
    > spots:
    >                     >                         
    >                     >                         1) Dynamic buffer allocation in
    > the SPDK NBD backend.
    >                     >                         
    >                     >                         As Paul indicated, the NBD
    > target is dynamically
    >                     > allocating memory for each I/O.
    >                     >                         The NBD backend wasn't designed
    > to be fast - it was
    >                     > designed to be simple.
    >                     >                         Pooling would be a lot faster
    > and is something fairly
    >                     > easy to implement.
    >                     >                         
    >                     >                         2) The way SPDK does the
    > syscalls when it implements
    >                     > the NBD backend.
    >                     >                         
    >                     >                         Again, the code was designed to
    > be simple, not high
    >                     > performance. It simply calls
    >                     >                         read() and write() on the socket
    > for each command.
    >                     > There are much higher
    >                     >                         performance ways of doing this,
    > they're just more
    >                     > complex to implement.
    >                     >                         
    >                     >                         3) The lack of multi-queue
    > support in NBD
    >                     >                         
    >                     >                         Every I/O is funneled through a
    > single sockpair up to
    >                     > user space. That means
    >                     >                         there is locking going on. I
    > believe this is just a
    >                     > limitation of NBD today - it
    >                     >                         doesn't plug into the block-mq
    > stuff in the kernel and
    >                     > expose multiple
    >                     >                         sockpairs. But someone more
    > knowledgeable on the
    >                     > kernel stack would need to take
    >                     >                         a look.
    >                     >                         
    >                     >                         Thanks,
    >                     >                         Ben
    >                     >                         
    >                     >                         > 
    >                     >                         > Couple of things that I am not
    > really sure in this
    >                     > flow is :- 1. How memory
    >                     >                         > registration is going to work
    > with RDMA driver.
    >                     >                         > 2. What changes are required
    > in spdk memory
    >                     > management
    >                     >                         > 
    >                     >                         > Thanks
    >                     >                         > Rishabh Mittal
    >                     >                         
    
    >     
    >     
    > 
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-09-05 21:22 Walker, Benjamin
  0 siblings, 0 replies; 32+ messages in thread
From: Walker, Benjamin @ 2019-09-05 21:22 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 26499 bytes --]

On Thu, 2019-09-05 at 19:47 +0000, Szmyd, Brian wrote:
> Hi Paul,
> 
> Rather than put the effort into a formalized document here is a brief
> description of the solution I have been investigating just to get an opinion
> of feasibility or even workability. 
> 
> Some background and a reiteration of the problem to set things up. I apologize
> to reiterate anything and to include details that some may already know.
> 
> We are looking for a solution that allows us to write a custom bdev for the
> SPDK bdev layer that distributes I/O between different NVMe-oF targets that we
> have attached and then present that to our application as either a raw block
> device or filesystem mountpoint.
> 
> This is normally (as I understand it) done to by exposing a device via QEMU to
> a VM using the vhost target. This SPDK target has implemented the virtio-scsi
> (among others) device according to this spec:
> 
> https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-8300021
> 
> The VM kernel then uses a virtio-scsi module to attach said device into its
> SCSI mid-layer and then have the device enumerated as a /dev/sd[a-z]+ device.
> 
> The problem is that QEMU virtualizes a PCIe bus for the guest kernel virtio-
> pci driver to discover the virtio devices and bind them to the virtio-scsi
> driver. There really is no other way (other than platform MMIO type devices)
> to attach a device to the virtio-scsi device.
> 
> SPDK exposes the virtio device to the VM via QEMU which has written a "user
> space" version of the vhost bus. This driver then translates the API into the
> virtio-pci specification:
> 
> https://github.com/qemu/qemu/blob/5d0e5694470d2952b4f257bc985cac8c89b4fd92/docs/interop/vhost-user.rst
> 
> This uses an eventfd descriptor for interrupting the non-polling side of the
> queue and a UNIX domain socket to setup (and control) the shared memory which
> contains the I/O buffers and virtio queues. This is documented in SPDKs own
> documentation and diagramed here:
> 
> https://github.com/spdk/spdk/blob/01103b2e4dfdcf23cc2125164aa116394c8185e8/doc/vhost_processing.md
> 
> If we could implement this vhost-user QEMU target as a virtio driver in the
> kernel as an alternative to the virtio-pci driver, it could bind a SPDK vhost
> into the host kernel as a virtio device and enumerated in the /dev/sd[a-z]+
> tree for our containers to bind. Attached is draft block diagram.

If you think of QEMU as just another user-space process, and the SPDK vhost
target as a user-space process, then it's clear that vhost-user is simply a
cross-process IPC mechanism based on shared memory. The "shared memory" part is
the critical part of that description - QEMU pre-registers all of the memory
that will be used for I/O buffers (in fact, all of the memory that is mapped
into the guest) with the SPDK process by sending fds across a Unix domain
socket.

If you move this code into the kernel, you have to solve two issues:

1) What memory is it registering with the SPDK process? The kernel driver has no
idea which application process may route I/O to it - in fact the application
process may not even exist yet - so it isn't memory allocated to the application
process. Maybe you have a pool of kernel buffers that get mapped into the SPDK
process, and when the application process performs I/O the kernel copies into
those buffers prior to telling SPDK about them? That would work, but now you're
back to doing a data copy. I do think you can get it down to 1 data copy instead
of 2 with a scheme like this.

2) One of the big performance problems you're seeing is syscall overhead in NBD.
If you still have a kernel block device that routes messages up to the SPDK
process, the application process is making the same syscalls because it's still
interacting with a block device in the kernel, but you're right that the backend
SPDK implementation could be polling on shared memory rings and potentially run
more efficiently.

> 
> Since we will not have a real bus to signal for the driver to probe for new
> devices we can use a sysfs interface for the application to notify the driver
> of a new socket and eventfd pair to setup a new virtio-scsi instance.
> Otherwise the design simply moves the vhost-user driver from the QEMU
> application into the Host kernel itself.
> 
> It's my understanding that this will avoid a lot more system calls and copies
> compared to what exposing an iSCSI device or NBD device as we're currently
> discussing. Does this seem feasible?

What you really want is a "block device in user space" solution that's higher
performance than NBD, and while that's been tried many, many times in the past I
do think there is a great opportunity here for someone. I'm not sure that the
interface between the block device process and the kernel is best done as a
modification of NBD or a wholesale replacement by vhost-user-scsi, but I'd like
to throw in a third option to consider - use NVMe queues in shared memory as the
interface instead. The NVMe queues are going to be much more efficient than
virtqueues for storage commands.

> 
> Thanks,
> Brian
> 
> On 9/5/19, 12:32 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
> 
>     Hi Paul.
>     
>     Thanks for investigating it. 
>     
>     We have one more idea floating around. Brian is going to send you a
> proposal shortly. If other proposal seems feasible to you that we can evaluate
> the work required in both the proposals.
>     
>     Thanks
>     Rishabh Mittal
>     
>     On 9/5/19, 11:09 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
>     
>         Hi,
>         
>         So I was able to perform the same steps here and I think one of the
> keys to really seeing what's going on is to start perftop like this:
>         
>          “perf top --sort comm,dso,symbol -C 0” to get a more focused view by
> sorting on command, shared object and symbol
>         
>         Attached are 2 snapshots, one with a NULL back end for nbd and one
> with libaio/nvme.  Some notes after chatting with Ben a bit, please read
> through and let us know what you think:
>         
>         * in both cases the vast majority of the highest overhead activities
> are kernel
>         * the "copy_user_enhanced" symbol on the NULL case (it shows up on the
> other as well but you have to scroll way down to see it) and is the
> user/kernel space copy, nothing SPDK can do about that
>         * the syscalls that dominate in both cases are likely something that
> can be improved on by changing how SPDK interacts with nbd. Ben had a couple
> of ideas inlcuidng (a) using libaio to interact with the nbd fd as opposed to
> interacting with the nbd socket, (b) "batching" wherever possible, for example
> on writes to nbd investigate not ack'ing them until some number have completed
>         * the kernel slab* commands are likely nbd kernel driver
> allocations/frees in the IO path, one possibility would be to look at
> optimizing the nbd kernel driver for this one
>         * the libc item on the NULL chart also shows up on the libaio profile
> however is again way down the scroll so it didn't make the screenshot :)  This
> could be a zeroing of something somewhere in the SPDK nbd driver
>         
>         It looks like this data supports what Ben had suspected a while back,
> much of the overhead we're looking at is kernel nbd.  Anyway, let us know what
> you think and if you want to explore any of the ideas above any further or see
> something else in the data that looks worthy to note.
>         
>         Thx
>         Paul
>         
>         
>         
>         -----Original Message-----
>         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul
> E
>         Sent: Wednesday, September 4, 2019 4:27 PM
>         To: Mittal, Rishabh <rimittal(a)ebay.com>; Walker, Benjamin <
> benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; 
> spdk(a)lists.01.org
>         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>;
> Kadayam, Hari <hkadayam(a)ebay.com>
>         Subject: Re: [SPDK] NBD with SPDK
>         
>         Cool, thanks for sending this.  I will try and repro tomorrow here and
> see what kind of results I get
>         
>         Thx
>         Paul
>         
>         -----Original Message-----
>         From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
>         Sent: Wednesday, September 4, 2019 4:23 PM
>         To: Luse, Paul E <paul.e.luse(a)intel.com>; Walker, Benjamin <
> benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; 
> spdk(a)lists.01.org
>         Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
> hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
>         Subject: Re: [SPDK] NBD with SPDK
>         
>         Avg CPU utilization is very low when I am running this.
>         
>         09/04/2019 04:21:40 PM
>         avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>                    2.59    0.00    2.57    0.00    0.00   94.84
>         
>         Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %
> rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
>         sda              0.00    0.20      0.00      0.80     0.00     0.00   
> 0.00   0.00    0.00    0.00   0.00     0.00     4.00   0.00   0.00
>         sdb              0.00    0.00      0.00      0.00     0.00     0.00   
> 0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
>         sdc              0.00 28846.80      0.00 191555.20     0.00
> 18211.00   0.00  38.70    0.00    1.03  29.64     0.00     6.64   0.03 100.00
>         nb0              0.00 47297.00      0.00
> 191562.40     0.00   593.60   0.00   1.24    0.00    1.32  61.83     0.00     
> 4.05   0
>         
>         
>         
>         On 9/4/19, 4:19 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
>         
>             I am using this command
>             
>             fio --name=randwrite --ioengine=libaio --iodepth=8 --rw=write --
> rwmixread=0  --bsrange=4k-4k --direct=1 -filename=/dev/nbd0 --numjobs=8 --
> runtime 120 --time_based --group_reporting
>             
>             I have created the device by using these commands
>             	1.  ./root/spdk/app/vhost
>             	2.  ./rpc.py bdev_aio_create /dev/sdc aio0
>             	3. /rpc.py start_nbd_disk aio0 /dev/nbd0
>             
>             I am using  "perf top"  to get the performance 
>             
>             On 9/4/19, 4:03 PM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
>             
>                 Hi Rishabh,
>                 
>                 Maybe it would help (me at least) if you described the
> complete & exact steps for your test - both setup of the env & test and
> command to profile.  Can you send that out?
>                 
>                 Thx
>                 Paul
>                 
>                 -----Original Message-----
>                 From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
>                 Sent: Wednesday, September 4, 2019 2:45 PM
>                 To: Walker, Benjamin <benjamin.walker(a)intel.com>; Harris,
> James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org; Luse, Paul E <
> paul.e.luse(a)intel.com>
>                 Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <
> hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
>                 Subject: Re: [SPDK] NBD with SPDK
>                 
>                 Yes, I am using 64 q depth with one thread in fio. I am using
> AIO. This profiling is for the entire system. I don't know why spdk threads
> are idle.
>                 
>                 On 9/4/19, 11:08 AM, "Walker, Benjamin" <
> benjamin.walker(a)intel.com> wrote:
>                 
>                     On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wrote:
>                     > I got the run again. It is with 4k write.
>                     > 
>                     > 13.16%  vhost                       [.]
>                     >
> spdk_ring_dequeue                                                             
>                     >              
>                     >    6.08%  vhost                       [.]
>                     >
> rte_rdtsc                                                                     
>                     >              
>                     >    4.77%  vhost                       [.]
>                     >
> spdk_thread_poll                                                              
>                     >              
>                     >    2.85%  vhost                       [.]
>                     >
> _spdk_reactor_run                                                             
>                     >  
>                     
>                     You're doing high queue depth for at least 30 seconds
> while the trace runs,
>                     right? Using fio with the libaio engine on the NBD device
> is probably the way to
>                     go. Are you limiting the profiling to just the core where
> the main SPDK process
>                     is pinned? I'm asking because SPDK still appears to be
> mostly idle, and I
>                     suspect the time is being spent in some other thread (in
> the kernel). Consider
>                     capturing a profile for the entire system. It will have
> fio stuff in it, but the
>                     expensive stuff still should generally bubble up to the
> top.
>                     
>                     Thanks,
>                     Ben
>                     
>                     
>                     > 
>                     > On 8/29/19, 6:05 PM, "Mittal, Rishabh" <
> rimittal(a)ebay.com> wrote:
>                     > 
>                     >     I got the profile with first run. 
>                     >     
>                     >       27.91%  vhost                       [.]
>                     >
> spdk_ring_dequeue                                                             
>                     >              
>                     >       12.94%  vhost                       [.]
>                     >
> rte_rdtsc                                                                     
>                     >              
>                     >       11.00%  vhost                       [.]
>                     >
> spdk_thread_poll                                                              
>                     >              
>                     >        6.15%  vhost                       [.]
>                     >
> _spdk_reactor_run                                                             
>                     >              
>                     >        4.35%  [kernel]                    [k]
>                     >
> syscall_return_via_sysret                                                     
>                     >              
>                     >        3.91%  vhost                       [.]
>                     >
> _spdk_msg_queue_run_batch                                                     
>                     >              
>                     >        3.38%  vhost                       [.]
>                     >
> _spdk_event_queue_run_batch                                                   
>                     >              
>                     >        2.83%  [unknown]                   [k]
>                     >
> 0xfffffe000000601b                                                            
>                     >              
>                     >        1.45%  vhost                       [.]
>                     >
> spdk_thread_get_from_ctx                                                      
>                     >              
>                     >        1.20%  [kernel]                    [k]
>                     >
> __fget                                                                        
>                     >              
>                     >        1.14%  libpthread-2.27.so          [.]
>                     >
> __libc_read                                                                   
>                     >              
>                     >        1.00%  libc-2.27.so                [.]
>                     >
> 0x000000000018ef76                                                            
>                     >              
>                     >        0.99%  libc-2.27.so                [.]
> 0x000000000018ef79          
>                     >     
>                     >     Thanks
>                     >     Rishabh Mittal                         
>                     >     
>                     >     On 8/19/19, 7:42 AM, "Luse, Paul E" <
> paul.e.luse(a)intel.com> wrote:
>                     >     
>                     >         That's great.  Keep any eye out for the items
> Ben mentions below - at
>                     > least the first one should be quick to implement and
> compare both profile data
>                     > and measured performance.
>                     >         
>                     >         Don’t' forget about the community meetings
> either, great place to chat
>                     > about these kinds of things.  
>                     > 
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Cbszmyd%40ebay.com%7C52847d18df514b39d8cf08d7322f74ea%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033051721021295&amp;sdata=heRt%2FhB5SPeqWNw44VoCIrt5W9N%2B0ExCXIVFNtzi2Zg%3D&amp;reserved=0
>                     >   Next one is tomorrow morn US time.
>                     >         
>                     >         Thx
>                     >         Paul
>                     >         
>                     >         -----Original Message-----
>                     >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On
> Behalf Of Mittal,
>                     > Rishabh via SPDK
>                     >         Sent: Thursday, August 15, 2019 6:50 PM
>                     >         To: Harris, James R <james.r.harris(a)intel.com>;
> Walker, Benjamin <
>                     > benjamin.walker(a)intel.com>; spdk(a)lists.01.org
>                     >         Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen,
> Xiaoxi <
>                     > xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>;
> Kadayam, Hari <
>                     > hkadayam(a)ebay.com>
>                     >         Subject: Re: [SPDK] NBD with SPDK
>                     >         
>                     >         Thanks. I will get the profiling by next week. 
>                     >         
>                     >         On 8/15/19, 6:26 PM, "Harris, James R" <
> james.r.harris(a)intel.com>
>                     > wrote:
>                     >         
>                     >             
>                     >             
>                     >             On 8/15/19, 4:34 PM, "Mittal, Rishabh" <
> rimittal(a)ebay.com> wrote:
>                     >             
>                     >                 Hi Jim
>                     >                 
>                     >                 What tool you use to take profiling. 
>                     >             
>                     >             Hi Rishabh,
>                     >             
>                     >             Mostly I just use "perf top".
>                     >             
>                     >             -Jim
>                     >             
>                     >                 
>                     >                 Thanks
>                     >                 Rishabh Mittal
>                     >                 
>                     >                 On 8/14/19, 9:54 AM, "Harris, James R" <
>                     > james.r.harris(a)intel.com> wrote:
>                     >                 
>                     >                     
>                     >                     
>                     >                     On 8/14/19, 9:18 AM, "Walker,
> Benjamin" <
>                     > benjamin.walker(a)intel.com> wrote:
>                     >                     
>                     >                     <trim>
>                     >                         
>                     >                         When an I/O is performed in the
> process initiating the
>                     > I/O to a file, the data
>                     >                         goes into the OS page cache
> buffers at a layer far
>                     > above the bio stack
>                     >                         (somewhere up in VFS). If SPDK
> were to reserve some
>                     > memory and hand it off to
>                     >                         your kernel driver, your kernel
> driver would still
>                     > need to copy it to that
>                     >                         location out of the page cache
> buffers. We can't
>                     > safely share the page cache
>                     >                         buffers with a user space
> process.
>                     >                        
>                     >                     I think Rishabh was suggesting the
> SPDK reserve the
>                     > virtual address space only.
>                     >                     Then the kernel could map the page
> cache buffers into that
>                     > virtual address space.
>                     >                     That would not require a data copy,
> but would require the
>                     > mapping operations.
>                     >                     
>                     >                     I think the profiling data would be
> really helpful - to
>                     > quantify how much of the 50us
>                     >                     Is due to copying the 4KB of
> data.  That can help drive
>                     > next steps on how to optimize
>                     >                     the SPDK NBD module.
>                     >                     
>                     >                     Thanks,
>                     >                     
>                     >                     -Jim
>                     >                     
>                     >                     
>                     >                         As Paul said, I'm skeptical that
> the memcpy is
>                     > significant in the overall
>                     >                         performance you're measuring. I
> encourage you to go
>                     > look at some profiling data
>                     >                         and confirm that the memcpy is
> really showing up. I
>                     > suspect the overhead is
>                     >                         instead primarily in these
> spots:
>                     >                         
>                     >                         1) Dynamic buffer allocation in
> the SPDK NBD backend.
>                     >                         
>                     >                         As Paul indicated, the NBD
> target is dynamically
>                     > allocating memory for each I/O.
>                     >                         The NBD backend wasn't designed
> to be fast - it was
>                     > designed to be simple.
>                     >                         Pooling would be a lot faster
> and is something fairly
>                     > easy to implement.
>                     >                         
>                     >                         2) The way SPDK does the
> syscalls when it implements
>                     > the NBD backend.
>                     >                         
>                     >                         Again, the code was designed to
> be simple, not high
>                     > performance. It simply calls
>                     >                         read() and write() on the socket
> for each command.
>                     > There are much higher
>                     >                         performance ways of doing this,
> they're just more
>                     > complex to implement.
>                     >                         
>                     >                         3) The lack of multi-queue
> support in NBD
>                     >                         
>                     >                         Every I/O is funneled through a
> single sockpair up to
>                     > user space. That means
>                     >                         there is locking going on. I
> believe this is just a
>                     > limitation of NBD today - it
>                     >                         doesn't plug into the block-mq
> stuff in the kernel and
>                     > expose multiple
>                     >                         sockpairs. But someone more
> knowledgeable on the
>                     > kernel stack would need to take
>                     >                         a look.
>                     >                         
>                     >                         Thanks,
>                     >                         Ben
>                     >                         
>                     >                         > 
>                     >                         > Couple of things that I am not
> really sure in this
>                     > flow is :- 1. How memory
>                     >                         > registration is going to work
> with RDMA driver.
>                     >                         > 2. What changes are required
> in spdk memory
>                     > management
>                     >                         > 
>                     >                         > Thanks
>                     >                         > Rishabh Mittal
>                     >                         

>     
>     
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-09-05 20:11 Luse, Paul E
  0 siblings, 0 replies; 32+ messages in thread
From: Luse, Paul E @ 2019-09-05 20:11 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 25308 bytes --]

Thanks Brian, great write-up and definitely no need for anything formalized :) I'm going to defer response to the various number of vhost experts in the community though...

Thx
Paul

-----Original Message-----
From: Szmyd, Brian [mailto:bszmyd(a)ebay.com] 
Sent: Thursday, September 5, 2019 12:48 PM
To: Mittal, Rishabh <rimittal(a)ebay.com>; Luse, Paul E <paul.e.luse(a)intel.com>; Storage Performance Development Kit <spdk(a)lists.01.org>; Walker, Benjamin <benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>
Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>
Subject: Re: [SPDK] NBD with SPDK

Hi Paul,

Rather than put the effort into a formalized document here is a brief description of the solution I have been investigating just to get an opinion of feasibility or even workability. 

Some background and a reiteration of the problem to set things up. I apologize to reiterate anything and to include details that some may already know.

We are looking for a solution that allows us to write a custom bdev for the SPDK bdev layer that distributes I/O between different NVMe-oF targets that we have attached and then present that to our application as either a raw block device or filesystem mountpoint.

This is normally (as I understand it) done to by exposing a device via QEMU to a VM using the vhost target. This SPDK target has implemented the virtio-scsi (among others) device according to this spec:

https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-8300021

The VM kernel then uses a virtio-scsi module to attach said device into its SCSI mid-layer and then have the device enumerated as a /dev/sd[a-z]+ device.

The problem is that QEMU virtualizes a PCIe bus for the guest kernel virtio-pci driver to discover the virtio devices and bind them to the virtio-scsi driver. There really is no other way (other than platform MMIO type devices) to attach a device to the virtio-scsi device.

SPDK exposes the virtio device to the VM via QEMU which has written a "user space" version of the vhost bus. This driver then translates the API into the virtio-pci specification:

https://github.com/qemu/qemu/blob/5d0e5694470d2952b4f257bc985cac8c89b4fd92/docs/interop/vhost-user.rst

This uses an eventfd descriptor for interrupting the non-polling side of the queue and a UNIX domain socket to setup (and control) the shared memory which contains the I/O buffers and virtio queues. This is documented in SPDKs own documentation and diagramed here:

https://github.com/spdk/spdk/blob/01103b2e4dfdcf23cc2125164aa116394c8185e8/doc/vhost_processing.md

If we could implement this vhost-user QEMU target as a virtio driver in the kernel as an alternative to the virtio-pci driver, it could bind a SPDK vhost into the host kernel as a virtio device and enumerated in the /dev/sd[a-z]+ tree for our containers to bind. Attached is draft block diagram.

Since we will not have a real bus to signal for the driver to probe for new devices we can use a sysfs interface for the application to notify the driver of a new socket and eventfd pair to setup a new virtio-scsi instance. Otherwise the design simply moves the vhost-user driver from the QEMU application into the Host kernel itself.

It's my understanding that this will avoid a lot more system calls and copies compared to what exposing an iSCSI device or NBD device as we're currently discussing. Does this seem feasible?

Thanks,
Brian

On 9/5/19, 12:32 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:

    Hi Paul.
    
    Thanks for investigating it. 
    
    We have one more idea floating around. Brian is going to send you a proposal shortly. If other proposal seems feasible to you that we can evaluate the work required in both the proposals.
    
    Thanks
    Rishabh Mittal
    
    On 9/5/19, 11:09 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
    
        Hi,
        
        So I was able to perform the same steps here and I think one of the keys to really seeing what's going on is to start perftop like this:
        
         “perf top --sort comm,dso,symbol -C 0” to get a more focused view by sorting on command, shared object and symbol
        
        Attached are 2 snapshots, one with a NULL back end for nbd and one with libaio/nvme.  Some notes after chatting with Ben a bit, please read through and let us know what you think:
        
        * in both cases the vast majority of the highest overhead activities are kernel
        * the "copy_user_enhanced" symbol on the NULL case (it shows up on the other as well but you have to scroll way down to see it) and is the user/kernel space copy, nothing SPDK can do about that
        * the syscalls that dominate in both cases are likely something that can be improved on by changing how SPDK interacts with nbd. Ben had a couple of ideas inlcuidng (a) using libaio to interact with the nbd fd as opposed to interacting with the nbd socket, (b) "batching" wherever possible, for example on writes to nbd investigate not ack'ing them until some number have completed
        * the kernel slab* commands are likely nbd kernel driver allocations/frees in the IO path, one possibility would be to look at optimizing the nbd kernel driver for this one
        * the libc item on the NULL chart also shows up on the libaio profile however is again way down the scroll so it didn't make the screenshot :)  This could be a zeroing of something somewhere in the SPDK nbd driver
        
        It looks like this data supports what Ben had suspected a while back, much of the overhead we're looking at is kernel nbd.  Anyway, let us know what you think and if you want to explore any of the ideas above any further or see something else in the data that looks worthy to note.
        
        Thx
        Paul
        
        
        
        -----Original Message-----
        From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Luse, Paul E
        Sent: Wednesday, September 4, 2019 4:27 PM
        To: Mittal, Rishabh <rimittal(a)ebay.com>; Walker, Benjamin <benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org
        Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>
        Subject: Re: [SPDK] NBD with SPDK
        
        Cool, thanks for sending this.  I will try and repro tomorrow here and see what kind of results I get
        
        Thx
        Paul
        
        -----Original Message-----
        From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
        Sent: Wednesday, September 4, 2019 4:23 PM
        To: Luse, Paul E <paul.e.luse(a)intel.com>; Walker, Benjamin <benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org
        Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
        Subject: Re: [SPDK] NBD with SPDK
        
        Avg CPU utilization is very low when I am running this.
        
        09/04/2019 04:21:40 PM
        avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                   2.59    0.00    2.57    0.00    0.00   94.84
        
        Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
        sda              0.00    0.20      0.00      0.80     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     4.00   0.00   0.00
        sdb              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
        sdc              0.00 28846.80      0.00 191555.20     0.00 18211.00   0.00  38.70    0.00    1.03  29.64     0.00     6.64   0.03 100.00
        nb0              0.00 47297.00      0.00 191562.40     0.00   593.60   0.00   1.24    0.00    1.32  61.83     0.00     4.05   0
        
        
        
        On 9/4/19, 4:19 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
        
            I am using this command
            
            fio --name=randwrite --ioengine=libaio --iodepth=8 --rw=write --rwmixread=0  --bsrange=4k-4k --direct=1 -filename=/dev/nbd0 --numjobs=8 --runtime 120 --time_based --group_reporting
            
            I have created the device by using these commands
            	1.  ./root/spdk/app/vhost
            	2.  ./rpc.py bdev_aio_create /dev/sdc aio0
            	3. /rpc.py start_nbd_disk aio0 /dev/nbd0
            
            I am using  "perf top"  to get the performance 
            
            On 9/4/19, 4:03 PM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
            
                Hi Rishabh,
                
                Maybe it would help (me at least) if you described the complete & exact steps for your test - both setup of the env & test and command to profile.  Can you send that out?
                
                Thx
                Paul
                
                -----Original Message-----
                From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
                Sent: Wednesday, September 4, 2019 2:45 PM
                To: Walker, Benjamin <benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org; Luse, Paul E <paul.e.luse(a)intel.com>
                Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
                Subject: Re: [SPDK] NBD with SPDK
                
                Yes, I am using 64 q depth with one thread in fio. I am using AIO. This profiling is for the entire system. I don't know why spdk threads are idle.
                
                On 9/4/19, 11:08 AM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:
                
                    On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wrote:
                    > I got the run again. It is with 4k write.
                    > 
                    > 13.16%  vhost                       [.]
                    > spdk_ring_dequeue                                                             
                    >              
                    >    6.08%  vhost                       [.]
                    > rte_rdtsc                                                                     
                    >              
                    >    4.77%  vhost                       [.]
                    > spdk_thread_poll                                                              
                    >              
                    >    2.85%  vhost                       [.]
                    > _spdk_reactor_run                                                             
                    >  
                    
                    You're doing high queue depth for at least 30 seconds while the trace runs,
                    right? Using fio with the libaio engine on the NBD device is probably the way to
                    go. Are you limiting the profiling to just the core where the main SPDK process
                    is pinned? I'm asking because SPDK still appears to be mostly idle, and I
                    suspect the time is being spent in some other thread (in the kernel). Consider
                    capturing a profile for the entire system. It will have fio stuff in it, but the
                    expensive stuff still should generally bubble up to the top.
                    
                    Thanks,
                    Ben
                    
                    
                    > 
                    > On 8/29/19, 6:05 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
                    > 
                    >     I got the profile with first run. 
                    >     
                    >       27.91%  vhost                       [.]
                    > spdk_ring_dequeue                                                             
                    >              
                    >       12.94%  vhost                       [.]
                    > rte_rdtsc                                                                     
                    >              
                    >       11.00%  vhost                       [.]
                    > spdk_thread_poll                                                              
                    >              
                    >        6.15%  vhost                       [.]
                    > _spdk_reactor_run                                                             
                    >              
                    >        4.35%  [kernel]                    [k]
                    > syscall_return_via_sysret                                                     
                    >              
                    >        3.91%  vhost                       [.]
                    > _spdk_msg_queue_run_batch                                                     
                    >              
                    >        3.38%  vhost                       [.]
                    > _spdk_event_queue_run_batch                                                   
                    >              
                    >        2.83%  [unknown]                   [k]
                    > 0xfffffe000000601b                                                            
                    >              
                    >        1.45%  vhost                       [.]
                    > spdk_thread_get_from_ctx                                                      
                    >              
                    >        1.20%  [kernel]                    [k]
                    > __fget                                                                        
                    >              
                    >        1.14%  libpthread-2.27.so          [.]
                    > __libc_read                                                                   
                    >              
                    >        1.00%  libc-2.27.so                [.]
                    > 0x000000000018ef76                                                            
                    >              
                    >        0.99%  libc-2.27.so                [.] 0x000000000018ef79          
                    >     
                    >     Thanks
                    >     Rishabh Mittal                         
                    >     
                    >     On 8/19/19, 7:42 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
                    >     
                    >         That's great.  Keep any eye out for the items Ben mentions below - at
                    > least the first one should be quick to implement and compare both profile data
                    > and measured performance.
                    >         
                    >         Don’t' forget about the community meetings either, great place to chat
                    > about these kinds of things.  
                    > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Cbszmyd%40ebay.com%7C52847d18df514b39d8cf08d7322f74ea%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033051721021295&amp;sdata=heRt%2FhB5SPeqWNw44VoCIrt5W9N%2B0ExCXIVFNtzi2Zg%3D&amp;reserved=0
                    >   Next one is tomorrow morn US time.
                    >         
                    >         Thx
                    >         Paul
                    >         
                    >         -----Original Message-----
                    >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Mittal,
                    > Rishabh via SPDK
                    >         Sent: Thursday, August 15, 2019 6:50 PM
                    >         To: Harris, James R <james.r.harris(a)intel.com>; Walker, Benjamin <
                    > benjamin.walker(a)intel.com>; spdk(a)lists.01.org
                    >         Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen, Xiaoxi <
                    > xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>; Kadayam, Hari <
                    > hkadayam(a)ebay.com>
                    >         Subject: Re: [SPDK] NBD with SPDK
                    >         
                    >         Thanks. I will get the profiling by next week. 
                    >         
                    >         On 8/15/19, 6:26 PM, "Harris, James R" <james.r.harris(a)intel.com>
                    > wrote:
                    >         
                    >             
                    >             
                    >             On 8/15/19, 4:34 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
                    >             
                    >                 Hi Jim
                    >                 
                    >                 What tool you use to take profiling. 
                    >             
                    >             Hi Rishabh,
                    >             
                    >             Mostly I just use "perf top".
                    >             
                    >             -Jim
                    >             
                    >                 
                    >                 Thanks
                    >                 Rishabh Mittal
                    >                 
                    >                 On 8/14/19, 9:54 AM, "Harris, James R" <
                    > james.r.harris(a)intel.com> wrote:
                    >                 
                    >                     
                    >                     
                    >                     On 8/14/19, 9:18 AM, "Walker, Benjamin" <
                    > benjamin.walker(a)intel.com> wrote:
                    >                     
                    >                     <trim>
                    >                         
                    >                         When an I/O is performed in the process initiating the
                    > I/O to a file, the data
                    >                         goes into the OS page cache buffers at a layer far
                    > above the bio stack
                    >                         (somewhere up in VFS). If SPDK were to reserve some
                    > memory and hand it off to
                    >                         your kernel driver, your kernel driver would still
                    > need to copy it to that
                    >                         location out of the page cache buffers. We can't
                    > safely share the page cache
                    >                         buffers with a user space process.
                    >                        
                    >                     I think Rishabh was suggesting the SPDK reserve the
                    > virtual address space only.
                    >                     Then the kernel could map the page cache buffers into that
                    > virtual address space.
                    >                     That would not require a data copy, but would require the
                    > mapping operations.
                    >                     
                    >                     I think the profiling data would be really helpful - to
                    > quantify how much of the 50us
                    >                     Is due to copying the 4KB of data.  That can help drive
                    > next steps on how to optimize
                    >                     the SPDK NBD module.
                    >                     
                    >                     Thanks,
                    >                     
                    >                     -Jim
                    >                     
                    >                     
                    >                         As Paul said, I'm skeptical that the memcpy is
                    > significant in the overall
                    >                         performance you're measuring. I encourage you to go
                    > look at some profiling data
                    >                         and confirm that the memcpy is really showing up. I
                    > suspect the overhead is
                    >                         instead primarily in these spots:
                    >                         
                    >                         1) Dynamic buffer allocation in the SPDK NBD backend.
                    >                         
                    >                         As Paul indicated, the NBD target is dynamically
                    > allocating memory for each I/O.
                    >                         The NBD backend wasn't designed to be fast - it was
                    > designed to be simple.
                    >                         Pooling would be a lot faster and is something fairly
                    > easy to implement.
                    >                         
                    >                         2) The way SPDK does the syscalls when it implements
                    > the NBD backend.
                    >                         
                    >                         Again, the code was designed to be simple, not high
                    > performance. It simply calls
                    >                         read() and write() on the socket for each command.
                    > There are much higher
                    >                         performance ways of doing this, they're just more
                    > complex to implement.
                    >                         
                    >                         3) The lack of multi-queue support in NBD
                    >                         
                    >                         Every I/O is funneled through a single sockpair up to
                    > user space. That means
                    >                         there is locking going on. I believe this is just a
                    > limitation of NBD today - it
                    >                         doesn't plug into the block-mq stuff in the kernel and
                    > expose multiple
                    >                         sockpairs. But someone more knowledgeable on the
                    > kernel stack would need to take
                    >                         a look.
                    >                         
                    >                         Thanks,
                    >                         Ben
                    >                         
                    >                         > 
                    >                         > Couple of things that I am not really sure in this
                    > flow is :- 1. How memory
                    >                         > registration is going to work with RDMA driver.
                    >                         > 2. What changes are required in spdk memory
                    > management
                    >                         > 
                    >                         > Thanks
                    >                         > Rishabh Mittal
                    >                         
                    >                     
                    >                     
                    >                 
                    >                 
                    >             
                    >             
                    >         
                    >         _______________________________________________
                    >         SPDK mailing list
                    >         SPDK(a)lists.01.org
                    >         
                    > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cbszmyd%40ebay.com%7C52847d18df514b39d8cf08d7322f74ea%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033051721021295&amp;sdata=JS0ej%2B6QBpawCofiD%2FktD%2Bmzpu3pc1YpsKw5CKVzVBw%3D&amp;reserved=0
                    >         
                    >     
                    >     
                    > 
                    
                    
                
                
            
            
        
        _______________________________________________
        SPDK mailing list
        SPDK(a)lists.01.org
        https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Cbszmyd%40ebay.com%7C52847d18df514b39d8cf08d7322f74ea%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637033051721021295&amp;sdata=JS0ej%2B6QBpawCofiD%2FktD%2Bmzpu3pc1YpsKw5CKVzVBw%3D&amp;reserved=0
        
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-09-04 23:27 Luse, Paul E
  0 siblings, 0 replies; 32+ messages in thread
From: Luse, Paul E @ 2019-09-04 23:27 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 16013 bytes --]

Cool, thanks for sending this.  I will try and repro tomorrow here and see what kind of results I get

Thx
Paul

-----Original Message-----
From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
Sent: Wednesday, September 4, 2019 4:23 PM
To: Luse, Paul E <paul.e.luse(a)intel.com>; Walker, Benjamin <benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org
Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
Subject: Re: [SPDK] NBD with SPDK

Avg CPU utilization is very low when I am running this.

09/04/2019 04:21:40 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.59    0.00    2.57    0.00    0.00   94.84

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda              0.00    0.20      0.00      0.80     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     4.00   0.00   0.00
sdb              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sdc              0.00 28846.80      0.00 191555.20     0.00 18211.00   0.00  38.70    0.00    1.03  29.64     0.00     6.64   0.03 100.00
nb0              0.00 47297.00      0.00 191562.40     0.00   593.60   0.00   1.24    0.00    1.32  61.83     0.00     4.05   0



On 9/4/19, 4:19 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:

    I am using this command
    
    fio --name=randwrite --ioengine=libaio --iodepth=8 --rw=write --rwmixread=0  --bsrange=4k-4k --direct=1 -filename=/dev/nbd0 --numjobs=8 --runtime 120 --time_based --group_reporting
    
    I have created the device by using these commands
    	1.  ./root/spdk/app/vhost
    	2.  ./rpc.py bdev_aio_create /dev/sdc aio0
    	3. /rpc.py start_nbd_disk aio0 /dev/nbd0
    
    I am using  "perf top"  to get the performance 
    
    On 9/4/19, 4:03 PM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
    
        Hi Rishabh,
        
        Maybe it would help (me at least) if you described the complete & exact steps for your test - both setup of the env & test and command to profile.  Can you send that out?
        
        Thx
        Paul
        
        -----Original Message-----
        From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
        Sent: Wednesday, September 4, 2019 2:45 PM
        To: Walker, Benjamin <benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org; Luse, Paul E <paul.e.luse(a)intel.com>
        Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
        Subject: Re: [SPDK] NBD with SPDK
        
        Yes, I am using 64 q depth with one thread in fio. I am using AIO. This profiling is for the entire system. I don't know why spdk threads are idle.
        
        On 9/4/19, 11:08 AM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:
        
            On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wrote:
            > I got the run again. It is with 4k write.
            > 
            > 13.16%  vhost                       [.]
            > spdk_ring_dequeue                                                             
            >              
            >    6.08%  vhost                       [.]
            > rte_rdtsc                                                                     
            >              
            >    4.77%  vhost                       [.]
            > spdk_thread_poll                                                              
            >              
            >    2.85%  vhost                       [.]
            > _spdk_reactor_run                                                             
            >  
            
            You're doing high queue depth for at least 30 seconds while the trace runs,
            right? Using fio with the libaio engine on the NBD device is probably the way to
            go. Are you limiting the profiling to just the core where the main SPDK process
            is pinned? I'm asking because SPDK still appears to be mostly idle, and I
            suspect the time is being spent in some other thread (in the kernel). Consider
            capturing a profile for the entire system. It will have fio stuff in it, but the
            expensive stuff still should generally bubble up to the top.
            
            Thanks,
            Ben
            
            
            > 
            > On 8/29/19, 6:05 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
            > 
            >     I got the profile with first run. 
            >     
            >       27.91%  vhost                       [.]
            > spdk_ring_dequeue                                                             
            >              
            >       12.94%  vhost                       [.]
            > rte_rdtsc                                                                     
            >              
            >       11.00%  vhost                       [.]
            > spdk_thread_poll                                                              
            >              
            >        6.15%  vhost                       [.]
            > _spdk_reactor_run                                                             
            >              
            >        4.35%  [kernel]                    [k]
            > syscall_return_via_sysret                                                     
            >              
            >        3.91%  vhost                       [.]
            > _spdk_msg_queue_run_batch                                                     
            >              
            >        3.38%  vhost                       [.]
            > _spdk_event_queue_run_batch                                                   
            >              
            >        2.83%  [unknown]                   [k]
            > 0xfffffe000000601b                                                            
            >              
            >        1.45%  vhost                       [.]
            > spdk_thread_get_from_ctx                                                      
            >              
            >        1.20%  [kernel]                    [k]
            > __fget                                                                        
            >              
            >        1.14%  libpthread-2.27.so          [.]
            > __libc_read                                                                   
            >              
            >        1.00%  libc-2.27.so                [.]
            > 0x000000000018ef76                                                            
            >              
            >        0.99%  libc-2.27.so                [.] 0x000000000018ef79          
            >     
            >     Thanks
            >     Rishabh Mittal                         
            >     
            >     On 8/19/19, 7:42 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
            >     
            >         That's great.  Keep any eye out for the items Ben mentions below - at
            > least the first one should be quick to implement and compare both profile data
            > and measured performance.
            >         
            >         Don’t' forget about the community meetings either, great place to chat
            > about these kinds of things.  
            > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Crimittal%40ebay.com%7C1bba5013016a4b69435908d7318c0e68%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637032349916269231&amp;sdata=FI1RsbUF1gKjlQIyKu317GeF9QuQVGspLdI7MkF5zZE%3D&amp;reserved=0
            >   Next one is tomorrow morn US time.
            >         
            >         Thx
            >         Paul
            >         
            >         -----Original Message-----
            >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Mittal,
            > Rishabh via SPDK
            >         Sent: Thursday, August 15, 2019 6:50 PM
            >         To: Harris, James R <james.r.harris(a)intel.com>; Walker, Benjamin <
            > benjamin.walker(a)intel.com>; spdk(a)lists.01.org
            >         Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen, Xiaoxi <
            > xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>; Kadayam, Hari <
            > hkadayam(a)ebay.com>
            >         Subject: Re: [SPDK] NBD with SPDK
            >         
            >         Thanks. I will get the profiling by next week. 
            >         
            >         On 8/15/19, 6:26 PM, "Harris, James R" <james.r.harris(a)intel.com>
            > wrote:
            >         
            >             
            >             
            >             On 8/15/19, 4:34 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
            >             
            >                 Hi Jim
            >                 
            >                 What tool you use to take profiling. 
            >             
            >             Hi Rishabh,
            >             
            >             Mostly I just use "perf top".
            >             
            >             -Jim
            >             
            >                 
            >                 Thanks
            >                 Rishabh Mittal
            >                 
            >                 On 8/14/19, 9:54 AM, "Harris, James R" <
            > james.r.harris(a)intel.com> wrote:
            >                 
            >                     
            >                     
            >                     On 8/14/19, 9:18 AM, "Walker, Benjamin" <
            > benjamin.walker(a)intel.com> wrote:
            >                     
            >                     <trim>
            >                         
            >                         When an I/O is performed in the process initiating the
            > I/O to a file, the data
            >                         goes into the OS page cache buffers at a layer far
            > above the bio stack
            >                         (somewhere up in VFS). If SPDK were to reserve some
            > memory and hand it off to
            >                         your kernel driver, your kernel driver would still
            > need to copy it to that
            >                         location out of the page cache buffers. We can't
            > safely share the page cache
            >                         buffers with a user space process.
            >                        
            >                     I think Rishabh was suggesting the SPDK reserve the
            > virtual address space only.
            >                     Then the kernel could map the page cache buffers into that
            > virtual address space.
            >                     That would not require a data copy, but would require the
            > mapping operations.
            >                     
            >                     I think the profiling data would be really helpful - to
            > quantify how much of the 50us
            >                     Is due to copying the 4KB of data.  That can help drive
            > next steps on how to optimize
            >                     the SPDK NBD module.
            >                     
            >                     Thanks,
            >                     
            >                     -Jim
            >                     
            >                     
            >                         As Paul said, I'm skeptical that the memcpy is
            > significant in the overall
            >                         performance you're measuring. I encourage you to go
            > look at some profiling data
            >                         and confirm that the memcpy is really showing up. I
            > suspect the overhead is
            >                         instead primarily in these spots:
            >                         
            >                         1) Dynamic buffer allocation in the SPDK NBD backend.
            >                         
            >                         As Paul indicated, the NBD target is dynamically
            > allocating memory for each I/O.
            >                         The NBD backend wasn't designed to be fast - it was
            > designed to be simple.
            >                         Pooling would be a lot faster and is something fairly
            > easy to implement.
            >                         
            >                         2) The way SPDK does the syscalls when it implements
            > the NBD backend.
            >                         
            >                         Again, the code was designed to be simple, not high
            > performance. It simply calls
            >                         read() and write() on the socket for each command.
            > There are much higher
            >                         performance ways of doing this, they're just more
            > complex to implement.
            >                         
            >                         3) The lack of multi-queue support in NBD
            >                         
            >                         Every I/O is funneled through a single sockpair up to
            > user space. That means
            >                         there is locking going on. I believe this is just a
            > limitation of NBD today - it
            >                         doesn't plug into the block-mq stuff in the kernel and
            > expose multiple
            >                         sockpairs. But someone more knowledgeable on the
            > kernel stack would need to take
            >                         a look.
            >                         
            >                         Thanks,
            >                         Ben
            >                         
            >                         > 
            >                         > Couple of things that I am not really sure in this
            > flow is :- 1. How memory
            >                         > registration is going to work with RDMA driver.
            >                         > 2. What changes are required in spdk memory
            > management
            >                         > 
            >                         > Thanks
            >                         > Rishabh Mittal
            >                         
            >                     
            >                     
            >                 
            >                 
            >             
            >             
            >         
            >         _______________________________________________
            >         SPDK mailing list
            >         SPDK(a)lists.01.org
            >         
            > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7C1bba5013016a4b69435908d7318c0e68%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637032349916269231&amp;sdata=zNZpYAyUoBjjAvBzT7PH2uaw60CTuL1tql27a3jRRKs%3D&amp;reserved=0
            >         
            >     
            >     
            > 
            
            
        
        
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-09-04 23:03 Luse, Paul E
  0 siblings, 0 replies; 32+ messages in thread
From: Luse, Paul E @ 2019-09-04 23:03 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 11925 bytes --]

Hi Rishabh,

Maybe it would help (me at least) if you described the complete & exact steps for your test - both setup of the env & test and command to profile.  Can you send that out?

Thx
Paul

-----Original Message-----
From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
Sent: Wednesday, September 4, 2019 2:45 PM
To: Walker, Benjamin <benjamin.walker(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; spdk(a)lists.01.org; Luse, Paul E <paul.e.luse(a)intel.com>
Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
Subject: Re: [SPDK] NBD with SPDK

Yes, I am using 64 q depth with one thread in fio. I am using AIO. This profiling is for the entire system. I don't know why spdk threads are idle.

On 9/4/19, 11:08 AM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:

    On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wrote:
    > I got the run again. It is with 4k write.
    > 
    > 13.16%  vhost                       [.]
    > spdk_ring_dequeue                                                             
    >              
    >    6.08%  vhost                       [.]
    > rte_rdtsc                                                                     
    >              
    >    4.77%  vhost                       [.]
    > spdk_thread_poll                                                              
    >              
    >    2.85%  vhost                       [.]
    > _spdk_reactor_run                                                             
    >  
    
    You're doing high queue depth for at least 30 seconds while the trace runs,
    right? Using fio with the libaio engine on the NBD device is probably the way to
    go. Are you limiting the profiling to just the core where the main SPDK process
    is pinned? I'm asking because SPDK still appears to be mostly idle, and I
    suspect the time is being spent in some other thread (in the kernel). Consider
    capturing a profile for the entire system. It will have fio stuff in it, but the
    expensive stuff still should generally bubble up to the top.
    
    Thanks,
    Ben
    
    
    > 
    > On 8/29/19, 6:05 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
    > 
    >     I got the profile with first run. 
    >     
    >       27.91%  vhost                       [.]
    > spdk_ring_dequeue                                                             
    >              
    >       12.94%  vhost                       [.]
    > rte_rdtsc                                                                     
    >              
    >       11.00%  vhost                       [.]
    > spdk_thread_poll                                                              
    >              
    >        6.15%  vhost                       [.]
    > _spdk_reactor_run                                                             
    >              
    >        4.35%  [kernel]                    [k]
    > syscall_return_via_sysret                                                     
    >              
    >        3.91%  vhost                       [.]
    > _spdk_msg_queue_run_batch                                                     
    >              
    >        3.38%  vhost                       [.]
    > _spdk_event_queue_run_batch                                                   
    >              
    >        2.83%  [unknown]                   [k]
    > 0xfffffe000000601b                                                            
    >              
    >        1.45%  vhost                       [.]
    > spdk_thread_get_from_ctx                                                      
    >              
    >        1.20%  [kernel]                    [k]
    > __fget                                                                        
    >              
    >        1.14%  libpthread-2.27.so          [.]
    > __libc_read                                                                   
    >              
    >        1.00%  libc-2.27.so                [.]
    > 0x000000000018ef76                                                            
    >              
    >        0.99%  libc-2.27.so                [.] 0x000000000018ef79          
    >     
    >     Thanks
    >     Rishabh Mittal                         
    >     
    >     On 8/19/19, 7:42 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
    >     
    >         That's great.  Keep any eye out for the items Ben mentions below - at
    > least the first one should be quick to implement and compare both profile data
    > and measured performance.
    >         
    >         Don’t' forget about the community meetings either, great place to chat
    > about these kinds of things.  
    > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Crimittal%40ebay.com%7C3d01fd5e4702408e4c1108d73162e234%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637032173088111750&amp;sdata=GnZbN7PFkn04M%2Bs4lok0YSGkPzEzYWdUjngVELJ6VDA%3D&amp;reserved=0
    >   Next one is tomorrow morn US time.
    >         
    >         Thx
    >         Paul
    >         
    >         -----Original Message-----
    >         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Mittal,
    > Rishabh via SPDK
    >         Sent: Thursday, August 15, 2019 6:50 PM
    >         To: Harris, James R <james.r.harris(a)intel.com>; Walker, Benjamin <
    > benjamin.walker(a)intel.com>; spdk(a)lists.01.org
    >         Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen, Xiaoxi <
    > xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>; Kadayam, Hari <
    > hkadayam(a)ebay.com>
    >         Subject: Re: [SPDK] NBD with SPDK
    >         
    >         Thanks. I will get the profiling by next week. 
    >         
    >         On 8/15/19, 6:26 PM, "Harris, James R" <james.r.harris(a)intel.com>
    > wrote:
    >         
    >             
    >             
    >             On 8/15/19, 4:34 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
    >             
    >                 Hi Jim
    >                 
    >                 What tool you use to take profiling. 
    >             
    >             Hi Rishabh,
    >             
    >             Mostly I just use "perf top".
    >             
    >             -Jim
    >             
    >                 
    >                 Thanks
    >                 Rishabh Mittal
    >                 
    >                 On 8/14/19, 9:54 AM, "Harris, James R" <
    > james.r.harris(a)intel.com> wrote:
    >                 
    >                     
    >                     
    >                     On 8/14/19, 9:18 AM, "Walker, Benjamin" <
    > benjamin.walker(a)intel.com> wrote:
    >                     
    >                     <trim>
    >                         
    >                         When an I/O is performed in the process initiating the
    > I/O to a file, the data
    >                         goes into the OS page cache buffers at a layer far
    > above the bio stack
    >                         (somewhere up in VFS). If SPDK were to reserve some
    > memory and hand it off to
    >                         your kernel driver, your kernel driver would still
    > need to copy it to that
    >                         location out of the page cache buffers. We can't
    > safely share the page cache
    >                         buffers with a user space process.
    >                        
    >                     I think Rishabh was suggesting the SPDK reserve the
    > virtual address space only.
    >                     Then the kernel could map the page cache buffers into that
    > virtual address space.
    >                     That would not require a data copy, but would require the
    > mapping operations.
    >                     
    >                     I think the profiling data would be really helpful - to
    > quantify how much of the 50us
    >                     Is due to copying the 4KB of data.  That can help drive
    > next steps on how to optimize
    >                     the SPDK NBD module.
    >                     
    >                     Thanks,
    >                     
    >                     -Jim
    >                     
    >                     
    >                         As Paul said, I'm skeptical that the memcpy is
    > significant in the overall
    >                         performance you're measuring. I encourage you to go
    > look at some profiling data
    >                         and confirm that the memcpy is really showing up. I
    > suspect the overhead is
    >                         instead primarily in these spots:
    >                         
    >                         1) Dynamic buffer allocation in the SPDK NBD backend.
    >                         
    >                         As Paul indicated, the NBD target is dynamically
    > allocating memory for each I/O.
    >                         The NBD backend wasn't designed to be fast - it was
    > designed to be simple.
    >                         Pooling would be a lot faster and is something fairly
    > easy to implement.
    >                         
    >                         2) The way SPDK does the syscalls when it implements
    > the NBD backend.
    >                         
    >                         Again, the code was designed to be simple, not high
    > performance. It simply calls
    >                         read() and write() on the socket for each command.
    > There are much higher
    >                         performance ways of doing this, they're just more
    > complex to implement.
    >                         
    >                         3) The lack of multi-queue support in NBD
    >                         
    >                         Every I/O is funneled through a single sockpair up to
    > user space. That means
    >                         there is locking going on. I believe this is just a
    > limitation of NBD today - it
    >                         doesn't plug into the block-mq stuff in the kernel and
    > expose multiple
    >                         sockpairs. But someone more knowledgeable on the
    > kernel stack would need to take
    >                         a look.
    >                         
    >                         Thanks,
    >                         Ben
    >                         
    >                         > 
    >                         > Couple of things that I am not really sure in this
    > flow is :- 1. How memory
    >                         > registration is going to work with RDMA driver.
    >                         > 2. What changes are required in spdk memory
    > management
    >                         > 
    >                         > Thanks
    >                         > Rishabh Mittal
    >                         
    >                     
    >                     
    >                 
    >                 
    >             
    >             
    >         
    >         _______________________________________________
    >         SPDK mailing list
    >         SPDK(a)lists.01.org
    >         
    > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7C3d01fd5e4702408e4c1108d73162e234%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637032173088111750&amp;sdata=IcBOJKqWOr9cKgXAulpR%2FSVd1BU%2FU9pDk2baxevpv8Q%3D&amp;reserved=0
    >         
    >     
    >     
    > 
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-09-04 18:08 Walker, Benjamin
  0 siblings, 0 replies; 32+ messages in thread
From: Walker, Benjamin @ 2019-09-04 18:08 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 10123 bytes --]

On Fri, 2019-08-30 at 22:28 +0000, Mittal, Rishabh wrote:
> I got the run again. It is with 4k write.
> 
> 13.16%  vhost                       [.]
> spdk_ring_dequeue                                                             
>              
>    6.08%  vhost                       [.]
> rte_rdtsc                                                                     
>              
>    4.77%  vhost                       [.]
> spdk_thread_poll                                                              
>              
>    2.85%  vhost                       [.]
> _spdk_reactor_run                                                             
>  

You're doing high queue depth for at least 30 seconds while the trace runs,
right? Using fio with the libaio engine on the NBD device is probably the way to
go. Are you limiting the profiling to just the core where the main SPDK process
is pinned? I'm asking because SPDK still appears to be mostly idle, and I
suspect the time is being spent in some other thread (in the kernel). Consider
capturing a profile for the entire system. It will have fio stuff in it, but the
expensive stuff still should generally bubble up to the top.

Thanks,
Ben


> 
> On 8/29/19, 6:05 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
> 
>     I got the profile with first run. 
>     
>       27.91%  vhost                       [.]
> spdk_ring_dequeue                                                             
>              
>       12.94%  vhost                       [.]
> rte_rdtsc                                                                     
>              
>       11.00%  vhost                       [.]
> spdk_thread_poll                                                              
>              
>        6.15%  vhost                       [.]
> _spdk_reactor_run                                                             
>              
>        4.35%  [kernel]                    [k]
> syscall_return_via_sysret                                                     
>              
>        3.91%  vhost                       [.]
> _spdk_msg_queue_run_batch                                                     
>              
>        3.38%  vhost                       [.]
> _spdk_event_queue_run_batch                                                   
>              
>        2.83%  [unknown]                   [k]
> 0xfffffe000000601b                                                            
>              
>        1.45%  vhost                       [.]
> spdk_thread_get_from_ctx                                                      
>              
>        1.20%  [kernel]                    [k]
> __fget                                                                        
>              
>        1.14%  libpthread-2.27.so          [.]
> __libc_read                                                                   
>              
>        1.00%  libc-2.27.so                [.]
> 0x000000000018ef76                                                            
>              
>        0.99%  libc-2.27.so                [.] 0x000000000018ef79          
>     
>     Thanks
>     Rishabh Mittal                         
>     
>     On 8/19/19, 7:42 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
>     
>         That's great.  Keep any eye out for the items Ben mentions below - at
> least the first one should be quick to implement and compare both profile data
> and measured performance.
>         
>         Don’t' forget about the community meetings either, great place to chat
> about these kinds of things.  
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Crimittal%40ebay.com%7Cd5c75891ea414963501c08d724b36248%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637018225183900855&amp;sdata=wEMi40AMPeGVt3XX3bHfneHqM0LFEB8Jt%2F9dQl6cIBE%3D&amp;reserved=0
>   Next one is tomorrow morn US time.
>         
>         Thx
>         Paul
>         
>         -----Original Message-----
>         From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Mittal,
> Rishabh via SPDK
>         Sent: Thursday, August 15, 2019 6:50 PM
>         To: Harris, James R <james.r.harris(a)intel.com>; Walker, Benjamin <
> benjamin.walker(a)intel.com>; spdk(a)lists.01.org
>         Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen, Xiaoxi <
> xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>; Kadayam, Hari <
> hkadayam(a)ebay.com>
>         Subject: Re: [SPDK] NBD with SPDK
>         
>         Thanks. I will get the profiling by next week. 
>         
>         On 8/15/19, 6:26 PM, "Harris, James R" <james.r.harris(a)intel.com>
> wrote:
>         
>             
>             
>             On 8/15/19, 4:34 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
>             
>                 Hi Jim
>                 
>                 What tool you use to take profiling. 
>             
>             Hi Rishabh,
>             
>             Mostly I just use "perf top".
>             
>             -Jim
>             
>                 
>                 Thanks
>                 Rishabh Mittal
>                 
>                 On 8/14/19, 9:54 AM, "Harris, James R" <
> james.r.harris(a)intel.com> wrote:
>                 
>                     
>                     
>                     On 8/14/19, 9:18 AM, "Walker, Benjamin" <
> benjamin.walker(a)intel.com> wrote:
>                     
>                     <trim>
>                         
>                         When an I/O is performed in the process initiating the
> I/O to a file, the data
>                         goes into the OS page cache buffers at a layer far
> above the bio stack
>                         (somewhere up in VFS). If SPDK were to reserve some
> memory and hand it off to
>                         your kernel driver, your kernel driver would still
> need to copy it to that
>                         location out of the page cache buffers. We can't
> safely share the page cache
>                         buffers with a user space process.
>                        
>                     I think Rishabh was suggesting the SPDK reserve the
> virtual address space only.
>                     Then the kernel could map the page cache buffers into that
> virtual address space.
>                     That would not require a data copy, but would require the
> mapping operations.
>                     
>                     I think the profiling data would be really helpful - to
> quantify how much of the 50us
>                     Is due to copying the 4KB of data.  That can help drive
> next steps on how to optimize
>                     the SPDK NBD module.
>                     
>                     Thanks,
>                     
>                     -Jim
>                     
>                     
>                         As Paul said, I'm skeptical that the memcpy is
> significant in the overall
>                         performance you're measuring. I encourage you to go
> look at some profiling data
>                         and confirm that the memcpy is really showing up. I
> suspect the overhead is
>                         instead primarily in these spots:
>                         
>                         1) Dynamic buffer allocation in the SPDK NBD backend.
>                         
>                         As Paul indicated, the NBD target is dynamically
> allocating memory for each I/O.
>                         The NBD backend wasn't designed to be fast - it was
> designed to be simple.
>                         Pooling would be a lot faster and is something fairly
> easy to implement.
>                         
>                         2) The way SPDK does the syscalls when it implements
> the NBD backend.
>                         
>                         Again, the code was designed to be simple, not high
> performance. It simply calls
>                         read() and write() on the socket for each command.
> There are much higher
>                         performance ways of doing this, they're just more
> complex to implement.
>                         
>                         3) The lack of multi-queue support in NBD
>                         
>                         Every I/O is funneled through a single sockpair up to
> user space. That means
>                         there is locking going on. I believe this is just a
> limitation of NBD today - it
>                         doesn't plug into the block-mq stuff in the kernel and
> expose multiple
>                         sockpairs. But someone more knowledgeable on the
> kernel stack would need to take
>                         a look.
>                         
>                         Thanks,
>                         Ben
>                         
>                         > 
>                         > Couple of things that I am not really sure in this
> flow is :- 1. How memory
>                         > registration is going to work with RDMA driver.
>                         > 2. What changes are required in spdk memory
> management
>                         > 
>                         > Thanks
>                         > Rishabh Mittal
>                         
>                     
>                     
>                 
>                 
>             
>             
>         
>         _______________________________________________
>         SPDK mailing list
>         SPDK(a)lists.01.org
>         
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7Cd5c75891ea414963501c08d724b36248%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637018225183900855&amp;sdata=9QDXP2O4MWvrQmKitBJONSkZZHXrRqfFXPrDqltPYjM%3D&amp;reserved=0
>         
>     
>     
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-30 17:06 Walker, Benjamin
  0 siblings, 0 replies; 32+ messages in thread
From: Walker, Benjamin @ 2019-08-30 17:06 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 8086 bytes --]

Hi Rishabh,

This looks like what I'd expect the profile to show if the system was idle. What workload was running while you did your profiling? Was the workload active the entire time of the profile?

Thanks,
Ben

On Fri, 2019-08-30 at 01:05 +0000, Mittal, Rishabh wrote:

I got the profile with first run.


  27.91%  vhost                       [.] spdk_ring_dequeue

  12.94%  vhost                       [.] rte_rdtsc

  11.00%  vhost                       [.] spdk_thread_poll

   6.15%  vhost                       [.] _spdk_reactor_run

   4.35%  [kernel]                    [k] syscall_return_via_sysret

   3.91%  vhost                       [.] _spdk_msg_queue_run_batch

   3.38%  vhost                       [.] _spdk_event_queue_run_batch

   2.83%  [unknown]                   [k] 0xfffffe000000601b

   1.45%  vhost                       [.] spdk_thread_get_from_ctx

   1.20%  [kernel]                    [k] __fget

   1.14%  libpthread-2.27.so          [.] __libc_read

   1.00%  libc-2.27.so                [.] 0x000000000018ef76

   0.99%  libc-2.27.so                [.] 0x000000000018ef79


Thanks

Rishabh Mittal


On 8/19/19, 7:42 AM, "Luse, Paul E" <

<mailto:paul.e.luse(a)intel.com>

paul.e.luse(a)intel.com

> wrote:


    That's great.  Keep any eye out for the items Ben mentions below - at least the first one should be quick to implement and compare both profile data and measured performance.



    Don’t' forget about the community meetings either, great place to chat about these kinds of things.

<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Crimittal%40ebay.com%7Cd5c75891ea414963501c08d724b36248%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637018225183900855&amp;sdata=wEMi40AMPeGVt3XX3bHfneHqM0LFEB8Jt%2F9dQl6cIBE%3D&amp;reserved=0>

https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Crimittal%40ebay.com%7Cd5c75891ea414963501c08d724b36248%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637018225183900855&amp;sdata=wEMi40AMPeGVt3XX3bHfneHqM0LFEB8Jt%2F9dQl6cIBE%3D&amp;reserved=0

  Next one is tomorrow morn US time.



    Thx

    Paul



    -----Original Message-----

    From: SPDK [mailto:

<mailto:spdk-bounces(a)lists.01.org>

spdk-bounces(a)lists.01.org

] On Behalf Of Mittal, Rishabh via SPDK

    Sent: Thursday, August 15, 2019 6:50 PM

    To: Harris, James R <

<mailto:james.r.harris(a)intel.com>

james.r.harris(a)intel.com

>; Walker, Benjamin <

<mailto:benjamin.walker(a)intel.com>

benjamin.walker(a)intel.com

>;

<mailto:spdk(a)lists.01.org>

spdk(a)lists.01.org


    Cc: Mittal, Rishabh <

<mailto:rimittal(a)ebay.com>

rimittal(a)ebay.com

>; Chen, Xiaoxi <

<mailto:xiaoxchen(a)ebay.com>

xiaoxchen(a)ebay.com

>; Szmyd, Brian <

<mailto:bszmyd(a)ebay.com>

bszmyd(a)ebay.com

>; Kadayam, Hari <

<mailto:hkadayam(a)ebay.com>

hkadayam(a)ebay.com

>

    Subject: Re: [SPDK] NBD with SPDK



    Thanks. I will get the profiling by next week.



    On 8/15/19, 6:26 PM, "Harris, James R" <

<mailto:james.r.harris(a)intel.com>

james.r.harris(a)intel.com

> wrote:







        On 8/15/19, 4:34 PM, "Mittal, Rishabh" <

<mailto:rimittal(a)ebay.com>

rimittal(a)ebay.com

> wrote:



            Hi Jim



            What tool you use to take profiling.



        Hi Rishabh,



        Mostly I just use "perf top".



        -Jim





            Thanks

            Rishabh Mittal



            On 8/14/19, 9:54 AM, "Harris, James R" <

<mailto:james.r.harris(a)intel.com>

james.r.harris(a)intel.com

> wrote:







                On 8/14/19, 9:18 AM, "Walker, Benjamin" <

<mailto:benjamin.walker(a)intel.com>

benjamin.walker(a)intel.com

> wrote:



                <trim>



                    When an I/O is performed in the process initiating the I/O to a file, the data

                    goes into the OS page cache buffers at a layer far above the bio stack

                    (somewhere up in VFS). If SPDK were to reserve some memory and hand it off to

                    your kernel driver, your kernel driver would still need to copy it to that

                    location out of the page cache buffers. We can't safely share the page cache

                    buffers with a user space process.



                I think Rishabh was suggesting the SPDK reserve the virtual address space only.

                Then the kernel could map the page cache buffers into that virtual address space.

                That would not require a data copy, but would require the mapping operations.



                I think the profiling data would be really helpful - to quantify how much of the 50us

                Is due to copying the 4KB of data.  That can help drive next steps on how to optimize

                the SPDK NBD module.



                Thanks,



                -Jim





                    As Paul said, I'm skeptical that the memcpy is significant in the overall

                    performance you're measuring. I encourage you to go look at some profiling data

                    and confirm that the memcpy is really showing up. I suspect the overhead is

                    instead primarily in these spots:



                    1) Dynamic buffer allocation in the SPDK NBD backend.



                    As Paul indicated, the NBD target is dynamically allocating memory for each I/O.

                    The NBD backend wasn't designed to be fast - it was designed to be simple.

                    Pooling would be a lot faster and is something fairly easy to implement.



                    2) The way SPDK does the syscalls when it implements the NBD backend.



                    Again, the code was designed to be simple, not high performance. It simply calls

                    read() and write() on the socket for each command. There are much higher

                    performance ways of doing this, they're just more complex to implement.



                    3) The lack of multi-queue support in NBD



                    Every I/O is funneled through a single sockpair up to user space. That means

                    there is locking going on. I believe this is just a limitation of NBD today - it

                    doesn't plug into the block-mq stuff in the kernel and expose multiple

                    sockpairs. But someone more knowledgeable on the kernel stack would need to take

                    a look.



                    Thanks,

                    Ben



                    >

                    > Couple of things that I am not really sure in this flow is :- 1. How memory

                    > registration is going to work with RDMA driver.

                    > 2. What changes are required in spdk memory management

                    >

                    > Thanks

                    > Rishabh Mittal

















    _______________________________________________

    SPDK mailing list



<mailto:SPDK(a)lists.01.org>

SPDK(a)lists.01.org




<https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7Cd5c75891ea414963501c08d724b36248%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637018225183900855&amp;sdata=9QDXP2O4MWvrQmKitBJONSkZZHXrRqfFXPrDqltPYjM%3D&amp;reserved=0>

https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7Cd5c75891ea414963501c08d724b36248%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637018225183900855&amp;sdata=9QDXP2O4MWvrQmKitBJONSkZZHXrRqfFXPrDqltPYjM%3D&amp;reserved=0





^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-30  1:05 Mittal, Rishabh
  0 siblings, 0 replies; 32+ messages in thread
From: Mittal, Rishabh @ 2019-08-30  1:05 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 7786 bytes --]

I got the profile with first run. 

  27.91%  vhost                       [.] spdk_ring_dequeue                                                                          
  12.94%  vhost                       [.] rte_rdtsc                                                                                  
  11.00%  vhost                       [.] spdk_thread_poll                                                                           
   6.15%  vhost                       [.] _spdk_reactor_run                                                                          
   4.35%  [kernel]                    [k] syscall_return_via_sysret                                                                  
   3.91%  vhost                       [.] _spdk_msg_queue_run_batch                                                                  
   3.38%  vhost                       [.] _spdk_event_queue_run_batch                                                                
   2.83%  [unknown]                   [k] 0xfffffe000000601b                                                                         
   1.45%  vhost                       [.] spdk_thread_get_from_ctx                                                                   
   1.20%  [kernel]                    [k] __fget                                                                                     
   1.14%  libpthread-2.27.so          [.] __libc_read                                                                                
   1.00%  libc-2.27.so                [.] 0x000000000018ef76                                                                         
   0.99%  libc-2.27.so                [.] 0x000000000018ef79          

Thanks
Rishabh Mittal                         

On 8/19/19, 7:42 AM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:

    That's great.  Keep any eye out for the items Ben mentions below - at least the first one should be quick to implement and compare both profile data and measured performance.
    
    Don’t' forget about the community meetings either, great place to chat about these kinds of things.  https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspdk.io%2Fcommunity%2F&amp;data=02%7C01%7Crimittal%40ebay.com%7Cd5c75891ea414963501c08d724b36248%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637018225183900855&amp;sdata=wEMi40AMPeGVt3XX3bHfneHqM0LFEB8Jt%2F9dQl6cIBE%3D&amp;reserved=0  Next one is tomorrow morn US time.
    
    Thx
    Paul
    
    -----Original Message-----
    From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Mittal, Rishabh via SPDK
    Sent: Thursday, August 15, 2019 6:50 PM
    To: Harris, James R <james.r.harris(a)intel.com>; Walker, Benjamin <benjamin.walker(a)intel.com>; spdk(a)lists.01.org
    Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>
    Subject: Re: [SPDK] NBD with SPDK
    
    Thanks. I will get the profiling by next week. 
    
    On 8/15/19, 6:26 PM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
    
        
        
        On 8/15/19, 4:34 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
        
            Hi Jim
            
            What tool you use to take profiling. 
        
        Hi Rishabh,
        
        Mostly I just use "perf top".
        
        -Jim
        
            
            Thanks
            Rishabh Mittal
            
            On 8/14/19, 9:54 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
            
                
                
                On 8/14/19, 9:18 AM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:
                
                <trim>
                    
                    When an I/O is performed in the process initiating the I/O to a file, the data
                    goes into the OS page cache buffers at a layer far above the bio stack
                    (somewhere up in VFS). If SPDK were to reserve some memory and hand it off to
                    your kernel driver, your kernel driver would still need to copy it to that
                    location out of the page cache buffers. We can't safely share the page cache
                    buffers with a user space process.
                   
                I think Rishabh was suggesting the SPDK reserve the virtual address space only.
                Then the kernel could map the page cache buffers into that virtual address space.
                That would not require a data copy, but would require the mapping operations.
                
                I think the profiling data would be really helpful - to quantify how much of the 50us
                Is due to copying the 4KB of data.  That can help drive next steps on how to optimize
                the SPDK NBD module.
                
                Thanks,
                
                -Jim
                
                
                    As Paul said, I'm skeptical that the memcpy is significant in the overall
                    performance you're measuring. I encourage you to go look at some profiling data
                    and confirm that the memcpy is really showing up. I suspect the overhead is
                    instead primarily in these spots:
                    
                    1) Dynamic buffer allocation in the SPDK NBD backend.
                    
                    As Paul indicated, the NBD target is dynamically allocating memory for each I/O.
                    The NBD backend wasn't designed to be fast - it was designed to be simple.
                    Pooling would be a lot faster and is something fairly easy to implement.
                    
                    2) The way SPDK does the syscalls when it implements the NBD backend.
                    
                    Again, the code was designed to be simple, not high performance. It simply calls
                    read() and write() on the socket for each command. There are much higher
                    performance ways of doing this, they're just more complex to implement.
                    
                    3) The lack of multi-queue support in NBD
                    
                    Every I/O is funneled through a single sockpair up to user space. That means
                    there is locking going on. I believe this is just a limitation of NBD today - it
                    doesn't plug into the block-mq stuff in the kernel and expose multiple
                    sockpairs. But someone more knowledgeable on the kernel stack would need to take
                    a look.
                    
                    Thanks,
                    Ben
                    
                    > 
                    > Couple of things that I am not really sure in this flow is :- 1. How memory
                    > registration is going to work with RDMA driver.
                    > 2. What changes are required in spdk memory management
                    > 
                    > Thanks
                    > Rishabh Mittal
                    
                
                
            
            
        
        
    
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7Cd5c75891ea414963501c08d724b36248%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637018225183900855&amp;sdata=9QDXP2O4MWvrQmKitBJONSkZZHXrRqfFXPrDqltPYjM%3D&amp;reserved=0
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-19 14:41 Luse, Paul E
  0 siblings, 0 replies; 32+ messages in thread
From: Luse, Paul E @ 2019-08-19 14:41 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4931 bytes --]

That's great.  Keep any eye out for the items Ben mentions below - at least the first one should be quick to implement and compare both profile data and measured performance.

Don’t' forget about the community meetings either, great place to chat about these kinds of things.  https://spdk.io/community/  Next one is tomorrow morn US time.

Thx
Paul

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Mittal, Rishabh via SPDK
Sent: Thursday, August 15, 2019 6:50 PM
To: Harris, James R <james.r.harris(a)intel.com>; Walker, Benjamin <benjamin.walker(a)intel.com>; spdk(a)lists.01.org
Cc: Mittal, Rishabh <rimittal(a)ebay.com>; Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>
Subject: Re: [SPDK] NBD with SPDK

Thanks. I will get the profiling by next week. 

On 8/15/19, 6:26 PM, "Harris, James R" <james.r.harris(a)intel.com> wrote:

    
    
    On 8/15/19, 4:34 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
    
        Hi Jim
        
        What tool you use to take profiling. 
    
    Hi Rishabh,
    
    Mostly I just use "perf top".
    
    -Jim
    
        
        Thanks
        Rishabh Mittal
        
        On 8/14/19, 9:54 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
        
            
            
            On 8/14/19, 9:18 AM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:
            
            <trim>
                
                When an I/O is performed in the process initiating the I/O to a file, the data
                goes into the OS page cache buffers at a layer far above the bio stack
                (somewhere up in VFS). If SPDK were to reserve some memory and hand it off to
                your kernel driver, your kernel driver would still need to copy it to that
                location out of the page cache buffers. We can't safely share the page cache
                buffers with a user space process.
               
            I think Rishabh was suggesting the SPDK reserve the virtual address space only.
            Then the kernel could map the page cache buffers into that virtual address space.
            That would not require a data copy, but would require the mapping operations.
            
            I think the profiling data would be really helpful - to quantify how much of the 50us
            Is due to copying the 4KB of data.  That can help drive next steps on how to optimize
            the SPDK NBD module.
            
            Thanks,
            
            -Jim
            
            
                As Paul said, I'm skeptical that the memcpy is significant in the overall
                performance you're measuring. I encourage you to go look at some profiling data
                and confirm that the memcpy is really showing up. I suspect the overhead is
                instead primarily in these spots:
                
                1) Dynamic buffer allocation in the SPDK NBD backend.
                
                As Paul indicated, the NBD target is dynamically allocating memory for each I/O.
                The NBD backend wasn't designed to be fast - it was designed to be simple.
                Pooling would be a lot faster and is something fairly easy to implement.
                
                2) The way SPDK does the syscalls when it implements the NBD backend.
                
                Again, the code was designed to be simple, not high performance. It simply calls
                read() and write() on the socket for each command. There are much higher
                performance ways of doing this, they're just more complex to implement.
                
                3) The lack of multi-queue support in NBD
                
                Every I/O is funneled through a single sockpair up to user space. That means
                there is locking going on. I believe this is just a limitation of NBD today - it
                doesn't plug into the block-mq stuff in the kernel and expose multiple
                sockpairs. But someone more knowledgeable on the kernel stack would need to take
                a look.
                
                Thanks,
                Ben
                
                > 
                > Couple of things that I am not really sure in this flow is :- 1. How memory
                > registration is going to work with RDMA driver.
                > 2. What changes are required in spdk memory management
                > 
                > Thanks
                > Rishabh Mittal
                
            
            
        
        
    
    

_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-16  1:50 Mittal, Rishabh
  0 siblings, 0 replies; 32+ messages in thread
From: Mittal, Rishabh @ 2019-08-16  1:50 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3975 bytes --]

Thanks. I will get the profiling by next week. 

On 8/15/19, 6:26 PM, "Harris, James R" <james.r.harris(a)intel.com> wrote:

    
    
    On 8/15/19, 4:34 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
    
        Hi Jim
        
        What tool you use to take profiling. 
    
    Hi Rishabh,
    
    Mostly I just use "perf top".
    
    -Jim
    
        
        Thanks
        Rishabh Mittal
        
        On 8/14/19, 9:54 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
        
            
            
            On 8/14/19, 9:18 AM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:
            
            <trim>
                
                When an I/O is performed in the process initiating the I/O to a file, the data
                goes into the OS page cache buffers at a layer far above the bio stack
                (somewhere up in VFS). If SPDK were to reserve some memory and hand it off to
                your kernel driver, your kernel driver would still need to copy it to that
                location out of the page cache buffers. We can't safely share the page cache
                buffers with a user space process.
               
            I think Rishabh was suggesting the SPDK reserve the virtual address space only.
            Then the kernel could map the page cache buffers into that virtual address space.
            That would not require a data copy, but would require the mapping operations.
            
            I think the profiling data would be really helpful - to quantify how much of the 50us
            Is due to copying the 4KB of data.  That can help drive next steps on how to optimize
            the SPDK NBD module.
            
            Thanks,
            
            -Jim
            
            
                As Paul said, I'm skeptical that the memcpy is significant in the overall
                performance you're measuring. I encourage you to go look at some profiling data
                and confirm that the memcpy is really showing up. I suspect the overhead is
                instead primarily in these spots:
                
                1) Dynamic buffer allocation in the SPDK NBD backend.
                
                As Paul indicated, the NBD target is dynamically allocating memory for each I/O.
                The NBD backend wasn't designed to be fast - it was designed to be simple.
                Pooling would be a lot faster and is something fairly easy to implement.
                
                2) The way SPDK does the syscalls when it implements the NBD backend.
                
                Again, the code was designed to be simple, not high performance. It simply calls
                read() and write() on the socket for each command. There are much higher
                performance ways of doing this, they're just more complex to implement.
                
                3) The lack of multi-queue support in NBD
                
                Every I/O is funneled through a single sockpair up to user space. That means
                there is locking going on. I believe this is just a limitation of NBD today - it
                doesn't plug into the block-mq stuff in the kernel and expose multiple
                sockpairs. But someone more knowledgeable on the kernel stack would need to take
                a look.
                
                Thanks,
                Ben
                
                > 
                > Couple of things that I am not really sure in this flow is :- 1. How memory
                > registration is going to work with RDMA driver.
                > 2. What changes are required in spdk memory management
                > 
                > Thanks
                > Rishabh Mittal
                
            
            
        
        
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-16  1:26 Harris, James R
  0 siblings, 0 replies; 32+ messages in thread
From: Harris, James R @ 2019-08-16  1:26 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3490 bytes --]



On 8/15/19, 4:34 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:

    Hi Jim
    
    What tool you use to take profiling. 

Hi Rishabh,

Mostly I just use "perf top".

-Jim

    
    Thanks
    Rishabh Mittal
    
    On 8/14/19, 9:54 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
    
        
        
        On 8/14/19, 9:18 AM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:
        
        <trim>
            
            When an I/O is performed in the process initiating the I/O to a file, the data
            goes into the OS page cache buffers at a layer far above the bio stack
            (somewhere up in VFS). If SPDK were to reserve some memory and hand it off to
            your kernel driver, your kernel driver would still need to copy it to that
            location out of the page cache buffers. We can't safely share the page cache
            buffers with a user space process.
           
        I think Rishabh was suggesting the SPDK reserve the virtual address space only.
        Then the kernel could map the page cache buffers into that virtual address space.
        That would not require a data copy, but would require the mapping operations.
        
        I think the profiling data would be really helpful - to quantify how much of the 50us
        Is due to copying the 4KB of data.  That can help drive next steps on how to optimize
        the SPDK NBD module.
        
        Thanks,
        
        -Jim
        
        
            As Paul said, I'm skeptical that the memcpy is significant in the overall
            performance you're measuring. I encourage you to go look at some profiling data
            and confirm that the memcpy is really showing up. I suspect the overhead is
            instead primarily in these spots:
            
            1) Dynamic buffer allocation in the SPDK NBD backend.
            
            As Paul indicated, the NBD target is dynamically allocating memory for each I/O.
            The NBD backend wasn't designed to be fast - it was designed to be simple.
            Pooling would be a lot faster and is something fairly easy to implement.
            
            2) The way SPDK does the syscalls when it implements the NBD backend.
            
            Again, the code was designed to be simple, not high performance. It simply calls
            read() and write() on the socket for each command. There are much higher
            performance ways of doing this, they're just more complex to implement.
            
            3) The lack of multi-queue support in NBD
            
            Every I/O is funneled through a single sockpair up to user space. That means
            there is locking going on. I believe this is just a limitation of NBD today - it
            doesn't plug into the block-mq stuff in the kernel and expose multiple
            sockpairs. But someone more knowledgeable on the kernel stack would need to take
            a look.
            
            Thanks,
            Ben
            
            > 
            > Couple of things that I am not really sure in this flow is :- 1. How memory
            > registration is going to work with RDMA driver.
            > 2. What changes are required in spdk memory management
            > 
            > Thanks
            > Rishabh Mittal
            
        
        
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-15 23:34 Mittal, Rishabh
  0 siblings, 0 replies; 32+ messages in thread
From: Mittal, Rishabh @ 2019-08-15 23:34 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3053 bytes --]

Hi Jim

What tool you use to take profiling. 

Thanks
Rishabh Mittal

On 8/14/19, 9:54 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:

    
    
    On 8/14/19, 9:18 AM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:
    
    <trim>
        
        When an I/O is performed in the process initiating the I/O to a file, the data
        goes into the OS page cache buffers at a layer far above the bio stack
        (somewhere up in VFS). If SPDK were to reserve some memory and hand it off to
        your kernel driver, your kernel driver would still need to copy it to that
        location out of the page cache buffers. We can't safely share the page cache
        buffers with a user space process.
       
    I think Rishabh was suggesting the SPDK reserve the virtual address space only.
    Then the kernel could map the page cache buffers into that virtual address space.
    That would not require a data copy, but would require the mapping operations.
    
    I think the profiling data would be really helpful - to quantify how much of the 50us
    Is due to copying the 4KB of data.  That can help drive next steps on how to optimize
    the SPDK NBD module.
    
    Thanks,
    
    -Jim
    
    
        As Paul said, I'm skeptical that the memcpy is significant in the overall
        performance you're measuring. I encourage you to go look at some profiling data
        and confirm that the memcpy is really showing up. I suspect the overhead is
        instead primarily in these spots:
        
        1) Dynamic buffer allocation in the SPDK NBD backend.
        
        As Paul indicated, the NBD target is dynamically allocating memory for each I/O.
        The NBD backend wasn't designed to be fast - it was designed to be simple.
        Pooling would be a lot faster and is something fairly easy to implement.
        
        2) The way SPDK does the syscalls when it implements the NBD backend.
        
        Again, the code was designed to be simple, not high performance. It simply calls
        read() and write() on the socket for each command. There are much higher
        performance ways of doing this, they're just more complex to implement.
        
        3) The lack of multi-queue support in NBD
        
        Every I/O is funneled through a single sockpair up to user space. That means
        there is locking going on. I believe this is just a limitation of NBD today - it
        doesn't plug into the block-mq stuff in the kernel and expose multiple
        sockpairs. But someone more knowledgeable on the kernel stack would need to take
        a look.
        
        Thanks,
        Ben
        
        > 
        > Couple of things that I am not really sure in this flow is :- 1. How memory
        > registration is going to work with RDMA driver.
        > 2. What changes are required in spdk memory management
        > 
        > Thanks
        > Rishabh Mittal
        
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-14 17:55 Mittal, Rishabh
  0 siblings, 0 replies; 32+ messages in thread
From: Mittal, Rishabh @ 2019-08-14 17:55 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3282 bytes --]

That’s right.  I am thinking of using function remap_page_range in kernel only for the buffers which are in use currently. I don't think there will be much cost of mapping the physical address to virtual address.

Xiaoxi,

What data size are you using in your testing.


Thanks
Rishabh Mittal

On 8/14/19, 9:54 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:

    
    
    On 8/14/19, 9:18 AM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:
    
    <trim>
        
        When an I/O is performed in the process initiating the I/O to a file, the data
        goes into the OS page cache buffers at a layer far above the bio stack
        (somewhere up in VFS). If SPDK were to reserve some memory and hand it off to
        your kernel driver, your kernel driver would still need to copy it to that
        location out of the page cache buffers. We can't safely share the page cache
        buffers with a user space process.
       
    I think Rishabh was suggesting the SPDK reserve the virtual address space only.
    Then the kernel could map the page cache buffers into that virtual address space.
    That would not require a data copy, but would require the mapping operations.
    
    I think the profiling data would be really helpful - to quantify how much of the 50us
    Is due to copying the 4KB of data.  That can help drive next steps on how to optimize
    the SPDK NBD module.
    
    Thanks,
    
    -Jim
    
    
        As Paul said, I'm skeptical that the memcpy is significant in the overall
        performance you're measuring. I encourage you to go look at some profiling data
        and confirm that the memcpy is really showing up. I suspect the overhead is
        instead primarily in these spots:
        
        1) Dynamic buffer allocation in the SPDK NBD backend.
        
        As Paul indicated, the NBD target is dynamically allocating memory for each I/O.
        The NBD backend wasn't designed to be fast - it was designed to be simple.
        Pooling would be a lot faster and is something fairly easy to implement.
        
        2) The way SPDK does the syscalls when it implements the NBD backend.
        
        Again, the code was designed to be simple, not high performance. It simply calls
        read() and write() on the socket for each command. There are much higher
        performance ways of doing this, they're just more complex to implement.
        
        3) The lack of multi-queue support in NBD
        
        Every I/O is funneled through a single sockpair up to user space. That means
        there is locking going on. I believe this is just a limitation of NBD today - it
        doesn't plug into the block-mq stuff in the kernel and expose multiple
        sockpairs. But someone more knowledgeable on the kernel stack would need to take
        a look.
        
        Thanks,
        Ben
        
        > 
        > Couple of things that I am not really sure in this flow is :- 1. How memory
        > registration is going to work with RDMA driver.
        > 2. What changes are required in spdk memory management
        > 
        > Thanks
        > Rishabh Mittal
        
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-14 17:05 Kadayam, Hari
  0 siblings, 0 replies; 32+ messages in thread
From: Kadayam, Hari @ 2019-08-14 17:05 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4516 bytes --]

Hi Ben,

I agree we need to profile this and improve where ever we are seeing the bottlenecks. The possible improvements you suggested are certainly very useful to look into and good place to start. Having said that, on a large writes won't memcpy surely going to add into latency and cost more CPU? 

Regarding your comment:
> We can't safely share the page cache buffers with a user space process.

The thought process here is the driver in the SPDK thread context does a remap using something like phys_to_virt() or mmap them, which means page cache buffer(s) gets to be accessed from user space process. Of course, we have concerns too regarding the safety of user space process accessing page cache. 

Regards,
Hari

On 8/14/19, 9:19 AM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:

    On Wed, 2019-08-14 at 14:28 +0000, Luse, Paul E wrote:
    > So I think there's still a feeling amongst most involved in the discussion
    > that eliminating the memcpy is likely not worth it, especially without
    > profiling data to prove it.  Ben and I were talking about some other much
    > simpler things that might be worth experimenting with).  One example would be
    > in spdk_nbd_io_recv_internal(), look at how spdk_malloc(), is called for every
    > IO/  Creating a pre-allocated pool and pulling from there would be a quick
    > change and may yield some positive results. Again though, profiling will
    > actually tell you where the most time is being spent and where the best bang
    > for your buck is in terms of making changes.
    > 
    > Thx
    > Paul
    > 
    > From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
    > 
    > Back end device is malloc0 which is a memory device running in the “vhost”
    > application address space.  It is not over NVMe-oF.
    > 
    > I guess that bio pages are already pinned because same buffers are sent to
    > lower layers to do DMA.  Lets say we have written a lightweight ebay block
    > driver in kernel. This would be the flow
    > 
    > 1.  SPDK reserve the virtual space and pass it to ebay block driver to do
    > mmap. This step happens once during startup. 
    > 2.  For every IO, ebay block driver map buffers to virtual memory and pass a
    > IO information to SPDK through shared queues.
    > 3.  SPDK read it from the shared queue and pass the same virtual address to do
    > RDMA.
    
    When an I/O is performed in the process initiating the I/O to a file, the data
    goes into the OS page cache buffers at a layer far above the bio stack
    (somewhere up in VFS). If SPDK were to reserve some memory and hand it off to
    your kernel driver, your kernel driver would still need to copy it to that
    location out of the page cache buffers. We can't safely share the page cache
    buffers with a user space process.
    
    As Paul said, I'm skeptical that the memcpy is significant in the overall
    performance you're measuring. I encourage you to go look at some profiling data
    and confirm that the memcpy is really showing up. I suspect the overhead is
    instead primarily in these spots:
    
    1) Dynamic buffer allocation in the SPDK NBD backend.
    
    As Paul indicated, the NBD target is dynamically allocating memory for each I/O.
    The NBD backend wasn't designed to be fast - it was designed to be simple.
    Pooling would be a lot faster and is something fairly easy to implement.
    
    2) The way SPDK does the syscalls when it implements the NBD backend.
    
    Again, the code was designed to be simple, not high performance. It simply calls
    read() and write() on the socket for each command. There are much higher
    performance ways of doing this, they're just more complex to implement.
    
    3) The lack of multi-queue support in NBD
    
    Every I/O is funneled through a single sockpair up to user space. That means
    there is locking going on. I believe this is just a limitation of NBD today - it
    doesn't plug into the block-mq stuff in the kernel and expose multiple
    sockpairs. But someone more knowledgeable on the kernel stack would need to take
    a look.
    
    Thanks,
    Ben
    
    > 
    > Couple of things that I am not really sure in this flow is :- 1. How memory
    > registration is going to work with RDMA driver.
    > 2. What changes are required in spdk memory management
    > 
    > Thanks
    > Rishabh Mittal
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-14 16:54 Harris, James R
  0 siblings, 0 replies; 32+ messages in thread
From: Harris, James R @ 2019-08-14 16:54 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2638 bytes --]



On 8/14/19, 9:18 AM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:

<trim>
    
    When an I/O is performed in the process initiating the I/O to a file, the data
    goes into the OS page cache buffers at a layer far above the bio stack
    (somewhere up in VFS). If SPDK were to reserve some memory and hand it off to
    your kernel driver, your kernel driver would still need to copy it to that
    location out of the page cache buffers. We can't safely share the page cache
    buffers with a user space process.
   
I think Rishabh was suggesting the SPDK reserve the virtual address space only.
Then the kernel could map the page cache buffers into that virtual address space.
That would not require a data copy, but would require the mapping operations.

I think the profiling data would be really helpful - to quantify how much of the 50us
Is due to copying the 4KB of data.  That can help drive next steps on how to optimize
the SPDK NBD module.

Thanks,

-Jim


    As Paul said, I'm skeptical that the memcpy is significant in the overall
    performance you're measuring. I encourage you to go look at some profiling data
    and confirm that the memcpy is really showing up. I suspect the overhead is
    instead primarily in these spots:
    
    1) Dynamic buffer allocation in the SPDK NBD backend.
    
    As Paul indicated, the NBD target is dynamically allocating memory for each I/O.
    The NBD backend wasn't designed to be fast - it was designed to be simple.
    Pooling would be a lot faster and is something fairly easy to implement.
    
    2) The way SPDK does the syscalls when it implements the NBD backend.
    
    Again, the code was designed to be simple, not high performance. It simply calls
    read() and write() on the socket for each command. There are much higher
    performance ways of doing this, they're just more complex to implement.
    
    3) The lack of multi-queue support in NBD
    
    Every I/O is funneled through a single sockpair up to user space. That means
    there is locking going on. I believe this is just a limitation of NBD today - it
    doesn't plug into the block-mq stuff in the kernel and expose multiple
    sockpairs. But someone more knowledgeable on the kernel stack would need to take
    a look.
    
    Thanks,
    Ben
    
    > 
    > Couple of things that I am not really sure in this flow is :- 1. How memory
    > registration is going to work with RDMA driver.
    > 2. What changes are required in spdk memory management
    > 
    > Thanks
    > Rishabh Mittal
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-14 16:18 Walker, Benjamin
  0 siblings, 0 replies; 32+ messages in thread
From: Walker, Benjamin @ 2019-08-14 16:18 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3403 bytes --]

On Wed, 2019-08-14 at 14:28 +0000, Luse, Paul E wrote:
> So I think there's still a feeling amongst most involved in the discussion
> that eliminating the memcpy is likely not worth it, especially without
> profiling data to prove it.  Ben and I were talking about some other much
> simpler things that might be worth experimenting with).  One example would be
> in spdk_nbd_io_recv_internal(), look at how spdk_malloc(), is called for every
> IO/  Creating a pre-allocated pool and pulling from there would be a quick
> change and may yield some positive results. Again though, profiling will
> actually tell you where the most time is being spent and where the best bang
> for your buck is in terms of making changes.
> 
> Thx
> Paul
> 
> From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
> 
> Back end device is malloc0 which is a memory device running in the “vhost”
> application address space.  It is not over NVMe-oF.
> 
> I guess that bio pages are already pinned because same buffers are sent to
> lower layers to do DMA.  Lets say we have written a lightweight ebay block
> driver in kernel. This would be the flow
> 
> 1.  SPDK reserve the virtual space and pass it to ebay block driver to do
> mmap. This step happens once during startup. 
> 2.  For every IO, ebay block driver map buffers to virtual memory and pass a
> IO information to SPDK through shared queues.
> 3.  SPDK read it from the shared queue and pass the same virtual address to do
> RDMA.

When an I/O is performed in the process initiating the I/O to a file, the data
goes into the OS page cache buffers at a layer far above the bio stack
(somewhere up in VFS). If SPDK were to reserve some memory and hand it off to
your kernel driver, your kernel driver would still need to copy it to that
location out of the page cache buffers. We can't safely share the page cache
buffers with a user space process.

As Paul said, I'm skeptical that the memcpy is significant in the overall
performance you're measuring. I encourage you to go look at some profiling data
and confirm that the memcpy is really showing up. I suspect the overhead is
instead primarily in these spots:

1) Dynamic buffer allocation in the SPDK NBD backend.

As Paul indicated, the NBD target is dynamically allocating memory for each I/O.
The NBD backend wasn't designed to be fast - it was designed to be simple.
Pooling would be a lot faster and is something fairly easy to implement.

2) The way SPDK does the syscalls when it implements the NBD backend.

Again, the code was designed to be simple, not high performance. It simply calls
read() and write() on the socket for each command. There are much higher
performance ways of doing this, they're just more complex to implement.

3) The lack of multi-queue support in NBD

Every I/O is funneled through a single sockpair up to user space. That means
there is locking going on. I believe this is just a limitation of NBD today - it
doesn't plug into the block-mq stuff in the kernel and expose multiple
sockpairs. But someone more knowledgeable on the kernel stack would need to take
a look.

Thanks,
Ben

> 
> Couple of things that I am not really sure in this flow is :- 1. How memory
> registration is going to work with RDMA driver.
> 2. What changes are required in spdk memory management
> 
> Thanks
> Rishabh Mittal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-14 14:28 Luse, Paul E
  0 siblings, 0 replies; 32+ messages in thread
From: Luse, Paul E @ 2019-08-14 14:28 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 10011 bytes --]

So I think there's still a feeling amongst most involved in the discussion that eliminating the memcpy is likely not worth it, especially without profiling data to prove it.  Ben and I were talking about some other much simpler things that might be worth experimenting with).  One example would be in spdk_nbd_io_recv_internal(), look at how spdk_malloc(), is called for every IO/  Creating a pre-allocated pool and pulling from there would be a quick change and may yield some positive results. Again though, profiling will actually tell you where the most time is being spent and where the best bang for your buck is in terms of making changes.

Thx
Paul

-----Original Message-----
From: Mittal, Rishabh [mailto:rimittal(a)ebay.com] 
Sent: Tuesday, August 13, 2019 3:09 PM
To: Harris, James R <james.r.harris(a)intel.com>; Storage Performance Development Kit <spdk(a)lists.01.org>; Luse, Paul E <paul.e.luse(a)intel.com>
Cc: Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>; Kadayam, Hari <hkadayam(a)ebay.com>
Subject: Re: [SPDK] NBD with SPDK

Back end device is malloc0 which is a memory device running in the “vhost” application address space.  It is not over NVMe-oF.

I guess that bio pages are already pinned because same buffers are sent to lower layers to do DMA.  Lets say we have written a lightweight ebay block driver in kernel. This would be the flow

1.  SPDK reserve the virtual space and pass it to ebay block driver to do mmap. This step happens once during startup. 
2.  For every IO, ebay block driver map buffers to virtual memory and pass a IO information to SPDK through shared queues.
3.  SPDK read it from the shared queue and pass the same virtual address to do RDMA.

Couple of things that I am not really sure in this flow is :- 1. How memory registration is going to work with RDMA driver.
2. What changes are required in spdk memory management

Thanks
Rishabh Mittal

On 8/13/19, 2:45 PM, "Harris, James R" <james.r.harris(a)intel.com> wrote:

    Hi Rishabh,
    
    The idea is technically feasible, but I think you would find the cost of pinning the pages plus mapping them into the SPDK process would far exceed the cost of the kernel/user copy.
    
    From your original e-mail - could you clarify what the 50us is measuring?  For example, does this include the NVMe-oF round trip?  And if so, what is the backing device for the namespace on the target side?
    
    Thanks,
    
    -Jim
    
    
    On 8/13/19, 12:55 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
    
        I don't have any profiling data. I am not really worried about system calls because I think we could find a way to optimize it. I am really worried about bcopy. How can we avoid bcopying from kernel to user space.
        
        Other idea we have is to map the physical address of a buffer in bio to spdk virtual memory. We have to modify nbd driver or write a new light weight driver for this.  Do you think is It something feasible to do in SPDK.
        
        
        Thanks
        Rishabh Mittal
        
        On 8/12/19, 11:42 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
        
            
            
            On 8/12/19, 11:20 AM, "SPDK on behalf of Harris, James R" <spdk-bounces(a)lists.01.org on behalf of james.r.harris(a)intel.com> wrote:
            
                
                
                On 8/12/19, 9:20 AM, "SPDK on behalf of Mittal, Rishabh via SPDK" <spdk-bounces(a)lists.01.org on behalf of spdk(a)lists.01.org> wrote:
                
                    <<As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this <<way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?>>
                    
                    I am thinking of passing the physical address of the buffers in bio to spdk.  I don’t know if it is already pinned by the kernel or do we need to explicitly do it. And also, spdk has some requirements on the alignment of physical address. I don’t know if address in bio conforms to those requirements.
                    
                    SPDK won’t be running in VM.
                
                
                Hi Rishabh,
                
                SPDK relies on data buffers being mapped into the SPDK application's address space, and are passed as virtual addresses throughout the SPDK stack.  Once the buffer reaches a module that requires a physical address (such as the NVMe driver for a PCIe-attached device), SPDK translates the virtual address to a physical address.  Note that the NVMe fabrics transports (RDMA and TCP) both deal with virtual addresses, not physical addresses.  The RDMA transport is built on top of ibverbs, where we register virtual address areas as memory regions for describing data transfers.
                
                So for nbd, pinning the buffers and getting the physical address(es) to SPDK wouldn't be enough.  Those physical address regions would also need to get dynamically mapped into the SPDK address space.
                
                Do you have any profiling data that shows the relative cost of the data copy v. the system calls themselves on your system?  There may be some optimization opportunities on the system calls to look at as well.
                
                Regards,
                
                -Jim
            
            Hi Rishabh,
            
            Could you also clarify what the 50us is measuring?  For example, does this include the NVMe-oF round trip?  And if so, what is the backing device for the namespace on the target side?
            
            Thanks,
            
            -Jim
            
                
                
                
                
                    From: "Luse, Paul E" <paul.e.luse(a)intel.com>
                    Date: Sunday, August 11, 2019 at 12:53 PM
                    To: "Mittal, Rishabh" <rimittal(a)ebay.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.org>
                    Cc: "Kadayam, Hari" <hkadayam(a)ebay.com>, "Chen, Xiaoxi" <xiaoxchen(a)ebay.com>, "Szmyd, Brian" <bszmyd(a)ebay.com>
                    Subject: RE: NBD with SPDK
                    
                    Hi Rishabh,
                    
                    Thanks for the question. I was talking to Jim and Ben about this a bit, one of them may want to elaborate but we’re thinking the cost of mmap and also making sure the memory is pinned is probably prohibitive. As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?
                    
                    Thx
                    Paul
                    
                    From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
                    Sent: Saturday, August 10, 2019 6:09 PM
                    To: spdk(a)lists.01.org
                    Cc: Luse, Paul E <paul.e.luse(a)intel.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
                    Subject: NBD with SPDK
                    
                    Hi,
                    
                    We are trying to use NBD and SPDK on client side.  Data path looks like this
                    
                    File System ----> NBD client ------>SPDK------->NVMEoF
                    
                    
                    Currently we are seeing a high latency in the order of 50 us by using this path. It seems like there is data buffer copy happening for write commands from kernel to user space when spdk nbd read data from the nbd socket.
                    
                    I think that there could be two ways to prevent data copy .
                    
                    
                      1.  Memory mapped the kernel buffers to spdk virtual space.  I am not sure if it is possible to mmap a buffer. And what is the impact to call mmap for each IO.
                      2.  If NBD kernel give the physical address of a buffer and SPDK use that to DMA it to NVMEoF. I think spdk must also be changing a virtual address to physical address before sending it to nvmeof.
                    
                    Option 2 makes more sense to me. Please let me know if option 2 is feasible in spdk
                    
                    Thanks
                    Rishabh Mittal
                    
                    _______________________________________________
                    SPDK mailing list
                    SPDK(a)lists.01.org
                    https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7C901f254c5a6a41c73e9308d720379ba6%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637013295533381091&amp;sdata=CAzaYwxClLrWNba%2Bx5kY%2FjgVeB2eHa3aSU5nDwndEYU%3D&amp;reserved=0
                    
                
                _______________________________________________
                SPDK mailing list
                SPDK(a)lists.01.org
                https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7C901f254c5a6a41c73e9308d720379ba6%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637013295533381091&amp;sdata=CAzaYwxClLrWNba%2Bx5kY%2FjgVeB2eHa3aSU5nDwndEYU%3D&amp;reserved=0
                
            
            
        
        
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-13 22:08 Mittal, Rishabh
  0 siblings, 0 replies; 32+ messages in thread
From: Mittal, Rishabh @ 2019-08-13 22:08 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 8930 bytes --]

Back end device is malloc0 which is a memory device running in the “vhost” application address space.  It is not over NVMe-oF.

I guess that bio pages are already pinned because same buffers are sent to lower layers to do DMA.  Lets say we have written a lightweight ebay block driver in kernel. This would be the flow

1.  SPDK reserve the virtual space and pass it to ebay block driver to do mmap. This step happens once during startup. 
2.  For every IO, ebay block driver map buffers to virtual memory and pass a IO information to SPDK through shared queues.
3.  SPDK read it from the shared queue and pass the same virtual address to do RDMA.

Couple of things that I am not really sure in this flow is :-
1. How memory registration is going to work with RDMA driver.
2. What changes are required in spdk memory management

Thanks
Rishabh Mittal

On 8/13/19, 2:45 PM, "Harris, James R" <james.r.harris(a)intel.com> wrote:

    Hi Rishabh,
    
    The idea is technically feasible, but I think you would find the cost of pinning the pages plus mapping them into the SPDK process would far exceed the cost of the kernel/user copy.
    
    From your original e-mail - could you clarify what the 50us is measuring?  For example, does this include the NVMe-oF round trip?  And if so, what is the backing device for the namespace on the target side?
    
    Thanks,
    
    -Jim
    
    
    On 8/13/19, 12:55 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:
    
        I don't have any profiling data. I am not really worried about system calls because I think we could find a way to optimize it. I am really worried about bcopy. How can we avoid bcopying from kernel to user space.
        
        Other idea we have is to map the physical address of a buffer in bio to spdk virtual memory. We have to modify nbd driver or write a new light weight driver for this.  Do you think is It something feasible to do in SPDK.
        
        
        Thanks
        Rishabh Mittal
        
        On 8/12/19, 11:42 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
        
            
            
            On 8/12/19, 11:20 AM, "SPDK on behalf of Harris, James R" <spdk-bounces(a)lists.01.org on behalf of james.r.harris(a)intel.com> wrote:
            
                
                
                On 8/12/19, 9:20 AM, "SPDK on behalf of Mittal, Rishabh via SPDK" <spdk-bounces(a)lists.01.org on behalf of spdk(a)lists.01.org> wrote:
                
                    <<As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this <<way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?>>
                    
                    I am thinking of passing the physical address of the buffers in bio to spdk.  I don’t know if it is already pinned by the kernel or do we need to explicitly do it. And also, spdk has some requirements on the alignment of physical address. I don’t know if address in bio conforms to those requirements.
                    
                    SPDK won’t be running in VM.
                
                
                Hi Rishabh,
                
                SPDK relies on data buffers being mapped into the SPDK application's address space, and are passed as virtual addresses throughout the SPDK stack.  Once the buffer reaches a module that requires a physical address (such as the NVMe driver for a PCIe-attached device), SPDK translates the virtual address to a physical address.  Note that the NVMe fabrics transports (RDMA and TCP) both deal with virtual addresses, not physical addresses.  The RDMA transport is built on top of ibverbs, where we register virtual address areas as memory regions for describing data transfers.
                
                So for nbd, pinning the buffers and getting the physical address(es) to SPDK wouldn't be enough.  Those physical address regions would also need to get dynamically mapped into the SPDK address space.
                
                Do you have any profiling data that shows the relative cost of the data copy v. the system calls themselves on your system?  There may be some optimization opportunities on the system calls to look at as well.
                
                Regards,
                
                -Jim
            
            Hi Rishabh,
            
            Could you also clarify what the 50us is measuring?  For example, does this include the NVMe-oF round trip?  And if so, what is the backing device for the namespace on the target side?
            
            Thanks,
            
            -Jim
            
                
                
                
                
                    From: "Luse, Paul E" <paul.e.luse(a)intel.com>
                    Date: Sunday, August 11, 2019 at 12:53 PM
                    To: "Mittal, Rishabh" <rimittal(a)ebay.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.org>
                    Cc: "Kadayam, Hari" <hkadayam(a)ebay.com>, "Chen, Xiaoxi" <xiaoxchen(a)ebay.com>, "Szmyd, Brian" <bszmyd(a)ebay.com>
                    Subject: RE: NBD with SPDK
                    
                    Hi Rishabh,
                    
                    Thanks for the question. I was talking to Jim and Ben about this a bit, one of them may want to elaborate but we’re thinking the cost of mmap and also making sure the memory is pinned is probably prohibitive. As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?
                    
                    Thx
                    Paul
                    
                    From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
                    Sent: Saturday, August 10, 2019 6:09 PM
                    To: spdk(a)lists.01.org
                    Cc: Luse, Paul E <paul.e.luse(a)intel.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
                    Subject: NBD with SPDK
                    
                    Hi,
                    
                    We are trying to use NBD and SPDK on client side.  Data path looks like this
                    
                    File System ----> NBD client ------>SPDK------->NVMEoF
                    
                    
                    Currently we are seeing a high latency in the order of 50 us by using this path. It seems like there is data buffer copy happening for write commands from kernel to user space when spdk nbd read data from the nbd socket.
                    
                    I think that there could be two ways to prevent data copy .
                    
                    
                      1.  Memory mapped the kernel buffers to spdk virtual space.  I am not sure if it is possible to mmap a buffer. And what is the impact to call mmap for each IO.
                      2.  If NBD kernel give the physical address of a buffer and SPDK use that to DMA it to NVMEoF. I think spdk must also be changing a virtual address to physical address before sending it to nvmeof.
                    
                    Option 2 makes more sense to me. Please let me know if option 2 is feasible in spdk
                    
                    Thanks
                    Rishabh Mittal
                    
                    _______________________________________________
                    SPDK mailing list
                    SPDK(a)lists.01.org
                    https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7C901f254c5a6a41c73e9308d720379ba6%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637013295533381091&amp;sdata=CAzaYwxClLrWNba%2Bx5kY%2FjgVeB2eHa3aSU5nDwndEYU%3D&amp;reserved=0
                    
                
                _______________________________________________
                SPDK mailing list
                SPDK(a)lists.01.org
                https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7C901f254c5a6a41c73e9308d720379ba6%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637013295533381091&amp;sdata=CAzaYwxClLrWNba%2Bx5kY%2FjgVeB2eHa3aSU5nDwndEYU%3D&amp;reserved=0
                
            
            
        
        
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-13 21:45 Harris, James R
  0 siblings, 0 replies; 32+ messages in thread
From: Harris, James R @ 2019-08-13 21:45 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 7513 bytes --]

Hi Rishabh,

The idea is technically feasible, but I think you would find the cost of pinning the pages plus mapping them into the SPDK process would far exceed the cost of the kernel/user copy.

From your original e-mail - could you clarify what the 50us is measuring?  For example, does this include the NVMe-oF round trip?  And if so, what is the backing device for the namespace on the target side?

Thanks,

-Jim


On 8/13/19, 12:55 PM, "Mittal, Rishabh" <rimittal(a)ebay.com> wrote:

    I don't have any profiling data. I am not really worried about system calls because I think we could find a way to optimize it. I am really worried about bcopy. How can we avoid bcopying from kernel to user space.
    
    Other idea we have is to map the physical address of a buffer in bio to spdk virtual memory. We have to modify nbd driver or write a new light weight driver for this.  Do you think is It something feasible to do in SPDK.
    
    
    Thanks
    Rishabh Mittal
    
    On 8/12/19, 11:42 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
    
        
        
        On 8/12/19, 11:20 AM, "SPDK on behalf of Harris, James R" <spdk-bounces(a)lists.01.org on behalf of james.r.harris(a)intel.com> wrote:
        
            
            
            On 8/12/19, 9:20 AM, "SPDK on behalf of Mittal, Rishabh via SPDK" <spdk-bounces(a)lists.01.org on behalf of spdk(a)lists.01.org> wrote:
            
                <<As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this <<way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?>>
                
                I am thinking of passing the physical address of the buffers in bio to spdk.  I don’t know if it is already pinned by the kernel or do we need to explicitly do it. And also, spdk has some requirements on the alignment of physical address. I don’t know if address in bio conforms to those requirements.
                
                SPDK won’t be running in VM.
            
            
            Hi Rishabh,
            
            SPDK relies on data buffers being mapped into the SPDK application's address space, and are passed as virtual addresses throughout the SPDK stack.  Once the buffer reaches a module that requires a physical address (such as the NVMe driver for a PCIe-attached device), SPDK translates the virtual address to a physical address.  Note that the NVMe fabrics transports (RDMA and TCP) both deal with virtual addresses, not physical addresses.  The RDMA transport is built on top of ibverbs, where we register virtual address areas as memory regions for describing data transfers.
            
            So for nbd, pinning the buffers and getting the physical address(es) to SPDK wouldn't be enough.  Those physical address regions would also need to get dynamically mapped into the SPDK address space.
            
            Do you have any profiling data that shows the relative cost of the data copy v. the system calls themselves on your system?  There may be some optimization opportunities on the system calls to look at as well.
            
            Regards,
            
            -Jim
        
        Hi Rishabh,
        
        Could you also clarify what the 50us is measuring?  For example, does this include the NVMe-oF round trip?  And if so, what is the backing device for the namespace on the target side?
        
        Thanks,
        
        -Jim
        
            
            
            
            
                From: "Luse, Paul E" <paul.e.luse(a)intel.com>
                Date: Sunday, August 11, 2019 at 12:53 PM
                To: "Mittal, Rishabh" <rimittal(a)ebay.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.org>
                Cc: "Kadayam, Hari" <hkadayam(a)ebay.com>, "Chen, Xiaoxi" <xiaoxchen(a)ebay.com>, "Szmyd, Brian" <bszmyd(a)ebay.com>
                Subject: RE: NBD with SPDK
                
                Hi Rishabh,
                
                Thanks for the question. I was talking to Jim and Ben about this a bit, one of them may want to elaborate but we’re thinking the cost of mmap and also making sure the memory is pinned is probably prohibitive. As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?
                
                Thx
                Paul
                
                From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
                Sent: Saturday, August 10, 2019 6:09 PM
                To: spdk(a)lists.01.org
                Cc: Luse, Paul E <paul.e.luse(a)intel.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
                Subject: NBD with SPDK
                
                Hi,
                
                We are trying to use NBD and SPDK on client side.  Data path looks like this
                
                File System ----> NBD client ------>SPDK------->NVMEoF
                
                
                Currently we are seeing a high latency in the order of 50 us by using this path. It seems like there is data buffer copy happening for write commands from kernel to user space when spdk nbd read data from the nbd socket.
                
                I think that there could be two ways to prevent data copy .
                
                
                  1.  Memory mapped the kernel buffers to spdk virtual space.  I am not sure if it is possible to mmap a buffer. And what is the impact to call mmap for each IO.
                  2.  If NBD kernel give the physical address of a buffer and SPDK use that to DMA it to NVMEoF. I think spdk must also be changing a virtual address to physical address before sending it to nvmeof.
                
                Option 2 makes more sense to me. Please let me know if option 2 is feasible in spdk
                
                Thanks
                Rishabh Mittal
                
                _______________________________________________
                SPDK mailing list
                SPDK(a)lists.01.org
                https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7C1f52c93de1c84250e44908d71f54baca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637012321296153316&amp;sdata=P%2FMNmm%2FZUu2MsxtIdnSrojG%2FZ0ww8cSdDGSvqZ%2FesPE%3D&amp;reserved=0
                
            
            _______________________________________________
            SPDK mailing list
            SPDK(a)lists.01.org
            https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7C1f52c93de1c84250e44908d71f54baca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637012321296153316&amp;sdata=P%2FMNmm%2FZUu2MsxtIdnSrojG%2FZ0ww8cSdDGSvqZ%2FesPE%3D&amp;reserved=0
            
        
        
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-13 19:55 Mittal, Rishabh
  0 siblings, 0 replies; 32+ messages in thread
From: Mittal, Rishabh @ 2019-08-13 19:55 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 6594 bytes --]

I don't have any profiling data. I am not really worried about system calls because I think we could find a way to optimize it. I am really worried about bcopy. How can we avoid bcopying from kernel to user space.

Other idea we have is to map the physical address of a buffer in bio to spdk virtual memory. We have to modify nbd driver or write a new light weight driver for this.  Do you think is It something feasible to do in SPDK.


Thanks
Rishabh Mittal

On 8/12/19, 11:42 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:

    
    
    On 8/12/19, 11:20 AM, "SPDK on behalf of Harris, James R" <spdk-bounces(a)lists.01.org on behalf of james.r.harris(a)intel.com> wrote:
    
        
        
        On 8/12/19, 9:20 AM, "SPDK on behalf of Mittal, Rishabh via SPDK" <spdk-bounces(a)lists.01.org on behalf of spdk(a)lists.01.org> wrote:
        
            <<As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this <<way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?>>
            
            I am thinking of passing the physical address of the buffers in bio to spdk.  I don’t know if it is already pinned by the kernel or do we need to explicitly do it. And also, spdk has some requirements on the alignment of physical address. I don’t know if address in bio conforms to those requirements.
            
            SPDK won’t be running in VM.
        
        
        Hi Rishabh,
        
        SPDK relies on data buffers being mapped into the SPDK application's address space, and are passed as virtual addresses throughout the SPDK stack.  Once the buffer reaches a module that requires a physical address (such as the NVMe driver for a PCIe-attached device), SPDK translates the virtual address to a physical address.  Note that the NVMe fabrics transports (RDMA and TCP) both deal with virtual addresses, not physical addresses.  The RDMA transport is built on top of ibverbs, where we register virtual address areas as memory regions for describing data transfers.
        
        So for nbd, pinning the buffers and getting the physical address(es) to SPDK wouldn't be enough.  Those physical address regions would also need to get dynamically mapped into the SPDK address space.
        
        Do you have any profiling data that shows the relative cost of the data copy v. the system calls themselves on your system?  There may be some optimization opportunities on the system calls to look at as well.
        
        Regards,
        
        -Jim
    
    Hi Rishabh,
    
    Could you also clarify what the 50us is measuring?  For example, does this include the NVMe-oF round trip?  And if so, what is the backing device for the namespace on the target side?
    
    Thanks,
    
    -Jim
    
        
        
        
        
            From: "Luse, Paul E" <paul.e.luse(a)intel.com>
            Date: Sunday, August 11, 2019 at 12:53 PM
            To: "Mittal, Rishabh" <rimittal(a)ebay.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.org>
            Cc: "Kadayam, Hari" <hkadayam(a)ebay.com>, "Chen, Xiaoxi" <xiaoxchen(a)ebay.com>, "Szmyd, Brian" <bszmyd(a)ebay.com>
            Subject: RE: NBD with SPDK
            
            Hi Rishabh,
            
            Thanks for the question. I was talking to Jim and Ben about this a bit, one of them may want to elaborate but we’re thinking the cost of mmap and also making sure the memory is pinned is probably prohibitive. As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?
            
            Thx
            Paul
            
            From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
            Sent: Saturday, August 10, 2019 6:09 PM
            To: spdk(a)lists.01.org
            Cc: Luse, Paul E <paul.e.luse(a)intel.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
            Subject: NBD with SPDK
            
            Hi,
            
            We are trying to use NBD and SPDK on client side.  Data path looks like this
            
            File System ----> NBD client ------>SPDK------->NVMEoF
            
            
            Currently we are seeing a high latency in the order of 50 us by using this path. It seems like there is data buffer copy happening for write commands from kernel to user space when spdk nbd read data from the nbd socket.
            
            I think that there could be two ways to prevent data copy .
            
            
              1.  Memory mapped the kernel buffers to spdk virtual space.  I am not sure if it is possible to mmap a buffer. And what is the impact to call mmap for each IO.
              2.  If NBD kernel give the physical address of a buffer and SPDK use that to DMA it to NVMEoF. I think spdk must also be changing a virtual address to physical address before sending it to nvmeof.
            
            Option 2 makes more sense to me. Please let me know if option 2 is feasible in spdk
            
            Thanks
            Rishabh Mittal
            
            _______________________________________________
            SPDK mailing list
            SPDK(a)lists.01.org
            https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7C1f52c93de1c84250e44908d71f54baca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637012321296153316&amp;sdata=P%2FMNmm%2FZUu2MsxtIdnSrojG%2FZ0ww8cSdDGSvqZ%2FesPE%3D&amp;reserved=0
            
        
        _______________________________________________
        SPDK mailing list
        SPDK(a)lists.01.org
        https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.01.org%2Fmailman%2Flistinfo%2Fspdk&amp;data=02%7C01%7Crimittal%40ebay.com%7C1f52c93de1c84250e44908d71f54baca%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C637012321296153316&amp;sdata=P%2FMNmm%2FZUu2MsxtIdnSrojG%2FZ0ww8cSdDGSvqZ%2FesPE%3D&amp;reserved=0
        
    
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-12 18:41 Harris, James R
  0 siblings, 0 replies; 32+ messages in thread
From: Harris, James R @ 2019-08-12 18:41 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5114 bytes --]



On 8/12/19, 11:20 AM, "SPDK on behalf of Harris, James R" <spdk-bounces(a)lists.01.org on behalf of james.r.harris(a)intel.com> wrote:

    
    
    On 8/12/19, 9:20 AM, "SPDK on behalf of Mittal, Rishabh via SPDK" <spdk-bounces(a)lists.01.org on behalf of spdk(a)lists.01.org> wrote:
    
        <<As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this <<way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?>>
        
        I am thinking of passing the physical address of the buffers in bio to spdk.  I don’t know if it is already pinned by the kernel or do we need to explicitly do it. And also, spdk has some requirements on the alignment of physical address. I don’t know if address in bio conforms to those requirements.
        
        SPDK won’t be running in VM.
    
    
    Hi Rishabh,
    
    SPDK relies on data buffers being mapped into the SPDK application's address space, and are passed as virtual addresses throughout the SPDK stack.  Once the buffer reaches a module that requires a physical address (such as the NVMe driver for a PCIe-attached device), SPDK translates the virtual address to a physical address.  Note that the NVMe fabrics transports (RDMA and TCP) both deal with virtual addresses, not physical addresses.  The RDMA transport is built on top of ibverbs, where we register virtual address areas as memory regions for describing data transfers.
    
    So for nbd, pinning the buffers and getting the physical address(es) to SPDK wouldn't be enough.  Those physical address regions would also need to get dynamically mapped into the SPDK address space.
    
    Do you have any profiling data that shows the relative cost of the data copy v. the system calls themselves on your system?  There may be some optimization opportunities on the system calls to look at as well.
    
    Regards,
    
    -Jim

Hi Rishabh,

Could you also clarify what the 50us is measuring?  For example, does this include the NVMe-oF round trip?  And if so, what is the backing device for the namespace on the target side?

Thanks,

-Jim

    
    
    
    
        From: "Luse, Paul E" <paul.e.luse(a)intel.com>
        Date: Sunday, August 11, 2019 at 12:53 PM
        To: "Mittal, Rishabh" <rimittal(a)ebay.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.org>
        Cc: "Kadayam, Hari" <hkadayam(a)ebay.com>, "Chen, Xiaoxi" <xiaoxchen(a)ebay.com>, "Szmyd, Brian" <bszmyd(a)ebay.com>
        Subject: RE: NBD with SPDK
        
        Hi Rishabh,
        
        Thanks for the question. I was talking to Jim and Ben about this a bit, one of them may want to elaborate but we’re thinking the cost of mmap and also making sure the memory is pinned is probably prohibitive. As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?
        
        Thx
        Paul
        
        From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
        Sent: Saturday, August 10, 2019 6:09 PM
        To: spdk(a)lists.01.org
        Cc: Luse, Paul E <paul.e.luse(a)intel.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
        Subject: NBD with SPDK
        
        Hi,
        
        We are trying to use NBD and SPDK on client side.  Data path looks like this
        
        File System ----> NBD client ------>SPDK------->NVMEoF
        
        
        Currently we are seeing a high latency in the order of 50 us by using this path. It seems like there is data buffer copy happening for write commands from kernel to user space when spdk nbd read data from the nbd socket.
        
        I think that there could be two ways to prevent data copy .
        
        
          1.  Memory mapped the kernel buffers to spdk virtual space.  I am not sure if it is possible to mmap a buffer. And what is the impact to call mmap for each IO.
          2.  If NBD kernel give the physical address of a buffer and SPDK use that to DMA it to NVMEoF. I think spdk must also be changing a virtual address to physical address before sending it to nvmeof.
        
        Option 2 makes more sense to me. Please let me know if option 2 is feasible in spdk
        
        Thanks
        Rishabh Mittal
        
        _______________________________________________
        SPDK mailing list
        SPDK(a)lists.01.org
        https://lists.01.org/mailman/listinfo/spdk
        
    
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-12 18:11 Harris, James R
  0 siblings, 0 replies; 32+ messages in thread
From: Harris, James R @ 2019-08-12 18:11 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4308 bytes --]



On 8/12/19, 9:20 AM, "SPDK on behalf of Mittal, Rishabh via SPDK" <spdk-bounces(a)lists.01.org on behalf of spdk(a)lists.01.org> wrote:

    <<As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this <<way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?>>
    
    I am thinking of passing the physical address of the buffers in bio to spdk.  I don’t know if it is already pinned by the kernel or do we need to explicitly do it. And also, spdk has some requirements on the alignment of physical address. I don’t know if address in bio conforms to those requirements.
    
    SPDK won’t be running in VM.


Hi Rishabh,

SPDK relies on data buffers being mapped into the SPDK application's address space, and are passed as virtual addresses throughout the SPDK stack.  Once the buffer reaches a module that requires a physical address (such as the NVMe driver for a PCIe-attached device), SPDK translates the virtual address to a physical address.  Note that the NVMe fabrics transports (RDMA and TCP) both deal with virtual addresses, not physical addresses.  The RDMA transport is built on top of ibverbs, where we register virtual address areas as memory regions for describing data transfers.

So for nbd, pinning the buffers and getting the physical address(es) to SPDK wouldn't be enough.  Those physical address regions would also need to get dynamically mapped into the SPDK address space.

Do you have any profiling data that shows the relative cost of the data copy v. the system calls themselves on your system?  There may be some optimization opportunities on the system calls to look at as well.

Regards,

-Jim




    From: "Luse, Paul E" <paul.e.luse(a)intel.com>
    Date: Sunday, August 11, 2019 at 12:53 PM
    To: "Mittal, Rishabh" <rimittal(a)ebay.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.org>
    Cc: "Kadayam, Hari" <hkadayam(a)ebay.com>, "Chen, Xiaoxi" <xiaoxchen(a)ebay.com>, "Szmyd, Brian" <bszmyd(a)ebay.com>
    Subject: RE: NBD with SPDK
    
    Hi Rishabh,
    
    Thanks for the question. I was talking to Jim and Ben about this a bit, one of them may want to elaborate but we’re thinking the cost of mmap and also making sure the memory is pinned is probably prohibitive. As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?
    
    Thx
    Paul
    
    From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
    Sent: Saturday, August 10, 2019 6:09 PM
    To: spdk(a)lists.01.org
    Cc: Luse, Paul E <paul.e.luse(a)intel.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
    Subject: NBD with SPDK
    
    Hi,
    
    We are trying to use NBD and SPDK on client side.  Data path looks like this
    
    File System ----> NBD client ------>SPDK------->NVMEoF
    
    
    Currently we are seeing a high latency in the order of 50 us by using this path. It seems like there is data buffer copy happening for write commands from kernel to user space when spdk nbd read data from the nbd socket.
    
    I think that there could be two ways to prevent data copy .
    
    
      1.  Memory mapped the kernel buffers to spdk virtual space.  I am not sure if it is possible to mmap a buffer. And what is the impact to call mmap for each IO.
      2.  If NBD kernel give the physical address of a buffer and SPDK use that to DMA it to NVMEoF. I think spdk must also be changing a virtual address to physical address before sending it to nvmeof.
    
    Option 2 makes more sense to me. Please let me know if option 2 is feasible in spdk
    
    Thanks
    Rishabh Mittal
    
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk
    


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-11 23:33 Mittal, Rishabh
  0 siblings, 0 replies; 32+ messages in thread
From: Mittal, Rishabh @ 2019-08-11 23:33 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 3633 bytes --]

By looking at the structure it seems that it is already pinned. Now only question remains that what is the behavior if it doesn’t conform to the alignment requirements of NVME ?


struct bio_vec {
        /* pointer to the physical page on which this buffer resides */
        struct page     *bv_page;

        /* the length in bytes of this buffer */
        unsigned int    bv_len;

        /* the byte offset within the page where the buffer resides */
        unsigned int    bv_offset;
};


From: "Mittal, Rishabh" <rimittal(a)ebay.com>
Date: Sunday, August 11, 2019 at 3:51 PM
To: "Luse, Paul E" <paul.e.luse(a)intel.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.org>
Cc: "Kadayam, Hari" <hkadayam(a)ebay.com>, "Chen, Xiaoxi" <xiaoxchen(a)ebay.com>, "Szmyd, Brian" <bszmyd(a)ebay.com>
Subject: Re: NBD with SPDK

<<As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this <<way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?>>

I am thinking of passing the physical address of the buffers in bio to spdk.  I don’t know if it is already pinned by the kernel or do we need to explicitly do it. And also, spdk has some requirements on the alignment of physical address. I don’t know if address in bio conforms to those requirements.

SPDK won’t be running in VM.

From: "Luse, Paul E" <paul.e.luse(a)intel.com>
Date: Sunday, August 11, 2019 at 12:53 PM
To: "Mittal, Rishabh" <rimittal(a)ebay.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.org>
Cc: "Kadayam, Hari" <hkadayam(a)ebay.com>, "Chen, Xiaoxi" <xiaoxchen(a)ebay.com>, "Szmyd, Brian" <bszmyd(a)ebay.com>
Subject: RE: NBD with SPDK

Hi Rishabh,

Thanks for the question. I was talking to Jim and Ben about this a bit, one of them may want to elaborate but we’re thinking the cost of mmap and also making sure the memory is pinned is probably prohibitive. As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?

Thx
Paul

From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
Sent: Saturday, August 10, 2019 6:09 PM
To: spdk(a)lists.01.org
Cc: Luse, Paul E <paul.e.luse(a)intel.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
Subject: NBD with SPDK

Hi,

We are trying to use NBD and SPDK on client side.  Data path looks like this

File System ----> NBD client ------>SPDK------->NVMEoF


Currently we are seeing a high latency in the order of 50 us by using this path. It seems like there is data buffer copy happening for write commands from kernel to user space when spdk nbd read data from the nbd socket.

I think that there could be two ways to prevent data copy .


  1.  Memory mapped the kernel buffers to spdk virtual space.  I am not sure if it is possible to mmap a buffer. And what is the impact to call mmap for each IO.
  2.  If NBD kernel give the physical address of a buffer and SPDK use that to DMA it to NVMEoF. I think spdk must also be changing a virtual address to physical address before sending it to nvmeof.

Option 2 makes more sense to me. Please let me know if option 2 is feasible in spdk

Thanks
Rishabh Mittal


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-11 22:51 Mittal, Rishabh
  0 siblings, 0 replies; 32+ messages in thread
From: Mittal, Rishabh @ 2019-08-11 22:51 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 2791 bytes --]

<<As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this <<way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?>>

I am thinking of passing the physical address of the buffers in bio to spdk.  I don’t know if it is already pinned by the kernel or do we need to explicitly do it. And also, spdk has some requirements on the alignment of physical address. I don’t know if address in bio conforms to those requirements.

SPDK won’t be running in VM.

From: "Luse, Paul E" <paul.e.luse(a)intel.com>
Date: Sunday, August 11, 2019 at 12:53 PM
To: "Mittal, Rishabh" <rimittal(a)ebay.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.org>
Cc: "Kadayam, Hari" <hkadayam(a)ebay.com>, "Chen, Xiaoxi" <xiaoxchen(a)ebay.com>, "Szmyd, Brian" <bszmyd(a)ebay.com>
Subject: RE: NBD with SPDK

Hi Rishabh,

Thanks for the question. I was talking to Jim and Ben about this a bit, one of them may want to elaborate but we’re thinking the cost of mmap and also making sure the memory is pinned is probably prohibitive. As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?

Thx
Paul

From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
Sent: Saturday, August 10, 2019 6:09 PM
To: spdk(a)lists.01.org
Cc: Luse, Paul E <paul.e.luse(a)intel.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
Subject: NBD with SPDK

Hi,

We are trying to use NBD and SPDK on client side.  Data path looks like this

File System ----> NBD client ------>SPDK------->NVMEoF


Currently we are seeing a high latency in the order of 50 us by using this path. It seems like there is data buffer copy happening for write commands from kernel to user space when spdk nbd read data from the nbd socket.

I think that there could be two ways to prevent data copy .


  1.  Memory mapped the kernel buffers to spdk virtual space.  I am not sure if it is possible to mmap a buffer. And what is the impact to call mmap for each IO.
  2.  If NBD kernel give the physical address of a buffer and SPDK use that to DMA it to NVMEoF. I think spdk must also be changing a virtual address to physical address before sending it to nvmeof.

Option 2 makes more sense to me. Please let me know if option 2 is feasible in spdk

Thanks
Rishabh Mittal


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [SPDK] NBD with SPDK
@ 2019-08-11 19:53 Luse, Paul E
  0 siblings, 0 replies; 32+ messages in thread
From: Luse, Paul E @ 2019-08-11 19:53 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 1781 bytes --]

Hi Rishabh,

Thanks for the question. I was talking to Jim and Ben about this a bit, one of them may want to elaborate but we’re thinking the cost of mmap and also making sure the memory is pinned is probably prohibitive. As I’m sure you’re aware, SPDK apps use spdk_alloc() with the SPDK_MALLOC_DMA which is backed by huge pages that are effectively pinned already. SPDK does virt to phy transition on memory allocated this way very efficiently using spdk_vtophys().  It would be an interesting experiment though.  Your app is not in a VM right?

Thx
Paul

From: Mittal, Rishabh [mailto:rimittal(a)ebay.com]
Sent: Saturday, August 10, 2019 6:09 PM
To: spdk(a)lists.01.org
Cc: Luse, Paul E <paul.e.luse(a)intel.com>; Kadayam, Hari <hkadayam(a)ebay.com>; Chen, Xiaoxi <xiaoxchen(a)ebay.com>; Szmyd, Brian <bszmyd(a)ebay.com>
Subject: NBD with SPDK

Hi,

We are trying to use NBD and SPDK on client side.  Data path looks like this

File System ----> NBD client ------>SPDK------->NVMEoF


Currently we are seeing a high latency in the order of 50 us by using this path. It seems like there is data buffer copy happening for write commands from kernel to user space when spdk nbd read data from the nbd socket.

I think that there could be two ways to prevent data copy .


  1.  Memory mapped the kernel buffers to spdk virtual space.  I am not sure if it is possible to mmap a buffer. And what is the impact to call mmap for each IO.
  2.  If NBD kernel give the physical address of a buffer and SPDK use that to DMA it to NVMEoF. I think spdk must also be changing a virtual address to physical address before sending it to nvmeof.

Option 2 makes more sense to me. Please let me know if option 2 is feasible in spdk

Thanks
Rishabh Mittal


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [SPDK] NBD with SPDK
@ 2019-08-11  1:08 Mittal, Rishabh
  0 siblings, 0 replies; 32+ messages in thread
From: Mittal, Rishabh @ 2019-08-11  1:08 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 914 bytes --]

Hi,

We are trying to use NBD and SPDK on client side.  Data path looks like this

File System ----> NBD client ------>SPDK------->NVMEoF


Currently we are seeing a high latency in the order of 50 us by using this path. It seems like there is data buffer copy happening for write commands from kernel to user space when spdk nbd read data from the nbd socket.

I think that there could be two ways to prevent data copy .


  1.  Memory mapped the kernel buffers to spdk virtual space.  I am not sure if it is possible to mmap a buffer. And what is the impact to call mmap for each IO.
  2.  If NBD kernel give the physical address of a buffer and SPDK use that to DMA it to NVMEoF. I think spdk must also be changing a virtual address to physical address before sending it to nvmeof.

Option 2 makes more sense to me. Please let me know if option 2 is feasible in spdk

Thanks
Rishabh Mittal


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2019-09-23  1:03 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-30 22:28 [SPDK] NBD with SPDK Mittal, Rishabh
  -- strict thread matches above, loose matches on Subject: below --
2019-09-23  1:03 Huang Zhiteng
2019-09-06 20:31 Kadayam, Hari
2019-09-06 17:13 Mittal, Rishabh
2019-09-06  2:14 Szmyd, Brian
2019-09-06  2:08 Huang Zhiteng
2019-09-05 22:00 Szmyd, Brian
2019-09-05 21:22 Walker, Benjamin
2019-09-05 20:11 Luse, Paul E
2019-09-04 23:27 Luse, Paul E
2019-09-04 23:03 Luse, Paul E
2019-09-04 18:08 Walker, Benjamin
2019-08-30 17:06 Walker, Benjamin
2019-08-30  1:05 Mittal, Rishabh
2019-08-19 14:41 Luse, Paul E
2019-08-16  1:50 Mittal, Rishabh
2019-08-16  1:26 Harris, James R
2019-08-15 23:34 Mittal, Rishabh
2019-08-14 17:55 Mittal, Rishabh
2019-08-14 17:05 Kadayam, Hari
2019-08-14 16:54 Harris, James R
2019-08-14 16:18 Walker, Benjamin
2019-08-14 14:28 Luse, Paul E
2019-08-13 22:08 Mittal, Rishabh
2019-08-13 21:45 Harris, James R
2019-08-13 19:55 Mittal, Rishabh
2019-08-12 18:41 Harris, James R
2019-08-12 18:11 Harris, James R
2019-08-11 23:33 Mittal, Rishabh
2019-08-11 22:51 Mittal, Rishabh
2019-08-11 19:53 Luse, Paul E
2019-08-11  1:08 Mittal, Rishabh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.