All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.)
@ 2018-07-12  5:42 Harris, James R
  0 siblings, 0 replies; 7+ messages in thread
From: Harris, James R @ 2018-07-12  5:42 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 37777 bytes --]

Personally, I do not think they suck.  I think they are great.  (

Seriously though – this API is inconsistent with the rest of the bdev API.  Nowhere else does a bdev I/O submit function return a bdev_io and then require the caller to submit the bdev_io.

I also think it’s a good idea to differentiate between READ (where the caller provides the buffers) and ZCOPY (where the bdev module provides the buffers).

Earlier, you and Srikanth had expressed concern about the global bdev_io pool allocated inside of the bdev layer.  You wanted control over that, so a mempool doesn’t get allocated if it’s not going to be used.  We can modify the bdev layer to allow specifying an external allocator.  We may even require it, and then move the current bdev_io mempool out of bdev.c and into the application initialization code.

Based on the above, are they any functional reasons why the zcopy APIs wouldn’t work for the SPDK NVMe-oF target?

Thanks,

-Jim


On 7/11/18, 9:30 PM, "SPDK on behalf of Meneghini, John" <spdk-bounces(a)lists.01.org on behalf of John.Meneghini(a)netapp.com> wrote:

    I'm sorry Ben, but I think the spdk_bdev_zcopy_start() and spdk_bdev_zcopy_end() APIs suck, and I don't want to use them in our BDEV.
    
    Here is are some APIs that I think could possibly work: 
    
    struct spdk_bdev_io *spdk_bdev_dma_start(struct spdk_bdev_desc *desc,
        			                                      struct spdk_io_channel *ch,
        			                                      uint64_t offset,                     /* Make this bytes instead of blocks because we may need to DMA something other than blocks */
                                                                                     uint64_t length,                     /* length in bytes of whole transfer */
                                                                                     uint32_t block_size,              /* block size in bytes. When set to 1 we are transferring bytes.*/
        			                                      enum spdk_bdev_io_type.  /* Everything else I need to know... LIVE, COMMIT, POPULATE, etc. is known by the io_type);
    
    void spdk_bdev_dma_end(struct spdk_bdev_io *bdev_io);                         /* Rundown the bdev_io and clean up any iovec or dma resources */
    
    int spdk_bdev_io_submit(struct spdk_bdev_io *bdev_io);
    
    These APIs belong in the bdev->fn_table, and the flow through the transport would be:
    
    spdk_nvmf_ctrlr_process_io_cmd
          nvmf_bdev_ctrlr_write_cmd
                 bdev_io = spdk_bdev_dma_start()
                 spdk_bdev_io_submit(bdev_io)
    
    Etc.
    
    This is the direction Srikanth was heading and I'd like to understand why that direction does not work.
    
    /John
    
    
    On 7/11/18, 9:17 PM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:
    
        This is a great explanation Ben, thanks.  I'll start looking through the patches as well.  A few quick questions for additional context and to make sure I understand your terminology:
        
        * Can you expand on " The populate flag determines whether the "live" data is present in the given buffer" maybe with an example or something (what do you mean by "live", you use it a few times)
        * You mention " I believe I've addressed all known use cases with this design" but in this email chain, at least, only cover the NVMe-oF use case.  Can you list the known cases you mention and maybe just say a few words about them?
        
        Thanks
        Paul
        
        -----Original Message-----
        From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
        Sent: Wednesday, July 11, 2018 2:53 PM
        To: John.Meneghini(a)netapp.com; Sriram.Popuri(a)netapp.com; Srikanth.Kaligotla(a)netapp.com
        Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org
        Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.)
        
        I've done a bunch of work here to formalize the zero copy API in the bdev layer.
        The patch series begins here:
        
        https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
        
        There has been some good discussion on the patches themselves, but I wanted to circle back to the mailing list for a bit. The basic idea is to add two new functions to the bdev layer:
        
        int spdk_bdev_zcopy_start(struct spdk_bdev_desc *desc,
        			  struct spdk_io_channel *ch,
        			  uint64_t offset_blocks, uint64_t num_blocks,
        			  bool populate,
        			  spdk_bdev_io_completion_cb cb, void *cb_arg); int spdk_bdev_zcopy_end(struct spdk_bdev_io *bdev_io, bool commit,
        			spdk_bdev_io_completion_cb cb, void *cb_arg);
        
        The zcopy start call makes a request to the backing block device to provide a region of memory suitable for performing a zero copy update to the region described by the offset_blocks and num_blocks parameters. The populate flag determines whether the "live" data is present in the given buffer. When this request completes, it returns a bdev_io in the completion callback. That bdev_io will have a scatter gather list filled out describing the memory region.
        
        When the user has finished transferring data into or out of the data buffer, the user calls spdk_bdev_zcopy_end with the originally provided bdev_io. The commit flag indicates whether the data in the buffer needs to be made "live".
        
        The NVMe-oF target can take advantage of these new APIs by altering the state machine in the RDMA layer to call spdk_bdev_zcopy_start to obtain data buffers after parsing the command, then performing RDMA transfers into or out of the provided region, and finally completing the command with spdk_bdev_zcopy_end.
        The work to do those modifications has not been done yet, and I expect them to be a significant amount of effort.
        
        As many bdev modules as possible should attempt to support zero copy semantics, but for those bdev modules that aren't able to, one of the patches in my series will automatically emulate the zcopy commands using regular reads and writes.
        This allows the higher level users of the bdev API to always make the zcopy calls for simplicity. The emulation path should be no worse, in terms of performance, than doing a regular read or write.
        
        I believe I've addressed all known use cases with this design, and while there will certainly be some future patches to fix problems that come up, I'm confident that we're on the right track here. If there is any feedback in this area at all, now is the time to provide it.
        
        I think this also leads into some really awesome future work that I'll suggest in the hope that someone in the community is inspired. One idea is that the bdev_nvme module could support zero copy semantics by returning a pointer to a portion of a Controller Memory Buffer or Persistent Memory Buffer suitable for data. There aren't many devices for sale today that can do this, but it's in the spec and I imagine they're coming. This would allow the NVMe-oF target, for example, to perform RDMA transfers peer-to-peer between the NIC and the NVMe SSD. Similarly, for the bdev_pmem module, direct pointers into the the persistent memory could be returned. The NVMe-oF target could then perform RDMA transfers into or out of the persistent memory directly, and the zcopy end call with commit set to true would perform the required flush instructions.
        
        Thanks,
        Ben
        
        
        On Mon, 2018-07-09 at 19:19 +0000, Popuri, Sriram wrote:
        > I would like to jot down what requirements we are looking at. Wherever 
        > there is "====wait" for something, there is a potential context switch.
        > 
        > For a write request, the following is the workflow for zero copy:
        > 1) Host submits Write request
        > 2) Protocol validation of request
        > 4) Get iovs
        > 5) R2T/RDMA_READ
        > ====wait for host====
        > 5) Host transfers data (data_out/RDMA_READ_COMPLETE)
        > 6) Perform Write(spdk_bdev_writev_blocks) to bdev_module ====wait for 
        > bdev module to respond====
        > 7) on bdev_module response, send rsp/status to host
        > 8) free req resources
        > 
        > Read path:
        > 1) Host submits Read request
        > 2) Protocol validatation of request
        > 3) Submit read to bdev_module
        > ====wait for data blocks from bdev_module====
        > 4) respond iov (data_in/RDMA_WRITE)
        > ====wait for host====
        > 5) host ack/RDMA_WRITE_COMPLETE
        > 6) free req resources
        > 
        > In future we will have Persistent reservations where there can be 
        > read/modify/update cases depending on how the bdev module handles it. 
        > So the zero copy apis needs to address future requirements as well.
        > 
        > The above workflows suggest we require a state to track at what step 
        > the request is. A state will help to cleanup resources properly when 
        > handling aborts, disconnects, namespace destroy workflows etc.
        > So the requirement is to figure out a way to clean up resources 
        > according to the state of the bdev_io in its life cycle and stitch in 
        > the zero copy workflows into existing spdk code which assumes buffers are pre-allocated.
        > 
        > I think Srikanth's proposed changes are made in target layer to 
        > populate and commit to bdev module. However his suggestion is changing 
        > the bdev_io life cycle which I understand from John is not agreeable to others.
        > 
        > With the new zcopy apis also there is no clarity on the bdev_io life 
        > cycle. A bdev_io is acquired during zcopy_start and it's not clear how 
        > to deal with the bdev_io allocated in read/write paths.
        > 
        > The orchestration of zero copy either be done at transport layer, 
        > target layer or bdev module layer.
        > 
        > We don't like the orchestration done from transport layer because:
        > 1) No clear way to deal with bdev_io (acquired from zcopy) and bdev_io 
        > (acquired from read/write codepaths).
        > 2) If I understand correctly, by the time a read/write request is 
        > submitted to target, the data_out would have happened. So the protocol 
        > validation happens after data_out phase. For example an LBA out of 
        > range request or invalid SGL length is returned after a data_out is 
        > performed. We want to run through the protocol validation before data_out phase.
        > 
        > I feel the bdev module is the ideal place to deal with zero copy 
        > buffers. The bdev module knows how to get buffers and when to claim 
        > them. All it needs is a proper cleanup and by the time request hits 
        > bdev module, bdev_io is acquired so no need to change bdev_io life cycle or introduce a new io type.
        > 
        > I propose the following:
        > 1) Host submits Write request
        > 2) Protocol validation of request
        >    spdk_nvmf_request_exec
        > 4) submit request to bdev module
        >    spdk_bdev_writev_blocks(). bdev_io is acquired
        > 5) bdev module determines the buffers needs to be allocated.
        >    Will perform something like what is prposed in the patch:
        > spdk_bdev_zcopy_start(populate=false)
        > 6) New spdk_nvmf_request_exec_status
        > (SPDK_NVMF_REQUEST_EXEC_STATUS_BUFF_READY)
        > 7) Transport does R2T/RDMA_READ
        > ====wait for host====
        > 8) Host transfers data (data_out/RDMA_READ_COMPLETE)
        > 9) Perform Write again(spdk_nvmf_request_exec/spdk_bdev_writev_blocks) 
        > to bdev_module.
        > ====wait for bdev module to respond====
        > 10) on bdev_module response, send rsp/status to host
        > 11) free req resources
        >     free bdev_io spdk_bdev_free_io.
        >     Here we need a hook to free resources allocated by bdev module
        >     spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)
        > 
        > Read path:
        > 1) Host submits Read request
        > 2) Protocol validatation of request
        > 3) Submit the request (spdk_bdev_readv_blocks) to bdev_module ====wait 
        > for data blocks from bdev_module====
        > 4) respond iov (data_in/RDMA_WRITE)
        > ====wait for host====
        > 5) host ack/RDMA_WRITE_COMPLETE
        > 6) free req resources
        >    free bdev_io spdk_bdev_free_io.
        >    Here we need a hook to free resources allocated by bdev module
        >    spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)
        > 
        > If it makes sense we can work on a patch.
        > 
        > Regards,
        > ~Sriram
        > 
        > -----Original Message-----
        > From: Walker, Benjamin <benjamin.walker(a)intel.com>
        > Sent: Monday, July 9, 2018 11:22 PM
        > To: Meneghini, John <John.Meneghini(a)netapp.com>; Popuri, Sriram 
        > <Sriram.Popuri @netapp.com>; Kaligotla, Srikanth 
        > <Srikanth.Kaligotla(a)netapp.com>
        > Cc: raju.gottumukkala(a)broadcom.com; Rodriguez, Edwin 
        > <Ed.Rodriguez(a)netapp.com> ; spdk(a)lists.01.org; Pai, Madhu 
        > <Madhusudan.Pai(a)netapp.com>; NGC-john.barnard- broadcom.com 
        > <john.barnard(a)broadcom.com>
        > Subject: Re: BDEV-IO Lifecycle - Need your input.
        > 
        > On Mon, 2018-07-09 at 16:18 +0000, Meneghini, John wrote:
        > > Hi Jim.
        > > 
        > > This is the patch series I believe Ben is proposing to replace
        > > Srikanth's:  ht tps://review.gerrithub.io/#/c/spdk/spdk/+/415860/
        > > 
        > > pick a21bbcf2 nvmf: Move data buffer pool to generic layer pick
        > > 09586e54 bdev: Add a zero copy I/O path pick 1cc9fe95 bdev: Make 
        > > malloc bdev use the new zero copy mechanism pick 40a1f62b bdevperf:
        > > Use new zcopy API for reads pick eba6f35a bdev: Emulate zero copy 
        > > support when necessary
        > > 
        > > Is this correct?
        > 
        > 
        > My patch series doesn't entirely replace Srikanth's set of patches. 
        > There are two separate things necessary to implement zero copy. First 
        > is the infrastructure within the bdev layer to make requests of a bdev 
        > module to obtain  or release a suitable buffer for a zero copy 
        > operation. Second is actually making the NVMe-oF target use that 
        > mechanism, which includes extending the bdev_io lifetime. My patch series only addresses the first part.
        > 
        > Thanks,
        > Ben
        > 
        > 
        > 
        > > 
        > > /John
        > > 
        > > On 7/9/18, 11:31 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
        > > 
        > >     Hi Sriram:
        > >     
        > >     Ben has some later patches in this series with some examples:
        > >     
        > >     https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - implements 
        > > this zero copy API in the simple ‘malloc’ bdev module
        > >     https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the 
        > > zero copy API in the SPDK bdevperf utility
        > >     
        > >     -Jim
        > >     
        > >     
        > >     On 7/9/18, 3:34 AM, "Popuri, Sriram" <Sriram.Popuri(a)netapp.com> wrote:
        > >     
        > >         Sorry was not focusing on this change. Just give me a day or 
        > > two to get back.
        > >         From a quick glance I didn't understand how zcopy start/end 
        > > fits into a read/write life cycle. Is there an example on how zcopy 
        > > start/end is consumed or someone can do give me a quick dump on how 
        > > its envisioned to be used.
        > >         
        > >         Regards,
        > >         ~Sriram
        > >         
        > >         -----Original Message-----
        > >         From: Meneghini, John 
        > >         Sent: Friday, July 6, 2018 10:16 PM
        > >         To: Walker, Benjamin <benjamin.walker(a)intel.com>; Verkamp, 
        > > Daniel <dan iel.verkamp(a)intel.com>; Harris, James R 
        > > <james.r.harris(a)intel.com>; Kaligotla, Srikanth 
        > > <Srikanth.Kaligotla(a)netapp.c
        > > om>
        > >         Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; 
        > > Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu 
        > > <Madhusudan.Pai(a)netapp.com>; NGC- john.barnard-broadcom.com 
        > > <john.barnard(a)br oadcom.com>; Popuri, Sriram <Sriram.
        > > Popuri(a)netapp.com>
        > >         Subject: Re: BDEV-IO Lifecycle - Need your input.
        > >         
        > >         I'm adding Sriram to this thread.  Sriram is another NetApp 
        > > engineer who is working on this stuff internally.
        > >         
        > >         /John
        > >         
        > >         On 7/6/18, 12:41 PM, "Walker, Benjamin" 
        > > <benjamin.walker(a)intel.com>
        > > wrote:
        > >         
        > >             Hi Srikanth,
        > >             
        > >             I wanted to check in to see if you had any additional 
        > > feedback on the zero copy
        > >             operations in the bdev layer here:
        > >             
        > >             https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
        > >             
        > >             This does not address extending the lifetime of the 
        > > bdev_io in the NVMe-oF
        > >             target, but I think this is looking like the right 
        > > mechanism for the bdev layer.
        > >             
        > >             Thanks,
        > >             Ben
        > >             
        > >             On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote:
        > >             > Thanks Srikanth.  Sounds like the spdk_bdev_io pool 
        > > sizing along with
        > >             > spdk_bdev_queue_io_wait() meets your needs then 
        > > regarding the spdk_bdev_io
        > >             > memory.
        > >             >  
        > >             > Regarding zcopy_start/zcopy_end – it looks like you’ve 
        > > already added a bunch
        > >             > of comments to Ben’s patch on GerritHub.  For now I’d 
        > > say let’s continue our
        > >             > discussion there.  I’ve responded to a couple of 
        > > similar questions there and
        > >             > I’m sure Ben will have more replies tomorrow.
        > >             >  
        > >             > -Jim
        > >             >  
        > >             >  
        > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
        > >             > Date: Thursday, June 21, 2018 at 5:01 PM
        > >             > To: James Harris <james.r.harris(a)intel.com>, "Walker, 
        > > Benjamin"
        > > <benjamin.walk
        > >             > er(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
        > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
        > > <raju.gottumukkala(a)broadcom .com>,
        > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
        > > "Rodriguez, Edwin" <Ed.Rodrigue
        > >             > z(a)netapp.com>, "Pai, Madhu" 
        > > <Madhusudan.Pai(a)netapp.com>,
        > > "NGC-
        > > john.barnard-
        > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
        > > "spdk(a)lists.01.org" < spdk(a)lists.01.
        > >             > org>
        > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
        > >             >  
        > >             > Hi Jim,
        > >             >  
        > >             > I wish I joined the community meeting, I also missed the
        > >             > spdk_bdev_queue_io_wait().
        > >             >  
        > >             > So, there are 2 issues I am intending to solve,
        > >             >  
        > >             > 1.       We want to ensure that an instance of bdev-io is
        > > acquired prior to or
        > >             > along with I/O data buffer. This allows for better 
        > > error handling when bdev_io
        > >             > pool is exhausted. Yes, we can solve it via sizing the 
        > > pool right. The changes
        > >             > queue-io-wait also addresses the problem.
        > >             > 2.       Most importantly, by extending the life of bdev-io
        > > (Same life span as
        > >             > nvmf_request structure), abort use case and other use 
        > > cases that involve
        > >             > cleaning up of IO data buffer are accomplished 
        > > effectively. Let me elaborate;
        > >             > bdev-io is the fabric that connects the nvme command 
        > > and operation with the
        > >             > backend. The driver context and data buffer context is 
        > > stored in bdev-io. The
        > >             > freeing up of bdev_io resource is pushed to the end so 
        > > the I/O cleanup can
        > >             > happen after controller has transmitted the data to 
        > > the host. In absence of
        > >             > bdev_io, we will have end up adding more and more void 
        > > context in request
        > >             > structure. Hence the push to extend the life of bdev_io.
        > >             >  
        > >             > I see Ben’s recent patch “zcopy_start” and 
        > > “zcopy_end”; We can make that work
        > >             > as long as the bdev-io allocated/acquired stays till the end.
        > > One of the
        > >             > challenges I see with that patch is defining a 
        > > callback for the I/O
        > >             > submission. For instance, zopy_start will allocate 
        > > bdev_io and submits the I/O
        > >             > to the bdev device. Underlying implementation can be 
        > > synchronous or can be
        > >             > asynchronous. The callback for this submission should 
        > > check to see if it is a
        > >             > BUFFER-READY or PENDING and accordingly relay it back 
        > > to transport. Next phase
        > >             > is actual I/O submission. Let’s say, it reuses the 
        > > bdev-io obtained in ZCOPY-
        > >             > START, now the callback should determine if it is a 
        > > success or failure.
        > >             > Finally when ZCOPY-END is invoked the supplied bdev_io 
        > > will have all necessary
        > >             > data to release the WRITE buffer or unlock the read 
        > > buffer based on the
        > >             > operation performed.
        > >             >  
        > >             > I hope I’m making sense. I guess, my effort is to 
        > > extend the lifespan of bdev-
        > >             > io and let it continue to host the driver context and 
        > > buffer context populated
        > >             > during the BEGIN phase.
        > >             >  
        > >             > Regards,
        > >             > Srikanth
        > >             >  
        > >             > From: Harris, James R <james.r.harris(a)intel.com> 
        > >             > Sent: Thursday, June 21, 2018 2:21 PM
        > >             > To: Kaligotla, Srikanth 
        > > <Srikanth.Kaligotla(a)netapp.com>; Walker, Benjamin <ben
        > >             > jamin.walker(a)intel.com>; Verkamp, Daniel 
        > > <daniel.verkamp(a)intel.c
        > > om>
        > >             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
        > > <John.Menegh ini(a)netapp.com
        > >             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
        > > Madhu <Madhu sudan.Pai(a)neta
        > >             > pp.com>; NGC-john.barnard-broadcom.com 
        > > <john.barnard(a)broadcom.co
        > > m>; spdk(a)lists
        > >             > .01.org
        > >             > Subject: Re: BDEV-IO Lifecycle - Need your input.
        > >             >  
        > >             > Hi Srikanth,
        > >             >  
        > >             > Following up on this thread and the discussion in 
        > > yesterday’s community
        > >             > meeting.
        > >             >  
        > >             > The recent spdk_bdev_queue_io_wait() changes allow an 
        > > SPDK application to work
        > >             > better in general when the spdk_bdev_io buffer pool is 
        > > exhausted.  We still
        > >             > need to make changes to the NVMe-oF target to use this 
        > > new API but that should
        > >             > be done soon-ish.
        > >             >  
        > >             > With that in place, the spdk_bdev_io buffer pool 
        > > itself can be configured up
        > >             > or down when the application starts.  Currently the 
        > > default is 64K
        > >             > spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) ==
        > > 216 plus the per-IO
        > >             > context size allocated for the bdev module.  This can 
        > > be up to
        > > 192 bytes
        > >             > (virtio bdev module) but is likely much smaller for 
        > > you depending on the
        > >             > context size for your ontap bdev module.
        > >             >  
        > >             > Let’s assume your per-IO context size is 64 bytes.  
        > > 64K x (192 +
        > > 64) = 16MB.
        > >             >  
        > >             > I’m not sure how many spdk_bdev_io you need in flight 
        > > at any given time.  64K
        > >             > seems like a lot but I’d be curious to hear your 
        > > thoughts on this.  If this is
        > >             > the right number, then worst case, there would be 
        > > about 16MB of DRAM that
        > >             > would sit unused if the NVMe-oF target in your system 
        > > was not active.  Is that
        > >             > too burdensome for your application?
        > >             >  
        > >             > Thanks,
        > >             >  
        > >             > -Jim
        > >             >  
        > >             >  
        > >             >  
        > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
        > >             > Date: Wednesday, June 20, 2018 at 12:47 PM
        > >             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
        > > James Harris <james.r.harr
        > >             > is(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
        > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
        > > <raju.gottumukkala(a)broadcom .com>,
        > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
        > > "Rodriguez, Edwin" <Ed.Rodrigue
        > >             > z(a)netapp.com>, "Pai, Madhu" 
        > > <Madhusudan.Pai(a)netapp.com>,
        > > "NGC-
        > > john.barnard-
        > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
        > > "spdk(a)lists.01.org" < spdk(a)lists.01.
        > >             > org>
        > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
        > >             >  
        > >             > Hello,
        > >             >  
        > >             > First revision of changes to extend the lifecycle of 
        > > bdev_io are available for
        > >             > review. I would like to solicit your input on the 
        > > proposed API/Code flow
        > >             > changes.
        > >             >  
        > >             > https://review.gerrithub.io/c/spdk/spdk/+/415860
        > >             >  
        > >             > Thanks,
        > >             > Srikanth
        > >             >  
        > >             >  
        > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
        > >             > Date: Friday, May 11, 2018 at 2:27 PM
        > >             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
        > > "Harris, James R" <james.r
        > >             > .harris(a)intel.com>
        > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
        > > <raju.gottumukkala(a)broadcom .com>,
        > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
        > > "Rodriguez, Edwin" <Ed.Rodrigue
        > >             > z(a)netapp.com>, "Pai, Madhu" 
        > > <Madhusudan.Pai(a)netapp.com>,
        > > "NGC-
        > > john.barnard-
        > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
        > > "spdk(a)lists.01.org" < spdk(a)lists.01.
        > >             > org>
        > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
        > >             >  
        > >             > CC: List
        > >             >  
        > >             > Hi Ben,
        > >             >  
        > >             > Your proposal to interface with the backend to acquire 
        > > and release buffers is
        > >             > good. You have accurately stated that the challenge is 
        > > in developing intuitive
        > >             > semantics. And that has been my struggle. To me, there 
        > > are two problem
        > >             > statements;
        > >             >  
        > >             > 1.       It is expected that bdev_io pool is sized correctly
        > > so
        > > the call to
        > >             > get bdev_io succeeds. Failure to acquire bdev_io will 
        > > result in DEVICE-ERROR.
        > >             > The transport is already capable of handling temporary 
        > > memory failures by
        > >             > moving the request to PENDING-Queue. Hence the 
        > > proposal to change the bdev_io
        > >             > lifecycle and perhaps connect it with the 
        > > spdk_nvmf_request object. Thus all
        > >             > buffer needs are addressed at the beginning of I/O request.
        > >             > 2.       I/O Data buffers are sourced and managed by the
        > > backend. One of the
        > >             > challenges I see with your proposed interface is the 
        > > lack of details like if
        > >             > resource is being acquired for a READ operation or 
        > > WRITE operation. The
        > >             > handling is quite different in each case. Since 
        > > bdev_io->type is overloaded
        > >             > the type of I/O operation is lost. I suppose one can 
        > > cast the cb_arg
        > >             > (nvmf_request) and then proceed. Also, the bdev_io 
        > > should be present to
        > >             > RELEASE the buffer. Zero copy semantics warrants that 
        > > data buffer stays until
        > >             > controller-to-host has occurred. In other words, 
        > > bdev_io lives till REQUEST
        > >             > come to COMPLETE state.
        > >             >  
        > >             > What are your thoughts in introducing spdk_bdev_init() 
        > > and
        > > spdk_bdev_fini() as
        > >             > an alternative approach to extend the lifecyle of 
        > > bdev_io and allow data
        > >             > buffer management via bdev fn_table ?
        > >             >  
        > >             > I hope I’m making sense…
        > >             >  
        > >             > Thanks,
        > >             > Srikanth
        > >             >  
        > >             > From: Walker, Benjamin <benjamin.walker(a)intel.com> 
        > >             > Sent: Friday, May 11, 2018 12:28 PM
        > >             > To: Harris, James R <james.r.harris(a)intel.com>; 
        > > Kaligotla, Srikanth <Srikanth.
        > >             > Kaligotla(a)netapp.com>
        > >             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
        > > <John.Menegh ini(a)netapp.com
        > >             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
        > > Madhu <Madhu sudan.Pai(a)neta
        > >             > pp.com>; NGC-john.barnard-broadcom.com 
        > > <john.barnard(a)broadcom.co
        > > m>
        > >             > Subject: Re: BDEV-IO Lifecycle - Need your input.
        > >             >  
        > >             > Hi Srikanth,
        > >             >  
        > >             > Yes - we'll need to introduce some way to acquire and 
        > > release buffers from the
        > >             > bdev layer earlier on in the state machine that 
        > > processes an NVMe-oF request.
        > >             > I've had this patch out for review for several months 
        > > as a proposal for this
        > >             > scenario:
        > >             >  
        > >             > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
        > >             >  
        > >             > It doesn't pass the tests - it's just a proposal for 
        > > the interface. 90% of the
        > >             > challenge here is in developing intuitive semantics.
        > >             >  
        > >             > Thanks,
        > >             > Ben
        > >             >  
        > >             > P.S. This is the kind of discussion that would fit 
        > > perfectly on the mailing
        > >             > list.
        > >             >  
        > >             > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
        > >             > > Hi Ben, Hi James,
        > >             > >  
        > >             > > I would like to solicit opinions on the lifecycle 
        > > for bdev-io resource
        > >             > > object. Attached is the image of RDMA state machine 
        > > in its current
        > >             > > implementation. When the REQUEST enters the 
        > > NEED-BUFFER state, the buffers
        > >             > > necessary for carrying out I/O operation are 
        > > allocated/acquired from the
        > >             > > memory pool. An instance of BDEV-IO comes into 
        > > existence after REQUEST
        > >             > > reaches READY-TO-EXECUTE state. The BDEV-IO is 
        > > teared down as soon the
        > >             > > backend returns. From a BDEV perspective, BDEV-IO is 
        > > simply a translation
        > >             > > unit that facilitates I/O buffers from the backend. 
        > > The driver context
        > >             > > embedded within bdev_io holds great deal of 
        > > information pertaining to the
        > >             > > I/O under execution. It assists in error handling, 
        > > dereferencing the buffers
        > >             > > upon I/O completion and in abort handling. In 
        > > summary, the bdev_io stays
        > >             > > alive until request has come to COMPLETE state. I’d 
        > > like to hear peoples
        > >             > > thoughts in introducing the plumbing to acquire 
        > > BDEV-IO resource in REQUEST-
        > >             > > NEED-BUFFER state and release it in REQUEST-COMPLETE 
        > > state.  I will shortly
        > >             > > have patch available for review that introduces 
        > > spdk_bdev_init and
        > >             > > spdk_bdev_fini which in turn invokes the 
        > > corresponding bdev fn_table to
        > >             > >  initialize/cleanup.  
        > >             > >  
        > >             > > I wanted to use this email to communicate our intent 
        > > and solicit your
        > >             > > feedback. We have a working implementation of the 
        > > above proposal and prior
        > >             > > to pushing it upstream for review would like to hear 
        > > your thoughts. These
        > >             > > proposed changes to upstream are a result of FC 
        > > transport work in
        > >             > > collaboration with the Broadcom team who are also 
        > > copied to this mail.
        > >             > > Myself and John will be at SPDK Dev conference and 
        > > if required we can
        > >             > > elaborate further on this proposal.
        > >             > >  
        > >             > > Thanks,
        > >             > > Srikanth
        > >             > 
        > >             >
        > >             
        > >         
        > >         
        > >     
        > >     
        > > 
        _______________________________________________
        SPDK mailing list
        SPDK(a)lists.01.org
        https://lists.01.org/mailman/listinfo/spdk
        
    
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk
    


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.)
@ 2018-07-12 21:16 Walker, Benjamin
  0 siblings, 0 replies; 7+ messages in thread
From: Walker, Benjamin @ 2018-07-12 21:16 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 38625 bytes --]

On Thu, 2018-07-12 at 21:01 +0000, Luse, Paul E wrote:
> Awesome Ben, thanks. This is quite a bit more sophisticated and flexible than
> I think what generally comes to mind with "zero copy" in terms of naming this
> thing but I guess I can't think of a better one - "data direct" or something,
> I dunno. Killer feature though no matter what its called... 
> 
> I was thinking some more about an earlier suggestion from John about using
> bytes instead of blocks in the API, have you thought much about that?

Sure - the length of the buffer could be specified in either bytes or blocks, as
long as when using bytes the length is a block multiple. We have a similar
scenario with the regular bdev I/O calls (spdk_bdev_read is in bytes,
spdk_bdev_read_blocks is in blocks). We could do the same thing for the zcopy
API where there is a byte and a block variant.

If we're talking about performing I/O in non-block sizes, then no - that breaks
about a million different things. This is a block device abstraction layer, with
emphasis on the word block. Certainly, for some classes of devices some of the
device may be accessible at a byte granularity, and we could discuss ways to
expose that to the user in a generic way at some point down the road. Maybe a
strategy like what NVMe devices do with CMB/PMR would work. However, I think
it's much more likely that accessing byte-addressable memory should just use a
different access method (like libpmem), rather than a block stack. I consider
this discussion very separate from the zcopy API (because it would equally
affect regular read and write), so I've left it out here.

> 
> Thx
> Paul
> 
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
> Sent: Thursday, July 12, 2018 1:44 PM
> To: John.Meneghini(a)netapp.com; spdk(a)lists.01.org; Sriram.Popuri(a)netapp.com; Sr
> ikanth.Kaligotla(a)netapp.com
> Cc: raju.gottumukkala(a)broadcom.com
> Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.)
> 
> On Thu, 2018-07-12 at 01:16 +0000, Luse, Paul E wrote:
> > This is a great explanation Ben, thanks.  I'll start looking through 
> > the patches as well.  A few quick questions for additional context and 
> > to make sure I understand your terminology:
> > 
> > * Can you expand on " The populate flag determines whether the "live" 
> > data is present in the given buffer" maybe with an example or 
> > something (what do you mean by "live", you use it a few times)
> 
> I struggled to find the words to explain this. When you do a zcopy_start, you
> are returned some region(s) of memory. If populate is false, that buffer
> doesn't contain any valid data. If populate is true, the buffer contains the
> data on the disk for the described region of blocks. Similarly with commit -
> if you memcpy into a buffer provided from zcopy_start, that doesn't
> necessarily become the data on a block device (i.e. visible from a separate
> read operation) until the data is committed. Obviously, block device modules
> may have wildly varying internal implementations. For example, the malloc bdev
> module just allocates a single region of memory representing the block device
> and returns pointers to it, so the populate and commit flags don't do anything
> at all. But the malloc bdev is a special case - the norm is that some portion
> of the disk will be cached in DRAM and the real data is stored on some
> persistent device. Obtaining a zero copy buffer may or may not have to page
> data into DRAM, and commit may or may not have to page data out.
> 
> > * You mention " I believe I've addressed all known use cases with this
> > design"
> > but in this email chain, at least, only cover the NVMe-oF use case.  
> > Can you list the known cases you mention and maybe just say a few words
> > about them?
> > 
> 
> I'm glad you ask because it needs to be explained, but I'm also not glad you
> asked because now I have to type out the giant explanation!
> 
> There's a matrix of different things that need to be supported. On the bdev
> module side, we want to support a number of different implementations.
> 
> 1) All data exists in addressable/DMA-able memory (equivalent to the malloc
> bdev
> module)
> 2) Some data exists in addressable/DMA-able memory (a caching bdev, or the
> nvme bdev module with a controller memory buffer).
> 
> Further, within those two choices, there are two other options:
> 
> 1) When updating the data in the buffer returned by zcopy_start, concurrent
> reads will immediately see the updates. This is what I was trying to say by
> "live" elsewhere. This is how the malloc bdev module works, and probably isn't
> a great bdev module design because it can't make any atomicity guarantees.
> 2) When updating data in the buffer returned by zcopy_start, future reads for
> the given blocks won't see the data until the buffer has been committed. We
> don't have any bdev modules that actually work like this today, but it's how I
> imagine a more robust caching bdev module would do it.
> 
> On the network side we also have to support four types of network interfaces.
> 
> 1) TPC/IP via sockets (iSCSI, NVMe-oF). For a read (which sends data over the
> network on the target side), we want to be able to ask the bdev module to
> provide a buffer that holds the data, then we use that to do a send() on the
> socket. That's the minimum number of copies. For a write (which recvs data
> over the network) we don't use zero copy semantics because the data has to be
> gathered into a buffer in user space anyway.
> 2) RDMA via ibverbs (NVMe-oF). For all paths, we want to ask the bdev to
> provide us a buffer describing the region on the disk and then we do an RDMA
> transfer into or out of it.
> 3) FC. For both paths, we want to ask the bdev to provide a buffer describing
> the region on the disk and then we do a DMA to/from the FC HBA.
> 4) virtio (vhost). The data is already in DMA-safe memory, so we don't benefit
> from zero copy at all.
> 
> So if you take all of these constraints and you work out an API, you basically
> end up with what I proposed.
> 
> > Thanks
> > Paul
> > 
> > -----Original Message-----
> > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, 
> > Benjamin
> > Sent: Wednesday, July 11, 2018 2:53 PM
> > To: John.Meneghini(a)netapp.com; Sriram.Popuri(a)netapp.com; 
> > Srikanth.Kaligotla(a)ne tapp.com
> > Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org
> > Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your 
> > input.)
> > 
> > I've done a bunch of work here to formalize the zero copy API in the 
> > bdev layer.
> > The patch series begins here:
> > 
> > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
> > 
> > There has been some good discussion on the patches themselves, but I 
> > wanted to circle back to the mailing list for a bit. The basic idea is 
> > to add two new functions to the bdev layer:
> > 
> > int spdk_bdev_zcopy_start(struct spdk_bdev_desc *desc,
> > 			  struct spdk_io_channel *ch,
> > 			  uint64_t offset_blocks, uint64_t num_blocks,
> > 			  bool populate,
> > 			  spdk_bdev_io_completion_cb cb, void *cb_arg); int 
> > spdk_bdev_zcopy_end(struct spdk_bdev_io *bdev_io, bool commit,
> > 			spdk_bdev_io_completion_cb cb, void *cb_arg);
> > 
> > The zcopy start call makes a request to the backing block device to 
> > provide a region of memory suitable for performing a zero copy update 
> > to the region described by the offset_blocks and num_blocks 
> > parameters. The populate flag determines whether the "live" data is 
> > present in the given buffer. When this request completes, it returns a 
> > bdev_io in the completion callback. That bdev_io will have a scatter 
> > gather list filled out describing the memory region.
> > 
> > When the user has finished transferring data into or out of the data 
> > buffer, the user calls spdk_bdev_zcopy_end with the originally 
> > provided bdev_io. The commit flag indicates whether the data in the buffer
> > needs to be made "live".
> > 
> > The NVMe-oF target can take advantage of these new APIs by altering 
> > the state machine in the RDMA layer to call spdk_bdev_zcopy_start to 
> > obtain data buffers after parsing the command, then performing RDMA 
> > transfers into or out of the provided region, and finally completing the
> > command with spdk_bdev_zcopy_end.
> > The work to do those modifications has not been done yet, and I expect 
> > them to be a significant amount of effort.
> > 
> > As many bdev modules as possible should attempt to support zero copy 
> > semantics, but for those bdev modules that aren't able to, one of the 
> > patches in my series will automatically emulate the zcopy commands 
> > using regular reads and writes.
> > This allows the higher level users of the bdev API to always make the 
> > zcopy calls for simplicity. The emulation path should be no worse, in 
> > terms of performance, than doing a regular read or write.
> > 
> > I believe I've addressed all known use cases with this design, and 
> > while there will certainly be some future patches to fix problems that 
> > come up, I'm confident that we're on the right track here. If there is 
> > any feedback in this area at all, now is the time to provide it.
> > 
> > I think this also leads into some really awesome future work that I'll 
> > suggest in the hope that someone in the community is inspired. One 
> > idea is that the bdev_nvme module could support zero copy semantics by 
> > returning a pointer to a portion of a Controller Memory Buffer or 
> > Persistent Memory Buffer suitable for data. There aren't many devices 
> > for sale today that can do this, but it's in the spec and I imagine 
> > they're coming. This would allow the NVMe-oF target, for example, to 
> > perform RDMA transfers peer-to-peer between the NIC and the NVMe SSD. 
> > Similarly, for the bdev_pmem module, direct pointers into the the 
> > persistent memory could be returned. The NVMe-oF target could then 
> > perform RDMA transfers into or out of the persistent memory directly, 
> > and the zcopy end call with commit set to true would perform the required
> > flush instructions.
> > 
> > Thanks,
> > Ben
> > 
> > 
> > On Mon, 2018-07-09 at 19:19 +0000, Popuri, Sriram wrote:
> > > I would like to jot down what requirements we are looking at. 
> > > Wherever there is "====wait" for something, there is a potential context
> > > switch.
> > > 
> > > For a write request, the following is the workflow for zero copy:
> > > 1) Host submits Write request
> > > 2) Protocol validation of request
> > > 4) Get iovs
> > > 5) R2T/RDMA_READ
> > > ====wait for host====
> > > 5) Host transfers data (data_out/RDMA_READ_COMPLETE)
> > > 6) Perform Write(spdk_bdev_writev_blocks) to bdev_module ====wait 
> > > for bdev module to respond====
> > > 7) on bdev_module response, send rsp/status to host
> > > 8) free req resources
> > > 
> > > Read path:
> > > 1) Host submits Read request
> > > 2) Protocol validatation of request
> > > 3) Submit read to bdev_module
> > > ====wait for data blocks from bdev_module====
> > > 4) respond iov (data_in/RDMA_WRITE)
> > > ====wait for host====
> > > 5) host ack/RDMA_WRITE_COMPLETE
> > > 6) free req resources
> > > 
> > > In future we will have Persistent reservations where there can be 
> > > read/modify/update cases depending on how the bdev module handles it.
> > > So the zero copy apis needs to address future requirements as well.
> > > 
> > > The above workflows suggest we require a state to track at what step 
> > > the request is. A state will help to cleanup resources properly when 
> > > handling aborts, disconnects, namespace destroy workflows etc.
> > > So the requirement is to figure out a way to clean up resources 
> > > according to the state of the bdev_io in its life cycle and stitch 
> > > in the zero copy workflows into existing spdk code which assumes 
> > > buffers are pre-allocated.
> > > 
> > > I think Srikanth's proposed changes are made in target layer to 
> > > populate and commit to bdev module. However his suggestion is 
> > > changing the bdev_io life cycle which I understand from John is not 
> > > agreeable to others.
> > > 
> > > With the new zcopy apis also there is no clarity on the bdev_io life 
> > > cycle. A bdev_io is acquired during zcopy_start and it's not clear 
> > > how to deal with the bdev_io allocated in read/write paths.
> > > 
> > > The orchestration of zero copy either be done at transport layer, 
> > > target layer or bdev module layer.
> > > 
> > > We don't like the orchestration done from transport layer because:
> > > 1) No clear way to deal with bdev_io (acquired from zcopy) and 
> > > bdev_io (acquired from read/write codepaths).
> > > 2) If I understand correctly, by the time a read/write request is 
> > > submitted to target, the data_out would have happened. So the 
> > > protocol validation happens after data_out phase. For example an LBA 
> > > out of range request or invalid SGL length is returned after a 
> > > data_out is performed. We want to run through the protocol 
> > > validation before data_out phase.
> > > 
> > > I feel the bdev module is the ideal place to deal with zero copy 
> > > buffers. The bdev module knows how to get buffers and when to claim 
> > > them. All it needs is a proper cleanup and by the time request hits 
> > > bdev module, bdev_io is acquired so no need to change bdev_io life 
> > > cycle or introduce a new io type.
> > > 
> > > I propose the following:
> > > 1) Host submits Write request
> > > 2) Protocol validation of request
> > >    spdk_nvmf_request_exec
> > > 4) submit request to bdev module
> > >    spdk_bdev_writev_blocks(). bdev_io is acquired
> > > 5) bdev module determines the buffers needs to be allocated.
> > >    Will perform something like what is prposed in the patch:
> > > spdk_bdev_zcopy_start(populate=false)
> > > 6) New spdk_nvmf_request_exec_status
> > > (SPDK_NVMF_REQUEST_EXEC_STATUS_BUFF_READY)
> > > 7) Transport does R2T/RDMA_READ
> > > ====wait for host====
> > > 8) Host transfers data (data_out/RDMA_READ_COMPLETE)
> > > 9) Perform Write 
> > > again(spdk_nvmf_request_exec/spdk_bdev_writev_blocks)
> > > to bdev_module.
> > > ====wait for bdev module to respond====
> > > 10) on bdev_module response, send rsp/status to host
> > > 11) free req resources
> > >     free bdev_io spdk_bdev_free_io.
> > >     Here we need a hook to free resources allocated by bdev module
> > >     spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)
> > > 
> > > Read path:
> > > 1) Host submits Read request
> > > 2) Protocol validatation of request
> > > 3) Submit the request (spdk_bdev_readv_blocks) to bdev_module 
> > > ====wait for data blocks from bdev_module====
> > > 4) respond iov (data_in/RDMA_WRITE)
> > > ====wait for host====
> > > 5) host ack/RDMA_WRITE_COMPLETE
> > > 6) free req resources
> > >    free bdev_io spdk_bdev_free_io.
> > >    Here we need a hook to free resources allocated by bdev module
> > >    spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)
> > > 
> > > If it makes sense we can work on a patch.
> > > 
> > > Regards,
> > > ~Sriram
> > > 
> > > -----Original Message-----
> > > From: Walker, Benjamin <benjamin.walker(a)intel.com>
> > > Sent: Monday, July 9, 2018 11:22 PM
> > > To: Meneghini, John <John.Meneghini(a)netapp.com>; Popuri, Sriram 
> > > <Sriram.Popuri @netapp.com>; Kaligotla, Srikanth 
> > > <Srikanth.Kaligotla(a)netapp.com>
> > > Cc: raju.gottumukkala(a)broadcom.com; Rodriguez, Edwin 
> > > <Ed.Rodriguez(a)netapp.com> ; spdk(a)lists.01.org; Pai, Madhu 
> > > <Madhusudan.Pai(a)netapp.com>; NGC-john.barnard- broadcom.com 
> > > <john.barnard(a)broadcom.com>
> > > Subject: Re: BDEV-IO Lifecycle - Need your input.
> > > 
> > > On Mon, 2018-07-09 at 16:18 +0000, Meneghini, John wrote:
> > > > Hi Jim.
> > > > 
> > > > This is the patch series I believe Ben is proposing to replace
> > > > Srikanth's:  ht tps://review.gerrithub.io/#/c/spdk/spdk/+/415860/
> > > > 
> > > > pick a21bbcf2 nvmf: Move data buffer pool to generic layer pick
> > > > 09586e54 bdev: Add a zero copy I/O path pick 1cc9fe95 bdev: Make 
> > > > malloc bdev use the new zero copy mechanism pick 40a1f62b bdevperf:
> > > > Use new zcopy API for reads pick eba6f35a bdev: Emulate zero copy 
> > > > support when necessary
> > > > 
> > > > Is this correct?
> > > 
> > > 
> > > My patch series doesn't entirely replace Srikanth's set of patches. 
> > > There are two separate things necessary to implement zero copy. 
> > > First is the infrastructure within the bdev layer to make requests 
> > > of a bdev module to obtain  or release a suitable buffer for a zero 
> > > copy operation. Second is actually making the NVMe-oF target use 
> > > that mechanism, which includes extending the bdev_io lifetime. My 
> > > patch series only addresses the first part.
> > > 
> > > Thanks,
> > > Ben
> > > 
> > > 
> > > 
> > > > 
> > > > /John
> > > > 
> > > > On 7/9/18, 11:31 AM, "Harris, James R" <james.r.harris(a)intel.com>
> > > > wrote:
> > > > 
> > > >     Hi Sriram:
> > > >     
> > > >     Ben has some later patches in this series with some examples:
> > > >     
> > > >     https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - 
> > > > implements this zero copy API in the simple ‘malloc’ bdev module
> > > >     https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the 
> > > > zero copy API in the SPDK bdevperf utility
> > > >     
> > > >     -Jim
> > > >     
> > > >     
> > > >     On 7/9/18, 3:34 AM, "Popuri, Sriram" <Sriram.Popuri(a)netapp.com>
> > > > wrote:
> > > >     
> > > >         Sorry was not focusing on this change. Just give me a day 
> > > > or two to get back.
> > > >         From a quick glance I didn't understand how zcopy 
> > > > start/end fits into a read/write life cycle. Is there an example 
> > > > on how zcopy start/end is consumed or someone can do give me a 
> > > > quick dump on how its envisioned to be used.
> > > >         
> > > >         Regards,
> > > >         ~Sriram
> > > >         
> > > >         -----Original Message-----
> > > >         From: Meneghini, John 
> > > >         Sent: Friday, July 6, 2018 10:16 PM
> > > >         To: Walker, Benjamin <benjamin.walker(a)intel.com>; Verkamp, 
> > > > Daniel <dan iel.verkamp(a)intel.com>; Harris, James R 
> > > > <james.r.harris(a)intel.com>; Kaligotla, Srikanth 
> > > > <Srikanth.Kaligotla(a)netapp.c
> > > > om>
> > > >         Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; 
> > > > Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu 
> > > > <Madhusudan.Pai(a)netapp.com>; NGC- john.barnard-broadcom.com 
> > > > <john.barnard(a)br oadcom.com>; Popuri, Sriram <Sriram.
> > > > Popuri(a)netapp.com>
> > > >         Subject: Re: BDEV-IO Lifecycle - Need your input.
> > > >         
> > > >         I'm adding Sriram to this thread.  Sriram is another 
> > > > NetApp engineer who is working on this stuff internally.
> > > >         
> > > >         /John
> > > >         
> > > >         On 7/6/18, 12:41 PM, "Walker, Benjamin" 
> > > > <benjamin.walker(a)intel.com>
> > > > wrote:
> > > >         
> > > >             Hi Srikanth,
> > > >             
> > > >             I wanted to check in to see if you had any additional 
> > > > feedback on the zero copy
> > > >             operations in the bdev layer here:
> > > >             
> > > >             https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
> > > >             
> > > >             This does not address extending the lifetime of the 
> > > > bdev_io in the NVMe-oF
> > > >             target, but I think this is looking like the right 
> > > > mechanism for the bdev layer.
> > > >             
> > > >             Thanks,
> > > >             Ben
> > > >             
> > > >             On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote:
> > > >             > Thanks Srikanth.  Sounds like the spdk_bdev_io pool 
> > > > sizing along with
> > > >             > spdk_bdev_queue_io_wait() meets your needs then 
> > > > regarding the spdk_bdev_io
> > > >             > memory.
> > > >             >  
> > > >             > Regarding zcopy_start/zcopy_end – it looks like 
> > > > you’ve already added a bunch
> > > >             > of comments to Ben’s patch on GerritHub.  For now 
> > > > I’d say let’s continue our
> > > >             > discussion there.  I’ve responded to a couple of 
> > > > similar questions there and
> > > >             > I’m sure Ben will have more replies tomorrow.
> > > >             >  
> > > >             > -Jim
> > > >             >  
> > > >             >  
> > > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com
> > > > >
> > > >             > Date: Thursday, June 21, 2018 at 5:01 PM
> > > >             > To: James Harris <james.r.harris(a)intel.com>, 
> > > > "Walker, Benjamin"
> > > > <benjamin.walk
> > > >             > er(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
> > > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > > > <raju.gottumukkala(a)broadcom .com>,
> > > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > > > "Rodriguez, Edwin" <Ed.Rodrigue
> > > >             > z(a)netapp.com>, "Pai, Madhu" 
> > > > <Madhusudan.Pai(a)netapp.com>,
> > > > "NGC-
> > > > john.barnard-
> > > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > > > "spdk(a)lists.01.org" < spdk(a)lists.01.
> > > >             > org>
> > > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> > > >             >  
> > > >             > Hi Jim,
> > > >             >  
> > > >             > I wish I joined the community meeting, I also missed the
> > > >             > spdk_bdev_queue_io_wait().
> > > >             >  
> > > >             > So, there are 2 issues I am intending to solve,
> > > >             >  
> > > >             > 1.       We want to ensure that an instance of bdev-io is
> > > > acquired prior to or
> > > >             > along with I/O data buffer. This allows for better 
> > > > error handling when bdev_io
> > > >             > pool is exhausted. Yes, we can solve it via sizing 
> > > > the pool right. The changes
> > > >             > queue-io-wait also addresses the problem.
> > > >             > 2.       Most importantly, by extending the life of bdev-
> > > > io
> > > > (Same life span as
> > > >             > nvmf_request structure), abort use case and other 
> > > > use cases that involve
> > > >             > cleaning up of IO data buffer are accomplished 
> > > > effectively. Let me elaborate;
> > > >             > bdev-io is the fabric that connects the nvme command 
> > > > and operation with the
> > > >             > backend. The driver context and data buffer context 
> > > > is stored in bdev-io. The
> > > >             > freeing up of bdev_io resource is pushed to the end 
> > > > so the I/O cleanup can
> > > >             > happen after controller has transmitted the data to 
> > > > the host. In absence of
> > > >             > bdev_io, we will have end up adding more and more 
> > > > void context in request
> > > >             > structure. Hence the push to extend the life of bdev_io.
> > > >             >  
> > > >             > I see Ben’s recent patch “zcopy_start” and 
> > > > “zcopy_end”; We can make that work
> > > >             > as long as the bdev-io allocated/acquired stays till 
> > > > the end.
> > > > One of the
> > > >             > challenges I see with that patch is defining a 
> > > > callback for the I/O
> > > >             > submission. For instance, zopy_start will allocate 
> > > > bdev_io and submits the I/O
> > > >             > to the bdev device. Underlying implementation can be 
> > > > synchronous or can be
> > > >             > asynchronous. The callback for this submission 
> > > > should check to see if it is a
> > > >             > BUFFER-READY or PENDING and accordingly relay it 
> > > > back to transport. Next phase
> > > >             > is actual I/O submission. Let’s say, it reuses the 
> > > > bdev-io obtained in ZCOPY-
> > > >             > START, now the callback should determine if it is a 
> > > > success or failure.
> > > >             > Finally when ZCOPY-END is invoked the supplied 
> > > > bdev_io will have all necessary
> > > >             > data to release the WRITE buffer or unlock the read 
> > > > buffer based on the
> > > >             > operation performed.
> > > >             >  
> > > >             > I hope I’m making sense. I guess, my effort is to 
> > > > extend the lifespan of bdev-
> > > >             > io and let it continue to host the driver context 
> > > > and buffer context populated
> > > >             > during the BEGIN phase.
> > > >             >  
> > > >             > Regards,
> > > >             > Srikanth
> > > >             >  
> > > >             > From: Harris, James R <james.r.harris(a)intel.com> 
> > > >             > Sent: Thursday, June 21, 2018 2:21 PM
> > > >             > To: Kaligotla, Srikanth 
> > > > <Srikanth.Kaligotla(a)netapp.com>; Walker, Benjamin <ben
> > > >             > jamin.walker(a)intel.com>; Verkamp, Daniel 
> > > > <daniel.verkamp(a)intel.c
> > > > om>
> > > >             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
> > > > <John.Menegh ini(a)netapp.com
> > > >             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
> > > > Madhu <Madhu sudan.Pai(a)neta
> > > >             > pp.com>; NGC-john.barnard-broadcom.com 
> > > > <john.barnard(a)broadcom.co
> > > > m>; spdk(a)lists
> > > >             > .01.org
> > > >             > Subject: Re: BDEV-IO Lifecycle - Need your input.
> > > >             >  
> > > >             > Hi Srikanth,
> > > >             >  
> > > >             > Following up on this thread and the discussion in 
> > > > yesterday’s community
> > > >             > meeting.
> > > >             >  
> > > >             > The recent spdk_bdev_queue_io_wait() changes allow 
> > > > an SPDK application to work
> > > >             > better in general when the spdk_bdev_io buffer pool 
> > > > is exhausted.  We still
> > > >             > need to make changes to the NVMe-oF target to use 
> > > > this new API but that should
> > > >             > be done soon-ish.
> > > >             >  
> > > >             > With that in place, the spdk_bdev_io buffer pool 
> > > > itself can be configured up
> > > >             > or down when the application starts.  Currently the 
> > > > default is 64K
> > > >             > spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) 
> > > > ==
> > > > 216 plus the per-IO
> > > >             > context size allocated for the bdev module.  This 
> > > > can be up to
> > > > 192 bytes
> > > >             > (virtio bdev module) but is likely much smaller for 
> > > > you depending on the
> > > >             > context size for your ontap bdev module.
> > > >             >  
> > > >             > Let’s assume your per-IO context size is 64 bytes.  
> > > > 64K x (192 +
> > > > 64) = 16MB.
> > > >             >  
> > > >             > I’m not sure how many spdk_bdev_io you need in 
> > > > flight at any given time.  64K
> > > >             > seems like a lot but I’d be curious to hear your 
> > > > thoughts on this.  If this is
> > > >             > the right number, then worst case, there would be 
> > > > about 16MB of DRAM that
> > > >             > would sit unused if the NVMe-oF target in your 
> > > > system was not active.  Is that
> > > >             > too burdensome for your application?
> > > >             >  
> > > >             > Thanks,
> > > >             >  
> > > >             > -Jim
> > > >             >  
> > > >             >  
> > > >             >  
> > > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com
> > > > >
> > > >             > Date: Wednesday, June 20, 2018 at 12:47 PM
> > > >             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
> > > > James Harris <james.r.harr
> > > >             > is(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
> > > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > > > <raju.gottumukkala(a)broadcom .com>,
> > > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > > > "Rodriguez, Edwin" <Ed.Rodrigue
> > > >             > z(a)netapp.com>, "Pai, Madhu" 
> > > > <Madhusudan.Pai(a)netapp.com>,
> > > > "NGC-
> > > > john.barnard-
> > > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > > > "spdk(a)lists.01.org" < spdk(a)lists.01.
> > > >             > org>
> > > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> > > >             >  
> > > >             > Hello,
> > > >             >  
> > > >             > First revision of changes to extend the lifecycle of 
> > > > bdev_io are available for
> > > >             > review. I would like to solicit your input on the 
> > > > proposed API/Code flow
> > > >             > changes.
> > > >             >  
> > > >             > https://review.gerrithub.io/c/spdk/spdk/+/415860
> > > >             >  
> > > >             > Thanks,
> > > >             > Srikanth
> > > >             >  
> > > >             >  
> > > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com
> > > > >
> > > >             > Date: Friday, May 11, 2018 at 2:27 PM
> > > >             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
> > > > "Harris, James R" <james.r
> > > >             > .harris(a)intel.com>
> > > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > > > <raju.gottumukkala(a)broadcom .com>,
> > > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > > > "Rodriguez, Edwin" <Ed.Rodrigue
> > > >             > z(a)netapp.com>, "Pai, Madhu" 
> > > > <Madhusudan.Pai(a)netapp.com>,
> > > > "NGC-
> > > > john.barnard-
> > > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > > > "spdk(a)lists.01.org" < spdk(a)lists.01.
> > > >             > org>
> > > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> > > >             >  
> > > >             > CC: List
> > > >             >  
> > > >             > Hi Ben,
> > > >             >  
> > > >             > Your proposal to interface with the backend to 
> > > > acquire and release buffers is
> > > >             > good. You have accurately stated that the challenge 
> > > > is in developing intuitive
> > > >             > semantics. And that has been my struggle. To me, 
> > > > there are two problem
> > > >             > statements;
> > > >             >  
> > > >             > 1.       It is expected that bdev_io pool is sized
> > > > correctly
> > > > so
> > > > the call to
> > > >             > get bdev_io succeeds. Failure to acquire bdev_io 
> > > > will result in DEVICE-ERROR.
> > > >             > The transport is already capable of handling 
> > > > temporary memory failures by
> > > >             > moving the request to PENDING-Queue. Hence the 
> > > > proposal to change the bdev_io
> > > >             > lifecycle and perhaps connect it with the 
> > > > spdk_nvmf_request object. Thus all
> > > >             > buffer needs are addressed at the beginning of I/O
> > > > request.
> > > >             > 2.       I/O Data buffers are sourced and managed by the
> > > > backend. One of the
> > > >             > challenges I see with your proposed interface is the 
> > > > lack of details like if
> > > >             > resource is being acquired for a READ operation or 
> > > > WRITE operation. The
> > > >             > handling is quite different in each case. Since 
> > > > bdev_io->type is overloaded
> > > >             > the type of I/O operation is lost. I suppose one can 
> > > > cast the cb_arg
> > > >             > (nvmf_request) and then proceed. Also, the bdev_io 
> > > > should be present to
> > > >             > RELEASE the buffer. Zero copy semantics warrants 
> > > > that data buffer stays until
> > > >             > controller-to-host has occurred. In other words, 
> > > > bdev_io lives till REQUEST
> > > >             > come to COMPLETE state.
> > > >             >  
> > > >             > What are your thoughts in introducing 
> > > > spdk_bdev_init() and
> > > > spdk_bdev_fini() as
> > > >             > an alternative approach to extend the lifecyle of 
> > > > bdev_io and allow data
> > > >             > buffer management via bdev fn_table ?
> > > >             >  
> > > >             > I hope I’m making sense…
> > > >             >  
> > > >             > Thanks,
> > > >             > Srikanth
> > > >             >  
> > > >             > From: Walker, Benjamin <benjamin.walker(a)intel.com> 
> > > >             > Sent: Friday, May 11, 2018 12:28 PM
> > > >             > To: Harris, James R <james.r.harris(a)intel.com>; 
> > > > Kaligotla, Srikanth <Srikanth.
> > > >             > Kaligotla(a)netapp.com>
> > > >             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
> > > > <John.Menegh ini(a)netapp.com
> > > >             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
> > > > Madhu <Madhu sudan.Pai(a)neta
> > > >             > pp.com>; NGC-john.barnard-broadcom.com 
> > > > <john.barnard(a)broadcom.co
> > > > m>
> > > >             > Subject: Re: BDEV-IO Lifecycle - Need your input.
> > > >             >  
> > > >             > Hi Srikanth,
> > > >             >  
> > > >             > Yes - we'll need to introduce some way to acquire 
> > > > and release buffers from the
> > > >             > bdev layer earlier on in the state machine that 
> > > > processes an NVMe-oF request.
> > > >             > I've had this patch out for review for several 
> > > > months as a proposal for this
> > > >             > scenario:
> > > >             >  
> > > >             > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
> > > >             >  
> > > >             > It doesn't pass the tests - it's just a proposal for 
> > > > the interface. 90% of the
> > > >             > challenge here is in developing intuitive semantics.
> > > >             >  
> > > >             > Thanks,
> > > >             > Ben
> > > >             >  
> > > >             > P.S. This is the kind of discussion that would fit 
> > > > perfectly on the mailing
> > > >             > list.
> > > >             >  
> > > >             > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, 
> > > > Srikanth
> > > > wrote:
> > > >             > > Hi Ben, Hi James,
> > > >             > >  
> > > >             > > I would like to solicit opinions on the lifecycle 
> > > > for bdev-io resource
> > > >             > > object. Attached is the image of RDMA state 
> > > > machine in its current
> > > >             > > implementation. When the REQUEST enters the 
> > > > NEED-BUFFER state, the buffers
> > > >             > > necessary for carrying out I/O operation are 
> > > > allocated/acquired from the
> > > >             > > memory pool. An instance of BDEV-IO comes into 
> > > > existence after REQUEST
> > > >             > > reaches READY-TO-EXECUTE state. The BDEV-IO is 
> > > > teared down as soon the
> > > >             > > backend returns. From a BDEV perspective, BDEV-IO 
> > > > is simply a translation
> > > >             > > unit that facilitates I/O buffers from the backend. 
> > > > The driver context
> > > >             > > embedded within bdev_io holds great deal of 
> > > > information pertaining to the
> > > >             > > I/O under execution. It assists in error handling, 
> > > > dereferencing the buffers
> > > >             > > upon I/O completion and in abort handling. In 
> > > > summary, the bdev_io stays
> > > >             > > alive until request has come to COMPLETE state. 
> > > > I’d like to hear peoples
> > > >             > > thoughts in introducing the plumbing to acquire 
> > > > BDEV-IO resource in REQUEST-
> > > >             > > NEED-BUFFER state and release it in 
> > > > REQUEST-COMPLETE state.  I will shortly
> > > >             > > have patch available for review that introduces 
> > > > spdk_bdev_init and
> > > >             > > spdk_bdev_fini which in turn invokes the 
> > > > corresponding bdev fn_table to
> > > >             > >  initialize/cleanup.  
> > > >             > >  
> > > >             > > I wanted to use this email to communicate our 
> > > > intent and solicit your
> > > >             > > feedback. We have a working implementation of the 
> > > > above proposal and prior
> > > >             > > to pushing it upstream for review would like to 
> > > > hear your thoughts. These
> > > >             > > proposed changes to upstream are a result of FC 
> > > > transport work in
> > > >             > > collaboration with the Broadcom team who are also 
> > > > copied to this mail.
> > > >             > > Myself and John will be at SPDK Dev conference and 
> > > > if required we can
> > > >             > > elaborate further on this proposal.
> > > >             > >  
> > > >             > > Thanks,
> > > >             > > Srikanth
> > > >             > 
> > > >             >
> > > >             
> > > >         
> > > >         
> > > >     
> > > >     
> > > > 
> > 
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.)
@ 2018-07-12 21:01 Luse, Paul E
  0 siblings, 0 replies; 7+ messages in thread
From: Luse, Paul E @ 2018-07-12 21:01 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 35599 bytes --]

Awesome Ben, thanks. This is quite a bit more sophisticated and flexible than I think what generally comes to mind with "zero copy" in terms of naming this thing but I guess I can't think of a better one - "data direct" or something, I dunno. Killer feature though no matter what its called... 

I was thinking some more about an earlier suggestion from John about using bytes instead of blocks in the API, have you thought much about that?

Thx
Paul

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
Sent: Thursday, July 12, 2018 1:44 PM
To: John.Meneghini(a)netapp.com; spdk(a)lists.01.org; Sriram.Popuri(a)netapp.com; Srikanth.Kaligotla(a)netapp.com
Cc: raju.gottumukkala(a)broadcom.com
Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.)

On Thu, 2018-07-12 at 01:16 +0000, Luse, Paul E wrote:
> This is a great explanation Ben, thanks.  I'll start looking through 
> the patches as well.  A few quick questions for additional context and 
> to make sure I understand your terminology:
> 
> * Can you expand on " The populate flag determines whether the "live" 
> data is present in the given buffer" maybe with an example or 
> something (what do you mean by "live", you use it a few times)

I struggled to find the words to explain this. When you do a zcopy_start, you are returned some region(s) of memory. If populate is false, that buffer doesn't contain any valid data. If populate is true, the buffer contains the data on the disk for the described region of blocks. Similarly with commit - if you memcpy into a buffer provided from zcopy_start, that doesn't necessarily become the data on a block device (i.e. visible from a separate read operation) until the data is committed. Obviously, block device modules may have wildly varying internal implementations. For example, the malloc bdev module just allocates a single region of memory representing the block device and returns pointers to it, so the populate and commit flags don't do anything at all. But the malloc bdev is a special case - the norm is that some portion of the disk will be cached in DRAM and the real data is stored on some persistent device. Obtaining a zero copy buffer may or may not have to page data into DRAM, and commit may or may not have to page data out.

> * You mention " I believe I've addressed all known use cases with this design"
> but in this email chain, at least, only cover the NVMe-oF use case.  
> Can you list the known cases you mention and maybe just say a few words about them?
> 

I'm glad you ask because it needs to be explained, but I'm also not glad you asked because now I have to type out the giant explanation!

There's a matrix of different things that need to be supported. On the bdev module side, we want to support a number of different implementations.

1) All data exists in addressable/DMA-able memory (equivalent to the malloc bdev
module)
2) Some data exists in addressable/DMA-able memory (a caching bdev, or the nvme bdev module with a controller memory buffer).

Further, within those two choices, there are two other options:

1) When updating the data in the buffer returned by zcopy_start, concurrent reads will immediately see the updates. This is what I was trying to say by "live" elsewhere. This is how the malloc bdev module works, and probably isn't a great bdev module design because it can't make any atomicity guarantees.
2) When updating data in the buffer returned by zcopy_start, future reads for the given blocks won't see the data until the buffer has been committed. We don't have any bdev modules that actually work like this today, but it's how I imagine a more robust caching bdev module would do it.

On the network side we also have to support four types of network interfaces.

1) TPC/IP via sockets (iSCSI, NVMe-oF). For a read (which sends data over the network on the target side), we want to be able to ask the bdev module to provide a buffer that holds the data, then we use that to do a send() on the socket. That's the minimum number of copies. For a write (which recvs data over the network) we don't use zero copy semantics because the data has to be gathered into a buffer in user space anyway.
2) RDMA via ibverbs (NVMe-oF). For all paths, we want to ask the bdev to provide us a buffer describing the region on the disk and then we do an RDMA transfer into or out of it.
3) FC. For both paths, we want to ask the bdev to provide a buffer describing the region on the disk and then we do a DMA to/from the FC HBA.
4) virtio (vhost). The data is already in DMA-safe memory, so we don't benefit from zero copy at all.

So if you take all of these constraints and you work out an API, you basically end up with what I proposed.

> Thanks
> Paul
> 
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, 
> Benjamin
> Sent: Wednesday, July 11, 2018 2:53 PM
> To: John.Meneghini(a)netapp.com; Sriram.Popuri(a)netapp.com; 
> Srikanth.Kaligotla(a)ne tapp.com
> Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org
> Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your 
> input.)
> 
> I've done a bunch of work here to formalize the zero copy API in the 
> bdev layer.
> The patch series begins here:
> 
> https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
> 
> There has been some good discussion on the patches themselves, but I 
> wanted to circle back to the mailing list for a bit. The basic idea is 
> to add two new functions to the bdev layer:
> 
> int spdk_bdev_zcopy_start(struct spdk_bdev_desc *desc,
> 			  struct spdk_io_channel *ch,
> 			  uint64_t offset_blocks, uint64_t num_blocks,
> 			  bool populate,
> 			  spdk_bdev_io_completion_cb cb, void *cb_arg); int 
> spdk_bdev_zcopy_end(struct spdk_bdev_io *bdev_io, bool commit,
> 			spdk_bdev_io_completion_cb cb, void *cb_arg);
> 
> The zcopy start call makes a request to the backing block device to 
> provide a region of memory suitable for performing a zero copy update 
> to the region described by the offset_blocks and num_blocks 
> parameters. The populate flag determines whether the "live" data is 
> present in the given buffer. When this request completes, it returns a 
> bdev_io in the completion callback. That bdev_io will have a scatter 
> gather list filled out describing the memory region.
> 
> When the user has finished transferring data into or out of the data 
> buffer, the user calls spdk_bdev_zcopy_end with the originally 
> provided bdev_io. The commit flag indicates whether the data in the buffer needs to be made "live".
> 
> The NVMe-oF target can take advantage of these new APIs by altering 
> the state machine in the RDMA layer to call spdk_bdev_zcopy_start to 
> obtain data buffers after parsing the command, then performing RDMA 
> transfers into or out of the provided region, and finally completing the command with spdk_bdev_zcopy_end.
> The work to do those modifications has not been done yet, and I expect 
> them to be a significant amount of effort.
> 
> As many bdev modules as possible should attempt to support zero copy 
> semantics, but for those bdev modules that aren't able to, one of the 
> patches in my series will automatically emulate the zcopy commands 
> using regular reads and writes.
> This allows the higher level users of the bdev API to always make the 
> zcopy calls for simplicity. The emulation path should be no worse, in 
> terms of performance, than doing a regular read or write.
> 
> I believe I've addressed all known use cases with this design, and 
> while there will certainly be some future patches to fix problems that 
> come up, I'm confident that we're on the right track here. If there is 
> any feedback in this area at all, now is the time to provide it.
> 
> I think this also leads into some really awesome future work that I'll 
> suggest in the hope that someone in the community is inspired. One 
> idea is that the bdev_nvme module could support zero copy semantics by 
> returning a pointer to a portion of a Controller Memory Buffer or 
> Persistent Memory Buffer suitable for data. There aren't many devices 
> for sale today that can do this, but it's in the spec and I imagine 
> they're coming. This would allow the NVMe-oF target, for example, to 
> perform RDMA transfers peer-to-peer between the NIC and the NVMe SSD. 
> Similarly, for the bdev_pmem module, direct pointers into the the 
> persistent memory could be returned. The NVMe-oF target could then 
> perform RDMA transfers into or out of the persistent memory directly, 
> and the zcopy end call with commit set to true would perform the required flush instructions.
> 
> Thanks,
> Ben
> 
> 
> On Mon, 2018-07-09 at 19:19 +0000, Popuri, Sriram wrote:
> > I would like to jot down what requirements we are looking at. 
> > Wherever there is "====wait" for something, there is a potential context switch.
> > 
> > For a write request, the following is the workflow for zero copy:
> > 1) Host submits Write request
> > 2) Protocol validation of request
> > 4) Get iovs
> > 5) R2T/RDMA_READ
> > ====wait for host====
> > 5) Host transfers data (data_out/RDMA_READ_COMPLETE)
> > 6) Perform Write(spdk_bdev_writev_blocks) to bdev_module ====wait 
> > for bdev module to respond====
> > 7) on bdev_module response, send rsp/status to host
> > 8) free req resources
> > 
> > Read path:
> > 1) Host submits Read request
> > 2) Protocol validatation of request
> > 3) Submit read to bdev_module
> > ====wait for data blocks from bdev_module====
> > 4) respond iov (data_in/RDMA_WRITE)
> > ====wait for host====
> > 5) host ack/RDMA_WRITE_COMPLETE
> > 6) free req resources
> > 
> > In future we will have Persistent reservations where there can be 
> > read/modify/update cases depending on how the bdev module handles it.
> > So the zero copy apis needs to address future requirements as well.
> > 
> > The above workflows suggest we require a state to track at what step 
> > the request is. A state will help to cleanup resources properly when 
> > handling aborts, disconnects, namespace destroy workflows etc.
> > So the requirement is to figure out a way to clean up resources 
> > according to the state of the bdev_io in its life cycle and stitch 
> > in the zero copy workflows into existing spdk code which assumes 
> > buffers are pre-allocated.
> > 
> > I think Srikanth's proposed changes are made in target layer to 
> > populate and commit to bdev module. However his suggestion is 
> > changing the bdev_io life cycle which I understand from John is not 
> > agreeable to others.
> > 
> > With the new zcopy apis also there is no clarity on the bdev_io life 
> > cycle. A bdev_io is acquired during zcopy_start and it's not clear 
> > how to deal with the bdev_io allocated in read/write paths.
> > 
> > The orchestration of zero copy either be done at transport layer, 
> > target layer or bdev module layer.
> > 
> > We don't like the orchestration done from transport layer because:
> > 1) No clear way to deal with bdev_io (acquired from zcopy) and 
> > bdev_io (acquired from read/write codepaths).
> > 2) If I understand correctly, by the time a read/write request is 
> > submitted to target, the data_out would have happened. So the 
> > protocol validation happens after data_out phase. For example an LBA 
> > out of range request or invalid SGL length is returned after a 
> > data_out is performed. We want to run through the protocol 
> > validation before data_out phase.
> > 
> > I feel the bdev module is the ideal place to deal with zero copy 
> > buffers. The bdev module knows how to get buffers and when to claim 
> > them. All it needs is a proper cleanup and by the time request hits 
> > bdev module, bdev_io is acquired so no need to change bdev_io life 
> > cycle or introduce a new io type.
> > 
> > I propose the following:
> > 1) Host submits Write request
> > 2) Protocol validation of request
> >    spdk_nvmf_request_exec
> > 4) submit request to bdev module
> >    spdk_bdev_writev_blocks(). bdev_io is acquired
> > 5) bdev module determines the buffers needs to be allocated.
> >    Will perform something like what is prposed in the patch:
> > spdk_bdev_zcopy_start(populate=false)
> > 6) New spdk_nvmf_request_exec_status
> > (SPDK_NVMF_REQUEST_EXEC_STATUS_BUFF_READY)
> > 7) Transport does R2T/RDMA_READ
> > ====wait for host====
> > 8) Host transfers data (data_out/RDMA_READ_COMPLETE)
> > 9) Perform Write 
> > again(spdk_nvmf_request_exec/spdk_bdev_writev_blocks)
> > to bdev_module.
> > ====wait for bdev module to respond====
> > 10) on bdev_module response, send rsp/status to host
> > 11) free req resources
> >     free bdev_io spdk_bdev_free_io.
> >     Here we need a hook to free resources allocated by bdev module
> >     spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)
> > 
> > Read path:
> > 1) Host submits Read request
> > 2) Protocol validatation of request
> > 3) Submit the request (spdk_bdev_readv_blocks) to bdev_module 
> > ====wait for data blocks from bdev_module====
> > 4) respond iov (data_in/RDMA_WRITE)
> > ====wait for host====
> > 5) host ack/RDMA_WRITE_COMPLETE
> > 6) free req resources
> >    free bdev_io spdk_bdev_free_io.
> >    Here we need a hook to free resources allocated by bdev module
> >    spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)
> > 
> > If it makes sense we can work on a patch.
> > 
> > Regards,
> > ~Sriram
> > 
> > -----Original Message-----
> > From: Walker, Benjamin <benjamin.walker(a)intel.com>
> > Sent: Monday, July 9, 2018 11:22 PM
> > To: Meneghini, John <John.Meneghini(a)netapp.com>; Popuri, Sriram 
> > <Sriram.Popuri @netapp.com>; Kaligotla, Srikanth 
> > <Srikanth.Kaligotla(a)netapp.com>
> > Cc: raju.gottumukkala(a)broadcom.com; Rodriguez, Edwin 
> > <Ed.Rodriguez(a)netapp.com> ; spdk(a)lists.01.org; Pai, Madhu 
> > <Madhusudan.Pai(a)netapp.com>; NGC-john.barnard- broadcom.com 
> > <john.barnard(a)broadcom.com>
> > Subject: Re: BDEV-IO Lifecycle - Need your input.
> > 
> > On Mon, 2018-07-09 at 16:18 +0000, Meneghini, John wrote:
> > > Hi Jim.
> > > 
> > > This is the patch series I believe Ben is proposing to replace
> > > Srikanth's:  ht tps://review.gerrithub.io/#/c/spdk/spdk/+/415860/
> > > 
> > > pick a21bbcf2 nvmf: Move data buffer pool to generic layer pick
> > > 09586e54 bdev: Add a zero copy I/O path pick 1cc9fe95 bdev: Make 
> > > malloc bdev use the new zero copy mechanism pick 40a1f62b bdevperf:
> > > Use new zcopy API for reads pick eba6f35a bdev: Emulate zero copy 
> > > support when necessary
> > > 
> > > Is this correct?
> > 
> > 
> > My patch series doesn't entirely replace Srikanth's set of patches. 
> > There are two separate things necessary to implement zero copy. 
> > First is the infrastructure within the bdev layer to make requests 
> > of a bdev module to obtain  or release a suitable buffer for a zero 
> > copy operation. Second is actually making the NVMe-oF target use 
> > that mechanism, which includes extending the bdev_io lifetime. My 
> > patch series only addresses the first part.
> > 
> > Thanks,
> > Ben
> > 
> > 
> > 
> > > 
> > > /John
> > > 
> > > On 7/9/18, 11:31 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
> > > 
> > >     Hi Sriram:
> > >     
> > >     Ben has some later patches in this series with some examples:
> > >     
> > >     https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - 
> > > implements this zero copy API in the simple ‘malloc’ bdev module
> > >     https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the 
> > > zero copy API in the SPDK bdevperf utility
> > >     
> > >     -Jim
> > >     
> > >     
> > >     On 7/9/18, 3:34 AM, "Popuri, Sriram" <Sriram.Popuri(a)netapp.com> wrote:
> > >     
> > >         Sorry was not focusing on this change. Just give me a day 
> > > or two to get back.
> > >         From a quick glance I didn't understand how zcopy 
> > > start/end fits into a read/write life cycle. Is there an example 
> > > on how zcopy start/end is consumed or someone can do give me a 
> > > quick dump on how its envisioned to be used.
> > >         
> > >         Regards,
> > >         ~Sriram
> > >         
> > >         -----Original Message-----
> > >         From: Meneghini, John 
> > >         Sent: Friday, July 6, 2018 10:16 PM
> > >         To: Walker, Benjamin <benjamin.walker(a)intel.com>; Verkamp, 
> > > Daniel <dan iel.verkamp(a)intel.com>; Harris, James R 
> > > <james.r.harris(a)intel.com>; Kaligotla, Srikanth 
> > > <Srikanth.Kaligotla(a)netapp.c
> > > om>
> > >         Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; 
> > > Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu 
> > > <Madhusudan.Pai(a)netapp.com>; NGC- john.barnard-broadcom.com 
> > > <john.barnard(a)br oadcom.com>; Popuri, Sriram <Sriram.
> > > Popuri(a)netapp.com>
> > >         Subject: Re: BDEV-IO Lifecycle - Need your input.
> > >         
> > >         I'm adding Sriram to this thread.  Sriram is another 
> > > NetApp engineer who is working on this stuff internally.
> > >         
> > >         /John
> > >         
> > >         On 7/6/18, 12:41 PM, "Walker, Benjamin" 
> > > <benjamin.walker(a)intel.com>
> > > wrote:
> > >         
> > >             Hi Srikanth,
> > >             
> > >             I wanted to check in to see if you had any additional 
> > > feedback on the zero copy
> > >             operations in the bdev layer here:
> > >             
> > >             https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
> > >             
> > >             This does not address extending the lifetime of the 
> > > bdev_io in the NVMe-oF
> > >             target, but I think this is looking like the right 
> > > mechanism for the bdev layer.
> > >             
> > >             Thanks,
> > >             Ben
> > >             
> > >             On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote:
> > >             > Thanks Srikanth.  Sounds like the spdk_bdev_io pool 
> > > sizing along with
> > >             > spdk_bdev_queue_io_wait() meets your needs then 
> > > regarding the spdk_bdev_io
> > >             > memory.
> > >             >  
> > >             > Regarding zcopy_start/zcopy_end – it looks like 
> > > you’ve already added a bunch
> > >             > of comments to Ben’s patch on GerritHub.  For now 
> > > I’d say let’s continue our
> > >             > discussion there.  I’ve responded to a couple of 
> > > similar questions there and
> > >             > I’m sure Ben will have more replies tomorrow.
> > >             >  
> > >             > -Jim
> > >             >  
> > >             >  
> > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> > >             > Date: Thursday, June 21, 2018 at 5:01 PM
> > >             > To: James Harris <james.r.harris(a)intel.com>, 
> > > "Walker, Benjamin"
> > > <benjamin.walk
> > >             > er(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
> > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > > <raju.gottumukkala(a)broadcom .com>,
> > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > > "Rodriguez, Edwin" <Ed.Rodrigue
> > >             > z(a)netapp.com>, "Pai, Madhu" 
> > > <Madhusudan.Pai(a)netapp.com>,
> > > "NGC-
> > > john.barnard-
> > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > > "spdk(a)lists.01.org" < spdk(a)lists.01.
> > >             > org>
> > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> > >             >  
> > >             > Hi Jim,
> > >             >  
> > >             > I wish I joined the community meeting, I also missed the
> > >             > spdk_bdev_queue_io_wait().
> > >             >  
> > >             > So, there are 2 issues I am intending to solve,
> > >             >  
> > >             > 1.       We want to ensure that an instance of bdev-io is
> > > acquired prior to or
> > >             > along with I/O data buffer. This allows for better 
> > > error handling when bdev_io
> > >             > pool is exhausted. Yes, we can solve it via sizing 
> > > the pool right. The changes
> > >             > queue-io-wait also addresses the problem.
> > >             > 2.       Most importantly, by extending the life of bdev-io
> > > (Same life span as
> > >             > nvmf_request structure), abort use case and other 
> > > use cases that involve
> > >             > cleaning up of IO data buffer are accomplished 
> > > effectively. Let me elaborate;
> > >             > bdev-io is the fabric that connects the nvme command 
> > > and operation with the
> > >             > backend. The driver context and data buffer context 
> > > is stored in bdev-io. The
> > >             > freeing up of bdev_io resource is pushed to the end 
> > > so the I/O cleanup can
> > >             > happen after controller has transmitted the data to 
> > > the host. In absence of
> > >             > bdev_io, we will have end up adding more and more 
> > > void context in request
> > >             > structure. Hence the push to extend the life of bdev_io.
> > >             >  
> > >             > I see Ben’s recent patch “zcopy_start” and 
> > > “zcopy_end”; We can make that work
> > >             > as long as the bdev-io allocated/acquired stays till 
> > > the end.
> > > One of the
> > >             > challenges I see with that patch is defining a 
> > > callback for the I/O
> > >             > submission. For instance, zopy_start will allocate 
> > > bdev_io and submits the I/O
> > >             > to the bdev device. Underlying implementation can be 
> > > synchronous or can be
> > >             > asynchronous. The callback for this submission 
> > > should check to see if it is a
> > >             > BUFFER-READY or PENDING and accordingly relay it 
> > > back to transport. Next phase
> > >             > is actual I/O submission. Let’s say, it reuses the 
> > > bdev-io obtained in ZCOPY-
> > >             > START, now the callback should determine if it is a 
> > > success or failure.
> > >             > Finally when ZCOPY-END is invoked the supplied 
> > > bdev_io will have all necessary
> > >             > data to release the WRITE buffer or unlock the read 
> > > buffer based on the
> > >             > operation performed.
> > >             >  
> > >             > I hope I’m making sense. I guess, my effort is to 
> > > extend the lifespan of bdev-
> > >             > io and let it continue to host the driver context 
> > > and buffer context populated
> > >             > during the BEGIN phase.
> > >             >  
> > >             > Regards,
> > >             > Srikanth
> > >             >  
> > >             > From: Harris, James R <james.r.harris(a)intel.com> 
> > >             > Sent: Thursday, June 21, 2018 2:21 PM
> > >             > To: Kaligotla, Srikanth 
> > > <Srikanth.Kaligotla(a)netapp.com>; Walker, Benjamin <ben
> > >             > jamin.walker(a)intel.com>; Verkamp, Daniel 
> > > <daniel.verkamp(a)intel.c
> > > om>
> > >             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
> > > <John.Menegh ini(a)netapp.com
> > >             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
> > > Madhu <Madhu sudan.Pai(a)neta
> > >             > pp.com>; NGC-john.barnard-broadcom.com 
> > > <john.barnard(a)broadcom.co
> > > m>; spdk(a)lists
> > >             > .01.org
> > >             > Subject: Re: BDEV-IO Lifecycle - Need your input.
> > >             >  
> > >             > Hi Srikanth,
> > >             >  
> > >             > Following up on this thread and the discussion in 
> > > yesterday’s community
> > >             > meeting.
> > >             >  
> > >             > The recent spdk_bdev_queue_io_wait() changes allow 
> > > an SPDK application to work
> > >             > better in general when the spdk_bdev_io buffer pool 
> > > is exhausted.  We still
> > >             > need to make changes to the NVMe-oF target to use 
> > > this new API but that should
> > >             > be done soon-ish.
> > >             >  
> > >             > With that in place, the spdk_bdev_io buffer pool 
> > > itself can be configured up
> > >             > or down when the application starts.  Currently the 
> > > default is 64K
> > >             > spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) 
> > > ==
> > > 216 plus the per-IO
> > >             > context size allocated for the bdev module.  This 
> > > can be up to
> > > 192 bytes
> > >             > (virtio bdev module) but is likely much smaller for 
> > > you depending on the
> > >             > context size for your ontap bdev module.
> > >             >  
> > >             > Let’s assume your per-IO context size is 64 bytes.  
> > > 64K x (192 +
> > > 64) = 16MB.
> > >             >  
> > >             > I’m not sure how many spdk_bdev_io you need in 
> > > flight at any given time.  64K
> > >             > seems like a lot but I’d be curious to hear your 
> > > thoughts on this.  If this is
> > >             > the right number, then worst case, there would be 
> > > about 16MB of DRAM that
> > >             > would sit unused if the NVMe-oF target in your 
> > > system was not active.  Is that
> > >             > too burdensome for your application?
> > >             >  
> > >             > Thanks,
> > >             >  
> > >             > -Jim
> > >             >  
> > >             >  
> > >             >  
> > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> > >             > Date: Wednesday, June 20, 2018 at 12:47 PM
> > >             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
> > > James Harris <james.r.harr
> > >             > is(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
> > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > > <raju.gottumukkala(a)broadcom .com>,
> > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > > "Rodriguez, Edwin" <Ed.Rodrigue
> > >             > z(a)netapp.com>, "Pai, Madhu" 
> > > <Madhusudan.Pai(a)netapp.com>,
> > > "NGC-
> > > john.barnard-
> > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > > "spdk(a)lists.01.org" < spdk(a)lists.01.
> > >             > org>
> > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> > >             >  
> > >             > Hello,
> > >             >  
> > >             > First revision of changes to extend the lifecycle of 
> > > bdev_io are available for
> > >             > review. I would like to solicit your input on the 
> > > proposed API/Code flow
> > >             > changes.
> > >             >  
> > >             > https://review.gerrithub.io/c/spdk/spdk/+/415860
> > >             >  
> > >             > Thanks,
> > >             > Srikanth
> > >             >  
> > >             >  
> > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> > >             > Date: Friday, May 11, 2018 at 2:27 PM
> > >             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
> > > "Harris, James R" <james.r
> > >             > .harris(a)intel.com>
> > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > > <raju.gottumukkala(a)broadcom .com>,
> > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > > "Rodriguez, Edwin" <Ed.Rodrigue
> > >             > z(a)netapp.com>, "Pai, Madhu" 
> > > <Madhusudan.Pai(a)netapp.com>,
> > > "NGC-
> > > john.barnard-
> > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > > "spdk(a)lists.01.org" < spdk(a)lists.01.
> > >             > org>
> > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> > >             >  
> > >             > CC: List
> > >             >  
> > >             > Hi Ben,
> > >             >  
> > >             > Your proposal to interface with the backend to 
> > > acquire and release buffers is
> > >             > good. You have accurately stated that the challenge 
> > > is in developing intuitive
> > >             > semantics. And that has been my struggle. To me, 
> > > there are two problem
> > >             > statements;
> > >             >  
> > >             > 1.       It is expected that bdev_io pool is sized correctly
> > > so
> > > the call to
> > >             > get bdev_io succeeds. Failure to acquire bdev_io 
> > > will result in DEVICE-ERROR.
> > >             > The transport is already capable of handling 
> > > temporary memory failures by
> > >             > moving the request to PENDING-Queue. Hence the 
> > > proposal to change the bdev_io
> > >             > lifecycle and perhaps connect it with the 
> > > spdk_nvmf_request object. Thus all
> > >             > buffer needs are addressed at the beginning of I/O request.
> > >             > 2.       I/O Data buffers are sourced and managed by the
> > > backend. One of the
> > >             > challenges I see with your proposed interface is the 
> > > lack of details like if
> > >             > resource is being acquired for a READ operation or 
> > > WRITE operation. The
> > >             > handling is quite different in each case. Since 
> > > bdev_io->type is overloaded
> > >             > the type of I/O operation is lost. I suppose one can 
> > > cast the cb_arg
> > >             > (nvmf_request) and then proceed. Also, the bdev_io 
> > > should be present to
> > >             > RELEASE the buffer. Zero copy semantics warrants 
> > > that data buffer stays until
> > >             > controller-to-host has occurred. In other words, 
> > > bdev_io lives till REQUEST
> > >             > come to COMPLETE state.
> > >             >  
> > >             > What are your thoughts in introducing 
> > > spdk_bdev_init() and
> > > spdk_bdev_fini() as
> > >             > an alternative approach to extend the lifecyle of 
> > > bdev_io and allow data
> > >             > buffer management via bdev fn_table ?
> > >             >  
> > >             > I hope I’m making sense…
> > >             >  
> > >             > Thanks,
> > >             > Srikanth
> > >             >  
> > >             > From: Walker, Benjamin <benjamin.walker(a)intel.com> 
> > >             > Sent: Friday, May 11, 2018 12:28 PM
> > >             > To: Harris, James R <james.r.harris(a)intel.com>; 
> > > Kaligotla, Srikanth <Srikanth.
> > >             > Kaligotla(a)netapp.com>
> > >             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
> > > <John.Menegh ini(a)netapp.com
> > >             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
> > > Madhu <Madhu sudan.Pai(a)neta
> > >             > pp.com>; NGC-john.barnard-broadcom.com 
> > > <john.barnard(a)broadcom.co
> > > m>
> > >             > Subject: Re: BDEV-IO Lifecycle - Need your input.
> > >             >  
> > >             > Hi Srikanth,
> > >             >  
> > >             > Yes - we'll need to introduce some way to acquire 
> > > and release buffers from the
> > >             > bdev layer earlier on in the state machine that 
> > > processes an NVMe-oF request.
> > >             > I've had this patch out for review for several 
> > > months as a proposal for this
> > >             > scenario:
> > >             >  
> > >             > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
> > >             >  
> > >             > It doesn't pass the tests - it's just a proposal for 
> > > the interface. 90% of the
> > >             > challenge here is in developing intuitive semantics.
> > >             >  
> > >             > Thanks,
> > >             > Ben
> > >             >  
> > >             > P.S. This is the kind of discussion that would fit 
> > > perfectly on the mailing
> > >             > list.
> > >             >  
> > >             > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, 
> > > Srikanth
> > > wrote:
> > >             > > Hi Ben, Hi James,
> > >             > >  
> > >             > > I would like to solicit opinions on the lifecycle 
> > > for bdev-io resource
> > >             > > object. Attached is the image of RDMA state 
> > > machine in its current
> > >             > > implementation. When the REQUEST enters the 
> > > NEED-BUFFER state, the buffers
> > >             > > necessary for carrying out I/O operation are 
> > > allocated/acquired from the
> > >             > > memory pool. An instance of BDEV-IO comes into 
> > > existence after REQUEST
> > >             > > reaches READY-TO-EXECUTE state. The BDEV-IO is 
> > > teared down as soon the
> > >             > > backend returns. From a BDEV perspective, BDEV-IO 
> > > is simply a translation
> > >             > > unit that facilitates I/O buffers from the backend. 
> > > The driver context
> > >             > > embedded within bdev_io holds great deal of 
> > > information pertaining to the
> > >             > > I/O under execution. It assists in error handling, 
> > > dereferencing the buffers
> > >             > > upon I/O completion and in abort handling. In 
> > > summary, the bdev_io stays
> > >             > > alive until request has come to COMPLETE state. 
> > > I’d like to hear peoples
> > >             > > thoughts in introducing the plumbing to acquire 
> > > BDEV-IO resource in REQUEST-
> > >             > > NEED-BUFFER state and release it in 
> > > REQUEST-COMPLETE state.  I will shortly
> > >             > > have patch available for review that introduces 
> > > spdk_bdev_init and
> > >             > > spdk_bdev_fini which in turn invokes the 
> > > corresponding bdev fn_table to
> > >             > >  initialize/cleanup.  
> > >             > >  
> > >             > > I wanted to use this email to communicate our 
> > > intent and solicit your
> > >             > > feedback. We have a working implementation of the 
> > > above proposal and prior
> > >             > > to pushing it upstream for review would like to 
> > > hear your thoughts. These
> > >             > > proposed changes to upstream are a result of FC 
> > > transport work in
> > >             > > collaboration with the Broadcom team who are also 
> > > copied to this mail.
> > >             > > Myself and John will be at SPDK Dev conference and 
> > > if required we can
> > >             > > elaborate further on this proposal.
> > >             > >  
> > >             > > Thanks,
> > >             > > Srikanth
> > >             > 
> > >             >
> > >             
> > >         
> > >         
> > >     
> > >     
> > > 
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.)
@ 2018-07-12 20:44 Walker, Benjamin
  0 siblings, 0 replies; 7+ messages in thread
From: Walker, Benjamin @ 2018-07-12 20:44 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 34589 bytes --]

On Thu, 2018-07-12 at 01:16 +0000, Luse, Paul E wrote:
> This is a great explanation Ben, thanks.  I'll start looking through the
> patches as well.  A few quick questions for additional context and to make
> sure I understand your terminology:
> 
> * Can you expand on " The populate flag determines whether the "live" data is
> present in the given buffer" maybe with an example or something (what do you
> mean by "live", you use it a few times)

I struggled to find the words to explain this. When you do a zcopy_start, you
are returned some region(s) of memory. If populate is false, that buffer doesn't
contain any valid data. If populate is true, the buffer contains the data on the
disk for the described region of blocks. Similarly with commit - if you memcpy
into a buffer provided from zcopy_start, that doesn't necessarily become the
data on a block device (i.e. visible from a separate read operation) until the
data is committed. Obviously, block device modules may have wildly varying
internal implementations. For example, the malloc bdev module just allocates a
single region of memory representing the block device and returns pointers to
it, so the populate and commit flags don't do anything at all. But the malloc
bdev is a special case - the norm is that some portion of the disk will be
cached in DRAM and the real data is stored on some persistent device. Obtaining
a zero copy buffer may or may not have to page data into DRAM, and commit may or
may not have to page data out.

> * You mention " I believe I've addressed all known use cases with this design"
> but in this email chain, at least, only cover the NVMe-oF use case.  Can you
> list the known cases you mention and maybe just say a few words about them?
> 

I'm glad you ask because it needs to be explained, but I'm also not glad you
asked because now I have to type out the giant explanation!

There's a matrix of different things that need to be supported. On the bdev
module side, we want to support a number of different implementations.

1) All data exists in addressable/DMA-able memory (equivalent to the malloc bdev
module)
2) Some data exists in addressable/DMA-able memory (a caching bdev, or the nvme
bdev module with a controller memory buffer).

Further, within those two choices, there are two other options:

1) When updating the data in the buffer returned by zcopy_start, concurrent
reads will immediately see the updates. This is what I was trying to say by
"live" elsewhere. This is how the malloc bdev module works, and probably isn't a
great bdev module design because it can't make any atomicity guarantees.
2) When updating data in the buffer returned by zcopy_start, future reads for
the given blocks won't see the data until the buffer has been committed. We
don't have any bdev modules that actually work like this today, but it's how I
imagine a more robust caching bdev module would do it.

On the network side we also have to support four types of network interfaces.

1) TPC/IP via sockets (iSCSI, NVMe-oF). For a read (which sends data over the
network on the target side), we want to be able to ask the bdev module to
provide a buffer that holds the data, then we use that to do a send() on the
socket. That's the minimum number of copies. For a write (which recvs data over
the network) we don't use zero copy semantics because the data has to be
gathered into a buffer in user space anyway.
2) RDMA via ibverbs (NVMe-oF). For all paths, we want to ask the bdev to provide
us a buffer describing the region on the disk and then we do an RDMA transfer
into or out of it.
3) FC. For both paths, we want to ask the bdev to provide a buffer describing
the region on the disk and then we do a DMA to/from the FC HBA.
4) virtio (vhost). The data is already in DMA-safe memory, so we don't benefit
from zero copy at all.

So if you take all of these constraints and you work out an API, you basically
end up with what I proposed.

> Thanks
> Paul
> 
> -----Original Message-----
> From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
> Sent: Wednesday, July 11, 2018 2:53 PM
> To: John.Meneghini(a)netapp.com; Sriram.Popuri(a)netapp.com; Srikanth.Kaligotla(a)ne
> tapp.com
> Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org
> Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.)
> 
> I've done a bunch of work here to formalize the zero copy API in the bdev
> layer.
> The patch series begins here:
> 
> https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
> 
> There has been some good discussion on the patches themselves, but I wanted to
> circle back to the mailing list for a bit. The basic idea is to add two new
> functions to the bdev layer:
> 
> int spdk_bdev_zcopy_start(struct spdk_bdev_desc *desc,
> 			  struct spdk_io_channel *ch,
> 			  uint64_t offset_blocks, uint64_t num_blocks,
> 			  bool populate,
> 			  spdk_bdev_io_completion_cb cb, void *cb_arg); int
> spdk_bdev_zcopy_end(struct spdk_bdev_io *bdev_io, bool commit,
> 			spdk_bdev_io_completion_cb cb, void *cb_arg);
> 
> The zcopy start call makes a request to the backing block device to provide a
> region of memory suitable for performing a zero copy update to the region
> described by the offset_blocks and num_blocks parameters. The populate flag
> determines whether the "live" data is present in the given buffer. When this
> request completes, it returns a bdev_io in the completion callback. That
> bdev_io will have a scatter gather list filled out describing the memory
> region.
> 
> When the user has finished transferring data into or out of the data buffer,
> the user calls spdk_bdev_zcopy_end with the originally provided bdev_io. The
> commit flag indicates whether the data in the buffer needs to be made "live".
> 
> The NVMe-oF target can take advantage of these new APIs by altering the state
> machine in the RDMA layer to call spdk_bdev_zcopy_start to obtain data buffers
> after parsing the command, then performing RDMA transfers into or out of the
> provided region, and finally completing the command with spdk_bdev_zcopy_end.
> The work to do those modifications has not been done yet, and I expect them to
> be a significant amount of effort.
> 
> As many bdev modules as possible should attempt to support zero copy
> semantics, but for those bdev modules that aren't able to, one of the patches
> in my series will automatically emulate the zcopy commands using regular reads
> and writes.
> This allows the higher level users of the bdev API to always make the zcopy
> calls for simplicity. The emulation path should be no worse, in terms of
> performance, than doing a regular read or write.
> 
> I believe I've addressed all known use cases with this design, and while there
> will certainly be some future patches to fix problems that come up, I'm
> confident that we're on the right track here. If there is any feedback in this
> area at all, now is the time to provide it.
> 
> I think this also leads into some really awesome future work that I'll suggest
> in the hope that someone in the community is inspired. One idea is that the
> bdev_nvme module could support zero copy semantics by returning a pointer to a
> portion of a Controller Memory Buffer or Persistent Memory Buffer suitable for
> data. There aren't many devices for sale today that can do this, but it's in
> the spec and I imagine they're coming. This would allow the NVMe-oF target,
> for example, to perform RDMA transfers peer-to-peer between the NIC and the
> NVMe SSD. Similarly, for the bdev_pmem module, direct pointers into the the
> persistent memory could be returned. The NVMe-oF target could then perform
> RDMA transfers into or out of the persistent memory directly, and the zcopy
> end call with commit set to true would perform the required flush
> instructions.
> 
> Thanks,
> Ben
> 
> 
> On Mon, 2018-07-09 at 19:19 +0000, Popuri, Sriram wrote:
> > I would like to jot down what requirements we are looking at. Wherever 
> > there is "====wait" for something, there is a potential context switch.
> > 
> > For a write request, the following is the workflow for zero copy:
> > 1) Host submits Write request
> > 2) Protocol validation of request
> > 4) Get iovs
> > 5) R2T/RDMA_READ
> > ====wait for host====
> > 5) Host transfers data (data_out/RDMA_READ_COMPLETE)
> > 6) Perform Write(spdk_bdev_writev_blocks) to bdev_module ====wait for 
> > bdev module to respond====
> > 7) on bdev_module response, send rsp/status to host
> > 8) free req resources
> > 
> > Read path:
> > 1) Host submits Read request
> > 2) Protocol validatation of request
> > 3) Submit read to bdev_module
> > ====wait for data blocks from bdev_module====
> > 4) respond iov (data_in/RDMA_WRITE)
> > ====wait for host====
> > 5) host ack/RDMA_WRITE_COMPLETE
> > 6) free req resources
> > 
> > In future we will have Persistent reservations where there can be 
> > read/modify/update cases depending on how the bdev module handles it. 
> > So the zero copy apis needs to address future requirements as well.
> > 
> > The above workflows suggest we require a state to track at what step 
> > the request is. A state will help to cleanup resources properly when 
> > handling aborts, disconnects, namespace destroy workflows etc.
> > So the requirement is to figure out a way to clean up resources 
> > according to the state of the bdev_io in its life cycle and stitch in 
> > the zero copy workflows into existing spdk code which assumes buffers are
> > pre-allocated.
> > 
> > I think Srikanth's proposed changes are made in target layer to 
> > populate and commit to bdev module. However his suggestion is changing 
> > the bdev_io life cycle which I understand from John is not agreeable to
> > others.
> > 
> > With the new zcopy apis also there is no clarity on the bdev_io life 
> > cycle. A bdev_io is acquired during zcopy_start and it's not clear how 
> > to deal with the bdev_io allocated in read/write paths.
> > 
> > The orchestration of zero copy either be done at transport layer, 
> > target layer or bdev module layer.
> > 
> > We don't like the orchestration done from transport layer because:
> > 1) No clear way to deal with bdev_io (acquired from zcopy) and bdev_io 
> > (acquired from read/write codepaths).
> > 2) If I understand correctly, by the time a read/write request is 
> > submitted to target, the data_out would have happened. So the protocol 
> > validation happens after data_out phase. For example an LBA out of 
> > range request or invalid SGL length is returned after a data_out is 
> > performed. We want to run through the protocol validation before data_out
> > phase.
> > 
> > I feel the bdev module is the ideal place to deal with zero copy 
> > buffers. The bdev module knows how to get buffers and when to claim 
> > them. All it needs is a proper cleanup and by the time request hits 
> > bdev module, bdev_io is acquired so no need to change bdev_io life cycle or
> > introduce a new io type.
> > 
> > I propose the following:
> > 1) Host submits Write request
> > 2) Protocol validation of request
> >    spdk_nvmf_request_exec
> > 4) submit request to bdev module
> >    spdk_bdev_writev_blocks(). bdev_io is acquired
> > 5) bdev module determines the buffers needs to be allocated.
> >    Will perform something like what is prposed in the patch:
> > spdk_bdev_zcopy_start(populate=false)
> > 6) New spdk_nvmf_request_exec_status
> > (SPDK_NVMF_REQUEST_EXEC_STATUS_BUFF_READY)
> > 7) Transport does R2T/RDMA_READ
> > ====wait for host====
> > 8) Host transfers data (data_out/RDMA_READ_COMPLETE)
> > 9) Perform Write again(spdk_nvmf_request_exec/spdk_bdev_writev_blocks) 
> > to bdev_module.
> > ====wait for bdev module to respond====
> > 10) on bdev_module response, send rsp/status to host
> > 11) free req resources
> >     free bdev_io spdk_bdev_free_io.
> >     Here we need a hook to free resources allocated by bdev module
> >     spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)
> > 
> > Read path:
> > 1) Host submits Read request
> > 2) Protocol validatation of request
> > 3) Submit the request (spdk_bdev_readv_blocks) to bdev_module ====wait 
> > for data blocks from bdev_module====
> > 4) respond iov (data_in/RDMA_WRITE)
> > ====wait for host====
> > 5) host ack/RDMA_WRITE_COMPLETE
> > 6) free req resources
> >    free bdev_io spdk_bdev_free_io.
> >    Here we need a hook to free resources allocated by bdev module
> >    spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)
> > 
> > If it makes sense we can work on a patch.
> > 
> > Regards,
> > ~Sriram
> > 
> > -----Original Message-----
> > From: Walker, Benjamin <benjamin.walker(a)intel.com>
> > Sent: Monday, July 9, 2018 11:22 PM
> > To: Meneghini, John <John.Meneghini(a)netapp.com>; Popuri, Sriram 
> > <Sriram.Popuri @netapp.com>; Kaligotla, Srikanth 
> > <Srikanth.Kaligotla(a)netapp.com>
> > Cc: raju.gottumukkala(a)broadcom.com; Rodriguez, Edwin 
> > <Ed.Rodriguez(a)netapp.com> ; spdk(a)lists.01.org; Pai, Madhu 
> > <Madhusudan.Pai(a)netapp.com>; NGC-john.barnard- broadcom.com 
> > <john.barnard(a)broadcom.com>
> > Subject: Re: BDEV-IO Lifecycle - Need your input.
> > 
> > On Mon, 2018-07-09 at 16:18 +0000, Meneghini, John wrote:
> > > Hi Jim.
> > > 
> > > This is the patch series I believe Ben is proposing to replace
> > > Srikanth's:  ht tps://review.gerrithub.io/#/c/spdk/spdk/+/415860/
> > > 
> > > pick a21bbcf2 nvmf: Move data buffer pool to generic layer pick
> > > 09586e54 bdev: Add a zero copy I/O path pick 1cc9fe95 bdev: Make 
> > > malloc bdev use the new zero copy mechanism pick 40a1f62b bdevperf:
> > > Use new zcopy API for reads pick eba6f35a bdev: Emulate zero copy 
> > > support when necessary
> > > 
> > > Is this correct?
> > 
> > 
> > My patch series doesn't entirely replace Srikanth's set of patches. 
> > There are two separate things necessary to implement zero copy. First 
> > is the infrastructure within the bdev layer to make requests of a bdev 
> > module to obtain  or release a suitable buffer for a zero copy 
> > operation. Second is actually making the NVMe-oF target use that 
> > mechanism, which includes extending the bdev_io lifetime. My patch series
> > only addresses the first part.
> > 
> > Thanks,
> > Ben
> > 
> > 
> > 
> > > 
> > > /John
> > > 
> > > On 7/9/18, 11:31 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
> > > 
> > >     Hi Sriram:
> > >     
> > >     Ben has some later patches in this series with some examples:
> > >     
> > >     https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - implements 
> > > this zero copy API in the simple ‘malloc’ bdev module
> > >     https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the 
> > > zero copy API in the SPDK bdevperf utility
> > >     
> > >     -Jim
> > >     
> > >     
> > >     On 7/9/18, 3:34 AM, "Popuri, Sriram" <Sriram.Popuri(a)netapp.com> wrote:
> > >     
> > >         Sorry was not focusing on this change. Just give me a day or 
> > > two to get back.
> > >         From a quick glance I didn't understand how zcopy start/end 
> > > fits into a read/write life cycle. Is there an example on how zcopy 
> > > start/end is consumed or someone can do give me a quick dump on how 
> > > its envisioned to be used.
> > >         
> > >         Regards,
> > >         ~Sriram
> > >         
> > >         -----Original Message-----
> > >         From: Meneghini, John 
> > >         Sent: Friday, July 6, 2018 10:16 PM
> > >         To: Walker, Benjamin <benjamin.walker(a)intel.com>; Verkamp, 
> > > Daniel <dan iel.verkamp(a)intel.com>; Harris, James R 
> > > <james.r.harris(a)intel.com>; Kaligotla, Srikanth 
> > > <Srikanth.Kaligotla(a)netapp.c
> > > om>
> > >         Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; 
> > > Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu 
> > > <Madhusudan.Pai(a)netapp.com>; NGC- john.barnard-broadcom.com 
> > > <john.barnard(a)br oadcom.com>; Popuri, Sriram <Sriram.
> > > Popuri(a)netapp.com>
> > >         Subject: Re: BDEV-IO Lifecycle - Need your input.
> > >         
> > >         I'm adding Sriram to this thread.  Sriram is another NetApp 
> > > engineer who is working on this stuff internally.
> > >         
> > >         /John
> > >         
> > >         On 7/6/18, 12:41 PM, "Walker, Benjamin" 
> > > <benjamin.walker(a)intel.com>
> > > wrote:
> > >         
> > >             Hi Srikanth,
> > >             
> > >             I wanted to check in to see if you had any additional 
> > > feedback on the zero copy
> > >             operations in the bdev layer here:
> > >             
> > >             https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
> > >             
> > >             This does not address extending the lifetime of the 
> > > bdev_io in the NVMe-oF
> > >             target, but I think this is looking like the right 
> > > mechanism for the bdev layer.
> > >             
> > >             Thanks,
> > >             Ben
> > >             
> > >             On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote:
> > >             > Thanks Srikanth.  Sounds like the spdk_bdev_io pool 
> > > sizing along with
> > >             > spdk_bdev_queue_io_wait() meets your needs then 
> > > regarding the spdk_bdev_io
> > >             > memory.
> > >             >  
> > >             > Regarding zcopy_start/zcopy_end – it looks like you’ve 
> > > already added a bunch
> > >             > of comments to Ben’s patch on GerritHub.  For now I’d 
> > > say let’s continue our
> > >             > discussion there.  I’ve responded to a couple of 
> > > similar questions there and
> > >             > I’m sure Ben will have more replies tomorrow.
> > >             >  
> > >             > -Jim
> > >             >  
> > >             >  
> > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> > >             > Date: Thursday, June 21, 2018 at 5:01 PM
> > >             > To: James Harris <james.r.harris(a)intel.com>, "Walker, 
> > > Benjamin"
> > > <benjamin.walk
> > >             > er(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
> > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > > <raju.gottumukkala(a)broadcom .com>,
> > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > > "Rodriguez, Edwin" <Ed.Rodrigue
> > >             > z(a)netapp.com>, "Pai, Madhu" 
> > > <Madhusudan.Pai(a)netapp.com>,
> > > "NGC-
> > > john.barnard-
> > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > > "spdk(a)lists.01.org" < spdk(a)lists.01.
> > >             > org>
> > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> > >             >  
> > >             > Hi Jim,
> > >             >  
> > >             > I wish I joined the community meeting, I also missed the
> > >             > spdk_bdev_queue_io_wait().
> > >             >  
> > >             > So, there are 2 issues I am intending to solve,
> > >             >  
> > >             > 1.       We want to ensure that an instance of bdev-io is
> > > acquired prior to or
> > >             > along with I/O data buffer. This allows for better 
> > > error handling when bdev_io
> > >             > pool is exhausted. Yes, we can solve it via sizing the 
> > > pool right. The changes
> > >             > queue-io-wait also addresses the problem.
> > >             > 2.       Most importantly, by extending the life of bdev-io
> > > (Same life span as
> > >             > nvmf_request structure), abort use case and other use 
> > > cases that involve
> > >             > cleaning up of IO data buffer are accomplished 
> > > effectively. Let me elaborate;
> > >             > bdev-io is the fabric that connects the nvme command 
> > > and operation with the
> > >             > backend. The driver context and data buffer context is 
> > > stored in bdev-io. The
> > >             > freeing up of bdev_io resource is pushed to the end so 
> > > the I/O cleanup can
> > >             > happen after controller has transmitted the data to 
> > > the host. In absence of
> > >             > bdev_io, we will have end up adding more and more void 
> > > context in request
> > >             > structure. Hence the push to extend the life of bdev_io.
> > >             >  
> > >             > I see Ben’s recent patch “zcopy_start” and 
> > > “zcopy_end”; We can make that work
> > >             > as long as the bdev-io allocated/acquired stays till the
> > > end.
> > > One of the
> > >             > challenges I see with that patch is defining a 
> > > callback for the I/O
> > >             > submission. For instance, zopy_start will allocate 
> > > bdev_io and submits the I/O
> > >             > to the bdev device. Underlying implementation can be 
> > > synchronous or can be
> > >             > asynchronous. The callback for this submission should 
> > > check to see if it is a
> > >             > BUFFER-READY or PENDING and accordingly relay it back 
> > > to transport. Next phase
> > >             > is actual I/O submission. Let’s say, it reuses the 
> > > bdev-io obtained in ZCOPY-
> > >             > START, now the callback should determine if it is a 
> > > success or failure.
> > >             > Finally when ZCOPY-END is invoked the supplied bdev_io 
> > > will have all necessary
> > >             > data to release the WRITE buffer or unlock the read 
> > > buffer based on the
> > >             > operation performed.
> > >             >  
> > >             > I hope I’m making sense. I guess, my effort is to 
> > > extend the lifespan of bdev-
> > >             > io and let it continue to host the driver context and 
> > > buffer context populated
> > >             > during the BEGIN phase.
> > >             >  
> > >             > Regards,
> > >             > Srikanth
> > >             >  
> > >             > From: Harris, James R <james.r.harris(a)intel.com> 
> > >             > Sent: Thursday, June 21, 2018 2:21 PM
> > >             > To: Kaligotla, Srikanth 
> > > <Srikanth.Kaligotla(a)netapp.com>; Walker, Benjamin <ben
> > >             > jamin.walker(a)intel.com>; Verkamp, Daniel 
> > > <daniel.verkamp(a)intel.c
> > > om>
> > >             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
> > > <John.Menegh ini(a)netapp.com
> > >             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
> > > Madhu <Madhu sudan.Pai(a)neta
> > >             > pp.com>; NGC-john.barnard-broadcom.com 
> > > <john.barnard(a)broadcom.co
> > > m>; spdk(a)lists
> > >             > .01.org
> > >             > Subject: Re: BDEV-IO Lifecycle - Need your input.
> > >             >  
> > >             > Hi Srikanth,
> > >             >  
> > >             > Following up on this thread and the discussion in 
> > > yesterday’s community
> > >             > meeting.
> > >             >  
> > >             > The recent spdk_bdev_queue_io_wait() changes allow an 
> > > SPDK application to work
> > >             > better in general when the spdk_bdev_io buffer pool is 
> > > exhausted.  We still
> > >             > need to make changes to the NVMe-oF target to use this 
> > > new API but that should
> > >             > be done soon-ish.
> > >             >  
> > >             > With that in place, the spdk_bdev_io buffer pool 
> > > itself can be configured up
> > >             > or down when the application starts.  Currently the 
> > > default is 64K
> > >             > spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) ==
> > > 216 plus the per-IO
> > >             > context size allocated for the bdev module.  This can 
> > > be up to
> > > 192 bytes
> > >             > (virtio bdev module) but is likely much smaller for 
> > > you depending on the
> > >             > context size for your ontap bdev module.
> > >             >  
> > >             > Let’s assume your per-IO context size is 64 bytes.  
> > > 64K x (192 +
> > > 64) = 16MB.
> > >             >  
> > >             > I’m not sure how many spdk_bdev_io you need in flight 
> > > at any given time.  64K
> > >             > seems like a lot but I’d be curious to hear your 
> > > thoughts on this.  If this is
> > >             > the right number, then worst case, there would be 
> > > about 16MB of DRAM that
> > >             > would sit unused if the NVMe-oF target in your system 
> > > was not active.  Is that
> > >             > too burdensome for your application?
> > >             >  
> > >             > Thanks,
> > >             >  
> > >             > -Jim
> > >             >  
> > >             >  
> > >             >  
> > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> > >             > Date: Wednesday, June 20, 2018 at 12:47 PM
> > >             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
> > > James Harris <james.r.harr
> > >             > is(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
> > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > > <raju.gottumukkala(a)broadcom .com>,
> > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > > "Rodriguez, Edwin" <Ed.Rodrigue
> > >             > z(a)netapp.com>, "Pai, Madhu" 
> > > <Madhusudan.Pai(a)netapp.com>,
> > > "NGC-
> > > john.barnard-
> > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > > "spdk(a)lists.01.org" < spdk(a)lists.01.
> > >             > org>
> > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> > >             >  
> > >             > Hello,
> > >             >  
> > >             > First revision of changes to extend the lifecycle of 
> > > bdev_io are available for
> > >             > review. I would like to solicit your input on the 
> > > proposed API/Code flow
> > >             > changes.
> > >             >  
> > >             > https://review.gerrithub.io/c/spdk/spdk/+/415860
> > >             >  
> > >             > Thanks,
> > >             > Srikanth
> > >             >  
> > >             >  
> > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> > >             > Date: Friday, May 11, 2018 at 2:27 PM
> > >             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
> > > "Harris, James R" <james.r
> > >             > .harris(a)intel.com>
> > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > > <raju.gottumukkala(a)broadcom .com>,
> > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > > "Rodriguez, Edwin" <Ed.Rodrigue
> > >             > z(a)netapp.com>, "Pai, Madhu" 
> > > <Madhusudan.Pai(a)netapp.com>,
> > > "NGC-
> > > john.barnard-
> > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > > "spdk(a)lists.01.org" < spdk(a)lists.01.
> > >             > org>
> > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> > >             >  
> > >             > CC: List
> > >             >  
> > >             > Hi Ben,
> > >             >  
> > >             > Your proposal to interface with the backend to acquire 
> > > and release buffers is
> > >             > good. You have accurately stated that the challenge is 
> > > in developing intuitive
> > >             > semantics. And that has been my struggle. To me, there 
> > > are two problem
> > >             > statements;
> > >             >  
> > >             > 1.       It is expected that bdev_io pool is sized correctly
> > > so
> > > the call to
> > >             > get bdev_io succeeds. Failure to acquire bdev_io will 
> > > result in DEVICE-ERROR.
> > >             > The transport is already capable of handling temporary 
> > > memory failures by
> > >             > moving the request to PENDING-Queue. Hence the 
> > > proposal to change the bdev_io
> > >             > lifecycle and perhaps connect it with the 
> > > spdk_nvmf_request object. Thus all
> > >             > buffer needs are addressed at the beginning of I/O request.
> > >             > 2.       I/O Data buffers are sourced and managed by the
> > > backend. One of the
> > >             > challenges I see with your proposed interface is the 
> > > lack of details like if
> > >             > resource is being acquired for a READ operation or 
> > > WRITE operation. The
> > >             > handling is quite different in each case. Since 
> > > bdev_io->type is overloaded
> > >             > the type of I/O operation is lost. I suppose one can 
> > > cast the cb_arg
> > >             > (nvmf_request) and then proceed. Also, the bdev_io 
> > > should be present to
> > >             > RELEASE the buffer. Zero copy semantics warrants that 
> > > data buffer stays until
> > >             > controller-to-host has occurred. In other words, 
> > > bdev_io lives till REQUEST
> > >             > come to COMPLETE state.
> > >             >  
> > >             > What are your thoughts in introducing spdk_bdev_init() 
> > > and
> > > spdk_bdev_fini() as
> > >             > an alternative approach to extend the lifecyle of 
> > > bdev_io and allow data
> > >             > buffer management via bdev fn_table ?
> > >             >  
> > >             > I hope I’m making sense…
> > >             >  
> > >             > Thanks,
> > >             > Srikanth
> > >             >  
> > >             > From: Walker, Benjamin <benjamin.walker(a)intel.com> 
> > >             > Sent: Friday, May 11, 2018 12:28 PM
> > >             > To: Harris, James R <james.r.harris(a)intel.com>; 
> > > Kaligotla, Srikanth <Srikanth.
> > >             > Kaligotla(a)netapp.com>
> > >             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
> > > <John.Menegh ini(a)netapp.com
> > >             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
> > > Madhu <Madhu sudan.Pai(a)neta
> > >             > pp.com>; NGC-john.barnard-broadcom.com 
> > > <john.barnard(a)broadcom.co
> > > m>
> > >             > Subject: Re: BDEV-IO Lifecycle - Need your input.
> > >             >  
> > >             > Hi Srikanth,
> > >             >  
> > >             > Yes - we'll need to introduce some way to acquire and 
> > > release buffers from the
> > >             > bdev layer earlier on in the state machine that 
> > > processes an NVMe-oF request.
> > >             > I've had this patch out for review for several months 
> > > as a proposal for this
> > >             > scenario:
> > >             >  
> > >             > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
> > >             >  
> > >             > It doesn't pass the tests - it's just a proposal for 
> > > the interface. 90% of the
> > >             > challenge here is in developing intuitive semantics.
> > >             >  
> > >             > Thanks,
> > >             > Ben
> > >             >  
> > >             > P.S. This is the kind of discussion that would fit 
> > > perfectly on the mailing
> > >             > list.
> > >             >  
> > >             > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth
> > > wrote:
> > >             > > Hi Ben, Hi James,
> > >             > >  
> > >             > > I would like to solicit opinions on the lifecycle 
> > > for bdev-io resource
> > >             > > object. Attached is the image of RDMA state machine 
> > > in its current
> > >             > > implementation. When the REQUEST enters the 
> > > NEED-BUFFER state, the buffers
> > >             > > necessary for carrying out I/O operation are 
> > > allocated/acquired from the
> > >             > > memory pool. An instance of BDEV-IO comes into 
> > > existence after REQUEST
> > >             > > reaches READY-TO-EXECUTE state. The BDEV-IO is 
> > > teared down as soon the
> > >             > > backend returns. From a BDEV perspective, BDEV-IO is 
> > > simply a translation
> > >             > > unit that facilitates I/O buffers from the backend. 
> > > The driver context
> > >             > > embedded within bdev_io holds great deal of 
> > > information pertaining to the
> > >             > > I/O under execution. It assists in error handling, 
> > > dereferencing the buffers
> > >             > > upon I/O completion and in abort handling. In 
> > > summary, the bdev_io stays
> > >             > > alive until request has come to COMPLETE state. I’d 
> > > like to hear peoples
> > >             > > thoughts in introducing the plumbing to acquire 
> > > BDEV-IO resource in REQUEST-
> > >             > > NEED-BUFFER state and release it in REQUEST-COMPLETE 
> > > state.  I will shortly
> > >             > > have patch available for review that introduces 
> > > spdk_bdev_init and
> > >             > > spdk_bdev_fini which in turn invokes the 
> > > corresponding bdev fn_table to
> > >             > >  initialize/cleanup.  
> > >             > >  
> > >             > > I wanted to use this email to communicate our intent 
> > > and solicit your
> > >             > > feedback. We have a working implementation of the 
> > > above proposal and prior
> > >             > > to pushing it upstream for review would like to hear 
> > > your thoughts. These
> > >             > > proposed changes to upstream are a result of FC 
> > > transport work in
> > >             > > collaboration with the Broadcom team who are also 
> > > copied to this mail.
> > >             > > Myself and John will be at SPDK Dev conference and 
> > > if required we can
> > >             > > elaborate further on this proposal.
> > >             > >  
> > >             > > Thanks,
> > >             > > Srikanth
> > >             > 
> > >             >
> > >             
> > >         
> > >         
> > >     
> > >     
> > > 
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.)
@ 2018-07-12  4:30 Meneghini, John
  0 siblings, 0 replies; 7+ messages in thread
From: Meneghini, John @ 2018-07-12  4:30 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 33852 bytes --]

I'm sorry Ben, but I think the spdk_bdev_zcopy_start() and spdk_bdev_zcopy_end() APIs suck, and I don't want to use them in our BDEV.

Here is are some APIs that I think could possibly work: 

struct spdk_bdev_io *spdk_bdev_dma_start(struct spdk_bdev_desc *desc,
    			                                      struct spdk_io_channel *ch,
    			                                      uint64_t offset,                     /* Make this bytes instead of blocks because we may need to DMA something other than blocks */
                                                                                 uint64_t length,                     /* length in bytes of whole transfer */
                                                                                 uint32_t block_size,              /* block size in bytes. When set to 1 we are transferring bytes.*/
    			                                      enum spdk_bdev_io_type.  /* Everything else I need to know... LIVE, COMMIT, POPULATE, etc. is known by the io_type);

void spdk_bdev_dma_end(struct spdk_bdev_io *bdev_io);                         /* Rundown the bdev_io and clean up any iovec or dma resources */

int spdk_bdev_io_submit(struct spdk_bdev_io *bdev_io);

These APIs belong in the bdev->fn_table, and the flow through the transport would be:

spdk_nvmf_ctrlr_process_io_cmd
      nvmf_bdev_ctrlr_write_cmd
             bdev_io = spdk_bdev_dma_start()
             spdk_bdev_io_submit(bdev_io)

Etc.

This is the direction Srikanth was heading and I'd like to understand why that direction does not work.

/John


On 7/11/18, 9:17 PM, "Luse, Paul E" <paul.e.luse(a)intel.com> wrote:

    This is a great explanation Ben, thanks.  I'll start looking through the patches as well.  A few quick questions for additional context and to make sure I understand your terminology:
    
    * Can you expand on " The populate flag determines whether the "live" data is present in the given buffer" maybe with an example or something (what do you mean by "live", you use it a few times)
    * You mention " I believe I've addressed all known use cases with this design" but in this email chain, at least, only cover the NVMe-oF use case.  Can you list the known cases you mention and maybe just say a few words about them?
    
    Thanks
    Paul
    
    -----Original Message-----
    From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
    Sent: Wednesday, July 11, 2018 2:53 PM
    To: John.Meneghini(a)netapp.com; Sriram.Popuri(a)netapp.com; Srikanth.Kaligotla(a)netapp.com
    Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org
    Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.)
    
    I've done a bunch of work here to formalize the zero copy API in the bdev layer.
    The patch series begins here:
    
    https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
    
    There has been some good discussion on the patches themselves, but I wanted to circle back to the mailing list for a bit. The basic idea is to add two new functions to the bdev layer:
    
    int spdk_bdev_zcopy_start(struct spdk_bdev_desc *desc,
    			  struct spdk_io_channel *ch,
    			  uint64_t offset_blocks, uint64_t num_blocks,
    			  bool populate,
    			  spdk_bdev_io_completion_cb cb, void *cb_arg); int spdk_bdev_zcopy_end(struct spdk_bdev_io *bdev_io, bool commit,
    			spdk_bdev_io_completion_cb cb, void *cb_arg);
    
    The zcopy start call makes a request to the backing block device to provide a region of memory suitable for performing a zero copy update to the region described by the offset_blocks and num_blocks parameters. The populate flag determines whether the "live" data is present in the given buffer. When this request completes, it returns a bdev_io in the completion callback. That bdev_io will have a scatter gather list filled out describing the memory region.
    
    When the user has finished transferring data into or out of the data buffer, the user calls spdk_bdev_zcopy_end with the originally provided bdev_io. The commit flag indicates whether the data in the buffer needs to be made "live".
    
    The NVMe-oF target can take advantage of these new APIs by altering the state machine in the RDMA layer to call spdk_bdev_zcopy_start to obtain data buffers after parsing the command, then performing RDMA transfers into or out of the provided region, and finally completing the command with spdk_bdev_zcopy_end.
    The work to do those modifications has not been done yet, and I expect them to be a significant amount of effort.
    
    As many bdev modules as possible should attempt to support zero copy semantics, but for those bdev modules that aren't able to, one of the patches in my series will automatically emulate the zcopy commands using regular reads and writes.
    This allows the higher level users of the bdev API to always make the zcopy calls for simplicity. The emulation path should be no worse, in terms of performance, than doing a regular read or write.
    
    I believe I've addressed all known use cases with this design, and while there will certainly be some future patches to fix problems that come up, I'm confident that we're on the right track here. If there is any feedback in this area at all, now is the time to provide it.
    
    I think this also leads into some really awesome future work that I'll suggest in the hope that someone in the community is inspired. One idea is that the bdev_nvme module could support zero copy semantics by returning a pointer to a portion of a Controller Memory Buffer or Persistent Memory Buffer suitable for data. There aren't many devices for sale today that can do this, but it's in the spec and I imagine they're coming. This would allow the NVMe-oF target, for example, to perform RDMA transfers peer-to-peer between the NIC and the NVMe SSD. Similarly, for the bdev_pmem module, direct pointers into the the persistent memory could be returned. The NVMe-oF target could then perform RDMA transfers into or out of the persistent memory directly, and the zcopy end call with commit set to true would perform the required flush instructions.
    
    Thanks,
    Ben
    
    
    On Mon, 2018-07-09 at 19:19 +0000, Popuri, Sriram wrote:
    > I would like to jot down what requirements we are looking at. Wherever 
    > there is "====wait" for something, there is a potential context switch.
    > 
    > For a write request, the following is the workflow for zero copy:
    > 1) Host submits Write request
    > 2) Protocol validation of request
    > 4) Get iovs
    > 5) R2T/RDMA_READ
    > ====wait for host====
    > 5) Host transfers data (data_out/RDMA_READ_COMPLETE)
    > 6) Perform Write(spdk_bdev_writev_blocks) to bdev_module ====wait for 
    > bdev module to respond====
    > 7) on bdev_module response, send rsp/status to host
    > 8) free req resources
    > 
    > Read path:
    > 1) Host submits Read request
    > 2) Protocol validatation of request
    > 3) Submit read to bdev_module
    > ====wait for data blocks from bdev_module====
    > 4) respond iov (data_in/RDMA_WRITE)
    > ====wait for host====
    > 5) host ack/RDMA_WRITE_COMPLETE
    > 6) free req resources
    > 
    > In future we will have Persistent reservations where there can be 
    > read/modify/update cases depending on how the bdev module handles it. 
    > So the zero copy apis needs to address future requirements as well.
    > 
    > The above workflows suggest we require a state to track at what step 
    > the request is. A state will help to cleanup resources properly when 
    > handling aborts, disconnects, namespace destroy workflows etc.
    > So the requirement is to figure out a way to clean up resources 
    > according to the state of the bdev_io in its life cycle and stitch in 
    > the zero copy workflows into existing spdk code which assumes buffers are pre-allocated.
    > 
    > I think Srikanth's proposed changes are made in target layer to 
    > populate and commit to bdev module. However his suggestion is changing 
    > the bdev_io life cycle which I understand from John is not agreeable to others.
    > 
    > With the new zcopy apis also there is no clarity on the bdev_io life 
    > cycle. A bdev_io is acquired during zcopy_start and it's not clear how 
    > to deal with the bdev_io allocated in read/write paths.
    > 
    > The orchestration of zero copy either be done at transport layer, 
    > target layer or bdev module layer.
    > 
    > We don't like the orchestration done from transport layer because:
    > 1) No clear way to deal with bdev_io (acquired from zcopy) and bdev_io 
    > (acquired from read/write codepaths).
    > 2) If I understand correctly, by the time a read/write request is 
    > submitted to target, the data_out would have happened. So the protocol 
    > validation happens after data_out phase. For example an LBA out of 
    > range request or invalid SGL length is returned after a data_out is 
    > performed. We want to run through the protocol validation before data_out phase.
    > 
    > I feel the bdev module is the ideal place to deal with zero copy 
    > buffers. The bdev module knows how to get buffers and when to claim 
    > them. All it needs is a proper cleanup and by the time request hits 
    > bdev module, bdev_io is acquired so no need to change bdev_io life cycle or introduce a new io type.
    > 
    > I propose the following:
    > 1) Host submits Write request
    > 2) Protocol validation of request
    >    spdk_nvmf_request_exec
    > 4) submit request to bdev module
    >    spdk_bdev_writev_blocks(). bdev_io is acquired
    > 5) bdev module determines the buffers needs to be allocated.
    >    Will perform something like what is prposed in the patch:
    > spdk_bdev_zcopy_start(populate=false)
    > 6) New spdk_nvmf_request_exec_status
    > (SPDK_NVMF_REQUEST_EXEC_STATUS_BUFF_READY)
    > 7) Transport does R2T/RDMA_READ
    > ====wait for host====
    > 8) Host transfers data (data_out/RDMA_READ_COMPLETE)
    > 9) Perform Write again(spdk_nvmf_request_exec/spdk_bdev_writev_blocks) 
    > to bdev_module.
    > ====wait for bdev module to respond====
    > 10) on bdev_module response, send rsp/status to host
    > 11) free req resources
    >     free bdev_io spdk_bdev_free_io.
    >     Here we need a hook to free resources allocated by bdev module
    >     spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)
    > 
    > Read path:
    > 1) Host submits Read request
    > 2) Protocol validatation of request
    > 3) Submit the request (spdk_bdev_readv_blocks) to bdev_module ====wait 
    > for data blocks from bdev_module====
    > 4) respond iov (data_in/RDMA_WRITE)
    > ====wait for host====
    > 5) host ack/RDMA_WRITE_COMPLETE
    > 6) free req resources
    >    free bdev_io spdk_bdev_free_io.
    >    Here we need a hook to free resources allocated by bdev module
    >    spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)
    > 
    > If it makes sense we can work on a patch.
    > 
    > Regards,
    > ~Sriram
    > 
    > -----Original Message-----
    > From: Walker, Benjamin <benjamin.walker(a)intel.com>
    > Sent: Monday, July 9, 2018 11:22 PM
    > To: Meneghini, John <John.Meneghini(a)netapp.com>; Popuri, Sriram 
    > <Sriram.Popuri @netapp.com>; Kaligotla, Srikanth 
    > <Srikanth.Kaligotla(a)netapp.com>
    > Cc: raju.gottumukkala(a)broadcom.com; Rodriguez, Edwin 
    > <Ed.Rodriguez(a)netapp.com> ; spdk(a)lists.01.org; Pai, Madhu 
    > <Madhusudan.Pai(a)netapp.com>; NGC-john.barnard- broadcom.com 
    > <john.barnard(a)broadcom.com>
    > Subject: Re: BDEV-IO Lifecycle - Need your input.
    > 
    > On Mon, 2018-07-09 at 16:18 +0000, Meneghini, John wrote:
    > > Hi Jim.
    > > 
    > > This is the patch series I believe Ben is proposing to replace
    > > Srikanth's:  ht tps://review.gerrithub.io/#/c/spdk/spdk/+/415860/
    > > 
    > > pick a21bbcf2 nvmf: Move data buffer pool to generic layer pick
    > > 09586e54 bdev: Add a zero copy I/O path pick 1cc9fe95 bdev: Make 
    > > malloc bdev use the new zero copy mechanism pick 40a1f62b bdevperf:
    > > Use new zcopy API for reads pick eba6f35a bdev: Emulate zero copy 
    > > support when necessary
    > > 
    > > Is this correct?
    > 
    > 
    > My patch series doesn't entirely replace Srikanth's set of patches. 
    > There are two separate things necessary to implement zero copy. First 
    > is the infrastructure within the bdev layer to make requests of a bdev 
    > module to obtain  or release a suitable buffer for a zero copy 
    > operation. Second is actually making the NVMe-oF target use that 
    > mechanism, which includes extending the bdev_io lifetime. My patch series only addresses the first part.
    > 
    > Thanks,
    > Ben
    > 
    > 
    > 
    > > 
    > > /John
    > > 
    > > On 7/9/18, 11:31 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
    > > 
    > >     Hi Sriram:
    > >     
    > >     Ben has some later patches in this series with some examples:
    > >     
    > >     https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - implements 
    > > this zero copy API in the simple ‘malloc’ bdev module
    > >     https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the 
    > > zero copy API in the SPDK bdevperf utility
    > >     
    > >     -Jim
    > >     
    > >     
    > >     On 7/9/18, 3:34 AM, "Popuri, Sriram" <Sriram.Popuri(a)netapp.com> wrote:
    > >     
    > >         Sorry was not focusing on this change. Just give me a day or 
    > > two to get back.
    > >         From a quick glance I didn't understand how zcopy start/end 
    > > fits into a read/write life cycle. Is there an example on how zcopy 
    > > start/end is consumed or someone can do give me a quick dump on how 
    > > its envisioned to be used.
    > >         
    > >         Regards,
    > >         ~Sriram
    > >         
    > >         -----Original Message-----
    > >         From: Meneghini, John 
    > >         Sent: Friday, July 6, 2018 10:16 PM
    > >         To: Walker, Benjamin <benjamin.walker(a)intel.com>; Verkamp, 
    > > Daniel <dan iel.verkamp(a)intel.com>; Harris, James R 
    > > <james.r.harris(a)intel.com>; Kaligotla, Srikanth 
    > > <Srikanth.Kaligotla(a)netapp.c
    > > om>
    > >         Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; 
    > > Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu 
    > > <Madhusudan.Pai(a)netapp.com>; NGC- john.barnard-broadcom.com 
    > > <john.barnard(a)br oadcom.com>; Popuri, Sriram <Sriram.
    > > Popuri(a)netapp.com>
    > >         Subject: Re: BDEV-IO Lifecycle - Need your input.
    > >         
    > >         I'm adding Sriram to this thread.  Sriram is another NetApp 
    > > engineer who is working on this stuff internally.
    > >         
    > >         /John
    > >         
    > >         On 7/6/18, 12:41 PM, "Walker, Benjamin" 
    > > <benjamin.walker(a)intel.com>
    > > wrote:
    > >         
    > >             Hi Srikanth,
    > >             
    > >             I wanted to check in to see if you had any additional 
    > > feedback on the zero copy
    > >             operations in the bdev layer here:
    > >             
    > >             https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
    > >             
    > >             This does not address extending the lifetime of the 
    > > bdev_io in the NVMe-oF
    > >             target, but I think this is looking like the right 
    > > mechanism for the bdev layer.
    > >             
    > >             Thanks,
    > >             Ben
    > >             
    > >             On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote:
    > >             > Thanks Srikanth.  Sounds like the spdk_bdev_io pool 
    > > sizing along with
    > >             > spdk_bdev_queue_io_wait() meets your needs then 
    > > regarding the spdk_bdev_io
    > >             > memory.
    > >             >  
    > >             > Regarding zcopy_start/zcopy_end – it looks like you’ve 
    > > already added a bunch
    > >             > of comments to Ben’s patch on GerritHub.  For now I’d 
    > > say let’s continue our
    > >             > discussion there.  I’ve responded to a couple of 
    > > similar questions there and
    > >             > I’m sure Ben will have more replies tomorrow.
    > >             >  
    > >             > -Jim
    > >             >  
    > >             >  
    > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
    > >             > Date: Thursday, June 21, 2018 at 5:01 PM
    > >             > To: James Harris <james.r.harris(a)intel.com>, "Walker, 
    > > Benjamin"
    > > <benjamin.walk
    > >             > er(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
    > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
    > > <raju.gottumukkala(a)broadcom .com>,
    > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
    > > "Rodriguez, Edwin" <Ed.Rodrigue
    > >             > z(a)netapp.com>, "Pai, Madhu" 
    > > <Madhusudan.Pai(a)netapp.com>,
    > > "NGC-
    > > john.barnard-
    > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
    > > "spdk(a)lists.01.org" < spdk(a)lists.01.
    > >             > org>
    > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
    > >             >  
    > >             > Hi Jim,
    > >             >  
    > >             > I wish I joined the community meeting, I also missed the
    > >             > spdk_bdev_queue_io_wait().
    > >             >  
    > >             > So, there are 2 issues I am intending to solve,
    > >             >  
    > >             > 1.       We want to ensure that an instance of bdev-io is
    > > acquired prior to or
    > >             > along with I/O data buffer. This allows for better 
    > > error handling when bdev_io
    > >             > pool is exhausted. Yes, we can solve it via sizing the 
    > > pool right. The changes
    > >             > queue-io-wait also addresses the problem.
    > >             > 2.       Most importantly, by extending the life of bdev-io
    > > (Same life span as
    > >             > nvmf_request structure), abort use case and other use 
    > > cases that involve
    > >             > cleaning up of IO data buffer are accomplished 
    > > effectively. Let me elaborate;
    > >             > bdev-io is the fabric that connects the nvme command 
    > > and operation with the
    > >             > backend. The driver context and data buffer context is 
    > > stored in bdev-io. The
    > >             > freeing up of bdev_io resource is pushed to the end so 
    > > the I/O cleanup can
    > >             > happen after controller has transmitted the data to 
    > > the host. In absence of
    > >             > bdev_io, we will have end up adding more and more void 
    > > context in request
    > >             > structure. Hence the push to extend the life of bdev_io.
    > >             >  
    > >             > I see Ben’s recent patch “zcopy_start” and 
    > > “zcopy_end”; We can make that work
    > >             > as long as the bdev-io allocated/acquired stays till the end.
    > > One of the
    > >             > challenges I see with that patch is defining a 
    > > callback for the I/O
    > >             > submission. For instance, zopy_start will allocate 
    > > bdev_io and submits the I/O
    > >             > to the bdev device. Underlying implementation can be 
    > > synchronous or can be
    > >             > asynchronous. The callback for this submission should 
    > > check to see if it is a
    > >             > BUFFER-READY or PENDING and accordingly relay it back 
    > > to transport. Next phase
    > >             > is actual I/O submission. Let’s say, it reuses the 
    > > bdev-io obtained in ZCOPY-
    > >             > START, now the callback should determine if it is a 
    > > success or failure.
    > >             > Finally when ZCOPY-END is invoked the supplied bdev_io 
    > > will have all necessary
    > >             > data to release the WRITE buffer or unlock the read 
    > > buffer based on the
    > >             > operation performed.
    > >             >  
    > >             > I hope I’m making sense. I guess, my effort is to 
    > > extend the lifespan of bdev-
    > >             > io and let it continue to host the driver context and 
    > > buffer context populated
    > >             > during the BEGIN phase.
    > >             >  
    > >             > Regards,
    > >             > Srikanth
    > >             >  
    > >             > From: Harris, James R <james.r.harris(a)intel.com> 
    > >             > Sent: Thursday, June 21, 2018 2:21 PM
    > >             > To: Kaligotla, Srikanth 
    > > <Srikanth.Kaligotla(a)netapp.com>; Walker, Benjamin <ben
    > >             > jamin.walker(a)intel.com>; Verkamp, Daniel 
    > > <daniel.verkamp(a)intel.c
    > > om>
    > >             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
    > > <John.Menegh ini(a)netapp.com
    > >             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
    > > Madhu <Madhu sudan.Pai(a)neta
    > >             > pp.com>; NGC-john.barnard-broadcom.com 
    > > <john.barnard(a)broadcom.co
    > > m>; spdk(a)lists
    > >             > .01.org
    > >             > Subject: Re: BDEV-IO Lifecycle - Need your input.
    > >             >  
    > >             > Hi Srikanth,
    > >             >  
    > >             > Following up on this thread and the discussion in 
    > > yesterday’s community
    > >             > meeting.
    > >             >  
    > >             > The recent spdk_bdev_queue_io_wait() changes allow an 
    > > SPDK application to work
    > >             > better in general when the spdk_bdev_io buffer pool is 
    > > exhausted.  We still
    > >             > need to make changes to the NVMe-oF target to use this 
    > > new API but that should
    > >             > be done soon-ish.
    > >             >  
    > >             > With that in place, the spdk_bdev_io buffer pool 
    > > itself can be configured up
    > >             > or down when the application starts.  Currently the 
    > > default is 64K
    > >             > spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) ==
    > > 216 plus the per-IO
    > >             > context size allocated for the bdev module.  This can 
    > > be up to
    > > 192 bytes
    > >             > (virtio bdev module) but is likely much smaller for 
    > > you depending on the
    > >             > context size for your ontap bdev module.
    > >             >  
    > >             > Let’s assume your per-IO context size is 64 bytes.  
    > > 64K x (192 +
    > > 64) = 16MB.
    > >             >  
    > >             > I’m not sure how many spdk_bdev_io you need in flight 
    > > at any given time.  64K
    > >             > seems like a lot but I’d be curious to hear your 
    > > thoughts on this.  If this is
    > >             > the right number, then worst case, there would be 
    > > about 16MB of DRAM that
    > >             > would sit unused if the NVMe-oF target in your system 
    > > was not active.  Is that
    > >             > too burdensome for your application?
    > >             >  
    > >             > Thanks,
    > >             >  
    > >             > -Jim
    > >             >  
    > >             >  
    > >             >  
    > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
    > >             > Date: Wednesday, June 20, 2018 at 12:47 PM
    > >             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
    > > James Harris <james.r.harr
    > >             > is(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
    > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
    > > <raju.gottumukkala(a)broadcom .com>,
    > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
    > > "Rodriguez, Edwin" <Ed.Rodrigue
    > >             > z(a)netapp.com>, "Pai, Madhu" 
    > > <Madhusudan.Pai(a)netapp.com>,
    > > "NGC-
    > > john.barnard-
    > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
    > > "spdk(a)lists.01.org" < spdk(a)lists.01.
    > >             > org>
    > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
    > >             >  
    > >             > Hello,
    > >             >  
    > >             > First revision of changes to extend the lifecycle of 
    > > bdev_io are available for
    > >             > review. I would like to solicit your input on the 
    > > proposed API/Code flow
    > >             > changes.
    > >             >  
    > >             > https://review.gerrithub.io/c/spdk/spdk/+/415860
    > >             >  
    > >             > Thanks,
    > >             > Srikanth
    > >             >  
    > >             >  
    > >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
    > >             > Date: Friday, May 11, 2018 at 2:27 PM
    > >             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
    > > "Harris, James R" <james.r
    > >             > .harris(a)intel.com>
    > >             > Cc: "raju.gottumukkala(a)broadcom.com" 
    > > <raju.gottumukkala(a)broadcom .com>,
    > >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
    > > "Rodriguez, Edwin" <Ed.Rodrigue
    > >             > z(a)netapp.com>, "Pai, Madhu" 
    > > <Madhusudan.Pai(a)netapp.com>,
    > > "NGC-
    > > john.barnard-
    > >             > broadcom.com" <john.barnard(a)broadcom.com>, 
    > > "spdk(a)lists.01.org" < spdk(a)lists.01.
    > >             > org>
    > >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
    > >             >  
    > >             > CC: List
    > >             >  
    > >             > Hi Ben,
    > >             >  
    > >             > Your proposal to interface with the backend to acquire 
    > > and release buffers is
    > >             > good. You have accurately stated that the challenge is 
    > > in developing intuitive
    > >             > semantics. And that has been my struggle. To me, there 
    > > are two problem
    > >             > statements;
    > >             >  
    > >             > 1.       It is expected that bdev_io pool is sized correctly
    > > so
    > > the call to
    > >             > get bdev_io succeeds. Failure to acquire bdev_io will 
    > > result in DEVICE-ERROR.
    > >             > The transport is already capable of handling temporary 
    > > memory failures by
    > >             > moving the request to PENDING-Queue. Hence the 
    > > proposal to change the bdev_io
    > >             > lifecycle and perhaps connect it with the 
    > > spdk_nvmf_request object. Thus all
    > >             > buffer needs are addressed at the beginning of I/O request.
    > >             > 2.       I/O Data buffers are sourced and managed by the
    > > backend. One of the
    > >             > challenges I see with your proposed interface is the 
    > > lack of details like if
    > >             > resource is being acquired for a READ operation or 
    > > WRITE operation. The
    > >             > handling is quite different in each case. Since 
    > > bdev_io->type is overloaded
    > >             > the type of I/O operation is lost. I suppose one can 
    > > cast the cb_arg
    > >             > (nvmf_request) and then proceed. Also, the bdev_io 
    > > should be present to
    > >             > RELEASE the buffer. Zero copy semantics warrants that 
    > > data buffer stays until
    > >             > controller-to-host has occurred. In other words, 
    > > bdev_io lives till REQUEST
    > >             > come to COMPLETE state.
    > >             >  
    > >             > What are your thoughts in introducing spdk_bdev_init() 
    > > and
    > > spdk_bdev_fini() as
    > >             > an alternative approach to extend the lifecyle of 
    > > bdev_io and allow data
    > >             > buffer management via bdev fn_table ?
    > >             >  
    > >             > I hope I’m making sense…
    > >             >  
    > >             > Thanks,
    > >             > Srikanth
    > >             >  
    > >             > From: Walker, Benjamin <benjamin.walker(a)intel.com> 
    > >             > Sent: Friday, May 11, 2018 12:28 PM
    > >             > To: Harris, James R <james.r.harris(a)intel.com>; 
    > > Kaligotla, Srikanth <Srikanth.
    > >             > Kaligotla(a)netapp.com>
    > >             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
    > > <John.Menegh ini(a)netapp.com
    > >             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
    > > Madhu <Madhu sudan.Pai(a)neta
    > >             > pp.com>; NGC-john.barnard-broadcom.com 
    > > <john.barnard(a)broadcom.co
    > > m>
    > >             > Subject: Re: BDEV-IO Lifecycle - Need your input.
    > >             >  
    > >             > Hi Srikanth,
    > >             >  
    > >             > Yes - we'll need to introduce some way to acquire and 
    > > release buffers from the
    > >             > bdev layer earlier on in the state machine that 
    > > processes an NVMe-oF request.
    > >             > I've had this patch out for review for several months 
    > > as a proposal for this
    > >             > scenario:
    > >             >  
    > >             > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
    > >             >  
    > >             > It doesn't pass the tests - it's just a proposal for 
    > > the interface. 90% of the
    > >             > challenge here is in developing intuitive semantics.
    > >             >  
    > >             > Thanks,
    > >             > Ben
    > >             >  
    > >             > P.S. This is the kind of discussion that would fit 
    > > perfectly on the mailing
    > >             > list.
    > >             >  
    > >             > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
    > >             > > Hi Ben, Hi James,
    > >             > >  
    > >             > > I would like to solicit opinions on the lifecycle 
    > > for bdev-io resource
    > >             > > object. Attached is the image of RDMA state machine 
    > > in its current
    > >             > > implementation. When the REQUEST enters the 
    > > NEED-BUFFER state, the buffers
    > >             > > necessary for carrying out I/O operation are 
    > > allocated/acquired from the
    > >             > > memory pool. An instance of BDEV-IO comes into 
    > > existence after REQUEST
    > >             > > reaches READY-TO-EXECUTE state. The BDEV-IO is 
    > > teared down as soon the
    > >             > > backend returns. From a BDEV perspective, BDEV-IO is 
    > > simply a translation
    > >             > > unit that facilitates I/O buffers from the backend. 
    > > The driver context
    > >             > > embedded within bdev_io holds great deal of 
    > > information pertaining to the
    > >             > > I/O under execution. It assists in error handling, 
    > > dereferencing the buffers
    > >             > > upon I/O completion and in abort handling. In 
    > > summary, the bdev_io stays
    > >             > > alive until request has come to COMPLETE state. I’d 
    > > like to hear peoples
    > >             > > thoughts in introducing the plumbing to acquire 
    > > BDEV-IO resource in REQUEST-
    > >             > > NEED-BUFFER state and release it in REQUEST-COMPLETE 
    > > state.  I will shortly
    > >             > > have patch available for review that introduces 
    > > spdk_bdev_init and
    > >             > > spdk_bdev_fini which in turn invokes the 
    > > corresponding bdev fn_table to
    > >             > >  initialize/cleanup.  
    > >             > >  
    > >             > > I wanted to use this email to communicate our intent 
    > > and solicit your
    > >             > > feedback. We have a working implementation of the 
    > > above proposal and prior
    > >             > > to pushing it upstream for review would like to hear 
    > > your thoughts. These
    > >             > > proposed changes to upstream are a result of FC 
    > > transport work in
    > >             > > collaboration with the Broadcom team who are also 
    > > copied to this mail.
    > >             > > Myself and John will be at SPDK Dev conference and 
    > > if required we can
    > >             > > elaborate further on this proposal.
    > >             > >  
    > >             > > Thanks,
    > >             > > Srikanth
    > >             > 
    > >             >
    > >             
    > >         
    > >         
    > >     
    > >     
    > > 
    _______________________________________________
    SPDK mailing list
    SPDK(a)lists.01.org
    https://lists.01.org/mailman/listinfo/spdk
    


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.)
@ 2018-07-12  1:16 Luse, Paul E
  0 siblings, 0 replies; 7+ messages in thread
From: Luse, Paul E @ 2018-07-12  1:16 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 29649 bytes --]

This is a great explanation Ben, thanks.  I'll start looking through the patches as well.  A few quick questions for additional context and to make sure I understand your terminology:

* Can you expand on " The populate flag determines whether the "live" data is present in the given buffer" maybe with an example or something (what do you mean by "live", you use it a few times)
* You mention " I believe I've addressed all known use cases with this design" but in this email chain, at least, only cover the NVMe-oF use case.  Can you list the known cases you mention and maybe just say a few words about them?

Thanks
Paul

-----Original Message-----
From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjamin
Sent: Wednesday, July 11, 2018 2:53 PM
To: John.Meneghini(a)netapp.com; Sriram.Popuri(a)netapp.com; Srikanth.Kaligotla(a)netapp.com
Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org
Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.)

I've done a bunch of work here to formalize the zero copy API in the bdev layer.
The patch series begins here:

https://review.gerrithub.io/#/c/spdk/spdk/+/386166/

There has been some good discussion on the patches themselves, but I wanted to circle back to the mailing list for a bit. The basic idea is to add two new functions to the bdev layer:

int spdk_bdev_zcopy_start(struct spdk_bdev_desc *desc,
			  struct spdk_io_channel *ch,
			  uint64_t offset_blocks, uint64_t num_blocks,
			  bool populate,
			  spdk_bdev_io_completion_cb cb, void *cb_arg); int spdk_bdev_zcopy_end(struct spdk_bdev_io *bdev_io, bool commit,
			spdk_bdev_io_completion_cb cb, void *cb_arg);

The zcopy start call makes a request to the backing block device to provide a region of memory suitable for performing a zero copy update to the region described by the offset_blocks and num_blocks parameters. The populate flag determines whether the "live" data is present in the given buffer. When this request completes, it returns a bdev_io in the completion callback. That bdev_io will have a scatter gather list filled out describing the memory region.

When the user has finished transferring data into or out of the data buffer, the user calls spdk_bdev_zcopy_end with the originally provided bdev_io. The commit flag indicates whether the data in the buffer needs to be made "live".

The NVMe-oF target can take advantage of these new APIs by altering the state machine in the RDMA layer to call spdk_bdev_zcopy_start to obtain data buffers after parsing the command, then performing RDMA transfers into or out of the provided region, and finally completing the command with spdk_bdev_zcopy_end.
The work to do those modifications has not been done yet, and I expect them to be a significant amount of effort.

As many bdev modules as possible should attempt to support zero copy semantics, but for those bdev modules that aren't able to, one of the patches in my series will automatically emulate the zcopy commands using regular reads and writes.
This allows the higher level users of the bdev API to always make the zcopy calls for simplicity. The emulation path should be no worse, in terms of performance, than doing a regular read or write.

I believe I've addressed all known use cases with this design, and while there will certainly be some future patches to fix problems that come up, I'm confident that we're on the right track here. If there is any feedback in this area at all, now is the time to provide it.

I think this also leads into some really awesome future work that I'll suggest in the hope that someone in the community is inspired. One idea is that the bdev_nvme module could support zero copy semantics by returning a pointer to a portion of a Controller Memory Buffer or Persistent Memory Buffer suitable for data. There aren't many devices for sale today that can do this, but it's in the spec and I imagine they're coming. This would allow the NVMe-oF target, for example, to perform RDMA transfers peer-to-peer between the NIC and the NVMe SSD. Similarly, for the bdev_pmem module, direct pointers into the the persistent memory could be returned. The NVMe-oF target could then perform RDMA transfers into or out of the persistent memory directly, and the zcopy end call with commit set to true would perform the required flush instructions.

Thanks,
Ben


On Mon, 2018-07-09 at 19:19 +0000, Popuri, Sriram wrote:
> I would like to jot down what requirements we are looking at. Wherever 
> there is "====wait" for something, there is a potential context switch.
> 
> For a write request, the following is the workflow for zero copy:
> 1) Host submits Write request
> 2) Protocol validation of request
> 4) Get iovs
> 5) R2T/RDMA_READ
> ====wait for host====
> 5) Host transfers data (data_out/RDMA_READ_COMPLETE)
> 6) Perform Write(spdk_bdev_writev_blocks) to bdev_module ====wait for 
> bdev module to respond====
> 7) on bdev_module response, send rsp/status to host
> 8) free req resources
> 
> Read path:
> 1) Host submits Read request
> 2) Protocol validatation of request
> 3) Submit read to bdev_module
> ====wait for data blocks from bdev_module====
> 4) respond iov (data_in/RDMA_WRITE)
> ====wait for host====
> 5) host ack/RDMA_WRITE_COMPLETE
> 6) free req resources
> 
> In future we will have Persistent reservations where there can be 
> read/modify/update cases depending on how the bdev module handles it. 
> So the zero copy apis needs to address future requirements as well.
> 
> The above workflows suggest we require a state to track at what step 
> the request is. A state will help to cleanup resources properly when 
> handling aborts, disconnects, namespace destroy workflows etc.
> So the requirement is to figure out a way to clean up resources 
> according to the state of the bdev_io in its life cycle and stitch in 
> the zero copy workflows into existing spdk code which assumes buffers are pre-allocated.
> 
> I think Srikanth's proposed changes are made in target layer to 
> populate and commit to bdev module. However his suggestion is changing 
> the bdev_io life cycle which I understand from John is not agreeable to others.
> 
> With the new zcopy apis also there is no clarity on the bdev_io life 
> cycle. A bdev_io is acquired during zcopy_start and it's not clear how 
> to deal with the bdev_io allocated in read/write paths.
> 
> The orchestration of zero copy either be done at transport layer, 
> target layer or bdev module layer.
> 
> We don't like the orchestration done from transport layer because:
> 1) No clear way to deal with bdev_io (acquired from zcopy) and bdev_io 
> (acquired from read/write codepaths).
> 2) If I understand correctly, by the time a read/write request is 
> submitted to target, the data_out would have happened. So the protocol 
> validation happens after data_out phase. For example an LBA out of 
> range request or invalid SGL length is returned after a data_out is 
> performed. We want to run through the protocol validation before data_out phase.
> 
> I feel the bdev module is the ideal place to deal with zero copy 
> buffers. The bdev module knows how to get buffers and when to claim 
> them. All it needs is a proper cleanup and by the time request hits 
> bdev module, bdev_io is acquired so no need to change bdev_io life cycle or introduce a new io type.
> 
> I propose the following:
> 1) Host submits Write request
> 2) Protocol validation of request
>    spdk_nvmf_request_exec
> 4) submit request to bdev module
>    spdk_bdev_writev_blocks(). bdev_io is acquired
> 5) bdev module determines the buffers needs to be allocated.
>    Will perform something like what is prposed in the patch:
> spdk_bdev_zcopy_start(populate=false)
> 6) New spdk_nvmf_request_exec_status
> (SPDK_NVMF_REQUEST_EXEC_STATUS_BUFF_READY)
> 7) Transport does R2T/RDMA_READ
> ====wait for host====
> 8) Host transfers data (data_out/RDMA_READ_COMPLETE)
> 9) Perform Write again(spdk_nvmf_request_exec/spdk_bdev_writev_blocks) 
> to bdev_module.
> ====wait for bdev module to respond====
> 10) on bdev_module response, send rsp/status to host
> 11) free req resources
>     free bdev_io spdk_bdev_free_io.
>     Here we need a hook to free resources allocated by bdev module
>     spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)
> 
> Read path:
> 1) Host submits Read request
> 2) Protocol validatation of request
> 3) Submit the request (spdk_bdev_readv_blocks) to bdev_module ====wait 
> for data blocks from bdev_module====
> 4) respond iov (data_in/RDMA_WRITE)
> ====wait for host====
> 5) host ack/RDMA_WRITE_COMPLETE
> 6) free req resources
>    free bdev_io spdk_bdev_free_io.
>    Here we need a hook to free resources allocated by bdev module
>    spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)
> 
> If it makes sense we can work on a patch.
> 
> Regards,
> ~Sriram
> 
> -----Original Message-----
> From: Walker, Benjamin <benjamin.walker(a)intel.com>
> Sent: Monday, July 9, 2018 11:22 PM
> To: Meneghini, John <John.Meneghini(a)netapp.com>; Popuri, Sriram 
> <Sriram.Popuri @netapp.com>; Kaligotla, Srikanth 
> <Srikanth.Kaligotla(a)netapp.com>
> Cc: raju.gottumukkala(a)broadcom.com; Rodriguez, Edwin 
> <Ed.Rodriguez(a)netapp.com> ; spdk(a)lists.01.org; Pai, Madhu 
> <Madhusudan.Pai(a)netapp.com>; NGC-john.barnard- broadcom.com 
> <john.barnard(a)broadcom.com>
> Subject: Re: BDEV-IO Lifecycle - Need your input.
> 
> On Mon, 2018-07-09 at 16:18 +0000, Meneghini, John wrote:
> > Hi Jim.
> > 
> > This is the patch series I believe Ben is proposing to replace
> > Srikanth's:  ht tps://review.gerrithub.io/#/c/spdk/spdk/+/415860/
> > 
> > pick a21bbcf2 nvmf: Move data buffer pool to generic layer pick
> > 09586e54 bdev: Add a zero copy I/O path pick 1cc9fe95 bdev: Make 
> > malloc bdev use the new zero copy mechanism pick 40a1f62b bdevperf:
> > Use new zcopy API for reads pick eba6f35a bdev: Emulate zero copy 
> > support when necessary
> > 
> > Is this correct?
> 
> 
> My patch series doesn't entirely replace Srikanth's set of patches. 
> There are two separate things necessary to implement zero copy. First 
> is the infrastructure within the bdev layer to make requests of a bdev 
> module to obtain  or release a suitable buffer for a zero copy 
> operation. Second is actually making the NVMe-oF target use that 
> mechanism, which includes extending the bdev_io lifetime. My patch series only addresses the first part.
> 
> Thanks,
> Ben
> 
> 
> 
> > 
> > /John
> > 
> > On 7/9/18, 11:31 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
> > 
> >     Hi Sriram:
> >     
> >     Ben has some later patches in this series with some examples:
> >     
> >     https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - implements 
> > this zero copy API in the simple ‘malloc’ bdev module
> >     https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the 
> > zero copy API in the SPDK bdevperf utility
> >     
> >     -Jim
> >     
> >     
> >     On 7/9/18, 3:34 AM, "Popuri, Sriram" <Sriram.Popuri(a)netapp.com> wrote:
> >     
> >         Sorry was not focusing on this change. Just give me a day or 
> > two to get back.
> >         From a quick glance I didn't understand how zcopy start/end 
> > fits into a read/write life cycle. Is there an example on how zcopy 
> > start/end is consumed or someone can do give me a quick dump on how 
> > its envisioned to be used.
> >         
> >         Regards,
> >         ~Sriram
> >         
> >         -----Original Message-----
> >         From: Meneghini, John 
> >         Sent: Friday, July 6, 2018 10:16 PM
> >         To: Walker, Benjamin <benjamin.walker(a)intel.com>; Verkamp, 
> > Daniel <dan iel.verkamp(a)intel.com>; Harris, James R 
> > <james.r.harris(a)intel.com>; Kaligotla, Srikanth 
> > <Srikanth.Kaligotla(a)netapp.c
> > om>
> >         Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; 
> > Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu 
> > <Madhusudan.Pai(a)netapp.com>; NGC- john.barnard-broadcom.com 
> > <john.barnard(a)br oadcom.com>; Popuri, Sriram <Sriram.
> > Popuri(a)netapp.com>
> >         Subject: Re: BDEV-IO Lifecycle - Need your input.
> >         
> >         I'm adding Sriram to this thread.  Sriram is another NetApp 
> > engineer who is working on this stuff internally.
> >         
> >         /John
> >         
> >         On 7/6/18, 12:41 PM, "Walker, Benjamin" 
> > <benjamin.walker(a)intel.com>
> > wrote:
> >         
> >             Hi Srikanth,
> >             
> >             I wanted to check in to see if you had any additional 
> > feedback on the zero copy
> >             operations in the bdev layer here:
> >             
> >             https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
> >             
> >             This does not address extending the lifetime of the 
> > bdev_io in the NVMe-oF
> >             target, but I think this is looking like the right 
> > mechanism for the bdev layer.
> >             
> >             Thanks,
> >             Ben
> >             
> >             On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote:
> >             > Thanks Srikanth.  Sounds like the spdk_bdev_io pool 
> > sizing along with
> >             > spdk_bdev_queue_io_wait() meets your needs then 
> > regarding the spdk_bdev_io
> >             > memory.
> >             >  
> >             > Regarding zcopy_start/zcopy_end – it looks like you’ve 
> > already added a bunch
> >             > of comments to Ben’s patch on GerritHub.  For now I’d 
> > say let’s continue our
> >             > discussion there.  I’ve responded to a couple of 
> > similar questions there and
> >             > I’m sure Ben will have more replies tomorrow.
> >             >  
> >             > -Jim
> >             >  
> >             >  
> >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> >             > Date: Thursday, June 21, 2018 at 5:01 PM
> >             > To: James Harris <james.r.harris(a)intel.com>, "Walker, 
> > Benjamin"
> > <benjamin.walk
> >             > er(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
> >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > <raju.gottumukkala(a)broadcom .com>,
> >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > "Rodriguez, Edwin" <Ed.Rodrigue
> >             > z(a)netapp.com>, "Pai, Madhu" 
> > <Madhusudan.Pai(a)netapp.com>,
> > "NGC-
> > john.barnard-
> >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > "spdk(a)lists.01.org" < spdk(a)lists.01.
> >             > org>
> >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> >             >  
> >             > Hi Jim,
> >             >  
> >             > I wish I joined the community meeting, I also missed the
> >             > spdk_bdev_queue_io_wait().
> >             >  
> >             > So, there are 2 issues I am intending to solve,
> >             >  
> >             > 1.       We want to ensure that an instance of bdev-io is
> > acquired prior to or
> >             > along with I/O data buffer. This allows for better 
> > error handling when bdev_io
> >             > pool is exhausted. Yes, we can solve it via sizing the 
> > pool right. The changes
> >             > queue-io-wait also addresses the problem.
> >             > 2.       Most importantly, by extending the life of bdev-io
> > (Same life span as
> >             > nvmf_request structure), abort use case and other use 
> > cases that involve
> >             > cleaning up of IO data buffer are accomplished 
> > effectively. Let me elaborate;
> >             > bdev-io is the fabric that connects the nvme command 
> > and operation with the
> >             > backend. The driver context and data buffer context is 
> > stored in bdev-io. The
> >             > freeing up of bdev_io resource is pushed to the end so 
> > the I/O cleanup can
> >             > happen after controller has transmitted the data to 
> > the host. In absence of
> >             > bdev_io, we will have end up adding more and more void 
> > context in request
> >             > structure. Hence the push to extend the life of bdev_io.
> >             >  
> >             > I see Ben’s recent patch “zcopy_start” and 
> > “zcopy_end”; We can make that work
> >             > as long as the bdev-io allocated/acquired stays till the end.
> > One of the
> >             > challenges I see with that patch is defining a 
> > callback for the I/O
> >             > submission. For instance, zopy_start will allocate 
> > bdev_io and submits the I/O
> >             > to the bdev device. Underlying implementation can be 
> > synchronous or can be
> >             > asynchronous. The callback for this submission should 
> > check to see if it is a
> >             > BUFFER-READY or PENDING and accordingly relay it back 
> > to transport. Next phase
> >             > is actual I/O submission. Let’s say, it reuses the 
> > bdev-io obtained in ZCOPY-
> >             > START, now the callback should determine if it is a 
> > success or failure.
> >             > Finally when ZCOPY-END is invoked the supplied bdev_io 
> > will have all necessary
> >             > data to release the WRITE buffer or unlock the read 
> > buffer based on the
> >             > operation performed.
> >             >  
> >             > I hope I’m making sense. I guess, my effort is to 
> > extend the lifespan of bdev-
> >             > io and let it continue to host the driver context and 
> > buffer context populated
> >             > during the BEGIN phase.
> >             >  
> >             > Regards,
> >             > Srikanth
> >             >  
> >             > From: Harris, James R <james.r.harris(a)intel.com> 
> >             > Sent: Thursday, June 21, 2018 2:21 PM
> >             > To: Kaligotla, Srikanth 
> > <Srikanth.Kaligotla(a)netapp.com>; Walker, Benjamin <ben
> >             > jamin.walker(a)intel.com>; Verkamp, Daniel 
> > <daniel.verkamp(a)intel.c
> > om>
> >             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
> > <John.Menegh ini(a)netapp.com
> >             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
> > Madhu <Madhu sudan.Pai(a)neta
> >             > pp.com>; NGC-john.barnard-broadcom.com 
> > <john.barnard(a)broadcom.co
> > m>; spdk(a)lists
> >             > .01.org
> >             > Subject: Re: BDEV-IO Lifecycle - Need your input.
> >             >  
> >             > Hi Srikanth,
> >             >  
> >             > Following up on this thread and the discussion in 
> > yesterday’s community
> >             > meeting.
> >             >  
> >             > The recent spdk_bdev_queue_io_wait() changes allow an 
> > SPDK application to work
> >             > better in general when the spdk_bdev_io buffer pool is 
> > exhausted.  We still
> >             > need to make changes to the NVMe-oF target to use this 
> > new API but that should
> >             > be done soon-ish.
> >             >  
> >             > With that in place, the spdk_bdev_io buffer pool 
> > itself can be configured up
> >             > or down when the application starts.  Currently the 
> > default is 64K
> >             > spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) ==
> > 216 plus the per-IO
> >             > context size allocated for the bdev module.  This can 
> > be up to
> > 192 bytes
> >             > (virtio bdev module) but is likely much smaller for 
> > you depending on the
> >             > context size for your ontap bdev module.
> >             >  
> >             > Let’s assume your per-IO context size is 64 bytes.  
> > 64K x (192 +
> > 64) = 16MB.
> >             >  
> >             > I’m not sure how many spdk_bdev_io you need in flight 
> > at any given time.  64K
> >             > seems like a lot but I’d be curious to hear your 
> > thoughts on this.  If this is
> >             > the right number, then worst case, there would be 
> > about 16MB of DRAM that
> >             > would sit unused if the NVMe-oF target in your system 
> > was not active.  Is that
> >             > too burdensome for your application?
> >             >  
> >             > Thanks,
> >             >  
> >             > -Jim
> >             >  
> >             >  
> >             >  
> >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> >             > Date: Wednesday, June 20, 2018 at 12:47 PM
> >             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
> > James Harris <james.r.harr
> >             > is(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
> >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > <raju.gottumukkala(a)broadcom .com>,
> >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > "Rodriguez, Edwin" <Ed.Rodrigue
> >             > z(a)netapp.com>, "Pai, Madhu" 
> > <Madhusudan.Pai(a)netapp.com>,
> > "NGC-
> > john.barnard-
> >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > "spdk(a)lists.01.org" < spdk(a)lists.01.
> >             > org>
> >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> >             >  
> >             > Hello,
> >             >  
> >             > First revision of changes to extend the lifecycle of 
> > bdev_io are available for
> >             > review. I would like to solicit your input on the 
> > proposed API/Code flow
> >             > changes.
> >             >  
> >             > https://review.gerrithub.io/c/spdk/spdk/+/415860
> >             >  
> >             > Thanks,
> >             > Srikanth
> >             >  
> >             >  
> >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> >             > Date: Friday, May 11, 2018 at 2:27 PM
> >             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
> > "Harris, James R" <james.r
> >             > .harris(a)intel.com>
> >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > <raju.gottumukkala(a)broadcom .com>,
> >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > "Rodriguez, Edwin" <Ed.Rodrigue
> >             > z(a)netapp.com>, "Pai, Madhu" 
> > <Madhusudan.Pai(a)netapp.com>,
> > "NGC-
> > john.barnard-
> >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > "spdk(a)lists.01.org" < spdk(a)lists.01.
> >             > org>
> >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> >             >  
> >             > CC: List
> >             >  
> >             > Hi Ben,
> >             >  
> >             > Your proposal to interface with the backend to acquire 
> > and release buffers is
> >             > good. You have accurately stated that the challenge is 
> > in developing intuitive
> >             > semantics. And that has been my struggle. To me, there 
> > are two problem
> >             > statements;
> >             >  
> >             > 1.       It is expected that bdev_io pool is sized correctly
> > so
> > the call to
> >             > get bdev_io succeeds. Failure to acquire bdev_io will 
> > result in DEVICE-ERROR.
> >             > The transport is already capable of handling temporary 
> > memory failures by
> >             > moving the request to PENDING-Queue. Hence the 
> > proposal to change the bdev_io
> >             > lifecycle and perhaps connect it with the 
> > spdk_nvmf_request object. Thus all
> >             > buffer needs are addressed at the beginning of I/O request.
> >             > 2.       I/O Data buffers are sourced and managed by the
> > backend. One of the
> >             > challenges I see with your proposed interface is the 
> > lack of details like if
> >             > resource is being acquired for a READ operation or 
> > WRITE operation. The
> >             > handling is quite different in each case. Since 
> > bdev_io->type is overloaded
> >             > the type of I/O operation is lost. I suppose one can 
> > cast the cb_arg
> >             > (nvmf_request) and then proceed. Also, the bdev_io 
> > should be present to
> >             > RELEASE the buffer. Zero copy semantics warrants that 
> > data buffer stays until
> >             > controller-to-host has occurred. In other words, 
> > bdev_io lives till REQUEST
> >             > come to COMPLETE state.
> >             >  
> >             > What are your thoughts in introducing spdk_bdev_init() 
> > and
> > spdk_bdev_fini() as
> >             > an alternative approach to extend the lifecyle of 
> > bdev_io and allow data
> >             > buffer management via bdev fn_table ?
> >             >  
> >             > I hope I’m making sense…
> >             >  
> >             > Thanks,
> >             > Srikanth
> >             >  
> >             > From: Walker, Benjamin <benjamin.walker(a)intel.com> 
> >             > Sent: Friday, May 11, 2018 12:28 PM
> >             > To: Harris, James R <james.r.harris(a)intel.com>; 
> > Kaligotla, Srikanth <Srikanth.
> >             > Kaligotla(a)netapp.com>
> >             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
> > <John.Menegh ini(a)netapp.com
> >             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
> > Madhu <Madhu sudan.Pai(a)neta
> >             > pp.com>; NGC-john.barnard-broadcom.com 
> > <john.barnard(a)broadcom.co
> > m>
> >             > Subject: Re: BDEV-IO Lifecycle - Need your input.
> >             >  
> >             > Hi Srikanth,
> >             >  
> >             > Yes - we'll need to introduce some way to acquire and 
> > release buffers from the
> >             > bdev layer earlier on in the state machine that 
> > processes an NVMe-oF request.
> >             > I've had this patch out for review for several months 
> > as a proposal for this
> >             > scenario:
> >             >  
> >             > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
> >             >  
> >             > It doesn't pass the tests - it's just a proposal for 
> > the interface. 90% of the
> >             > challenge here is in developing intuitive semantics.
> >             >  
> >             > Thanks,
> >             > Ben
> >             >  
> >             > P.S. This is the kind of discussion that would fit 
> > perfectly on the mailing
> >             > list.
> >             >  
> >             > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
> >             > > Hi Ben, Hi James,
> >             > >  
> >             > > I would like to solicit opinions on the lifecycle 
> > for bdev-io resource
> >             > > object. Attached is the image of RDMA state machine 
> > in its current
> >             > > implementation. When the REQUEST enters the 
> > NEED-BUFFER state, the buffers
> >             > > necessary for carrying out I/O operation are 
> > allocated/acquired from the
> >             > > memory pool. An instance of BDEV-IO comes into 
> > existence after REQUEST
> >             > > reaches READY-TO-EXECUTE state. The BDEV-IO is 
> > teared down as soon the
> >             > > backend returns. From a BDEV perspective, BDEV-IO is 
> > simply a translation
> >             > > unit that facilitates I/O buffers from the backend. 
> > The driver context
> >             > > embedded within bdev_io holds great deal of 
> > information pertaining to the
> >             > > I/O under execution. It assists in error handling, 
> > dereferencing the buffers
> >             > > upon I/O completion and in abort handling. In 
> > summary, the bdev_io stays
> >             > > alive until request has come to COMPLETE state. I’d 
> > like to hear peoples
> >             > > thoughts in introducing the plumbing to acquire 
> > BDEV-IO resource in REQUEST-
> >             > > NEED-BUFFER state and release it in REQUEST-COMPLETE 
> > state.  I will shortly
> >             > > have patch available for review that introduces 
> > spdk_bdev_init and
> >             > > spdk_bdev_fini which in turn invokes the 
> > corresponding bdev fn_table to
> >             > >  initialize/cleanup.  
> >             > >  
> >             > > I wanted to use this email to communicate our intent 
> > and solicit your
> >             > > feedback. We have a working implementation of the 
> > above proposal and prior
> >             > > to pushing it upstream for review would like to hear 
> > your thoughts. These
> >             > > proposed changes to upstream are a result of FC 
> > transport work in
> >             > > collaboration with the Broadcom team who are also 
> > copied to this mail.
> >             > > Myself and John will be at SPDK Dev conference and 
> > if required we can
> >             > > elaborate further on this proposal.
> >             > >  
> >             > > Thanks,
> >             > > Srikanth
> >             > 
> >             >
> >             
> >         
> >         
> >     
> >     
> > 
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.)
@ 2018-07-11 21:52 Walker, Benjamin
  0 siblings, 0 replies; 7+ messages in thread
From: Walker, Benjamin @ 2018-07-11 21:52 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 28480 bytes --]

I've done a bunch of work here to formalize the zero copy API in the bdev layer.
The patch series begins here:

https://review.gerrithub.io/#/c/spdk/spdk/+/386166/

There has been some good discussion on the patches themselves, but I wanted to
circle back to the mailing list for a bit. The basic idea is to add two new
functions to the bdev layer:

int spdk_bdev_zcopy_start(struct spdk_bdev_desc *desc,
			  struct spdk_io_channel *ch,
			  uint64_t offset_blocks, uint64_t num_blocks,
			  bool populate,
			  spdk_bdev_io_completion_cb cb, void *cb_arg);
int spdk_bdev_zcopy_end(struct spdk_bdev_io *bdev_io, bool commit,
			spdk_bdev_io_completion_cb cb, void *cb_arg);

The zcopy start call makes a request to the backing block device to provide a
region of memory suitable for performing a zero copy update to the region
described by the offset_blocks and num_blocks parameters. The populate flag
determines whether the "live" data is present in the given buffer. When this
request completes, it returns a bdev_io in the completion callback. That bdev_io
will have a scatter gather list filled out describing the memory region.

When the user has finished transferring data into or out of the data buffer, the
user calls spdk_bdev_zcopy_end with the originally provided bdev_io. The commit
flag indicates whether the data in the buffer needs to be made "live".

The NVMe-oF target can take advantage of these new APIs by altering the state
machine in the RDMA layer to call spdk_bdev_zcopy_start to obtain data buffers
after parsing the command, then performing RDMA transfers into or out of the
provided region, and finally completing the command with spdk_bdev_zcopy_end.
The work to do those modifications has not been done yet, and I expect them to
be a significant amount of effort.

As many bdev modules as possible should attempt to support zero copy semantics,
but for those bdev modules that aren't able to, one of the patches in my series
will automatically emulate the zcopy commands using regular reads and writes.
This allows the higher level users of the bdev API to always make the zcopy
calls for simplicity. The emulation path should be no worse, in terms of
performance, than doing a regular read or write.

I believe I've addressed all known use cases with this design, and while there
will certainly be some future patches to fix problems that come up, I'm
confident that we're on the right track here. If there is any feedback in this
area at all, now is the time to provide it.

I think this also leads into some really awesome future work that I'll suggest
in the hope that someone in the community is inspired. One idea is that the
bdev_nvme module could support zero copy semantics by returning a pointer to a
portion of a Controller Memory Buffer or Persistent Memory Buffer suitable for
data. There aren't many devices for sale today that can do this, but it's in the
spec and I imagine they're coming. This would allow the NVMe-oF target, for
example, to perform RDMA transfers peer-to-peer between the NIC and the NVMe
SSD. Similarly, for the bdev_pmem module, direct pointers into the the
persistent memory could be returned. The NVMe-oF target could then perform RDMA
transfers into or out of the persistent memory directly, and the zcopy end call
with commit set to true would perform the required flush instructions.

Thanks,
Ben


On Mon, 2018-07-09 at 19:19 +0000, Popuri, Sriram wrote:
> I would like to jot down what requirements we are looking at. Wherever there
> is "====wait" for something, there is a potential context switch.
> 
> For a write request, the following is the workflow for zero copy:
> 1) Host submits Write request
> 2) Protocol validation of request
> 4) Get iovs
> 5) R2T/RDMA_READ
> ====wait for host====
> 5) Host transfers data (data_out/RDMA_READ_COMPLETE)
> 6) Perform Write(spdk_bdev_writev_blocks) to bdev_module
> ====wait for bdev module to respond====
> 7) on bdev_module response, send rsp/status to host
> 8) free req resources
> 
> Read path:
> 1) Host submits Read request
> 2) Protocol validatation of request
> 3) Submit read to bdev_module
> ====wait for data blocks from bdev_module====
> 4) respond iov (data_in/RDMA_WRITE)
> ====wait for host====
> 5) host ack/RDMA_WRITE_COMPLETE
> 6) free req resources
> 
> In future we will have Persistent reservations where there can be
> read/modify/update cases depending on how the bdev module handles it. So the
> zero copy apis needs to address future requirements as well.
> 
> The above workflows suggest we require a state to track at what step the
> request is. A state will help to cleanup resources properly when handling
> aborts, disconnects, namespace destroy workflows etc.
> So the requirement is to figure out a way to clean up resources according to
> the state of the bdev_io in its life cycle and stitch in the zero copy
> workflows into existing spdk code which assumes buffers are pre-allocated.
> 
> I think Srikanth's proposed changes are made in target layer to populate and
> commit to bdev module. However his suggestion is changing the bdev_io life
> cycle which I understand from John is not agreeable to others.
> 
> With the new zcopy apis also there is no clarity on the bdev_io life cycle. A
> bdev_io is acquired during zcopy_start and it's not clear how to deal with the
> bdev_io allocated in read/write paths.
> 
> The orchestration of zero copy either be done at transport layer, target layer
> or bdev module layer.
> 
> We don't like the orchestration done from transport layer because:
> 1) No clear way to deal with bdev_io (acquired from zcopy) and bdev_io
> (acquired from read/write codepaths).
> 2) If I understand correctly, by the time a read/write request is submitted to
> target, the data_out would have happened. So the protocol validation happens
> after data_out phase. For example an LBA out of range request or invalid SGL
> length is returned after a data_out is performed. We want to run through the
> protocol validation before data_out phase.
> 
> I feel the bdev module is the ideal place to deal with zero copy buffers. The
> bdev module knows how to get buffers and when to claim them. All it needs is a
> proper cleanup and by the time request hits bdev module, bdev_io is acquired
> so no need to change bdev_io life cycle or introduce a new io type.
> 
> I propose the following:
> 1) Host submits Write request
> 2) Protocol validation of request
>    spdk_nvmf_request_exec
> 4) submit request to bdev module
>    spdk_bdev_writev_blocks(). bdev_io is acquired
> 5) bdev module determines the buffers needs to be allocated.
>    Will perform something like what is prposed in the patch:
> spdk_bdev_zcopy_start(populate=false)
> 6) New spdk_nvmf_request_exec_status
> (SPDK_NVMF_REQUEST_EXEC_STATUS_BUFF_READY)
> 7) Transport does R2T/RDMA_READ
> ====wait for host====
> 8) Host transfers data (data_out/RDMA_READ_COMPLETE)
> 9) Perform Write again(spdk_nvmf_request_exec/spdk_bdev_writev_blocks) to
> bdev_module.
> ====wait for bdev module to respond====
> 10) on bdev_module response, send rsp/status to host
> 11) free req resources
>     free bdev_io spdk_bdev_free_io.
>     Here we need a hook to free resources allocated by bdev module
>     spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)
> 
> Read path:
> 1) Host submits Read request
> 2) Protocol validatation of request
> 3) Submit the request (spdk_bdev_readv_blocks) to bdev_module
> ====wait for data blocks from bdev_module====
> 4) respond iov (data_in/RDMA_WRITE)
> ====wait for host====
> 5) host ack/RDMA_WRITE_COMPLETE
> 6) free req resources
>    free bdev_io spdk_bdev_free_io.
>    Here we need a hook to free resources allocated by bdev module
>    spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)
> 
> If it makes sense we can work on a patch.
> 
> Regards,
> ~Sriram
> 
> -----Original Message-----
> From: Walker, Benjamin <benjamin.walker(a)intel.com> 
> Sent: Monday, July 9, 2018 11:22 PM
> To: Meneghini, John <John.Meneghini(a)netapp.com>; Popuri, Sriram <Sriram.Popuri
> @netapp.com>; Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>
> Cc: raju.gottumukkala(a)broadcom.com; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>
> ; spdk(a)lists.01.org; Pai, Madhu <Madhusudan.Pai(a)netapp.com>; NGC-john.barnard-
> broadcom.com <john.barnard(a)broadcom.com>
> Subject: Re: BDEV-IO Lifecycle - Need your input.
> 
> On Mon, 2018-07-09 at 16:18 +0000, Meneghini, John wrote:
> > Hi Jim.
> > 
> > This is the patch series I believe Ben is proposing to replace 
> > Srikanth's:  ht tps://review.gerrithub.io/#/c/spdk/spdk/+/415860/
> > 
> > pick a21bbcf2 nvmf: Move data buffer pool to generic layer pick 
> > 09586e54 bdev: Add a zero copy I/O path pick 1cc9fe95 bdev: Make 
> > malloc bdev use the new zero copy mechanism pick 40a1f62b bdevperf: 
> > Use new zcopy API for reads pick eba6f35a bdev: Emulate zero copy 
> > support when necessary
> > 
> > Is this correct?
> 
> 
> My patch series doesn't entirely replace Srikanth's set of patches. There are
> two separate things necessary to implement zero copy. First is the
> infrastructure within the bdev layer to make requests of a bdev module to
> obtain  or release a suitable buffer for a zero copy operation. Second is
> actually making the NVMe-oF target use that mechanism, which includes
> extending the bdev_io lifetime. My patch series only addresses the first part.
> 
> Thanks,
> Ben
> 
> 
> 
> > 
> > /John
> > 
> > On 7/9/18, 11:31 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
> > 
> >     Hi Sriram:
> >     
> >     Ben has some later patches in this series with some examples:
> >     
> >     https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - implements 
> > this zero copy API in the simple ‘malloc’ bdev module
> >     https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the 
> > zero copy API in the SPDK bdevperf utility
> >     
> >     -Jim
> >     
> >     
> >     On 7/9/18, 3:34 AM, "Popuri, Sriram" <Sriram.Popuri(a)netapp.com> wrote:
> >     
> >         Sorry was not focusing on this change. Just give me a day or 
> > two to get back.
> >         From a quick glance I didn't understand how zcopy start/end 
> > fits into a read/write life cycle. Is there an example on how zcopy 
> > start/end is consumed or someone can do give me a quick dump on how 
> > its envisioned to be used.
> >         
> >         Regards,
> >         ~Sriram
> >         
> >         -----Original Message-----
> >         From: Meneghini, John 
> >         Sent: Friday, July 6, 2018 10:16 PM
> >         To: Walker, Benjamin <benjamin.walker(a)intel.com>; Verkamp, 
> > Daniel <dan iel.verkamp(a)intel.com>; Harris, James R 
> > <james.r.harris(a)intel.com>; Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.c
> > om>
> >         Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; 
> > Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu 
> > <Madhusudan.Pai(a)netapp.com>; NGC- john.barnard-broadcom.com <john.barnard(a)br
> > oadcom.com>; Popuri, Sriram <Sriram.
> > Popuri(a)netapp.com>
> >         Subject: Re: BDEV-IO Lifecycle - Need your input.
> >         
> >         I'm adding Sriram to this thread.  Sriram is another NetApp 
> > engineer who is working on this stuff internally.
> >         
> >         /John
> >         
> >         On 7/6/18, 12:41 PM, "Walker, Benjamin" 
> > <benjamin.walker(a)intel.com>
> > wrote:
> >         
> >             Hi Srikanth,
> >             
> >             I wanted to check in to see if you had any additional 
> > feedback on the zero copy
> >             operations in the bdev layer here:
> >             
> >             https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
> >             
> >             This does not address extending the lifetime of the 
> > bdev_io in the NVMe-oF
> >             target, but I think this is looking like the right 
> > mechanism for the bdev layer.
> >             
> >             Thanks,
> >             Ben
> >             
> >             On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote:
> >             > Thanks Srikanth.  Sounds like the spdk_bdev_io pool 
> > sizing along with
> >             > spdk_bdev_queue_io_wait() meets your needs then 
> > regarding the spdk_bdev_io
> >             > memory.
> >             >  
> >             > Regarding zcopy_start/zcopy_end – it looks like you’ve 
> > already added a bunch
> >             > of comments to Ben’s patch on GerritHub.  For now I’d 
> > say let’s continue our
> >             > discussion there.  I’ve responded to a couple of similar 
> > questions there and
> >             > I’m sure Ben will have more replies tomorrow.
> >             >  
> >             > -Jim
> >             >  
> >             >  
> >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> >             > Date: Thursday, June 21, 2018 at 5:01 PM
> >             > To: James Harris <james.r.harris(a)intel.com>, "Walker,
> > Benjamin"
> > <benjamin.walk
> >             > er(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
> >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > <raju.gottumukkala(a)broadcom .com>,
> >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > "Rodriguez, Edwin" <Ed.Rodrigue
> >             > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, 
> > "NGC-
> > john.barnard-
> >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > "spdk(a)lists.01.org" < spdk(a)lists.01.
> >             > org>
> >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> >             >  
> >             > Hi Jim,
> >             >  
> >             > I wish I joined the community meeting, I also missed the
> >             > spdk_bdev_queue_io_wait().
> >             >  
> >             > So, there are 2 issues I am intending to solve,
> >             >  
> >             > 1.       We want to ensure that an instance of bdev-io is
> > acquired prior to or
> >             > along with I/O data buffer. This allows for better error 
> > handling when bdev_io
> >             > pool is exhausted. Yes, we can solve it via sizing the 
> > pool right. The changes
> >             > queue-io-wait also addresses the problem.
> >             > 2.       Most importantly, by extending the life of bdev-io
> > (Same life span as
> >             > nvmf_request structure), abort use case and other use 
> > cases that involve
> >             > cleaning up of IO data buffer are accomplished 
> > effectively. Let me elaborate;
> >             > bdev-io is the fabric that connects the nvme command and 
> > operation with the
> >             > backend. The driver context and data buffer context is 
> > stored in bdev-io. The
> >             > freeing up of bdev_io resource is pushed to the end so 
> > the I/O cleanup can
> >             > happen after controller has transmitted the data to the 
> > host. In absence of
> >             > bdev_io, we will have end up adding more and more void 
> > context in request
> >             > structure. Hence the push to extend the life of bdev_io.
> >             >  
> >             > I see Ben’s recent patch “zcopy_start” and “zcopy_end”; 
> > We can make that work
> >             > as long as the bdev-io allocated/acquired stays till the end.
> > One of the
> >             > challenges I see with that patch is defining a callback 
> > for the I/O
> >             > submission. For instance, zopy_start will allocate 
> > bdev_io and submits the I/O
> >             > to the bdev device. Underlying implementation can be 
> > synchronous or can be
> >             > asynchronous. The callback for this submission should 
> > check to see if it is a
> >             > BUFFER-READY or PENDING and accordingly relay it back to 
> > transport. Next phase
> >             > is actual I/O submission. Let’s say, it reuses the 
> > bdev-io obtained in ZCOPY-
> >             > START, now the callback should determine if it is a 
> > success or failure.
> >             > Finally when ZCOPY-END is invoked the supplied bdev_io 
> > will have all necessary
> >             > data to release the WRITE buffer or unlock the read 
> > buffer based on the
> >             > operation performed.
> >             >  
> >             > I hope I’m making sense. I guess, my effort is to extend 
> > the lifespan of bdev-
> >             > io and let it continue to host the driver context and 
> > buffer context populated
> >             > during the BEGIN phase.
> >             >  
> >             > Regards,
> >             > Srikanth
> >             >  
> >             > From: Harris, James R <james.r.harris(a)intel.com> 
> >             > Sent: Thursday, June 21, 2018 2:21 PM
> >             > To: Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>; 
> > Walker, Benjamin <ben
> >             > jamin.walker(a)intel.com>; Verkamp, Daniel 
> > <daniel.verkamp(a)intel.c
> > om>
> >             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
> > <John.Menegh ini(a)netapp.com
> >             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
> > Madhu <Madhu sudan.Pai(a)neta
> >             > pp.com>; NGC-john.barnard-broadcom.com 
> > <john.barnard(a)broadcom.co
> > m>; spdk(a)lists
> >             > .01.org
> >             > Subject: Re: BDEV-IO Lifecycle - Need your input.
> >             >  
> >             > Hi Srikanth,
> >             >  
> >             > Following up on this thread and the discussion in 
> > yesterday’s community
> >             > meeting.
> >             >  
> >             > The recent spdk_bdev_queue_io_wait() changes allow an 
> > SPDK application to work
> >             > better in general when the spdk_bdev_io buffer pool is 
> > exhausted.  We still
> >             > need to make changes to the NVMe-oF target to use this 
> > new API but that should
> >             > be done soon-ish.
> >             >  
> >             > With that in place, the spdk_bdev_io buffer pool itself 
> > can be configured up
> >             > or down when the application starts.  Currently the 
> > default is 64K
> >             > spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) == 
> > 216 plus the per-IO
> >             > context size allocated for the bdev module.  This can be 
> > up to
> > 192 bytes
> >             > (virtio bdev module) but is likely much smaller for you 
> > depending on the
> >             > context size for your ontap bdev module.
> >             >  
> >             > Let’s assume your per-IO context size is 64 bytes.  64K 
> > x (192 +
> > 64) = 16MB.
> >             >  
> >             > I’m not sure how many spdk_bdev_io you need in flight at 
> > any given time.  64K
> >             > seems like a lot but I’d be curious to hear your 
> > thoughts on this.  If this is
> >             > the right number, then worst case, there would be about 
> > 16MB of DRAM that
> >             > would sit unused if the NVMe-oF target in your system 
> > was not active.  Is that
> >             > too burdensome for your application?
> >             >  
> >             > Thanks,
> >             >  
> >             > -Jim
> >             >  
> >             >  
> >             >  
> >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> >             > Date: Wednesday, June 20, 2018 at 12:47 PM
> >             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
> > James Harris <james.r.harr
> >             > is(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
> >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > <raju.gottumukkala(a)broadcom .com>,
> >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > "Rodriguez, Edwin" <Ed.Rodrigue
> >             > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, 
> > "NGC-
> > john.barnard-
> >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > "spdk(a)lists.01.org" < spdk(a)lists.01.
> >             > org>
> >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> >             >  
> >             > Hello,
> >             >  
> >             > First revision of changes to extend the lifecycle of 
> > bdev_io are available for
> >             > review. I would like to solicit your input on the 
> > proposed API/Code flow
> >             > changes.
> >             >  
> >             > https://review.gerrithub.io/c/spdk/spdk/+/415860
> >             >  
> >             > Thanks,
> >             > Srikanth
> >             >  
> >             >  
> >             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> >             > Date: Friday, May 11, 2018 at 2:27 PM
> >             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
> > "Harris, James R" <james.r
> >             > .harris(a)intel.com>
> >             > Cc: "raju.gottumukkala(a)broadcom.com" 
> > <raju.gottumukkala(a)broadcom .com>,
> >             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> > "Rodriguez, Edwin" <Ed.Rodrigue
> >             > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, 
> > "NGC-
> > john.barnard-
> >             > broadcom.com" <john.barnard(a)broadcom.com>, 
> > "spdk(a)lists.01.org" < spdk(a)lists.01.
> >             > org>
> >             > Subject: RE: BDEV-IO Lifecycle - Need your input.
> >             >  
> >             > CC: List
> >             >  
> >             > Hi Ben,
> >             >  
> >             > Your proposal to interface with the backend to acquire 
> > and release buffers is
> >             > good. You have accurately stated that the challenge is 
> > in developing intuitive
> >             > semantics. And that has been my struggle. To me, there 
> > are two problem
> >             > statements;
> >             >  
> >             > 1.       It is expected that bdev_io pool is sized correctly
> > so
> > the call to
> >             > get bdev_io succeeds. Failure to acquire bdev_io will 
> > result in DEVICE-ERROR.
> >             > The transport is already capable of handling temporary 
> > memory failures by
> >             > moving the request to PENDING-Queue. Hence the proposal 
> > to change the bdev_io
> >             > lifecycle and perhaps connect it with the 
> > spdk_nvmf_request object. Thus all
> >             > buffer needs are addressed at the beginning of I/O request.
> >             > 2.       I/O Data buffers are sourced and managed by the
> > backend. One of the
> >             > challenges I see with your proposed interface is the 
> > lack of details like if
> >             > resource is being acquired for a READ operation or WRITE 
> > operation. The
> >             > handling is quite different in each case. Since 
> > bdev_io->type is overloaded
> >             > the type of I/O operation is lost. I suppose one can 
> > cast the cb_arg
> >             > (nvmf_request) and then proceed. Also, the bdev_io 
> > should be present to
> >             > RELEASE the buffer. Zero copy semantics warrants that 
> > data buffer stays until
> >             > controller-to-host has occurred. In other words, bdev_io 
> > lives till REQUEST
> >             > come to COMPLETE state.
> >             >  
> >             > What are your thoughts in introducing spdk_bdev_init() 
> > and
> > spdk_bdev_fini() as
> >             > an alternative approach to extend the lifecyle of 
> > bdev_io and allow data
> >             > buffer management via bdev fn_table ?
> >             >  
> >             > I hope I’m making sense…
> >             >  
> >             > Thanks,
> >             > Srikanth
> >             >  
> >             > From: Walker, Benjamin <benjamin.walker(a)intel.com> 
> >             > Sent: Friday, May 11, 2018 12:28 PM
> >             > To: Harris, James R <james.r.harris(a)intel.com>; 
> > Kaligotla, Srikanth <Srikanth.
> >             > Kaligotla(a)netapp.com>
> >             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
> > <John.Menegh ini(a)netapp.com
> >             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
> > Madhu <Madhu sudan.Pai(a)neta
> >             > pp.com>; NGC-john.barnard-broadcom.com 
> > <john.barnard(a)broadcom.co
> > m>
> >             > Subject: Re: BDEV-IO Lifecycle - Need your input.
> >             >  
> >             > Hi Srikanth,
> >             >  
> >             > Yes - we'll need to introduce some way to acquire and 
> > release buffers from the
> >             > bdev layer earlier on in the state machine that 
> > processes an NVMe-oF request.
> >             > I've had this patch out for review for several months as 
> > a proposal for this
> >             > scenario:
> >             >  
> >             > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
> >             >  
> >             > It doesn't pass the tests - it's just a proposal for the 
> > interface. 90% of the
> >             > challenge here is in developing intuitive semantics.
> >             >  
> >             > Thanks,
> >             > Ben
> >             >  
> >             > P.S. This is the kind of discussion that would fit 
> > perfectly on the mailing
> >             > list.
> >             >  
> >             > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
> >             > > Hi Ben, Hi James,
> >             > >  
> >             > > I would like to solicit opinions on the lifecycle for 
> > bdev-io resource
> >             > > object. Attached is the image of RDMA state machine in 
> > its current
> >             > > implementation. When the REQUEST enters the 
> > NEED-BUFFER state, the buffers
> >             > > necessary for carrying out I/O operation are 
> > allocated/acquired from the
> >             > > memory pool. An instance of BDEV-IO comes into 
> > existence after REQUEST
> >             > > reaches READY-TO-EXECUTE state. The BDEV-IO is teared 
> > down as soon the
> >             > > backend returns. From a BDEV perspective, BDEV-IO is 
> > simply a translation
> >             > > unit that facilitates I/O buffers from the backend. 
> > The driver context
> >             > > embedded within bdev_io holds great deal of 
> > information pertaining to the
> >             > > I/O under execution. It assists in error handling, 
> > dereferencing the buffers
> >             > > upon I/O completion and in abort handling. In summary, 
> > the bdev_io stays
> >             > > alive until request has come to COMPLETE state. I’d 
> > like to hear peoples
> >             > > thoughts in introducing the plumbing to acquire 
> > BDEV-IO resource in REQUEST-
> >             > > NEED-BUFFER state and release it in REQUEST-COMPLETE 
> > state.  I will shortly
> >             > > have patch available for review that introduces 
> > spdk_bdev_init and
> >             > > spdk_bdev_fini which in turn invokes the corresponding 
> > bdev fn_table to
> >             > >  initialize/cleanup.  
> >             > >  
> >             > > I wanted to use this email to communicate our intent 
> > and solicit your
> >             > > feedback. We have a working implementation of the 
> > above proposal and prior
> >             > > to pushing it upstream for review would like to hear 
> > your thoughts. These
> >             > > proposed changes to upstream are a result of FC 
> > transport work in
> >             > > collaboration with the Broadcom team who are also 
> > copied to this mail.
> >             > > Myself and John will be at SPDK Dev conference and if 
> > required we can
> >             > > elaborate further on this proposal.
> >             > >  
> >             > > Thanks,
> >             > > Srikanth
> >             > 
> >             >
> >             
> >         
> >         
> >     
> >     
> > 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-07-12 21:16 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-12  5:42 [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.) Harris, James R
  -- strict thread matches above, loose matches on Subject: below --
2018-07-12 21:16 Walker, Benjamin
2018-07-12 21:01 Luse, Paul E
2018-07-12 20:44 Walker, Benjamin
2018-07-12  4:30 Meneghini, John
2018-07-12  1:16 Luse, Paul E
2018-07-11 21:52 Walker, Benjamin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.