All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [SPDK] BDEV-IO Lifecycle - Need your input.
@ 2018-05-11 18:27 Kaligotla, Srikanth
  0 siblings, 0 replies; 12+ messages in thread
From: Kaligotla, Srikanth @ 2018-05-11 18:27 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 4423 bytes --]

CC: List

Hi Ben,

Your proposal<https://review.gerrithub.io/#/c/spdk/spdk/+/386166/> to interface with the backend to acquire and release buffers is good. You have accurately stated that the challenge is in developing intuitive semantics. And that has been my struggle. To me, there are two problem statements;


  1.  It is expected that bdev_io pool is sized correctly so the call to get bdev_io succeeds. Failure to acquire bdev_io will result in DEVICE-ERROR. The transport is already capable of handling temporary memory failures by moving the request to PENDING-Queue. Hence the proposal to change the bdev_io lifecycle and perhaps connect it with the spdk_nvmf_request object. Thus all buffer needs are addressed at the beginning of I/O request.
  2.  I/O Data buffers are sourced and managed by the backend. One of the challenges I see with your proposed interface is the lack of details like if resource is being acquired for a READ operation or WRITE operation. The handling is quite different in each case. Since bdev_io->type is overloaded the type of I/O operation is lost. I suppose one can cast the cb_arg (nvmf_request) and then proceed. Also, the bdev_io should be present to RELEASE the buffer. Zero copy semantics warrants that data buffer stays until controller-to-host has occurred. In other words, bdev_io lives till REQUEST come to COMPLETE state.

What are your thoughts in introducing spdk_bdev_init() and spdk_bdev_fini() as an alternative approach to extend the lifecyle of bdev_io and allow data buffer management via bdev fn_table ?

I hope I’m making sense…

Thanks,
Srikanth

From: Walker, Benjamin <benjamin.walker(a)intel.com>
Sent: Friday, May 11, 2018 12:28 PM
To: Harris, James R <james.r.harris(a)intel.com>; Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>
Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Meneghini(a)netapp.com>; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)netapp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>
Subject: Re: BDEV-IO Lifecycle - Need your input.

Hi Srikanth,

Yes - we'll need to introduce some way to acquire and release buffers from the bdev layer earlier on in the state machine that processes an NVMe-oF request. I've had this patch out for review for several months as a proposal for this scenario:

https://review.gerrithub.io/#/c/spdk/spdk/+/386166/

It doesn't pass the tests - it's just a proposal for the interface. 90% of the challenge here is in developing intuitive semantics.

Thanks,
Ben

P.S. This is the kind of discussion that would fit perfectly on the mailing list.

On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
Hi Ben, Hi James,

I would like to solicit opinions on the lifecycle for bdev-io resource object. Attached is the image of RDMA state machine in its current implementation. When the REQUEST enters the NEED-BUFFER state, the buffers necessary for carrying out I/O operation are allocated/acquired from the memory pool. An instance of BDEV-IO comes into existence after REQUEST reaches READY-TO-EXECUTE state. The BDEV-IO is teared down as soon the backend returns. From a BDEV perspective, BDEV-IO is simply a translation unit that facilitates I/O buffers from the backend. The driver context embedded within bdev_io holds great deal of information pertaining to the I/O under execution. It assists in error handling, dereferencing the buffers upon I/O completion and in abort handling. In summary, the bdev_io stays alive until request has come to COMPLETE state. I’d like to hear peoples thoughts in introducing the plumbing to acquire BDEV-IO resource in REQUEST-NEED-BUFFER state and release it in REQUEST-COMPLETE state.  I will shortly have patch available for review that introduces spdk_bdev_init and spdk_bdev_fini which in turn invokes the corresponding bdev fn_table to  initialize/cleanup.

I wanted to use this email to communicate our intent and solicit your feedback. We have a working implementation of the above proposal and prior to pushing it upstream for review would like to hear your thoughts. These proposed changes to upstream are a result of FC transport work in collaboration with the Broadcom team who are also copied to this mail. Myself and John will be at SPDK Dev conference and if required we can elaborate further on this proposal.

Thanks,
Srikanth


[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 10826 bytes --]

[-- Attachment #3: rdma_sm1.png --]
[-- Type: image/png, Size: 225417 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [SPDK] BDEV-IO Lifecycle - Need your input.
@ 2018-07-09 19:19 Popuri, Sriram
  0 siblings, 0 replies; 12+ messages in thread
From: Popuri, Sriram @ 2018-07-09 19:19 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 23797 bytes --]

I would like to jot down what requirements we are looking at. Wherever there is "====wait" for something, there is a potential context switch.

For a write request, the following is the workflow for zero copy:
1) Host submits Write request
2) Protocol validation of request
4) Get iovs
5) R2T/RDMA_READ
====wait for host====
5) Host transfers data (data_out/RDMA_READ_COMPLETE)
6) Perform Write(spdk_bdev_writev_blocks) to bdev_module
====wait for bdev module to respond====
7) on bdev_module response, send rsp/status to host
8) free req resources

Read path:
1) Host submits Read request
2) Protocol validatation of request
3) Submit read to bdev_module
====wait for data blocks from bdev_module====
4) respond iov (data_in/RDMA_WRITE)
====wait for host====
5) host ack/RDMA_WRITE_COMPLETE
6) free req resources

In future we will have Persistent reservations where there can be read/modify/update cases depending on how the bdev module handles it. So the zero copy apis needs to address future requirements as well.

The above workflows suggest we require a state to track at what step the request is. A state will help to cleanup resources properly when handling aborts, disconnects, namespace destroy workflows etc.
So the requirement is to figure out a way to clean up resources according to the state of the bdev_io in its life cycle and stitch in the zero copy workflows into existing spdk code which assumes buffers are pre-allocated.

I think Srikanth's proposed changes are made in target layer to populate and commit to bdev module. However his suggestion is changing the bdev_io life cycle which I understand from John is not agreeable to others.

With the new zcopy apis also there is no clarity on the bdev_io life cycle. A bdev_io is acquired during zcopy_start and it's not clear how to deal with the bdev_io allocated in read/write paths.

The orchestration of zero copy either be done at transport layer, target layer or bdev module layer.

We don't like the orchestration done from transport layer because:
1) No clear way to deal with bdev_io (acquired from zcopy) and bdev_io (acquired from read/write codepaths).
2) If I understand correctly, by the time a read/write request is submitted to target, the data_out would have happened. So the protocol validation happens after data_out phase. For example an LBA out of range request or invalid SGL length is returned after a data_out is performed. We want to run through the protocol validation before data_out phase.

I feel the bdev module is the ideal place to deal with zero copy buffers. The bdev module knows how to get buffers and when to claim them. All it needs is a proper cleanup and by the time request hits bdev module, bdev_io is acquired so no need to change bdev_io life cycle or introduce a new io type.

I propose the following:
1) Host submits Write request
2) Protocol validation of request
   spdk_nvmf_request_exec
4) submit request to bdev module
   spdk_bdev_writev_blocks(). bdev_io is acquired
5) bdev module determines the buffers needs to be allocated.
   Will perform something like what is prposed in the patch: spdk_bdev_zcopy_start(populate=false)
6) New spdk_nvmf_request_exec_status (SPDK_NVMF_REQUEST_EXEC_STATUS_BUFF_READY)
7) Transport does R2T/RDMA_READ
====wait for host====
8) Host transfers data (data_out/RDMA_READ_COMPLETE)
9) Perform Write again(spdk_nvmf_request_exec/spdk_bdev_writev_blocks) to bdev_module.
====wait for bdev module to respond====
10) on bdev_module response, send rsp/status to host
11) free req resources
    free bdev_io spdk_bdev_free_io.
    Here we need a hook to free resources allocated by bdev module
    spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)

Read path:
1) Host submits Read request
2) Protocol validatation of request
3) Submit the request (spdk_bdev_readv_blocks) to bdev_module
====wait for data blocks from bdev_module====
4) respond iov (data_in/RDMA_WRITE)
====wait for host====
5) host ack/RDMA_WRITE_COMPLETE
6) free req resources
   free bdev_io spdk_bdev_free_io.
   Here we need a hook to free resources allocated by bdev module
   spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx)

If it makes sense we can work on a patch.

Regards,
~Sriram

-----Original Message-----
From: Walker, Benjamin <benjamin.walker(a)intel.com> 
Sent: Monday, July 9, 2018 11:22 PM
To: Meneghini, John <John.Meneghini(a)netapp.com>; Popuri, Sriram <Sriram.Popuri(a)netapp.com>; Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>
Cc: raju.gottumukkala(a)broadcom.com; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; spdk(a)lists.01.org; Pai, Madhu <Madhusudan.Pai(a)netapp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>
Subject: Re: BDEV-IO Lifecycle - Need your input.

On Mon, 2018-07-09 at 16:18 +0000, Meneghini, John wrote:
> Hi Jim.
> 
> This is the patch series I believe Ben is proposing to replace 
> Srikanth's:  ht tps://review.gerrithub.io/#/c/spdk/spdk/+/415860/
> 
> pick a21bbcf2 nvmf: Move data buffer pool to generic layer pick 
> 09586e54 bdev: Add a zero copy I/O path pick 1cc9fe95 bdev: Make 
> malloc bdev use the new zero copy mechanism pick 40a1f62b bdevperf: 
> Use new zcopy API for reads pick eba6f35a bdev: Emulate zero copy 
> support when necessary
> 
> Is this correct?


My patch series doesn't entirely replace Srikanth's set of patches. There are two separate things necessary to implement zero copy. First is the infrastructure within the bdev layer to make requests of a bdev module to obtain  or release a suitable buffer for a zero copy operation. Second is actually making the NVMe-oF target use that mechanism, which includes extending the bdev_io lifetime. My patch series only addresses the first part.

Thanks,
Ben



> 
> /John
> 
> On 7/9/18, 11:31 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
> 
>     Hi Sriram:
>     
>     Ben has some later patches in this series with some examples:
>     
>     https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - implements 
> this zero copy API in the simple ‘malloc’ bdev module
>     https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the 
> zero copy API in the SPDK bdevperf utility
>     
>     -Jim
>     
>     
>     On 7/9/18, 3:34 AM, "Popuri, Sriram" <Sriram.Popuri(a)netapp.com> wrote:
>     
>         Sorry was not focusing on this change. Just give me a day or 
> two to get back.
>         From a quick glance I didn't understand how zcopy start/end 
> fits into a read/write life cycle. Is there an example on how zcopy 
> start/end is consumed or someone can do give me a quick dump on how 
> its envisioned to be used.
>         
>         Regards,
>         ~Sriram
>         
>         -----Original Message-----
>         From: Meneghini, John 
>         Sent: Friday, July 6, 2018 10:16 PM
>         To: Walker, Benjamin <benjamin.walker(a)intel.com>; Verkamp, 
> Daniel <dan iel.verkamp(a)intel.com>; Harris, James R 
> <james.r.harris(a)intel.com>; Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>
>         Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; 
> Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu 
> <Madhusudan.Pai(a)netapp.com>; NGC- john.barnard-broadcom.com <john.barnard(a)broadcom.com>; Popuri, Sriram <Sriram.
> Popuri(a)netapp.com>
>         Subject: Re: BDEV-IO Lifecycle - Need your input.
>         
>         I'm adding Sriram to this thread.  Sriram is another NetApp 
> engineer who is working on this stuff internally.
>         
>         /John
>         
>         On 7/6/18, 12:41 PM, "Walker, Benjamin" 
> <benjamin.walker(a)intel.com>
> wrote:
>         
>             Hi Srikanth,
>             
>             I wanted to check in to see if you had any additional 
> feedback on the zero copy
>             operations in the bdev layer here:
>             
>             https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
>             
>             This does not address extending the lifetime of the 
> bdev_io in the NVMe-oF
>             target, but I think this is looking like the right 
> mechanism for the bdev layer.
>             
>             Thanks,
>             Ben
>             
>             On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote:
>             > Thanks Srikanth.  Sounds like the spdk_bdev_io pool 
> sizing along with
>             > spdk_bdev_queue_io_wait() meets your needs then 
> regarding the spdk_bdev_io
>             > memory.
>             >  
>             > Regarding zcopy_start/zcopy_end – it looks like you’ve 
> already added a bunch
>             > of comments to Ben’s patch on GerritHub.  For now I’d 
> say let’s continue our
>             > discussion there.  I’ve responded to a couple of similar 
> questions there and
>             > I’m sure Ben will have more replies tomorrow.
>             >  
>             > -Jim
>             >  
>             >  
>             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
>             > Date: Thursday, June 21, 2018 at 5:01 PM
>             > To: James Harris <james.r.harris(a)intel.com>, "Walker, Benjamin"
> <benjamin.walk
>             > er(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
>             > Cc: "raju.gottumukkala(a)broadcom.com" 
> <raju.gottumukkala(a)broadcom .com>,
>             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> "Rodriguez, Edwin" <Ed.Rodrigue
>             > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, 
> "NGC-
> john.barnard-
>             > broadcom.com" <john.barnard(a)broadcom.com>, 
> "spdk(a)lists.01.org" < spdk(a)lists.01.
>             > org>
>             > Subject: RE: BDEV-IO Lifecycle - Need your input.
>             >  
>             > Hi Jim,
>             >  
>             > I wish I joined the community meeting, I also missed the
>             > spdk_bdev_queue_io_wait().
>             >  
>             > So, there are 2 issues I am intending to solve,
>             >  
>             > 1.       We want to ensure that an instance of bdev-io is
> acquired prior to or
>             > along with I/O data buffer. This allows for better error 
> handling when bdev_io
>             > pool is exhausted. Yes, we can solve it via sizing the 
> pool right. The changes
>             > queue-io-wait also addresses the problem.
>             > 2.       Most importantly, by extending the life of bdev-io
> (Same life span as
>             > nvmf_request structure), abort use case and other use 
> cases that involve
>             > cleaning up of IO data buffer are accomplished 
> effectively. Let me elaborate;
>             > bdev-io is the fabric that connects the nvme command and 
> operation with the
>             > backend. The driver context and data buffer context is 
> stored in bdev-io. The
>             > freeing up of bdev_io resource is pushed to the end so 
> the I/O cleanup can
>             > happen after controller has transmitted the data to the 
> host. In absence of
>             > bdev_io, we will have end up adding more and more void 
> context in request
>             > structure. Hence the push to extend the life of bdev_io.
>             >  
>             > I see Ben’s recent patch “zcopy_start” and “zcopy_end”; 
> We can make that work
>             > as long as the bdev-io allocated/acquired stays till the end.
> One of the
>             > challenges I see with that patch is defining a callback 
> for the I/O
>             > submission. For instance, zopy_start will allocate 
> bdev_io and submits the I/O
>             > to the bdev device. Underlying implementation can be 
> synchronous or can be
>             > asynchronous. The callback for this submission should 
> check to see if it is a
>             > BUFFER-READY or PENDING and accordingly relay it back to 
> transport. Next phase
>             > is actual I/O submission. Let’s say, it reuses the 
> bdev-io obtained in ZCOPY-
>             > START, now the callback should determine if it is a 
> success or failure.
>             > Finally when ZCOPY-END is invoked the supplied bdev_io 
> will have all necessary
>             > data to release the WRITE buffer or unlock the read 
> buffer based on the
>             > operation performed.
>             >  
>             > I hope I’m making sense. I guess, my effort is to extend 
> the lifespan of bdev-
>             > io and let it continue to host the driver context and 
> buffer context populated
>             > during the BEGIN phase.
>             >  
>             > Regards,
>             > Srikanth
>             >  
>             > From: Harris, James R <james.r.harris(a)intel.com> 
>             > Sent: Thursday, June 21, 2018 2:21 PM
>             > To: Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>; 
> Walker, Benjamin <ben
>             > jamin.walker(a)intel.com>; Verkamp, Daniel 
> <daniel.verkamp(a)intel.c
> om>
>             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
> <John.Menegh ini(a)netapp.com
>             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
> Madhu <Madhu sudan.Pai(a)neta
>             > pp.com>; NGC-john.barnard-broadcom.com 
> <john.barnard(a)broadcom.co
> m>; spdk(a)lists
>             > .01.org
>             > Subject: Re: BDEV-IO Lifecycle - Need your input.
>             >  
>             > Hi Srikanth,
>             >  
>             > Following up on this thread and the discussion in 
> yesterday’s community
>             > meeting.
>             >  
>             > The recent spdk_bdev_queue_io_wait() changes allow an 
> SPDK application to work
>             > better in general when the spdk_bdev_io buffer pool is 
> exhausted.  We still
>             > need to make changes to the NVMe-oF target to use this 
> new API but that should
>             > be done soon-ish.
>             >  
>             > With that in place, the spdk_bdev_io buffer pool itself 
> can be configured up
>             > or down when the application starts.  Currently the 
> default is 64K
>             > spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) == 
> 216 plus the per-IO
>             > context size allocated for the bdev module.  This can be 
> up to
> 192 bytes
>             > (virtio bdev module) but is likely much smaller for you 
> depending on the
>             > context size for your ontap bdev module.
>             >  
>             > Let’s assume your per-IO context size is 64 bytes.  64K 
> x (192 +
> 64) = 16MB.
>             >  
>             > I’m not sure how many spdk_bdev_io you need in flight at 
> any given time.  64K
>             > seems like a lot but I’d be curious to hear your 
> thoughts on this.  If this is
>             > the right number, then worst case, there would be about 
> 16MB of DRAM that
>             > would sit unused if the NVMe-oF target in your system 
> was not active.  Is that
>             > too burdensome for your application?
>             >  
>             > Thanks,
>             >  
>             > -Jim
>             >  
>             >  
>             >  
>             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
>             > Date: Wednesday, June 20, 2018 at 12:47 PM
>             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
> James Harris <james.r.harr
>             > is(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
>             > Cc: "raju.gottumukkala(a)broadcom.com" 
> <raju.gottumukkala(a)broadcom .com>,
>             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> "Rodriguez, Edwin" <Ed.Rodrigue
>             > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, 
> "NGC-
> john.barnard-
>             > broadcom.com" <john.barnard(a)broadcom.com>, 
> "spdk(a)lists.01.org" < spdk(a)lists.01.
>             > org>
>             > Subject: RE: BDEV-IO Lifecycle - Need your input.
>             >  
>             > Hello,
>             >  
>             > First revision of changes to extend the lifecycle of 
> bdev_io are available for
>             > review. I would like to solicit your input on the 
> proposed API/Code flow
>             > changes.
>             >  
>             > https://review.gerrithub.io/c/spdk/spdk/+/415860
>             >  
>             > Thanks,
>             > Srikanth
>             >  
>             >  
>             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
>             > Date: Friday, May 11, 2018 at 2:27 PM
>             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, 
> "Harris, James R" <james.r
>             > .harris(a)intel.com>
>             > Cc: "raju.gottumukkala(a)broadcom.com" 
> <raju.gottumukkala(a)broadcom .com>,
>             > "Meneghini, John" <John.Meneghini(a)netapp.com>, 
> "Rodriguez, Edwin" <Ed.Rodrigue
>             > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, 
> "NGC-
> john.barnard-
>             > broadcom.com" <john.barnard(a)broadcom.com>, 
> "spdk(a)lists.01.org" < spdk(a)lists.01.
>             > org>
>             > Subject: RE: BDEV-IO Lifecycle - Need your input.
>             >  
>             > CC: List
>             >  
>             > Hi Ben,
>             >  
>             > Your proposal to interface with the backend to acquire 
> and release buffers is
>             > good. You have accurately stated that the challenge is 
> in developing intuitive
>             > semantics. And that has been my struggle. To me, there 
> are two problem
>             > statements;
>             >  
>             > 1.       It is expected that bdev_io pool is sized correctly so
> the call to
>             > get bdev_io succeeds. Failure to acquire bdev_io will 
> result in DEVICE-ERROR.
>             > The transport is already capable of handling temporary 
> memory failures by
>             > moving the request to PENDING-Queue. Hence the proposal 
> to change the bdev_io
>             > lifecycle and perhaps connect it with the 
> spdk_nvmf_request object. Thus all
>             > buffer needs are addressed at the beginning of I/O request.
>             > 2.       I/O Data buffers are sourced and managed by the
> backend. One of the
>             > challenges I see with your proposed interface is the 
> lack of details like if
>             > resource is being acquired for a READ operation or WRITE 
> operation. The
>             > handling is quite different in each case. Since 
> bdev_io->type is overloaded
>             > the type of I/O operation is lost. I suppose one can 
> cast the cb_arg
>             > (nvmf_request) and then proceed. Also, the bdev_io 
> should be present to
>             > RELEASE the buffer. Zero copy semantics warrants that 
> data buffer stays until
>             > controller-to-host has occurred. In other words, bdev_io 
> lives till REQUEST
>             > come to COMPLETE state.
>             >  
>             > What are your thoughts in introducing spdk_bdev_init() 
> and
> spdk_bdev_fini() as
>             > an alternative approach to extend the lifecyle of 
> bdev_io and allow data
>             > buffer management via bdev fn_table ?
>             >  
>             > I hope I’m making sense…
>             >  
>             > Thanks,
>             > Srikanth
>             >  
>             > From: Walker, Benjamin <benjamin.walker(a)intel.com> 
>             > Sent: Friday, May 11, 2018 12:28 PM
>             > To: Harris, James R <james.r.harris(a)intel.com>; 
> Kaligotla, Srikanth <Srikanth.
>             > Kaligotla(a)netapp.com>
>             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John 
> <John.Menegh ini(a)netapp.com
>             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, 
> Madhu <Madhu sudan.Pai(a)neta
>             > pp.com>; NGC-john.barnard-broadcom.com 
> <john.barnard(a)broadcom.co
> m>
>             > Subject: Re: BDEV-IO Lifecycle - Need your input.
>             >  
>             > Hi Srikanth,
>             >  
>             > Yes - we'll need to introduce some way to acquire and 
> release buffers from the
>             > bdev layer earlier on in the state machine that 
> processes an NVMe-oF request.
>             > I've had this patch out for review for several months as 
> a proposal for this
>             > scenario:
>             >  
>             > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
>             >  
>             > It doesn't pass the tests - it's just a proposal for the 
> interface. 90% of the
>             > challenge here is in developing intuitive semantics.
>             >  
>             > Thanks,
>             > Ben
>             >  
>             > P.S. This is the kind of discussion that would fit 
> perfectly on the mailing
>             > list.
>             >  
>             > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
>             > > Hi Ben, Hi James,
>             > >  
>             > > I would like to solicit opinions on the lifecycle for 
> bdev-io resource
>             > > object. Attached is the image of RDMA state machine in 
> its current
>             > > implementation. When the REQUEST enters the 
> NEED-BUFFER state, the buffers
>             > > necessary for carrying out I/O operation are 
> allocated/acquired from the
>             > > memory pool. An instance of BDEV-IO comes into 
> existence after REQUEST
>             > > reaches READY-TO-EXECUTE state. The BDEV-IO is teared 
> down as soon the
>             > > backend returns. From a BDEV perspective, BDEV-IO is 
> simply a translation
>             > > unit that facilitates I/O buffers from the backend. 
> The driver context
>             > > embedded within bdev_io holds great deal of 
> information pertaining to the
>             > > I/O under execution. It assists in error handling, 
> dereferencing the buffers
>             > > upon I/O completion and in abort handling. In summary, 
> the bdev_io stays
>             > > alive until request has come to COMPLETE state. I’d 
> like to hear peoples
>             > > thoughts in introducing the plumbing to acquire 
> BDEV-IO resource in REQUEST-
>             > > NEED-BUFFER state and release it in REQUEST-COMPLETE 
> state.  I will shortly
>             > > have patch available for review that introduces 
> spdk_bdev_init and
>             > > spdk_bdev_fini which in turn invokes the corresponding 
> bdev fn_table to
>             > >  initialize/cleanup.  
>             > >  
>             > > I wanted to use this email to communicate our intent 
> and solicit your
>             > > feedback. We have a working implementation of the 
> above proposal and prior
>             > > to pushing it upstream for review would like to hear 
> your thoughts. These
>             > > proposed changes to upstream are a result of FC 
> transport work in
>             > > collaboration with the Broadcom team who are also 
> copied to this mail.
>             > > Myself and John will be at SPDK Dev conference and if 
> required we can
>             > > elaborate further on this proposal.
>             > >  
>             > > Thanks,
>             > > Srikanth
>             > 
>             >
>             
>         
>         
>     
>     
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [SPDK] BDEV-IO Lifecycle - Need your input.
@ 2018-07-09 17:52 Walker, Benjamin
  0 siblings, 0 replies; 12+ messages in thread
From: Walker, Benjamin @ 2018-07-09 17:52 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 18824 bytes --]

On Mon, 2018-07-09 at 16:18 +0000, Meneghini, John wrote:
> Hi Jim.
> 
> This is the patch series I believe Ben is proposing to replace Srikanth's:  ht
> tps://review.gerrithub.io/#/c/spdk/spdk/+/415860/
> 
> pick a21bbcf2 nvmf: Move data buffer pool to generic layer
> pick 09586e54 bdev: Add a zero copy I/O path
> pick 1cc9fe95 bdev: Make malloc bdev use the new zero copy mechanism
> pick 40a1f62b bdevperf: Use new zcopy API for reads
> pick eba6f35a bdev: Emulate zero copy support when necessary
> 
> Is this correct?


My patch series doesn't entirely replace Srikanth's set of patches. There are
two separate things necessary to implement zero copy. First is the
infrastructure within the bdev layer to make requests of a bdev module to obtain
 or release a suitable buffer for a zero copy operation. Second is actually
making the NVMe-oF target use that mechanism, which includes extending the
bdev_io lifetime. My patch series only addresses the first part.

Thanks,
Ben



> 
> /John
> 
> On 7/9/18, 11:31 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:
> 
>     Hi Sriram:
>     
>     Ben has some later patches in this series with some examples:
>     
>     https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - implements this zero
> copy API in the simple ‘malloc’ bdev module
>     https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the zero copy
> API in the SPDK bdevperf utility
>     
>     -Jim
>     
>     
>     On 7/9/18, 3:34 AM, "Popuri, Sriram" <Sriram.Popuri(a)netapp.com> wrote:
>     
>         Sorry was not focusing on this change. Just give me a day or two to
> get back.
>         From a quick glance I didn't understand how zcopy start/end fits into
> a read/write life cycle. Is there an example on how zcopy start/end is
> consumed or someone can do give me a quick dump on how its envisioned to be
> used.
>         
>         Regards,
>         ~Sriram
>         
>         -----Original Message-----
>         From: Meneghini, John 
>         Sent: Friday, July 6, 2018 10:16 PM
>         To: Walker, Benjamin <benjamin.walker(a)intel.com>; Verkamp, Daniel <dan
> iel.verkamp(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; Kaligotla,
> Srikanth <Srikanth.Kaligotla(a)netapp.com>
>         Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; Rodriguez,
> Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)netapp.com>; NGC-
> john.barnard-broadcom.com <john.barnard(a)broadcom.com>; Popuri, Sriram <Sriram.
> Popuri(a)netapp.com>
>         Subject: Re: BDEV-IO Lifecycle - Need your input.
>         
>         I'm adding Sriram to this thread.  Sriram is another NetApp engineer
> who is working on this stuff internally.
>         
>         /John
>         
>         On 7/6/18, 12:41 PM, "Walker, Benjamin" <benjamin.walker(a)intel.com>
> wrote:
>         
>             Hi Srikanth,
>             
>             I wanted to check in to see if you had any additional feedback on
> the zero copy
>             operations in the bdev layer here:
>             
>             https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
>             
>             This does not address extending the lifetime of the bdev_io in the
> NVMe-oF
>             target, but I think this is looking like the right mechanism for
> the bdev layer.
>             
>             Thanks,
>             Ben
>             
>             On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote:
>             > Thanks Srikanth.  Sounds like the spdk_bdev_io pool sizing along
> with
>             > spdk_bdev_queue_io_wait() meets your needs then regarding the
> spdk_bdev_io
>             > memory.
>             >  
>             > Regarding zcopy_start/zcopy_end – it looks like you’ve already
> added a bunch
>             > of comments to Ben’s patch on GerritHub.  For now I’d say let’s
> continue our
>             > discussion there.  I’ve responded to a couple of similar
> questions there and
>             > I’m sure Ben will have more replies tomorrow.
>             >  
>             > -Jim
>             >  
>             >  
>             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
>             > Date: Thursday, June 21, 2018 at 5:01 PM
>             > To: James Harris <james.r.harris(a)intel.com>, "Walker, Benjamin"
> <benjamin.walk
>             > er(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
>             > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom
> .com>,
>             > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez,
> Edwin" <Ed.Rodrigue
>             > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-
> john.barnard-
>             > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <
> spdk(a)lists.01.
>             > org>
>             > Subject: RE: BDEV-IO Lifecycle - Need your input.
>             >  
>             > Hi Jim,
>             >  
>             > I wish I joined the community meeting, I also missed the
>             > spdk_bdev_queue_io_wait().
>             >  
>             > So, there are 2 issues I am intending to solve,
>             >  
>             > 1.       We want to ensure that an instance of bdev-io is
> acquired prior to or
>             > along with I/O data buffer. This allows for better error
> handling when bdev_io
>             > pool is exhausted. Yes, we can solve it via sizing the pool
> right. The changes
>             > queue-io-wait also addresses the problem.
>             > 2.       Most importantly, by extending the life of bdev-io
> (Same life span as
>             > nvmf_request structure), abort use case and other use cases that
> involve
>             > cleaning up of IO data buffer are accomplished effectively. Let
> me elaborate;
>             > bdev-io is the fabric that connects the nvme command and
> operation with the
>             > backend. The driver context and data buffer context is stored in
> bdev-io. The
>             > freeing up of bdev_io resource is pushed to the end so the I/O
> cleanup can
>             > happen after controller has transmitted the data to the host. In
> absence of
>             > bdev_io, we will have end up adding more and more void context
> in request
>             > structure. Hence the push to extend the life of bdev_io.
>             >  
>             > I see Ben’s recent patch “zcopy_start” and “zcopy_end”; We can
> make that work
>             > as long as the bdev-io allocated/acquired stays till the end.
> One of the
>             > challenges I see with that patch is defining a callback for the
> I/O
>             > submission. For instance, zopy_start will allocate bdev_io and
> submits the I/O
>             > to the bdev device. Underlying implementation can be synchronous
> or can be
>             > asynchronous. The callback for this submission should check to
> see if it is a
>             > BUFFER-READY or PENDING and accordingly relay it back to
> transport. Next phase
>             > is actual I/O submission. Let’s say, it reuses the bdev-io
> obtained in ZCOPY-
>             > START, now the callback should determine if it is a success or
> failure.
>             > Finally when ZCOPY-END is invoked the supplied bdev_io will have
> all necessary
>             > data to release the WRITE buffer or unlock the read buffer based
> on the
>             > operation performed.
>             >  
>             > I hope I’m making sense. I guess, my effort is to extend the
> lifespan of bdev-
>             > io and let it continue to host the driver context and buffer
> context populated
>             > during the BEGIN phase.
>             >  
>             > Regards,
>             > Srikanth
>             >  
>             > From: Harris, James R <james.r.harris(a)intel.com> 
>             > Sent: Thursday, June 21, 2018 2:21 PM
>             > To: Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>; Walker,
> Benjamin <ben
>             > jamin.walker(a)intel.com>; Verkamp, Daniel <daniel.verkamp(a)intel.c
> om>
>             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Menegh
> ini(a)netapp.com
>             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhu
> sudan.Pai(a)neta
>             > pp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.co
> m>; spdk(a)lists
>             > .01.org
>             > Subject: Re: BDEV-IO Lifecycle - Need your input.
>             >  
>             > Hi Srikanth,
>             >  
>             > Following up on this thread and the discussion in yesterday’s
> community
>             > meeting.
>             >  
>             > The recent spdk_bdev_queue_io_wait() changes allow an SPDK
> application to work
>             > better in general when the spdk_bdev_io buffer pool is
> exhausted.  We still
>             > need to make changes to the NVMe-oF target to use this new API
> but that should
>             > be done soon-ish.
>             >  
>             > With that in place, the spdk_bdev_io buffer pool itself can be
> configured up
>             > or down when the application starts.  Currently the default is
> 64K
>             > spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) == 216 plus
> the per-IO
>             > context size allocated for the bdev module.  This can be up to
> 192 bytes
>             > (virtio bdev module) but is likely much smaller for you
> depending on the
>             > context size for your ontap bdev module.
>             >  
>             > Let’s assume your per-IO context size is 64 bytes.  64K x (192 +
> 64) = 16MB.
>             >  
>             > I’m not sure how many spdk_bdev_io you need in flight at any
> given time.  64K
>             > seems like a lot but I’d be curious to hear your thoughts on
> this.  If this is
>             > the right number, then worst case, there would be about 16MB of
> DRAM that
>             > would sit unused if the NVMe-oF target in your system was not
> active.  Is that
>             > too burdensome for your application?
>             >  
>             > Thanks,
>             >  
>             > -Jim
>             >  
>             >  
>             >  
>             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
>             > Date: Wednesday, June 20, 2018 at 12:47 PM
>             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, James Harris
> <james.r.harr
>             > is(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
>             > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom
> .com>,
>             > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez,
> Edwin" <Ed.Rodrigue
>             > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-
> john.barnard-
>             > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <
> spdk(a)lists.01.
>             > org>
>             > Subject: RE: BDEV-IO Lifecycle - Need your input.
>             >  
>             > Hello,
>             >  
>             > First revision of changes to extend the lifecycle of bdev_io are
> available for
>             > review. I would like to solicit your input on the proposed
> API/Code flow
>             > changes.
>             >  
>             > https://review.gerrithub.io/c/spdk/spdk/+/415860
>             >  
>             > Thanks,
>             > Srikanth
>             >  
>             >  
>             > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
>             > Date: Friday, May 11, 2018 at 2:27 PM
>             > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, "Harris,
> James R" <james.r
>             > .harris(a)intel.com>
>             > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom
> .com>,
>             > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez,
> Edwin" <Ed.Rodrigue
>             > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-
> john.barnard-
>             > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <
> spdk(a)lists.01.
>             > org>
>             > Subject: RE: BDEV-IO Lifecycle - Need your input.
>             >  
>             > CC: List
>             >  
>             > Hi Ben,
>             >  
>             > Your proposal to interface with the backend to acquire and
> release buffers is
>             > good. You have accurately stated that the challenge is in
> developing intuitive
>             > semantics. And that has been my struggle. To me, there are two
> problem
>             > statements;
>             >  
>             > 1.       It is expected that bdev_io pool is sized correctly so
> the call to
>             > get bdev_io succeeds. Failure to acquire bdev_io will result in
> DEVICE-ERROR.
>             > The transport is already capable of handling temporary memory
> failures by
>             > moving the request to PENDING-Queue. Hence the proposal to
> change the bdev_io
>             > lifecycle and perhaps connect it with the spdk_nvmf_request
> object. Thus all
>             > buffer needs are addressed at the beginning of I/O request.
>             > 2.       I/O Data buffers are sourced and managed by the
> backend. One of the
>             > challenges I see with your proposed interface is the lack of
> details like if
>             > resource is being acquired for a READ operation or WRITE
> operation. The
>             > handling is quite different in each case. Since bdev_io->type is
> overloaded
>             > the type of I/O operation is lost. I suppose one can cast the
> cb_arg
>             > (nvmf_request) and then proceed. Also, the bdev_io should be
> present to
>             > RELEASE the buffer. Zero copy semantics warrants that data
> buffer stays until
>             > controller-to-host has occurred. In other words, bdev_io lives
> till REQUEST
>             > come to COMPLETE state.
>             >  
>             > What are your thoughts in introducing spdk_bdev_init() and
> spdk_bdev_fini() as
>             > an alternative approach to extend the lifecyle of bdev_io and
> allow data
>             > buffer management via bdev fn_table ?
>             >  
>             > I hope I’m making sense…
>             >  
>             > Thanks,
>             > Srikanth
>             >  
>             > From: Walker, Benjamin <benjamin.walker(a)intel.com> 
>             > Sent: Friday, May 11, 2018 12:28 PM
>             > To: Harris, James R <james.r.harris(a)intel.com>; Kaligotla,
> Srikanth <Srikanth.
>             > Kaligotla(a)netapp.com>
>             > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Menegh
> ini(a)netapp.com
>             > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhu
> sudan.Pai(a)neta
>             > pp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.co
> m>
>             > Subject: Re: BDEV-IO Lifecycle - Need your input.
>             >  
>             > Hi Srikanth,
>             >  
>             > Yes - we'll need to introduce some way to acquire and release
> buffers from the
>             > bdev layer earlier on in the state machine that processes an
> NVMe-oF request.
>             > I've had this patch out for review for several months as a
> proposal for this
>             > scenario:
>             >  
>             > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
>             >  
>             > It doesn't pass the tests - it's just a proposal for the
> interface. 90% of the
>             > challenge here is in developing intuitive semantics.
>             >  
>             > Thanks,
>             > Ben
>             >  
>             > P.S. This is the kind of discussion that would fit perfectly on
> the mailing
>             > list.
>             >  
>             > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
>             > > Hi Ben, Hi James,
>             > >  
>             > > I would like to solicit opinions on the lifecycle for bdev-io
> resource
>             > > object. Attached is the image of RDMA state machine in its
> current
>             > > implementation. When the REQUEST enters the NEED-BUFFER state,
> the buffers
>             > > necessary for carrying out I/O operation are
> allocated/acquired from the
>             > > memory pool. An instance of BDEV-IO comes into existence after
> REQUEST
>             > > reaches READY-TO-EXECUTE state. The BDEV-IO is teared down as
> soon the
>             > > backend returns. From a BDEV perspective, BDEV-IO is simply a
> translation
>             > > unit that facilitates I/O buffers from the backend. The driver
> context
>             > > embedded within bdev_io holds great deal of information
> pertaining to the
>             > > I/O under execution. It assists in error handling,
> dereferencing the buffers
>             > > upon I/O completion and in abort handling. In summary, the
> bdev_io stays
>             > > alive until request has come to COMPLETE state. I’d like to
> hear peoples
>             > > thoughts in introducing the plumbing to acquire BDEV-IO
> resource in REQUEST-
>             > > NEED-BUFFER state and release it in REQUEST-COMPLETE state.  I
> will shortly
>             > > have patch available for review that introduces spdk_bdev_init
> and
>             > > spdk_bdev_fini which in turn invokes the corresponding bdev
> fn_table to
>             > >  initialize/cleanup.  
>             > >  
>             > > I wanted to use this email to communicate our intent and
> solicit your
>             > > feedback. We have a working implementation of the above
> proposal and prior
>             > > to pushing it upstream for review would like to hear your
> thoughts. These
>             > > proposed changes to upstream are a result of FC transport work
> in
>             > > collaboration with the Broadcom team who are also copied to
> this mail.
>             > > Myself and John will be at SPDK Dev conference and if required
> we can
>             > > elaborate further on this proposal.
>             > >  
>             > > Thanks,
>             > > Srikanth
>             > 
>             >  
>             
>         
>         
>     
>     
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [SPDK] BDEV-IO Lifecycle - Need your input.
@ 2018-07-09 16:18 Meneghini, John
  0 siblings, 0 replies; 12+ messages in thread
From: Meneghini, John @ 2018-07-09 16:18 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 17336 bytes --]

Hi Jim.

This is the patch series I believe Ben is proposing to replace Srikanth's:  https://review.gerrithub.io/#/c/spdk/spdk/+/415860/

pick a21bbcf2 nvmf: Move data buffer pool to generic layer
pick 09586e54 bdev: Add a zero copy I/O path
pick 1cc9fe95 bdev: Make malloc bdev use the new zero copy mechanism
pick 40a1f62b bdevperf: Use new zcopy API for reads
pick eba6f35a bdev: Emulate zero copy support when necessary

# Rebase 09e3f4e3..eba6f35a onto 09e3f4e3 (5 commands)

Is this correct?

/John

On 7/9/18, 11:31 AM, "Harris, James R" <james.r.harris(a)intel.com> wrote:

    Hi Sriram:
    
    Ben has some later patches in this series with some examples:
    
    https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - implements this zero copy API in the simple ‘malloc’ bdev module
    https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the zero copy API in the SPDK bdevperf utility
    
    -Jim
    
    
    On 7/9/18, 3:34 AM, "Popuri, Sriram" <Sriram.Popuri(a)netapp.com> wrote:
    
        Sorry was not focusing on this change. Just give me a day or two to get back.
        From a quick glance I didn't understand how zcopy start/end fits into a read/write life cycle. Is there an example on how zcopy start/end is consumed or someone can do give me a quick dump on how its envisioned to be used.
        
        Regards,
        ~Sriram
        
        -----Original Message-----
        From: Meneghini, John 
        Sent: Friday, July 6, 2018 10:16 PM
        To: Walker, Benjamin <benjamin.walker(a)intel.com>; Verkamp, Daniel <daniel.verkamp(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>
        Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)netapp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>; Popuri, Sriram <Sriram.Popuri(a)netapp.com>
        Subject: Re: BDEV-IO Lifecycle - Need your input.
        
        I'm adding Sriram to this thread.  Sriram is another NetApp engineer who is working on this stuff internally.
        
        /John
        
        On 7/6/18, 12:41 PM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:
        
            Hi Srikanth,
            
            I wanted to check in to see if you had any additional feedback on the zero copy
            operations in the bdev layer here:
            
            https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
            
            This does not address extending the lifetime of the bdev_io in the NVMe-oF
            target, but I think this is looking like the right mechanism for the bdev layer.
            
            Thanks,
            Ben
            
            On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote:
            > Thanks Srikanth.  Sounds like the spdk_bdev_io pool sizing along with
            > spdk_bdev_queue_io_wait() meets your needs then regarding the spdk_bdev_io
            > memory.
            >  
            > Regarding zcopy_start/zcopy_end – it looks like you’ve already added a bunch
            > of comments to Ben’s patch on GerritHub.  For now I’d say let’s continue our
            > discussion there.  I’ve responded to a couple of similar questions there and
            > I’m sure Ben will have more replies tomorrow.
            >  
            > -Jim
            >  
            >  
            > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
            > Date: Thursday, June 21, 2018 at 5:01 PM
            > To: James Harris <james.r.harris(a)intel.com>, "Walker, Benjamin" <benjamin.walk
            > er(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
            > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
            > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
            > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
            > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
            > org>
            > Subject: RE: BDEV-IO Lifecycle - Need your input.
            >  
            > Hi Jim,
            >  
            > I wish I joined the community meeting, I also missed the
            > spdk_bdev_queue_io_wait().
            >  
            > So, there are 2 issues I am intending to solve,
            >  
            > 1.       We want to ensure that an instance of bdev-io is acquired prior to or
            > along with I/O data buffer. This allows for better error handling when bdev_io
            > pool is exhausted. Yes, we can solve it via sizing the pool right. The changes
            > queue-io-wait also addresses the problem.
            > 2.       Most importantly, by extending the life of bdev-io (Same life span as
            > nvmf_request structure), abort use case and other use cases that involve
            > cleaning up of IO data buffer are accomplished effectively. Let me elaborate;
            > bdev-io is the fabric that connects the nvme command and operation with the
            > backend. The driver context and data buffer context is stored in bdev-io. The
            > freeing up of bdev_io resource is pushed to the end so the I/O cleanup can
            > happen after controller has transmitted the data to the host. In absence of
            > bdev_io, we will have end up adding more and more void context in request
            > structure. Hence the push to extend the life of bdev_io.
            >  
            > I see Ben’s recent patch “zcopy_start” and “zcopy_end”; We can make that work
            > as long as the bdev-io allocated/acquired stays till the end. One of the
            > challenges I see with that patch is defining a callback for the I/O
            > submission. For instance, zopy_start will allocate bdev_io and submits the I/O
            > to the bdev device. Underlying implementation can be synchronous or can be
            > asynchronous. The callback for this submission should check to see if it is a
            > BUFFER-READY or PENDING and accordingly relay it back to transport. Next phase
            > is actual I/O submission. Let’s say, it reuses the bdev-io obtained in ZCOPY-
            > START, now the callback should determine if it is a success or failure.
            > Finally when ZCOPY-END is invoked the supplied bdev_io will have all necessary
            > data to release the WRITE buffer or unlock the read buffer based on the
            > operation performed.
            >  
            > I hope I’m making sense. I guess, my effort is to extend the lifespan of bdev-
            > io and let it continue to host the driver context and buffer context populated
            > during the BEGIN phase.
            >  
            > Regards,
            > Srikanth
            >  
            > From: Harris, James R <james.r.harris(a)intel.com> 
            > Sent: Thursday, June 21, 2018 2:21 PM
            > To: Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>; Walker, Benjamin <ben
            > jamin.walker(a)intel.com>; Verkamp, Daniel <daniel.verkamp(a)intel.com>
            > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Meneghini(a)netapp.com
            > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)neta
            > pp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>; spdk(a)lists
            > .01.org
            > Subject: Re: BDEV-IO Lifecycle - Need your input.
            >  
            > Hi Srikanth,
            >  
            > Following up on this thread and the discussion in yesterday’s community
            > meeting.
            >  
            > The recent spdk_bdev_queue_io_wait() changes allow an SPDK application to work
            > better in general when the spdk_bdev_io buffer pool is exhausted.  We still
            > need to make changes to the NVMe-oF target to use this new API but that should
            > be done soon-ish.
            >  
            > With that in place, the spdk_bdev_io buffer pool itself can be configured up
            > or down when the application starts.  Currently the default is 64K
            > spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) == 216 plus the per-IO
            > context size allocated for the bdev module.  This can be up to 192 bytes
            > (virtio bdev module) but is likely much smaller for you depending on the
            > context size for your ontap bdev module.
            >  
            > Let’s assume your per-IO context size is 64 bytes.  64K x (192 + 64) = 16MB.
            >  
            > I’m not sure how many spdk_bdev_io you need in flight at any given time.  64K
            > seems like a lot but I’d be curious to hear your thoughts on this.  If this is
            > the right number, then worst case, there would be about 16MB of DRAM that
            > would sit unused if the NVMe-oF target in your system was not active.  Is that
            > too burdensome for your application?
            >  
            > Thanks,
            >  
            > -Jim
            >  
            >  
            >  
            > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
            > Date: Wednesday, June 20, 2018 at 12:47 PM
            > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, James Harris <james.r.harr
            > is(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
            > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
            > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
            > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
            > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
            > org>
            > Subject: RE: BDEV-IO Lifecycle - Need your input.
            >  
            > Hello,
            >  
            > First revision of changes to extend the lifecycle of bdev_io are available for
            > review. I would like to solicit your input on the proposed API/Code flow
            > changes.
            >  
            > https://review.gerrithub.io/c/spdk/spdk/+/415860
            >  
            > Thanks,
            > Srikanth
            >  
            >  
            > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
            > Date: Friday, May 11, 2018 at 2:27 PM
            > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, "Harris, James R" <james.r
            > .harris(a)intel.com>
            > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
            > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
            > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
            > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
            > org>
            > Subject: RE: BDEV-IO Lifecycle - Need your input.
            >  
            > CC: List
            >  
            > Hi Ben,
            >  
            > Your proposal to interface with the backend to acquire and release buffers is
            > good. You have accurately stated that the challenge is in developing intuitive
            > semantics. And that has been my struggle. To me, there are two problem
            > statements;
            >  
            > 1.       It is expected that bdev_io pool is sized correctly so the call to
            > get bdev_io succeeds. Failure to acquire bdev_io will result in DEVICE-ERROR.
            > The transport is already capable of handling temporary memory failures by
            > moving the request to PENDING-Queue. Hence the proposal to change the bdev_io
            > lifecycle and perhaps connect it with the spdk_nvmf_request object. Thus all
            > buffer needs are addressed at the beginning of I/O request.
            > 2.       I/O Data buffers are sourced and managed by the backend. One of the
            > challenges I see with your proposed interface is the lack of details like if
            > resource is being acquired for a READ operation or WRITE operation. The
            > handling is quite different in each case. Since bdev_io->type is overloaded
            > the type of I/O operation is lost. I suppose one can cast the cb_arg
            > (nvmf_request) and then proceed. Also, the bdev_io should be present to
            > RELEASE the buffer. Zero copy semantics warrants that data buffer stays until
            > controller-to-host has occurred. In other words, bdev_io lives till REQUEST
            > come to COMPLETE state.
            >  
            > What are your thoughts in introducing spdk_bdev_init() and spdk_bdev_fini() as
            > an alternative approach to extend the lifecyle of bdev_io and allow data
            > buffer management via bdev fn_table ?
            >  
            > I hope I’m making sense…
            >  
            > Thanks,
            > Srikanth
            >  
            > From: Walker, Benjamin <benjamin.walker(a)intel.com> 
            > Sent: Friday, May 11, 2018 12:28 PM
            > To: Harris, James R <james.r.harris(a)intel.com>; Kaligotla, Srikanth <Srikanth.
            > Kaligotla(a)netapp.com>
            > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Meneghini(a)netapp.com
            > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)neta
            > pp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>
            > Subject: Re: BDEV-IO Lifecycle - Need your input.
            >  
            > Hi Srikanth,
            >  
            > Yes - we'll need to introduce some way to acquire and release buffers from the
            > bdev layer earlier on in the state machine that processes an NVMe-oF request.
            > I've had this patch out for review for several months as a proposal for this
            > scenario:
            >  
            > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
            >  
            > It doesn't pass the tests - it's just a proposal for the interface. 90% of the
            > challenge here is in developing intuitive semantics.
            >  
            > Thanks,
            > Ben
            >  
            > P.S. This is the kind of discussion that would fit perfectly on the mailing
            > list.
            >  
            > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
            > > Hi Ben, Hi James,
            > >  
            > > I would like to solicit opinions on the lifecycle for bdev-io resource
            > > object. Attached is the image of RDMA state machine in its current
            > > implementation. When the REQUEST enters the NEED-BUFFER state, the buffers
            > > necessary for carrying out I/O operation are allocated/acquired from the
            > > memory pool. An instance of BDEV-IO comes into existence after REQUEST
            > > reaches READY-TO-EXECUTE state. The BDEV-IO is teared down as soon the
            > > backend returns. From a BDEV perspective, BDEV-IO is simply a translation
            > > unit that facilitates I/O buffers from the backend. The driver context
            > > embedded within bdev_io holds great deal of information pertaining to the
            > > I/O under execution. It assists in error handling, dereferencing the buffers
            > > upon I/O completion and in abort handling. In summary, the bdev_io stays
            > > alive until request has come to COMPLETE state. I’d like to hear peoples
            > > thoughts in introducing the plumbing to acquire BDEV-IO resource in REQUEST-
            > > NEED-BUFFER state and release it in REQUEST-COMPLETE state.  I will shortly
            > > have patch available for review that introduces spdk_bdev_init and
            > > spdk_bdev_fini which in turn invokes the corresponding bdev fn_table to
            > >  initialize/cleanup.  
            > >  
            > > I wanted to use this email to communicate our intent and solicit your
            > > feedback. We have a working implementation of the above proposal and prior
            > > to pushing it upstream for review would like to hear your thoughts. These
            > > proposed changes to upstream are a result of FC transport work in
            > > collaboration with the Broadcom team who are also copied to this mail.
            > > Myself and John will be at SPDK Dev conference and if required we can
            > > elaborate further on this proposal.
            > >  
            > > Thanks,
            > > Srikanth
            > 
            >  
            
        
        
    
    


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [SPDK] BDEV-IO Lifecycle - Need your input.
@ 2018-07-09 15:30 Harris, James R
  0 siblings, 0 replies; 12+ messages in thread
From: Harris, James R @ 2018-07-09 15:30 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 15597 bytes --]

Hi Sriram:

Ben has some later patches in this series with some examples:

https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - implements this zero copy API in the simple ‘malloc’ bdev module
https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the zero copy API in the SPDK bdevperf utility

-Jim


On 7/9/18, 3:34 AM, "Popuri, Sriram" <Sriram.Popuri(a)netapp.com> wrote:

    Sorry was not focusing on this change. Just give me a day or two to get back.
    From a quick glance I didn't understand how zcopy start/end fits into a read/write life cycle. Is there an example on how zcopy start/end is consumed or someone can do give me a quick dump on how its envisioned to be used.
    
    Regards,
    ~Sriram
    
    -----Original Message-----
    From: Meneghini, John 
    Sent: Friday, July 6, 2018 10:16 PM
    To: Walker, Benjamin <benjamin.walker(a)intel.com>; Verkamp, Daniel <daniel.verkamp(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>
    Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)netapp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>; Popuri, Sriram <Sriram.Popuri(a)netapp.com>
    Subject: Re: BDEV-IO Lifecycle - Need your input.
    
    I'm adding Sriram to this thread.  Sriram is another NetApp engineer who is working on this stuff internally.
    
    /John
    
    On 7/6/18, 12:41 PM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:
    
        Hi Srikanth,
        
        I wanted to check in to see if you had any additional feedback on the zero copy
        operations in the bdev layer here:
        
        https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
        
        This does not address extending the lifetime of the bdev_io in the NVMe-oF
        target, but I think this is looking like the right mechanism for the bdev layer.
        
        Thanks,
        Ben
        
        On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote:
        > Thanks Srikanth.  Sounds like the spdk_bdev_io pool sizing along with
        > spdk_bdev_queue_io_wait() meets your needs then regarding the spdk_bdev_io
        > memory.
        >  
        > Regarding zcopy_start/zcopy_end – it looks like you’ve already added a bunch
        > of comments to Ben’s patch on GerritHub.  For now I’d say let’s continue our
        > discussion there.  I’ve responded to a couple of similar questions there and
        > I’m sure Ben will have more replies tomorrow.
        >  
        > -Jim
        >  
        >  
        > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
        > Date: Thursday, June 21, 2018 at 5:01 PM
        > To: James Harris <james.r.harris(a)intel.com>, "Walker, Benjamin" <benjamin.walk
        > er(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
        > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
        > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
        > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
        > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
        > org>
        > Subject: RE: BDEV-IO Lifecycle - Need your input.
        >  
        > Hi Jim,
        >  
        > I wish I joined the community meeting, I also missed the
        > spdk_bdev_queue_io_wait().
        >  
        > So, there are 2 issues I am intending to solve,
        >  
        > 1.       We want to ensure that an instance of bdev-io is acquired prior to or
        > along with I/O data buffer. This allows for better error handling when bdev_io
        > pool is exhausted. Yes, we can solve it via sizing the pool right. The changes
        > queue-io-wait also addresses the problem.
        > 2.       Most importantly, by extending the life of bdev-io (Same life span as
        > nvmf_request structure), abort use case and other use cases that involve
        > cleaning up of IO data buffer are accomplished effectively. Let me elaborate;
        > bdev-io is the fabric that connects the nvme command and operation with the
        > backend. The driver context and data buffer context is stored in bdev-io. The
        > freeing up of bdev_io resource is pushed to the end so the I/O cleanup can
        > happen after controller has transmitted the data to the host. In absence of
        > bdev_io, we will have end up adding more and more void context in request
        > structure. Hence the push to extend the life of bdev_io.
        >  
        > I see Ben’s recent patch “zcopy_start” and “zcopy_end”; We can make that work
        > as long as the bdev-io allocated/acquired stays till the end. One of the
        > challenges I see with that patch is defining a callback for the I/O
        > submission. For instance, zopy_start will allocate bdev_io and submits the I/O
        > to the bdev device. Underlying implementation can be synchronous or can be
        > asynchronous. The callback for this submission should check to see if it is a
        > BUFFER-READY or PENDING and accordingly relay it back to transport. Next phase
        > is actual I/O submission. Let’s say, it reuses the bdev-io obtained in ZCOPY-
        > START, now the callback should determine if it is a success or failure.
        > Finally when ZCOPY-END is invoked the supplied bdev_io will have all necessary
        > data to release the WRITE buffer or unlock the read buffer based on the
        > operation performed.
        >  
        > I hope I’m making sense. I guess, my effort is to extend the lifespan of bdev-
        > io and let it continue to host the driver context and buffer context populated
        > during the BEGIN phase.
        >  
        > Regards,
        > Srikanth
        >  
        > From: Harris, James R <james.r.harris(a)intel.com> 
        > Sent: Thursday, June 21, 2018 2:21 PM
        > To: Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>; Walker, Benjamin <ben
        > jamin.walker(a)intel.com>; Verkamp, Daniel <daniel.verkamp(a)intel.com>
        > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Meneghini(a)netapp.com
        > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)neta
        > pp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>; spdk(a)lists
        > .01.org
        > Subject: Re: BDEV-IO Lifecycle - Need your input.
        >  
        > Hi Srikanth,
        >  
        > Following up on this thread and the discussion in yesterday’s community
        > meeting.
        >  
        > The recent spdk_bdev_queue_io_wait() changes allow an SPDK application to work
        > better in general when the spdk_bdev_io buffer pool is exhausted.  We still
        > need to make changes to the NVMe-oF target to use this new API but that should
        > be done soon-ish.
        >  
        > With that in place, the spdk_bdev_io buffer pool itself can be configured up
        > or down when the application starts.  Currently the default is 64K
        > spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) == 216 plus the per-IO
        > context size allocated for the bdev module.  This can be up to 192 bytes
        > (virtio bdev module) but is likely much smaller for you depending on the
        > context size for your ontap bdev module.
        >  
        > Let’s assume your per-IO context size is 64 bytes.  64K x (192 + 64) = 16MB.
        >  
        > I’m not sure how many spdk_bdev_io you need in flight at any given time.  64K
        > seems like a lot but I’d be curious to hear your thoughts on this.  If this is
        > the right number, then worst case, there would be about 16MB of DRAM that
        > would sit unused if the NVMe-oF target in your system was not active.  Is that
        > too burdensome for your application?
        >  
        > Thanks,
        >  
        > -Jim
        >  
        >  
        >  
        > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
        > Date: Wednesday, June 20, 2018 at 12:47 PM
        > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, James Harris <james.r.harr
        > is(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
        > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
        > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
        > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
        > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
        > org>
        > Subject: RE: BDEV-IO Lifecycle - Need your input.
        >  
        > Hello,
        >  
        > First revision of changes to extend the lifecycle of bdev_io are available for
        > review. I would like to solicit your input on the proposed API/Code flow
        > changes.
        >  
        > https://review.gerrithub.io/c/spdk/spdk/+/415860
        >  
        > Thanks,
        > Srikanth
        >  
        >  
        > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
        > Date: Friday, May 11, 2018 at 2:27 PM
        > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, "Harris, James R" <james.r
        > .harris(a)intel.com>
        > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
        > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
        > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
        > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
        > org>
        > Subject: RE: BDEV-IO Lifecycle - Need your input.
        >  
        > CC: List
        >  
        > Hi Ben,
        >  
        > Your proposal to interface with the backend to acquire and release buffers is
        > good. You have accurately stated that the challenge is in developing intuitive
        > semantics. And that has been my struggle. To me, there are two problem
        > statements;
        >  
        > 1.       It is expected that bdev_io pool is sized correctly so the call to
        > get bdev_io succeeds. Failure to acquire bdev_io will result in DEVICE-ERROR.
        > The transport is already capable of handling temporary memory failures by
        > moving the request to PENDING-Queue. Hence the proposal to change the bdev_io
        > lifecycle and perhaps connect it with the spdk_nvmf_request object. Thus all
        > buffer needs are addressed at the beginning of I/O request.
        > 2.       I/O Data buffers are sourced and managed by the backend. One of the
        > challenges I see with your proposed interface is the lack of details like if
        > resource is being acquired for a READ operation or WRITE operation. The
        > handling is quite different in each case. Since bdev_io->type is overloaded
        > the type of I/O operation is lost. I suppose one can cast the cb_arg
        > (nvmf_request) and then proceed. Also, the bdev_io should be present to
        > RELEASE the buffer. Zero copy semantics warrants that data buffer stays until
        > controller-to-host has occurred. In other words, bdev_io lives till REQUEST
        > come to COMPLETE state.
        >  
        > What are your thoughts in introducing spdk_bdev_init() and spdk_bdev_fini() as
        > an alternative approach to extend the lifecyle of bdev_io and allow data
        > buffer management via bdev fn_table ?
        >  
        > I hope I’m making sense…
        >  
        > Thanks,
        > Srikanth
        >  
        > From: Walker, Benjamin <benjamin.walker(a)intel.com> 
        > Sent: Friday, May 11, 2018 12:28 PM
        > To: Harris, James R <james.r.harris(a)intel.com>; Kaligotla, Srikanth <Srikanth.
        > Kaligotla(a)netapp.com>
        > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Meneghini(a)netapp.com
        > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)neta
        > pp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>
        > Subject: Re: BDEV-IO Lifecycle - Need your input.
        >  
        > Hi Srikanth,
        >  
        > Yes - we'll need to introduce some way to acquire and release buffers from the
        > bdev layer earlier on in the state machine that processes an NVMe-oF request.
        > I've had this patch out for review for several months as a proposal for this
        > scenario:
        >  
        > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
        >  
        > It doesn't pass the tests - it's just a proposal for the interface. 90% of the
        > challenge here is in developing intuitive semantics.
        >  
        > Thanks,
        > Ben
        >  
        > P.S. This is the kind of discussion that would fit perfectly on the mailing
        > list.
        >  
        > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
        > > Hi Ben, Hi James,
        > >  
        > > I would like to solicit opinions on the lifecycle for bdev-io resource
        > > object. Attached is the image of RDMA state machine in its current
        > > implementation. When the REQUEST enters the NEED-BUFFER state, the buffers
        > > necessary for carrying out I/O operation are allocated/acquired from the
        > > memory pool. An instance of BDEV-IO comes into existence after REQUEST
        > > reaches READY-TO-EXECUTE state. The BDEV-IO is teared down as soon the
        > > backend returns. From a BDEV perspective, BDEV-IO is simply a translation
        > > unit that facilitates I/O buffers from the backend. The driver context
        > > embedded within bdev_io holds great deal of information pertaining to the
        > > I/O under execution. It assists in error handling, dereferencing the buffers
        > > upon I/O completion and in abort handling. In summary, the bdev_io stays
        > > alive until request has come to COMPLETE state. I’d like to hear peoples
        > > thoughts in introducing the plumbing to acquire BDEV-IO resource in REQUEST-
        > > NEED-BUFFER state and release it in REQUEST-COMPLETE state.  I will shortly
        > > have patch available for review that introduces spdk_bdev_init and
        > > spdk_bdev_fini which in turn invokes the corresponding bdev fn_table to
        > >  initialize/cleanup.  
        > >  
        > > I wanted to use this email to communicate our intent and solicit your
        > > feedback. We have a working implementation of the above proposal and prior
        > > to pushing it upstream for review would like to hear your thoughts. These
        > > proposed changes to upstream are a result of FC transport work in
        > > collaboration with the Broadcom team who are also copied to this mail.
        > > Myself and John will be at SPDK Dev conference and if required we can
        > > elaborate further on this proposal.
        > >  
        > > Thanks,
        > > Srikanth
        > 
        >  
        
    
    


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [SPDK] BDEV-IO Lifecycle - Need your input.
@ 2018-07-09 10:34 Popuri, Sriram
  0 siblings, 0 replies; 12+ messages in thread
From: Popuri, Sriram @ 2018-07-09 10:34 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 14122 bytes --]

Sorry was not focusing on this change. Just give me a day or two to get back.
From a quick glance I didn't understand how zcopy start/end fits into a read/write life cycle. Is there an example on how zcopy start/end is consumed or someone can do give me a quick dump on how its envisioned to be used.

Regards,
~Sriram

-----Original Message-----
From: Meneghini, John 
Sent: Friday, July 6, 2018 10:16 PM
To: Walker, Benjamin <benjamin.walker(a)intel.com>; Verkamp, Daniel <daniel.verkamp(a)intel.com>; Harris, James R <james.r.harris(a)intel.com>; Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>
Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)netapp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>; Popuri, Sriram <Sriram.Popuri(a)netapp.com>
Subject: Re: BDEV-IO Lifecycle - Need your input.

I'm adding Sriram to this thread.  Sriram is another NetApp engineer who is working on this stuff internally.

/John

On 7/6/18, 12:41 PM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:

    Hi Srikanth,
    
    I wanted to check in to see if you had any additional feedback on the zero copy
    operations in the bdev layer here:
    
    https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
    
    This does not address extending the lifetime of the bdev_io in the NVMe-oF
    target, but I think this is looking like the right mechanism for the bdev layer.
    
    Thanks,
    Ben
    
    On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote:
    > Thanks Srikanth.  Sounds like the spdk_bdev_io pool sizing along with
    > spdk_bdev_queue_io_wait() meets your needs then regarding the spdk_bdev_io
    > memory.
    >  
    > Regarding zcopy_start/zcopy_end – it looks like you’ve already added a bunch
    > of comments to Ben’s patch on GerritHub.  For now I’d say let’s continue our
    > discussion there.  I’ve responded to a couple of similar questions there and
    > I’m sure Ben will have more replies tomorrow.
    >  
    > -Jim
    >  
    >  
    > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
    > Date: Thursday, June 21, 2018 at 5:01 PM
    > To: James Harris <james.r.harris(a)intel.com>, "Walker, Benjamin" <benjamin.walk
    > er(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
    > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
    > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
    > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
    > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
    > org>
    > Subject: RE: BDEV-IO Lifecycle - Need your input.
    >  
    > Hi Jim,
    >  
    > I wish I joined the community meeting, I also missed the
    > spdk_bdev_queue_io_wait().
    >  
    > So, there are 2 issues I am intending to solve,
    >  
    > 1.       We want to ensure that an instance of bdev-io is acquired prior to or
    > along with I/O data buffer. This allows for better error handling when bdev_io
    > pool is exhausted. Yes, we can solve it via sizing the pool right. The changes
    > queue-io-wait also addresses the problem.
    > 2.       Most importantly, by extending the life of bdev-io (Same life span as
    > nvmf_request structure), abort use case and other use cases that involve
    > cleaning up of IO data buffer are accomplished effectively. Let me elaborate;
    > bdev-io is the fabric that connects the nvme command and operation with the
    > backend. The driver context and data buffer context is stored in bdev-io. The
    > freeing up of bdev_io resource is pushed to the end so the I/O cleanup can
    > happen after controller has transmitted the data to the host. In absence of
    > bdev_io, we will have end up adding more and more void context in request
    > structure. Hence the push to extend the life of bdev_io.
    >  
    > I see Ben’s recent patch “zcopy_start” and “zcopy_end”; We can make that work
    > as long as the bdev-io allocated/acquired stays till the end. One of the
    > challenges I see with that patch is defining a callback for the I/O
    > submission. For instance, zopy_start will allocate bdev_io and submits the I/O
    > to the bdev device. Underlying implementation can be synchronous or can be
    > asynchronous. The callback for this submission should check to see if it is a
    > BUFFER-READY or PENDING and accordingly relay it back to transport. Next phase
    > is actual I/O submission. Let’s say, it reuses the bdev-io obtained in ZCOPY-
    > START, now the callback should determine if it is a success or failure.
    > Finally when ZCOPY-END is invoked the supplied bdev_io will have all necessary
    > data to release the WRITE buffer or unlock the read buffer based on the
    > operation performed.
    >  
    > I hope I’m making sense. I guess, my effort is to extend the lifespan of bdev-
    > io and let it continue to host the driver context and buffer context populated
    > during the BEGIN phase.
    >  
    > Regards,
    > Srikanth
    >  
    > From: Harris, James R <james.r.harris(a)intel.com> 
    > Sent: Thursday, June 21, 2018 2:21 PM
    > To: Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>; Walker, Benjamin <ben
    > jamin.walker(a)intel.com>; Verkamp, Daniel <daniel.verkamp(a)intel.com>
    > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Meneghini(a)netapp.com
    > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)neta
    > pp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>; spdk(a)lists
    > .01.org
    > Subject: Re: BDEV-IO Lifecycle - Need your input.
    >  
    > Hi Srikanth,
    >  
    > Following up on this thread and the discussion in yesterday’s community
    > meeting.
    >  
    > The recent spdk_bdev_queue_io_wait() changes allow an SPDK application to work
    > better in general when the spdk_bdev_io buffer pool is exhausted.  We still
    > need to make changes to the NVMe-oF target to use this new API but that should
    > be done soon-ish.
    >  
    > With that in place, the spdk_bdev_io buffer pool itself can be configured up
    > or down when the application starts.  Currently the default is 64K
    > spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) == 216 plus the per-IO
    > context size allocated for the bdev module.  This can be up to 192 bytes
    > (virtio bdev module) but is likely much smaller for you depending on the
    > context size for your ontap bdev module.
    >  
    > Let’s assume your per-IO context size is 64 bytes.  64K x (192 + 64) = 16MB.
    >  
    > I’m not sure how many spdk_bdev_io you need in flight at any given time.  64K
    > seems like a lot but I’d be curious to hear your thoughts on this.  If this is
    > the right number, then worst case, there would be about 16MB of DRAM that
    > would sit unused if the NVMe-oF target in your system was not active.  Is that
    > too burdensome for your application?
    >  
    > Thanks,
    >  
    > -Jim
    >  
    >  
    >  
    > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
    > Date: Wednesday, June 20, 2018 at 12:47 PM
    > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, James Harris <james.r.harr
    > is(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
    > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
    > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
    > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
    > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
    > org>
    > Subject: RE: BDEV-IO Lifecycle - Need your input.
    >  
    > Hello,
    >  
    > First revision of changes to extend the lifecycle of bdev_io are available for
    > review. I would like to solicit your input on the proposed API/Code flow
    > changes.
    >  
    > https://review.gerrithub.io/c/spdk/spdk/+/415860
    >  
    > Thanks,
    > Srikanth
    >  
    >  
    > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
    > Date: Friday, May 11, 2018 at 2:27 PM
    > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, "Harris, James R" <james.r
    > .harris(a)intel.com>
    > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
    > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
    > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
    > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
    > org>
    > Subject: RE: BDEV-IO Lifecycle - Need your input.
    >  
    > CC: List
    >  
    > Hi Ben,
    >  
    > Your proposal to interface with the backend to acquire and release buffers is
    > good. You have accurately stated that the challenge is in developing intuitive
    > semantics. And that has been my struggle. To me, there are two problem
    > statements;
    >  
    > 1.       It is expected that bdev_io pool is sized correctly so the call to
    > get bdev_io succeeds. Failure to acquire bdev_io will result in DEVICE-ERROR.
    > The transport is already capable of handling temporary memory failures by
    > moving the request to PENDING-Queue. Hence the proposal to change the bdev_io
    > lifecycle and perhaps connect it with the spdk_nvmf_request object. Thus all
    > buffer needs are addressed at the beginning of I/O request.
    > 2.       I/O Data buffers are sourced and managed by the backend. One of the
    > challenges I see with your proposed interface is the lack of details like if
    > resource is being acquired for a READ operation or WRITE operation. The
    > handling is quite different in each case. Since bdev_io->type is overloaded
    > the type of I/O operation is lost. I suppose one can cast the cb_arg
    > (nvmf_request) and then proceed. Also, the bdev_io should be present to
    > RELEASE the buffer. Zero copy semantics warrants that data buffer stays until
    > controller-to-host has occurred. In other words, bdev_io lives till REQUEST
    > come to COMPLETE state.
    >  
    > What are your thoughts in introducing spdk_bdev_init() and spdk_bdev_fini() as
    > an alternative approach to extend the lifecyle of bdev_io and allow data
    > buffer management via bdev fn_table ?
    >  
    > I hope I’m making sense…
    >  
    > Thanks,
    > Srikanth
    >  
    > From: Walker, Benjamin <benjamin.walker(a)intel.com> 
    > Sent: Friday, May 11, 2018 12:28 PM
    > To: Harris, James R <james.r.harris(a)intel.com>; Kaligotla, Srikanth <Srikanth.
    > Kaligotla(a)netapp.com>
    > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Meneghini(a)netapp.com
    > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)neta
    > pp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>
    > Subject: Re: BDEV-IO Lifecycle - Need your input.
    >  
    > Hi Srikanth,
    >  
    > Yes - we'll need to introduce some way to acquire and release buffers from the
    > bdev layer earlier on in the state machine that processes an NVMe-oF request.
    > I've had this patch out for review for several months as a proposal for this
    > scenario:
    >  
    > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
    >  
    > It doesn't pass the tests - it's just a proposal for the interface. 90% of the
    > challenge here is in developing intuitive semantics.
    >  
    > Thanks,
    > Ben
    >  
    > P.S. This is the kind of discussion that would fit perfectly on the mailing
    > list.
    >  
    > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
    > > Hi Ben, Hi James,
    > >  
    > > I would like to solicit opinions on the lifecycle for bdev-io resource
    > > object. Attached is the image of RDMA state machine in its current
    > > implementation. When the REQUEST enters the NEED-BUFFER state, the buffers
    > > necessary for carrying out I/O operation are allocated/acquired from the
    > > memory pool. An instance of BDEV-IO comes into existence after REQUEST
    > > reaches READY-TO-EXECUTE state. The BDEV-IO is teared down as soon the
    > > backend returns. From a BDEV perspective, BDEV-IO is simply a translation
    > > unit that facilitates I/O buffers from the backend. The driver context
    > > embedded within bdev_io holds great deal of information pertaining to the
    > > I/O under execution. It assists in error handling, dereferencing the buffers
    > > upon I/O completion and in abort handling. In summary, the bdev_io stays
    > > alive until request has come to COMPLETE state. I’d like to hear peoples
    > > thoughts in introducing the plumbing to acquire BDEV-IO resource in REQUEST-
    > > NEED-BUFFER state and release it in REQUEST-COMPLETE state.  I will shortly
    > > have patch available for review that introduces spdk_bdev_init and
    > > spdk_bdev_fini which in turn invokes the corresponding bdev fn_table to
    > >  initialize/cleanup.  
    > >  
    > > I wanted to use this email to communicate our intent and solicit your
    > > feedback. We have a working implementation of the above proposal and prior
    > > to pushing it upstream for review would like to hear your thoughts. These
    > > proposed changes to upstream are a result of FC transport work in
    > > collaboration with the Broadcom team who are also copied to this mail.
    > > Myself and John will be at SPDK Dev conference and if required we can
    > > elaborate further on this proposal.
    > >  
    > > Thanks,
    > > Srikanth
    > 
    >  
    


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [SPDK] BDEV-IO Lifecycle - Need your input.
@ 2018-07-06 16:46 Meneghini, John
  0 siblings, 0 replies; 12+ messages in thread
From: Meneghini, John @ 2018-07-06 16:46 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 13202 bytes --]

I'm adding Sriram to this thread.  Sriram is another NetApp engineer who is working on this stuff internally.

/John

On 7/6/18, 12:41 PM, "Walker, Benjamin" <benjamin.walker(a)intel.com> wrote:

    Hi Srikanth,
    
    I wanted to check in to see if you had any additional feedback on the zero copy
    operations in the bdev layer here:
    
    https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
    
    This does not address extending the lifetime of the bdev_io in the NVMe-oF
    target, but I think this is looking like the right mechanism for the bdev layer.
    
    Thanks,
    Ben
    
    On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote:
    > Thanks Srikanth.  Sounds like the spdk_bdev_io pool sizing along with
    > spdk_bdev_queue_io_wait() meets your needs then regarding the spdk_bdev_io
    > memory.
    >  
    > Regarding zcopy_start/zcopy_end – it looks like you’ve already added a bunch
    > of comments to Ben’s patch on GerritHub.  For now I’d say let’s continue our
    > discussion there.  I’ve responded to a couple of similar questions there and
    > I’m sure Ben will have more replies tomorrow.
    >  
    > -Jim
    >  
    >  
    > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
    > Date: Thursday, June 21, 2018 at 5:01 PM
    > To: James Harris <james.r.harris(a)intel.com>, "Walker, Benjamin" <benjamin.walk
    > er(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
    > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
    > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
    > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
    > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
    > org>
    > Subject: RE: BDEV-IO Lifecycle - Need your input.
    >  
    > Hi Jim,
    >  
    > I wish I joined the community meeting, I also missed the
    > spdk_bdev_queue_io_wait().
    >  
    > So, there are 2 issues I am intending to solve,
    >  
    > 1.       We want to ensure that an instance of bdev-io is acquired prior to or
    > along with I/O data buffer. This allows for better error handling when bdev_io
    > pool is exhausted. Yes, we can solve it via sizing the pool right. The changes
    > queue-io-wait also addresses the problem.
    > 2.       Most importantly, by extending the life of bdev-io (Same life span as
    > nvmf_request structure), abort use case and other use cases that involve
    > cleaning up of IO data buffer are accomplished effectively. Let me elaborate;
    > bdev-io is the fabric that connects the nvme command and operation with the
    > backend. The driver context and data buffer context is stored in bdev-io. The
    > freeing up of bdev_io resource is pushed to the end so the I/O cleanup can
    > happen after controller has transmitted the data to the host. In absence of
    > bdev_io, we will have end up adding more and more void context in request
    > structure. Hence the push to extend the life of bdev_io.
    >  
    > I see Ben’s recent patch “zcopy_start” and “zcopy_end”; We can make that work
    > as long as the bdev-io allocated/acquired stays till the end. One of the
    > challenges I see with that patch is defining a callback for the I/O
    > submission. For instance, zopy_start will allocate bdev_io and submits the I/O
    > to the bdev device. Underlying implementation can be synchronous or can be
    > asynchronous. The callback for this submission should check to see if it is a
    > BUFFER-READY or PENDING and accordingly relay it back to transport. Next phase
    > is actual I/O submission. Let’s say, it reuses the bdev-io obtained in ZCOPY-
    > START, now the callback should determine if it is a success or failure.
    > Finally when ZCOPY-END is invoked the supplied bdev_io will have all necessary
    > data to release the WRITE buffer or unlock the read buffer based on the
    > operation performed.
    >  
    > I hope I’m making sense. I guess, my effort is to extend the lifespan of bdev-
    > io and let it continue to host the driver context and buffer context populated
    > during the BEGIN phase.
    >  
    > Regards,
    > Srikanth
    >  
    > From: Harris, James R <james.r.harris(a)intel.com> 
    > Sent: Thursday, June 21, 2018 2:21 PM
    > To: Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>; Walker, Benjamin <ben
    > jamin.walker(a)intel.com>; Verkamp, Daniel <daniel.verkamp(a)intel.com>
    > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Meneghini(a)netapp.com
    > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)neta
    > pp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>; spdk(a)lists
    > .01.org
    > Subject: Re: BDEV-IO Lifecycle - Need your input.
    >  
    > Hi Srikanth,
    >  
    > Following up on this thread and the discussion in yesterday’s community
    > meeting.
    >  
    > The recent spdk_bdev_queue_io_wait() changes allow an SPDK application to work
    > better in general when the spdk_bdev_io buffer pool is exhausted.  We still
    > need to make changes to the NVMe-oF target to use this new API but that should
    > be done soon-ish.
    >  
    > With that in place, the spdk_bdev_io buffer pool itself can be configured up
    > or down when the application starts.  Currently the default is 64K
    > spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) == 216 plus the per-IO
    > context size allocated for the bdev module.  This can be up to 192 bytes
    > (virtio bdev module) but is likely much smaller for you depending on the
    > context size for your ontap bdev module.
    >  
    > Let’s assume your per-IO context size is 64 bytes.  64K x (192 + 64) = 16MB.
    >  
    > I’m not sure how many spdk_bdev_io you need in flight at any given time.  64K
    > seems like a lot but I’d be curious to hear your thoughts on this.  If this is
    > the right number, then worst case, there would be about 16MB of DRAM that
    > would sit unused if the NVMe-oF target in your system was not active.  Is that
    > too burdensome for your application?
    >  
    > Thanks,
    >  
    > -Jim
    >  
    >  
    >  
    > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
    > Date: Wednesday, June 20, 2018 at 12:47 PM
    > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, James Harris <james.r.harr
    > is(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
    > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
    > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
    > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
    > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
    > org>
    > Subject: RE: BDEV-IO Lifecycle - Need your input.
    >  
    > Hello,
    >  
    > First revision of changes to extend the lifecycle of bdev_io are available for
    > review. I would like to solicit your input on the proposed API/Code flow
    > changes.
    >  
    > https://review.gerrithub.io/c/spdk/spdk/+/415860
    >  
    > Thanks,
    > Srikanth
    >  
    >  
    > From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
    > Date: Friday, May 11, 2018 at 2:27 PM
    > To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, "Harris, James R" <james.r
    > .harris(a)intel.com>
    > Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
    > "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
    > z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
    > broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
    > org>
    > Subject: RE: BDEV-IO Lifecycle - Need your input.
    >  
    > CC: List
    >  
    > Hi Ben,
    >  
    > Your proposal to interface with the backend to acquire and release buffers is
    > good. You have accurately stated that the challenge is in developing intuitive
    > semantics. And that has been my struggle. To me, there are two problem
    > statements;
    >  
    > 1.       It is expected that bdev_io pool is sized correctly so the call to
    > get bdev_io succeeds. Failure to acquire bdev_io will result in DEVICE-ERROR.
    > The transport is already capable of handling temporary memory failures by
    > moving the request to PENDING-Queue. Hence the proposal to change the bdev_io
    > lifecycle and perhaps connect it with the spdk_nvmf_request object. Thus all
    > buffer needs are addressed at the beginning of I/O request.
    > 2.       I/O Data buffers are sourced and managed by the backend. One of the
    > challenges I see with your proposed interface is the lack of details like if
    > resource is being acquired for a READ operation or WRITE operation. The
    > handling is quite different in each case. Since bdev_io->type is overloaded
    > the type of I/O operation is lost. I suppose one can cast the cb_arg
    > (nvmf_request) and then proceed. Also, the bdev_io should be present to
    > RELEASE the buffer. Zero copy semantics warrants that data buffer stays until
    > controller-to-host has occurred. In other words, bdev_io lives till REQUEST
    > come to COMPLETE state.
    >  
    > What are your thoughts in introducing spdk_bdev_init() and spdk_bdev_fini() as
    > an alternative approach to extend the lifecyle of bdev_io and allow data
    > buffer management via bdev fn_table ?
    >  
    > I hope I’m making sense…
    >  
    > Thanks,
    > Srikanth
    >  
    > From: Walker, Benjamin <benjamin.walker(a)intel.com> 
    > Sent: Friday, May 11, 2018 12:28 PM
    > To: Harris, James R <james.r.harris(a)intel.com>; Kaligotla, Srikanth <Srikanth.
    > Kaligotla(a)netapp.com>
    > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Meneghini(a)netapp.com
    > >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)neta
    > pp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>
    > Subject: Re: BDEV-IO Lifecycle - Need your input.
    >  
    > Hi Srikanth,
    >  
    > Yes - we'll need to introduce some way to acquire and release buffers from the
    > bdev layer earlier on in the state machine that processes an NVMe-oF request.
    > I've had this patch out for review for several months as a proposal for this
    > scenario:
    >  
    > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
    >  
    > It doesn't pass the tests - it's just a proposal for the interface. 90% of the
    > challenge here is in developing intuitive semantics.
    >  
    > Thanks,
    > Ben
    >  
    > P.S. This is the kind of discussion that would fit perfectly on the mailing
    > list.
    >  
    > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
    > > Hi Ben, Hi James,
    > >  
    > > I would like to solicit opinions on the lifecycle for bdev-io resource
    > > object. Attached is the image of RDMA state machine in its current
    > > implementation. When the REQUEST enters the NEED-BUFFER state, the buffers
    > > necessary for carrying out I/O operation are allocated/acquired from the
    > > memory pool. An instance of BDEV-IO comes into existence after REQUEST
    > > reaches READY-TO-EXECUTE state. The BDEV-IO is teared down as soon the
    > > backend returns. From a BDEV perspective, BDEV-IO is simply a translation
    > > unit that facilitates I/O buffers from the backend. The driver context
    > > embedded within bdev_io holds great deal of information pertaining to the
    > > I/O under execution. It assists in error handling, dereferencing the buffers
    > > upon I/O completion and in abort handling. In summary, the bdev_io stays
    > > alive until request has come to COMPLETE state. I’d like to hear peoples
    > > thoughts in introducing the plumbing to acquire BDEV-IO resource in REQUEST-
    > > NEED-BUFFER state and release it in REQUEST-COMPLETE state.  I will shortly
    > > have patch available for review that introduces spdk_bdev_init and
    > > spdk_bdev_fini which in turn invokes the corresponding bdev fn_table to
    > >  initialize/cleanup.  
    > >  
    > > I wanted to use this email to communicate our intent and solicit your
    > > feedback. We have a working implementation of the above proposal and prior
    > > to pushing it upstream for review would like to hear your thoughts. These
    > > proposed changes to upstream are a result of FC transport work in
    > > collaboration with the Broadcom team who are also copied to this mail.
    > > Myself and John will be at SPDK Dev conference and if required we can
    > > elaborate further on this proposal.
    > >  
    > > Thanks,
    > > Srikanth
    > 
    >  
    


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [SPDK] BDEV-IO Lifecycle - Need your input.
@ 2018-07-06 16:41 Walker, Benjamin
  0 siblings, 0 replies; 12+ messages in thread
From: Walker, Benjamin @ 2018-07-06 16:41 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 12001 bytes --]

Hi Srikanth,

I wanted to check in to see if you had any additional feedback on the zero copy
operations in the bdev layer here:

https://review.gerrithub.io/#/c/spdk/spdk/+/386166/

This does not address extending the lifetime of the bdev_io in the NVMe-oF
target, but I think this is looking like the right mechanism for the bdev layer.

Thanks,
Ben

On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote:
> Thanks Srikanth.  Sounds like the spdk_bdev_io pool sizing along with
> spdk_bdev_queue_io_wait() meets your needs then regarding the spdk_bdev_io
> memory.
>  
> Regarding zcopy_start/zcopy_end – it looks like you’ve already added a bunch
> of comments to Ben’s patch on GerritHub.  For now I’d say let’s continue our
> discussion there.  I’ve responded to a couple of similar questions there and
> I’m sure Ben will have more replies tomorrow.
>  
> -Jim
>  
>  
> From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> Date: Thursday, June 21, 2018 at 5:01 PM
> To: James Harris <james.r.harris(a)intel.com>, "Walker, Benjamin" <benjamin.walk
> er(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
> Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
> "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
> z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
> broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
> org>
> Subject: RE: BDEV-IO Lifecycle - Need your input.
>  
> Hi Jim,
>  
> I wish I joined the community meeting, I also missed the
> spdk_bdev_queue_io_wait().
>  
> So, there are 2 issues I am intending to solve,
>  
> 1.       We want to ensure that an instance of bdev-io is acquired prior to or
> along with I/O data buffer. This allows for better error handling when bdev_io
> pool is exhausted. Yes, we can solve it via sizing the pool right. The changes
> queue-io-wait also addresses the problem.
> 2.       Most importantly, by extending the life of bdev-io (Same life span as
> nvmf_request structure), abort use case and other use cases that involve
> cleaning up of IO data buffer are accomplished effectively. Let me elaborate;
> bdev-io is the fabric that connects the nvme command and operation with the
> backend. The driver context and data buffer context is stored in bdev-io. The
> freeing up of bdev_io resource is pushed to the end so the I/O cleanup can
> happen after controller has transmitted the data to the host. In absence of
> bdev_io, we will have end up adding more and more void context in request
> structure. Hence the push to extend the life of bdev_io.
>  
> I see Ben’s recent patch “zcopy_start” and “zcopy_end”; We can make that work
> as long as the bdev-io allocated/acquired stays till the end. One of the
> challenges I see with that patch is defining a callback for the I/O
> submission. For instance, zopy_start will allocate bdev_io and submits the I/O
> to the bdev device. Underlying implementation can be synchronous or can be
> asynchronous. The callback for this submission should check to see if it is a
> BUFFER-READY or PENDING and accordingly relay it back to transport. Next phase
> is actual I/O submission. Let’s say, it reuses the bdev-io obtained in ZCOPY-
> START, now the callback should determine if it is a success or failure.
> Finally when ZCOPY-END is invoked the supplied bdev_io will have all necessary
> data to release the WRITE buffer or unlock the read buffer based on the
> operation performed.
>  
> I hope I’m making sense. I guess, my effort is to extend the lifespan of bdev-
> io and let it continue to host the driver context and buffer context populated
> during the BEGIN phase.
>  
> Regards,
> Srikanth
>  
> From: Harris, James R <james.r.harris(a)intel.com> 
> Sent: Thursday, June 21, 2018 2:21 PM
> To: Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>; Walker, Benjamin <ben
> jamin.walker(a)intel.com>; Verkamp, Daniel <daniel.verkamp(a)intel.com>
> Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Meneghini(a)netapp.com
> >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)neta
> pp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>; spdk(a)lists
> .01.org
> Subject: Re: BDEV-IO Lifecycle - Need your input.
>  
> Hi Srikanth,
>  
> Following up on this thread and the discussion in yesterday’s community
> meeting.
>  
> The recent spdk_bdev_queue_io_wait() changes allow an SPDK application to work
> better in general when the spdk_bdev_io buffer pool is exhausted.  We still
> need to make changes to the NVMe-oF target to use this new API but that should
> be done soon-ish.
>  
> With that in place, the spdk_bdev_io buffer pool itself can be configured up
> or down when the application starts.  Currently the default is 64K
> spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) == 216 plus the per-IO
> context size allocated for the bdev module.  This can be up to 192 bytes
> (virtio bdev module) but is likely much smaller for you depending on the
> context size for your ontap bdev module.
>  
> Let’s assume your per-IO context size is 64 bytes.  64K x (192 + 64) = 16MB.
>  
> I’m not sure how many spdk_bdev_io you need in flight at any given time.  64K
> seems like a lot but I’d be curious to hear your thoughts on this.  If this is
> the right number, then worst case, there would be about 16MB of DRAM that
> would sit unused if the NVMe-oF target in your system was not active.  Is that
> too burdensome for your application?
>  
> Thanks,
>  
> -Jim
>  
>  
>  
> From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> Date: Wednesday, June 20, 2018 at 12:47 PM
> To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, James Harris <james.r.harr
> is(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
> Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
> "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
> z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
> broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
> org>
> Subject: RE: BDEV-IO Lifecycle - Need your input.
>  
> Hello,
>  
> First revision of changes to extend the lifecycle of bdev_io are available for
> review. I would like to solicit your input on the proposed API/Code flow
> changes.
>  
> https://review.gerrithub.io/c/spdk/spdk/+/415860
>  
> Thanks,
> Srikanth
>  
>  
> From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
> Date: Friday, May 11, 2018 at 2:27 PM
> To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, "Harris, James R" <james.r
> .harris(a)intel.com>
> Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>,
> "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodrigue
> z(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-
> broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.
> org>
> Subject: RE: BDEV-IO Lifecycle - Need your input.
>  
> CC: List
>  
> Hi Ben,
>  
> Your proposal to interface with the backend to acquire and release buffers is
> good. You have accurately stated that the challenge is in developing intuitive
> semantics. And that has been my struggle. To me, there are two problem
> statements;
>  
> 1.       It is expected that bdev_io pool is sized correctly so the call to
> get bdev_io succeeds. Failure to acquire bdev_io will result in DEVICE-ERROR.
> The transport is already capable of handling temporary memory failures by
> moving the request to PENDING-Queue. Hence the proposal to change the bdev_io
> lifecycle and perhaps connect it with the spdk_nvmf_request object. Thus all
> buffer needs are addressed at the beginning of I/O request.
> 2.       I/O Data buffers are sourced and managed by the backend. One of the
> challenges I see with your proposed interface is the lack of details like if
> resource is being acquired for a READ operation or WRITE operation. The
> handling is quite different in each case. Since bdev_io->type is overloaded
> the type of I/O operation is lost. I suppose one can cast the cb_arg
> (nvmf_request) and then proceed. Also, the bdev_io should be present to
> RELEASE the buffer. Zero copy semantics warrants that data buffer stays until
> controller-to-host has occurred. In other words, bdev_io lives till REQUEST
> come to COMPLETE state.
>  
> What are your thoughts in introducing spdk_bdev_init() and spdk_bdev_fini() as
> an alternative approach to extend the lifecyle of bdev_io and allow data
> buffer management via bdev fn_table ?
>  
> I hope I’m making sense…
>  
> Thanks,
> Srikanth
>  
> From: Walker, Benjamin <benjamin.walker(a)intel.com> 
> Sent: Friday, May 11, 2018 12:28 PM
> To: Harris, James R <james.r.harris(a)intel.com>; Kaligotla, Srikanth <Srikanth.
> Kaligotla(a)netapp.com>
> Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Meneghini(a)netapp.com
> >; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)neta
> pp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>
> Subject: Re: BDEV-IO Lifecycle - Need your input.
>  
> Hi Srikanth,
>  
> Yes - we'll need to introduce some way to acquire and release buffers from the
> bdev layer earlier on in the state machine that processes an NVMe-oF request.
> I've had this patch out for review for several months as a proposal for this
> scenario:
>  
> https://review.gerrithub.io/#/c/spdk/spdk/+/386166/
>  
> It doesn't pass the tests - it's just a proposal for the interface. 90% of the
> challenge here is in developing intuitive semantics.
>  
> Thanks,
> Ben
>  
> P.S. This is the kind of discussion that would fit perfectly on the mailing
> list.
>  
> On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
> > Hi Ben, Hi James,
> >  
> > I would like to solicit opinions on the lifecycle for bdev-io resource
> > object. Attached is the image of RDMA state machine in its current
> > implementation. When the REQUEST enters the NEED-BUFFER state, the buffers
> > necessary for carrying out I/O operation are allocated/acquired from the
> > memory pool. An instance of BDEV-IO comes into existence after REQUEST
> > reaches READY-TO-EXECUTE state. The BDEV-IO is teared down as soon the
> > backend returns. From a BDEV perspective, BDEV-IO is simply a translation
> > unit that facilitates I/O buffers from the backend. The driver context
> > embedded within bdev_io holds great deal of information pertaining to the
> > I/O under execution. It assists in error handling, dereferencing the buffers
> > upon I/O completion and in abort handling. In summary, the bdev_io stays
> > alive until request has come to COMPLETE state. I’d like to hear peoples
> > thoughts in introducing the plumbing to acquire BDEV-IO resource in REQUEST-
> > NEED-BUFFER state and release it in REQUEST-COMPLETE state.  I will shortly
> > have patch available for review that introduces spdk_bdev_init and
> > spdk_bdev_fini which in turn invokes the corresponding bdev fn_table to
> >  initialize/cleanup.  
> >  
> > I wanted to use this email to communicate our intent and solicit your
> > feedback. We have a working implementation of the above proposal and prior
> > to pushing it upstream for review would like to hear your thoughts. These
> > proposed changes to upstream are a result of FC transport work in
> > collaboration with the Broadcom team who are also copied to this mail.
> > Myself and John will be at SPDK Dev conference and if required we can
> > elaborate further on this proposal.
> >  
> > Thanks,
> > Srikanth
> 
>  

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [SPDK] BDEV-IO Lifecycle - Need your input.
@ 2018-06-22  1:11 Harris, James R
  0 siblings, 0 replies; 12+ messages in thread
From: Harris, James R @ 2018-06-22  1:11 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 12026 bytes --]

Thanks Srikanth.  Sounds like the spdk_bdev_io pool sizing along with spdk_bdev_queue_io_wait() meets your needs then regarding the spdk_bdev_io memory.

Regarding zcopy_start/zcopy_end – it looks like you’ve already added a bunch of comments to Ben’s patch on GerritHub.  For now I’d say let’s continue our discussion there.  I’ve responded to a couple of similar questions there and I’m sure Ben will have more replies tomorrow.

-Jim


From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
Date: Thursday, June 21, 2018 at 5:01 PM
To: James Harris <james.r.harris(a)intel.com>, "Walker, Benjamin" <benjamin.walker(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>, "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodriguez(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.org>
Subject: RE: BDEV-IO Lifecycle - Need your input.

Hi Jim,

I wish I joined the community meeting, I also missed the spdk_bdev_queue_io_wait().

So, there are 2 issues I am intending to solve,

1.       We want to ensure that an instance of bdev-io is acquired prior to or along with I/O data buffer. This allows for better error handling when bdev_io pool is exhausted. Yes, we can solve it via sizing the pool right. The changes queue-io-wait also addresses the problem.
2.       Most importantly, by extending the life of bdev-io (Same life span as nvmf_request structure), abort use case and other use cases that involve cleaning up of IO data buffer are accomplished effectively. Let me elaborate; bdev-io is the fabric that connects the nvme command and operation with the backend. The driver context and data buffer context is stored in bdev-io. The freeing up of bdev_io resource is pushed to the end so the I/O cleanup can happen after controller has transmitted the data to the host. In absence of bdev_io, we will have end up adding more and more void context in request structure. Hence the push to extend the life of bdev_io.

I see Ben’s recent patch “zcopy_start” and “zcopy_end”; We can make that work as long as the bdev-io allocated/acquired stays till the end. One of the challenges I see with that patch is defining a callback for the I/O submission. For instance, zopy_start will allocate bdev_io and submits the I/O to the bdev device. Underlying implementation can be synchronous or can be asynchronous. The callback for this submission should check to see if it is a BUFFER-READY or PENDING and accordingly relay it back to transport. Next phase is actual I/O submission. Let’s say, it reuses the bdev-io obtained in ZCOPY-START, now the callback should determine if it is a success or failure. Finally when ZCOPY-END is invoked the supplied bdev_io will have all necessary data to release the WRITE buffer or unlock the read buffer based on the operation performed.

I hope I’m making sense. I guess, my effort is to extend the lifespan of bdev-io and let it continue to host the driver context and buffer context populated during the BEGIN phase.

Regards,
Srikanth

From: Harris, James R <james.r.harris(a)intel.com>
Sent: Thursday, June 21, 2018 2:21 PM
To: Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>; Walker, Benjamin <benjamin.walker(a)intel.com>; Verkamp, Daniel <daniel.verkamp(a)intel.com>
Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Meneghini(a)netapp.com>; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)netapp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>; spdk(a)lists.01.org
Subject: Re: BDEV-IO Lifecycle - Need your input.

Hi Srikanth,

Following up on this thread and the discussion in yesterday’s community meeting.

The recent spdk_bdev_queue_io_wait() changes allow an SPDK application to work better in general when the spdk_bdev_io buffer pool is exhausted.  We still need to make changes to the NVMe-oF target to use this new API but that should be done soon-ish.

With that in place, the spdk_bdev_io buffer pool itself can be configured up or down when the application starts.  Currently the default is 64K spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) == 216 plus the per-IO context size allocated for the bdev module.  This can be up to 192 bytes (virtio bdev module) but is likely much smaller for you depending on the context size for your ontap bdev module.

Let’s assume your per-IO context size is 64 bytes.  64K x (192 + 64) = 16MB.

I’m not sure how many spdk_bdev_io you need in flight at any given time.  64K seems like a lot but I’d be curious to hear your thoughts on this.  If this is the right number, then worst case, there would be about 16MB of DRAM that would sit unused if the NVMe-oF target in your system was not active.  Is that too burdensome for your application?

Thanks,

-Jim



From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com<mailto:Srikanth.Kaligotla(a)netapp.com>>
Date: Wednesday, June 20, 2018 at 12:47 PM
To: "Walker, Benjamin" <benjamin.walker(a)intel.com<mailto:benjamin.walker(a)intel.com>>, James Harris <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>, Daniel Verkamp <daniel.verkamp(a)intel.com<mailto:daniel.verkamp(a)intel.com>>
Cc: "raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>" <raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>>, "Meneghini, John" <John.Meneghini(a)netapp.com<mailto:John.Meneghini(a)netapp.com>>, "Rodriguez, Edwin" <Ed.Rodriguez(a)netapp.com<mailto:Ed.Rodriguez(a)netapp.com>>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com<mailto:Madhusudan.Pai(a)netapp.com>>, "NGC-john.barnard-broadcom.com" <john.barnard(a)broadcom.com<mailto:john.barnard(a)broadcom.com>>, "spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>" <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: RE: BDEV-IO Lifecycle - Need your input.

Hello,

First revision of changes to extend the lifecycle of bdev_io are available for review. I would like to solicit your input on the proposed API/Code flow changes.

https://review.gerrithub.io/c/spdk/spdk/+/415860

Thanks,
Srikanth


From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com<mailto:Srikanth.Kaligotla(a)netapp.com>>
Date: Friday, May 11, 2018 at 2:27 PM
To: "Walker, Benjamin" <benjamin.walker(a)intel.com<mailto:benjamin.walker(a)intel.com>>, "Harris, James R" <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>
Cc: "raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>" <raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>>, "Meneghini, John" <John.Meneghini(a)netapp.com<mailto:John.Meneghini(a)netapp.com>>, "Rodriguez, Edwin" <Ed.Rodriguez(a)netapp.com<mailto:Ed.Rodriguez(a)netapp.com>>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com<mailto:Madhusudan.Pai(a)netapp.com>>, "NGC-john.barnard-broadcom.com" <john.barnard(a)broadcom.com<mailto:john.barnard(a)broadcom.com>>, "spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>" <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: RE: BDEV-IO Lifecycle - Need your input.

CC: List

Hi Ben,

Your proposal<https://review.gerrithub.io/#/c/spdk/spdk/+/386166/> to interface with the backend to acquire and release buffers is good. You have accurately stated that the challenge is in developing intuitive semantics. And that has been my struggle. To me, there are two problem statements;

1.       It is expected that bdev_io pool is sized correctly so the call to get bdev_io succeeds. Failure to acquire bdev_io will result in DEVICE-ERROR. The transport is already capable of handling temporary memory failures by moving the request to PENDING-Queue. Hence the proposal to change the bdev_io lifecycle and perhaps connect it with the spdk_nvmf_request object. Thus all buffer needs are addressed at the beginning of I/O request.
2.       I/O Data buffers are sourced and managed by the backend. One of the challenges I see with your proposed interface is the lack of details like if resource is being acquired for a READ operation or WRITE operation. The handling is quite different in each case. Since bdev_io->type is overloaded the type of I/O operation is lost. I suppose one can cast the cb_arg (nvmf_request) and then proceed. Also, the bdev_io should be present to RELEASE the buffer. Zero copy semantics warrants that data buffer stays until controller-to-host has occurred. In other words, bdev_io lives till REQUEST come to COMPLETE state.

What are your thoughts in introducing spdk_bdev_init() and spdk_bdev_fini() as an alternative approach to extend the lifecyle of bdev_io and allow data buffer management via bdev fn_table ?

I hope I’m making sense…

Thanks,
Srikanth

From: Walker, Benjamin <benjamin.walker(a)intel.com<mailto:benjamin.walker(a)intel.com>>
Sent: Friday, May 11, 2018 12:28 PM
To: Harris, James R <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>; Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com<mailto:Srikanth.Kaligotla(a)netapp.com>>
Cc: raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>; Meneghini, John <John.Meneghini(a)netapp.com<mailto:John.Meneghini(a)netapp.com>>; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com<mailto:Ed.Rodriguez(a)netapp.com>>; Pai, Madhu <Madhusudan.Pai(a)netapp.com<mailto:Madhusudan.Pai(a)netapp.com>>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com<mailto:john.barnard(a)broadcom.com>>
Subject: Re: BDEV-IO Lifecycle - Need your input.

Hi Srikanth,

Yes - we'll need to introduce some way to acquire and release buffers from the bdev layer earlier on in the state machine that processes an NVMe-oF request. I've had this patch out for review for several months as a proposal for this scenario:

https://review.gerrithub.io/#/c/spdk/spdk/+/386166/

It doesn't pass the tests - it's just a proposal for the interface. 90% of the challenge here is in developing intuitive semantics.

Thanks,
Ben

P.S. This is the kind of discussion that would fit perfectly on the mailing list.

On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
Hi Ben, Hi James,

I would like to solicit opinions on the lifecycle for bdev-io resource object. Attached is the image of RDMA state machine in its current implementation. When the REQUEST enters the NEED-BUFFER state, the buffers necessary for carrying out I/O operation are allocated/acquired from the memory pool. An instance of BDEV-IO comes into existence after REQUEST reaches READY-TO-EXECUTE state. The BDEV-IO is teared down as soon the backend returns. From a BDEV perspective, BDEV-IO is simply a translation unit that facilitates I/O buffers from the backend. The driver context embedded within bdev_io holds great deal of information pertaining to the I/O under execution. It assists in error handling, dereferencing the buffers upon I/O completion and in abort handling. In summary, the bdev_io stays alive until request has come to COMPLETE state. I’d like to hear peoples thoughts in introducing the plumbing to acquire BDEV-IO resource in REQUEST-NEED-BUFFER state and release it in REQUEST-COMPLETE state.  I will shortly have patch available for review that introduces spdk_bdev_init and spdk_bdev_fini which in turn invokes the corresponding bdev fn_table to  initialize/cleanup.

I wanted to use this email to communicate our intent and solicit your feedback. We have a working implementation of the above proposal and prior to pushing it upstream for review would like to hear your thoughts. These proposed changes to upstream are a result of FC transport work in collaboration with the Broadcom team who are also copied to this mail. Myself and John will be at SPDK Dev conference and if required we can elaborate further on this proposal.

Thanks,
Srikanth


[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 28484 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [SPDK] BDEV-IO Lifecycle - Need your input.
@ 2018-06-22  0:01 Kaligotla, Srikanth
  0 siblings, 0 replies; 12+ messages in thread
From: Kaligotla, Srikanth @ 2018-06-22  0:01 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 10938 bytes --]

Hi Jim,

I wish I joined the community meeting, I also missed the spdk_bdev_queue_io_wait().

So, there are 2 issues I am intending to solve,


  1.  We want to ensure that an instance of bdev-io is acquired prior to or along with I/O data buffer. This allows for better error handling when bdev_io pool is exhausted. Yes, we can solve it via sizing the pool right. The changes queue-io-wait also addresses the problem.
  2.  Most importantly, by extending the life of bdev-io (Same life span as nvmf_request structure), abort use case and other use cases that involve cleaning up of IO data buffer are accomplished effectively. Let me elaborate; bdev-io is the fabric that connects the nvme command and operation with the backend. The driver context and data buffer context is stored in bdev-io. The freeing up of bdev_io resource is pushed to the end so the I/O cleanup can happen after controller has transmitted the data to the host. In absence of bdev_io, we will have end up adding more and more void context in request structure. Hence the push to extend the life of bdev_io.

I see Ben’s recent patch “zcopy_start” and “zcopy_end”; We can make that work as long as the bdev-io allocated/acquired stays till the end. One of the challenges I see with that patch is defining a callback for the I/O submission. For instance, zopy_start will allocate bdev_io and submits the I/O to the bdev device. Underlying implementation can be synchronous or can be asynchronous. The callback for this submission should check to see if it is a BUFFER-READY or PENDING and accordingly relay it back to transport. Next phase is actual I/O submission. Let’s say, it reuses the bdev-io obtained in ZCOPY-START, now the callback should determine if it is a success or failure. Finally when ZCOPY-END is invoked the supplied bdev_io will have all necessary data to release the WRITE buffer or unlock the read buffer based on the operation performed.

I hope I’m making sense. I guess, my effort is to extend the lifespan of bdev-io and let it continue to host the driver context and buffer context populated during the BEGIN phase.

Regards,
Srikanth

From: Harris, James R <james.r.harris(a)intel.com>
Sent: Thursday, June 21, 2018 2:21 PM
To: Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com>; Walker, Benjamin <benjamin.walker(a)intel.com>; Verkamp, Daniel <daniel.verkamp(a)intel.com>
Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John <John.Meneghini(a)netapp.com>; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com>; Pai, Madhu <Madhusudan.Pai(a)netapp.com>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com>; spdk(a)lists.01.org
Subject: Re: BDEV-IO Lifecycle - Need your input.

Hi Srikanth,

Following up on this thread and the discussion in yesterday’s community meeting.

The recent spdk_bdev_queue_io_wait() changes allow an SPDK application to work better in general when the spdk_bdev_io buffer pool is exhausted.  We still need to make changes to the NVMe-oF target to use this new API but that should be done soon-ish.

With that in place, the spdk_bdev_io buffer pool itself can be configured up or down when the application starts.  Currently the default is 64K spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) == 216 plus the per-IO context size allocated for the bdev module.  This can be up to 192 bytes (virtio bdev module) but is likely much smaller for you depending on the context size for your ontap bdev module.

Let’s assume your per-IO context size is 64 bytes.  64K x (192 + 64) = 16MB.

I’m not sure how many spdk_bdev_io you need in flight at any given time.  64K seems like a lot but I’d be curious to hear your thoughts on this.  If this is the right number, then worst case, there would be about 16MB of DRAM that would sit unused if the NVMe-oF target in your system was not active.  Is that too burdensome for your application?

Thanks,

-Jim



From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com<mailto:Srikanth.Kaligotla(a)netapp.com>>
Date: Wednesday, June 20, 2018 at 12:47 PM
To: "Walker, Benjamin" <benjamin.walker(a)intel.com<mailto:benjamin.walker(a)intel.com>>, James Harris <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>, Daniel Verkamp <daniel.verkamp(a)intel.com<mailto:daniel.verkamp(a)intel.com>>
Cc: "raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>" <raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>>, "Meneghini, John" <John.Meneghini(a)netapp.com<mailto:John.Meneghini(a)netapp.com>>, "Rodriguez, Edwin" <Ed.Rodriguez(a)netapp.com<mailto:Ed.Rodriguez(a)netapp.com>>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com<mailto:Madhusudan.Pai(a)netapp.com>>, "NGC-john.barnard-broadcom.com" <john.barnard(a)broadcom.com<mailto:john.barnard(a)broadcom.com>>, "spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>" <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: RE: BDEV-IO Lifecycle - Need your input.

Hello,

First revision of changes to extend the lifecycle of bdev_io are available for review. I would like to solicit your input on the proposed API/Code flow changes.

https://review.gerrithub.io/c/spdk/spdk/+/415860

Thanks,
Srikanth


From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com<mailto:Srikanth.Kaligotla(a)netapp.com>>
Date: Friday, May 11, 2018 at 2:27 PM
To: "Walker, Benjamin" <benjamin.walker(a)intel.com<mailto:benjamin.walker(a)intel.com>>, "Harris, James R" <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>
Cc: "raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>" <raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>>, "Meneghini, John" <John.Meneghini(a)netapp.com<mailto:John.Meneghini(a)netapp.com>>, "Rodriguez, Edwin" <Ed.Rodriguez(a)netapp.com<mailto:Ed.Rodriguez(a)netapp.com>>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com<mailto:Madhusudan.Pai(a)netapp.com>>, "NGC-john.barnard-broadcom.com" <john.barnard(a)broadcom.com<mailto:john.barnard(a)broadcom.com>>, "spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>" <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: RE: BDEV-IO Lifecycle - Need your input.

CC: List

Hi Ben,

Your proposal<https://review.gerrithub.io/#/c/spdk/spdk/+/386166/> to interface with the backend to acquire and release buffers is good. You have accurately stated that the challenge is in developing intuitive semantics. And that has been my struggle. To me, there are two problem statements;

1.       It is expected that bdev_io pool is sized correctly so the call to get bdev_io succeeds. Failure to acquire bdev_io will result in DEVICE-ERROR. The transport is already capable of handling temporary memory failures by moving the request to PENDING-Queue. Hence the proposal to change the bdev_io lifecycle and perhaps connect it with the spdk_nvmf_request object. Thus all buffer needs are addressed at the beginning of I/O request.
2.       I/O Data buffers are sourced and managed by the backend. One of the challenges I see with your proposed interface is the lack of details like if resource is being acquired for a READ operation or WRITE operation. The handling is quite different in each case. Since bdev_io->type is overloaded the type of I/O operation is lost. I suppose one can cast the cb_arg (nvmf_request) and then proceed. Also, the bdev_io should be present to RELEASE the buffer. Zero copy semantics warrants that data buffer stays until controller-to-host has occurred. In other words, bdev_io lives till REQUEST come to COMPLETE state.

What are your thoughts in introducing spdk_bdev_init() and spdk_bdev_fini() as an alternative approach to extend the lifecyle of bdev_io and allow data buffer management via bdev fn_table ?

I hope I’m making sense…

Thanks,
Srikanth

From: Walker, Benjamin <benjamin.walker(a)intel.com<mailto:benjamin.walker(a)intel.com>>
Sent: Friday, May 11, 2018 12:28 PM
To: Harris, James R <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>; Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com<mailto:Srikanth.Kaligotla(a)netapp.com>>
Cc: raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>; Meneghini, John <John.Meneghini(a)netapp.com<mailto:John.Meneghini(a)netapp.com>>; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com<mailto:Ed.Rodriguez(a)netapp.com>>; Pai, Madhu <Madhusudan.Pai(a)netapp.com<mailto:Madhusudan.Pai(a)netapp.com>>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com<mailto:john.barnard(a)broadcom.com>>
Subject: Re: BDEV-IO Lifecycle - Need your input.

Hi Srikanth,

Yes - we'll need to introduce some way to acquire and release buffers from the bdev layer earlier on in the state machine that processes an NVMe-oF request. I've had this patch out for review for several months as a proposal for this scenario:

https://review.gerrithub.io/#/c/spdk/spdk/+/386166/

It doesn't pass the tests - it's just a proposal for the interface. 90% of the challenge here is in developing intuitive semantics.

Thanks,
Ben

P.S. This is the kind of discussion that would fit perfectly on the mailing list.

On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
Hi Ben, Hi James,

I would like to solicit opinions on the lifecycle for bdev-io resource object. Attached is the image of RDMA state machine in its current implementation. When the REQUEST enters the NEED-BUFFER state, the buffers necessary for carrying out I/O operation are allocated/acquired from the memory pool. An instance of BDEV-IO comes into existence after REQUEST reaches READY-TO-EXECUTE state. The BDEV-IO is teared down as soon the backend returns. From a BDEV perspective, BDEV-IO is simply a translation unit that facilitates I/O buffers from the backend. The driver context embedded within bdev_io holds great deal of information pertaining to the I/O under execution. It assists in error handling, dereferencing the buffers upon I/O completion and in abort handling. In summary, the bdev_io stays alive until request has come to COMPLETE state. I’d like to hear peoples thoughts in introducing the plumbing to acquire BDEV-IO resource in REQUEST-NEED-BUFFER state and release it in REQUEST-COMPLETE state.  I will shortly have patch available for review that introduces spdk_bdev_init and spdk_bdev_fini which in turn invokes the corresponding bdev fn_table to  initialize/cleanup.

I wanted to use this email to communicate our intent and solicit your feedback. We have a working implementation of the above proposal and prior to pushing it upstream for review would like to hear your thoughts. These proposed changes to upstream are a result of FC transport work in collaboration with the Broadcom team who are also copied to this mail. Myself and John will be at SPDK Dev conference and if required we can elaborate further on this proposal.

Thanks,
Srikanth


[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 25364 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [SPDK] BDEV-IO Lifecycle - Need your input.
@ 2018-06-21 18:20 Harris, James R
  0 siblings, 0 replies; 12+ messages in thread
From: Harris, James R @ 2018-06-21 18:20 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 7792 bytes --]

Hi Srikanth,

Following up on this thread and the discussion in yesterday’s community meeting.

The recent spdk_bdev_queue_io_wait() changes allow an SPDK application to work better in general when the spdk_bdev_io buffer pool is exhausted.  We still need to make changes to the NVMe-oF target to use this new API but that should be done soon-ish.

With that in place, the spdk_bdev_io buffer pool itself can be configured up or down when the application starts.  Currently the default is 64K spdk_bdev_io buffers.  sizeof(struct spdk_bdev_io) == 216 plus the per-IO context size allocated for the bdev module.  This can be up to 192 bytes (virtio bdev module) but is likely much smaller for you depending on the context size for your ontap bdev module.

Let’s assume your per-IO context size is 64 bytes.  64K x (192 + 64) = 16MB.

I’m not sure how many spdk_bdev_io you need in flight at any given time.  64K seems like a lot but I’d be curious to hear your thoughts on this.  If this is the right number, then worst case, there would be about 16MB of DRAM that would sit unused if the NVMe-oF target in your system was not active.  Is that too burdensome for your application?

Thanks,

-Jim



From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com>
Date: Wednesday, June 20, 2018 at 12:47 PM
To: "Walker, Benjamin" <benjamin.walker(a)intel.com>, James Harris <james.r.harris(a)intel.com>, Daniel Verkamp <daniel.verkamp(a)intel.com>
Cc: "raju.gottumukkala(a)broadcom.com" <raju.gottumukkala(a)broadcom.com>, "Meneghini, John" <John.Meneghini(a)netapp.com>, "Rodriguez, Edwin" <Ed.Rodriguez(a)netapp.com>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com>, "NGC-john.barnard-broadcom.com" <john.barnard(a)broadcom.com>, "spdk(a)lists.01.org" <spdk(a)lists.01.org>
Subject: RE: BDEV-IO Lifecycle - Need your input.

Hello,

First revision of changes to extend the lifecycle of bdev_io are available for review. I would like to solicit your input on the proposed API/Code flow changes.

https://review.gerrithub.io/c/spdk/spdk/+/415860

Thanks,
Srikanth


From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com<mailto:Srikanth.Kaligotla(a)netapp.com>>
Date: Friday, May 11, 2018 at 2:27 PM
To: "Walker, Benjamin" <benjamin.walker(a)intel.com<mailto:benjamin.walker(a)intel.com>>, "Harris, James R" <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>
Cc: "raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>" <raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>>, "Meneghini, John" <John.Meneghini(a)netapp.com<mailto:John.Meneghini(a)netapp.com>>, "Rodriguez, Edwin" <Ed.Rodriguez(a)netapp.com<mailto:Ed.Rodriguez(a)netapp.com>>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com<mailto:Madhusudan.Pai(a)netapp.com>>, "NGC-john.barnard-broadcom.com" <john.barnard(a)broadcom.com<mailto:john.barnard(a)broadcom.com>>, "spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>" <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: RE: BDEV-IO Lifecycle - Need your input.

CC: List

Hi Ben,

Your proposal<https://review.gerrithub.io/#/c/spdk/spdk/+/386166/> to interface with the backend to acquire and release buffers is good. You have accurately stated that the challenge is in developing intuitive semantics. And that has been my struggle. To me, there are two problem statements;

1.       It is expected that bdev_io pool is sized correctly so the call to get bdev_io succeeds. Failure to acquire bdev_io will result in DEVICE-ERROR. The transport is already capable of handling temporary memory failures by moving the request to PENDING-Queue. Hence the proposal to change the bdev_io lifecycle and perhaps connect it with the spdk_nvmf_request object. Thus all buffer needs are addressed at the beginning of I/O request.
2.       I/O Data buffers are sourced and managed by the backend. One of the challenges I see with your proposed interface is the lack of details like if resource is being acquired for a READ operation or WRITE operation. The handling is quite different in each case. Since bdev_io->type is overloaded the type of I/O operation is lost. I suppose one can cast the cb_arg (nvmf_request) and then proceed. Also, the bdev_io should be present to RELEASE the buffer. Zero copy semantics warrants that data buffer stays until controller-to-host has occurred. In other words, bdev_io lives till REQUEST come to COMPLETE state.

What are your thoughts in introducing spdk_bdev_init() and spdk_bdev_fini() as an alternative approach to extend the lifecyle of bdev_io and allow data buffer management via bdev fn_table ?

I hope I’m making sense…

Thanks,
Srikanth

From: Walker, Benjamin <benjamin.walker(a)intel.com<mailto:benjamin.walker(a)intel.com>>
Sent: Friday, May 11, 2018 12:28 PM
To: Harris, James R <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>; Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com<mailto:Srikanth.Kaligotla(a)netapp.com>>
Cc: raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>; Meneghini, John <John.Meneghini(a)netapp.com<mailto:John.Meneghini(a)netapp.com>>; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com<mailto:Ed.Rodriguez(a)netapp.com>>; Pai, Madhu <Madhusudan.Pai(a)netapp.com<mailto:Madhusudan.Pai(a)netapp.com>>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com<mailto:john.barnard(a)broadcom.com>>
Subject: Re: BDEV-IO Lifecycle - Need your input.

Hi Srikanth,

Yes - we'll need to introduce some way to acquire and release buffers from the bdev layer earlier on in the state machine that processes an NVMe-oF request. I've had this patch out for review for several months as a proposal for this scenario:

https://review.gerrithub.io/#/c/spdk/spdk/+/386166/

It doesn't pass the tests - it's just a proposal for the interface. 90% of the challenge here is in developing intuitive semantics.

Thanks,
Ben

P.S. This is the kind of discussion that would fit perfectly on the mailing list.

On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
Hi Ben, Hi James,

I would like to solicit opinions on the lifecycle for bdev-io resource object. Attached is the image of RDMA state machine in its current implementation. When the REQUEST enters the NEED-BUFFER state, the buffers necessary for carrying out I/O operation are allocated/acquired from the memory pool. An instance of BDEV-IO comes into existence after REQUEST reaches READY-TO-EXECUTE state. The BDEV-IO is teared down as soon the backend returns. From a BDEV perspective, BDEV-IO is simply a translation unit that facilitates I/O buffers from the backend. The driver context embedded within bdev_io holds great deal of information pertaining to the I/O under execution. It assists in error handling, dereferencing the buffers upon I/O completion and in abort handling. In summary, the bdev_io stays alive until request has come to COMPLETE state. I’d like to hear peoples thoughts in introducing the plumbing to acquire BDEV-IO resource in REQUEST-NEED-BUFFER state and release it in REQUEST-COMPLETE state.  I will shortly have patch available for review that introduces spdk_bdev_init and spdk_bdev_fini which in turn invokes the corresponding bdev fn_table to  initialize/cleanup.

I wanted to use this email to communicate our intent and solicit your feedback. We have a working implementation of the above proposal and prior to pushing it upstream for review would like to hear your thoughts. These proposed changes to upstream are a result of FC transport work in collaboration with the Broadcom team who are also copied to this mail. Myself and John will be at SPDK Dev conference and if required we can elaborate further on this proposal.

Thanks,
Srikanth


[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 18528 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [SPDK] BDEV-IO Lifecycle - Need your input.
@ 2018-06-20 19:47 Kaligotla, Srikanth
  0 siblings, 0 replies; 12+ messages in thread
From: Kaligotla, Srikanth @ 2018-06-20 19:47 UTC (permalink / raw)
  To: spdk

[-- Attachment #1: Type: text/plain, Size: 5938 bytes --]

Hello,

First revision of changes to extend the lifecycle of bdev_io are available for review. I would like to solicit your input on the proposed API/Code flow changes.

https://review.gerrithub.io/c/spdk/spdk/+/415860

Thanks,
Srikanth


From: "Kaligotla, Srikanth" <Srikanth.Kaligotla(a)netapp.com<mailto:Srikanth.Kaligotla(a)netapp.com>>
Date: Friday, May 11, 2018 at 2:27 PM
To: "Walker, Benjamin" <benjamin.walker(a)intel.com<mailto:benjamin.walker(a)intel.com>>, "Harris, James R" <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>
Cc: "raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>" <raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>>, "Meneghini, John" <John.Meneghini(a)netapp.com<mailto:John.Meneghini(a)netapp.com>>, "Rodriguez, Edwin" <Ed.Rodriguez(a)netapp.com<mailto:Ed.Rodriguez(a)netapp.com>>, "Pai, Madhu" <Madhusudan.Pai(a)netapp.com<mailto:Madhusudan.Pai(a)netapp.com>>, "NGC-john.barnard-broadcom.com" <john.barnard(a)broadcom.com<mailto:john.barnard(a)broadcom.com>>, "spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>" <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
Subject: RE: BDEV-IO Lifecycle - Need your input.

CC: List

Hi Ben,

Your proposal<https://review.gerrithub.io/#/c/spdk/spdk/+/386166/> to interface with the backend to acquire and release buffers is good. You have accurately stated that the challenge is in developing intuitive semantics. And that has been my struggle. To me, there are two problem statements;


  1.  It is expected that bdev_io pool is sized correctly so the call to get bdev_io succeeds. Failure to acquire bdev_io will result in DEVICE-ERROR. The transport is already capable of handling temporary memory failures by moving the request to PENDING-Queue. Hence the proposal to change the bdev_io lifecycle and perhaps connect it with the spdk_nvmf_request object. Thus all buffer needs are addressed at the beginning of I/O request.
  2.  I/O Data buffers are sourced and managed by the backend. One of the challenges I see with your proposed interface is the lack of details like if resource is being acquired for a READ operation or WRITE operation. The handling is quite different in each case. Since bdev_io->type is overloaded the type of I/O operation is lost. I suppose one can cast the cb_arg (nvmf_request) and then proceed. Also, the bdev_io should be present to RELEASE the buffer. Zero copy semantics warrants that data buffer stays until controller-to-host has occurred. In other words, bdev_io lives till REQUEST come to COMPLETE state.

What are your thoughts in introducing spdk_bdev_init() and spdk_bdev_fini() as an alternative approach to extend the lifecyle of bdev_io and allow data buffer management via bdev fn_table ?

I hope I’m making sense…

Thanks,
Srikanth

From: Walker, Benjamin <benjamin.walker(a)intel.com<mailto:benjamin.walker(a)intel.com>>
Sent: Friday, May 11, 2018 12:28 PM
To: Harris, James R <james.r.harris(a)intel.com<mailto:james.r.harris(a)intel.com>>; Kaligotla, Srikanth <Srikanth.Kaligotla(a)netapp.com<mailto:Srikanth.Kaligotla(a)netapp.com>>
Cc: raju.gottumukkala(a)broadcom.com<mailto:raju.gottumukkala(a)broadcom.com>; Meneghini, John <John.Meneghini(a)netapp.com<mailto:John.Meneghini(a)netapp.com>>; Rodriguez, Edwin <Ed.Rodriguez(a)netapp.com<mailto:Ed.Rodriguez(a)netapp.com>>; Pai, Madhu <Madhusudan.Pai(a)netapp.com<mailto:Madhusudan.Pai(a)netapp.com>>; NGC-john.barnard-broadcom.com <john.barnard(a)broadcom.com<mailto:john.barnard(a)broadcom.com>>
Subject: Re: BDEV-IO Lifecycle - Need your input.

Hi Srikanth,

Yes - we'll need to introduce some way to acquire and release buffers from the bdev layer earlier on in the state machine that processes an NVMe-oF request. I've had this patch out for review for several months as a proposal for this scenario:

https://review.gerrithub.io/#/c/spdk/spdk/+/386166/

It doesn't pass the tests - it's just a proposal for the interface. 90% of the challenge here is in developing intuitive semantics.

Thanks,
Ben

P.S. This is the kind of discussion that would fit perfectly on the mailing list.

On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wrote:
Hi Ben, Hi James,

I would like to solicit opinions on the lifecycle for bdev-io resource object. Attached is the image of RDMA state machine in its current implementation. When the REQUEST enters the NEED-BUFFER state, the buffers necessary for carrying out I/O operation are allocated/acquired from the memory pool. An instance of BDEV-IO comes into existence after REQUEST reaches READY-TO-EXECUTE state. The BDEV-IO is teared down as soon the backend returns. From a BDEV perspective, BDEV-IO is simply a translation unit that facilitates I/O buffers from the backend. The driver context embedded within bdev_io holds great deal of information pertaining to the I/O under execution. It assists in error handling, dereferencing the buffers upon I/O completion and in abort handling. In summary, the bdev_io stays alive until request has come to COMPLETE state. I’d like to hear peoples thoughts in introducing the plumbing to acquire BDEV-IO resource in REQUEST-NEED-BUFFER state and release it in REQUEST-COMPLETE state.  I will shortly have patch available for review that introduces spdk_bdev_init and spdk_bdev_fini which in turn invokes the corresponding bdev fn_table to  initialize/cleanup.

I wanted to use this email to communicate our intent and solicit your feedback. We have a working implementation of the above proposal and prior to pushing it upstream for review would like to hear your thoughts. These proposed changes to upstream are a result of FC transport work in collaboration with the Broadcom team who are also copied to this mail. Myself and John will be at SPDK Dev conference and if required we can elaborate further on this proposal.

Thanks,
Srikanth


[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 13804 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-07-09 19:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-11 18:27 [SPDK] BDEV-IO Lifecycle - Need your input Kaligotla, Srikanth
2018-06-20 19:47 Kaligotla, Srikanth
2018-06-21 18:20 Harris, James R
2018-06-22  0:01 Kaligotla, Srikanth
2018-06-22  1:11 Harris, James R
2018-07-06 16:41 Walker, Benjamin
2018-07-06 16:46 Meneghini, John
2018-07-09 10:34 Popuri, Sriram
2018-07-09 15:30 Harris, James R
2018-07-09 16:18 Meneghini, John
2018-07-09 17:52 Walker, Benjamin
2018-07-09 19:19 Popuri, Sriram

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.