From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============7548827487943640762==" MIME-Version: 1.0 From: Harris, James R Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.) Date: Thu, 12 Jul 2018 05:42:24 +0000 Message-ID: List-ID: To: spdk@lists.01.org --===============7548827487943640762== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Personally, I do not think they suck. I think they are great. ( Seriously though =E2=80=93 this API is inconsistent with the rest of the bd= ev API. Nowhere else does a bdev I/O submit function return a bdev_io and = then require the caller to submit the bdev_io. I also think it=E2=80=99s a good idea to differentiate between READ (where = the caller provides the buffers) and ZCOPY (where the bdev module provides = the buffers). Earlier, you and Srikanth had expressed concern about the global bdev_io po= ol allocated inside of the bdev layer. You wanted control over that, so a = mempool doesn=E2=80=99t get allocated if it=E2=80=99s not going to be used.= We can modify the bdev layer to allow specifying an external allocator. = We may even require it, and then move the current bdev_io mempool out of bd= ev.c and into the application initialization code. Based on the above, are they any functional reasons why the zcopy APIs woul= dn=E2=80=99t work for the SPDK NVMe-oF target? Thanks, -Jim On 7/11/18, 9:30 PM, "SPDK on behalf of Meneghini, John" wrote: I'm sorry Ben, but I think the spdk_bdev_zcopy_start() and spdk_bdev_zc= opy_end() APIs suck, and I don't want to use them in our BDEV. = Here is are some APIs that I think could possibly work: = = struct spdk_bdev_io *spdk_bdev_dma_start(struct spdk_bdev_desc *desc, struct spdk_io_channel *ch, uint64_t offset, = /* Make this bytes instead of blocks because we may need to DMA = something other than blocks */ = uint64_t length, /* length in bytes of whole = transfer */ = uint32_t block_size, /* block size in bytes. When se= t to 1 we are transferring bytes.*/ enum spdk_bdev_io_type. /= * Everything else I need to know... LIVE, COMMIT, POPULATE, etc. is known b= y the io_type); = void spdk_bdev_dma_end(struct spdk_bdev_io *bdev_io); = /* Rundown the bdev_io and clean up any iovec or dma resources */ = int spdk_bdev_io_submit(struct spdk_bdev_io *bdev_io); = These APIs belong in the bdev->fn_table, and the flow through the trans= port would be: = spdk_nvmf_ctrlr_process_io_cmd nvmf_bdev_ctrlr_write_cmd bdev_io =3D spdk_bdev_dma_start() spdk_bdev_io_submit(bdev_io) = Etc. = This is the direction Srikanth was heading and I'd like to understand w= hy that direction does not work. = /John = = On 7/11/18, 9:17 PM, "Luse, Paul E" wrote: = This is a great explanation Ben, thanks. I'll start looking throug= h the patches as well. A few quick questions for additional context and to= make sure I understand your terminology: = * Can you expand on " The populate flag determines whether the "liv= e" data is present in the given buffer" maybe with an example or something = (what do you mean by "live", you use it a few times) * You mention " I believe I've addressed all known use cases with t= his design" but in this email chain, at least, only cover the NVMe-oF use c= ase. Can you list the known cases you mention and maybe just say a few wor= ds about them? = Thanks Paul = -----Original Message----- From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker= , Benjamin Sent: Wednesday, July 11, 2018 2:53 PM To: John.Meneghini(a)netapp.com; Sriram.Popuri(a)netapp.com; Srikan= th.Kaligotla(a)netapp.com Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your in= put.) = I've done a bunch of work here to formalize the zero copy API in th= e bdev layer. The patch series begins here: = https://review.gerrithub.io/#/c/spdk/spdk/+/386166/ = There has been some good discussion on the patches themselves, but = I wanted to circle back to the mailing list for a bit. The basic idea is to= add two new functions to the bdev layer: = int spdk_bdev_zcopy_start(struct spdk_bdev_desc *desc, struct spdk_io_channel *ch, uint64_t offset_blocks, uint64_t num_blocks, bool populate, spdk_bdev_io_completion_cb cb, void *cb_arg); int spdk_bdev_zc= opy_end(struct spdk_bdev_io *bdev_io, bool commit, spdk_bdev_io_completion_cb cb, void *cb_arg); = The zcopy start call makes a request to the backing block device to= provide a region of memory suitable for performing a zero copy update to t= he region described by the offset_blocks and num_blocks parameters. The pop= ulate flag determines whether the "live" data is present in the given buffe= r. When this request completes, it returns a bdev_io in the completion call= back. That bdev_io will have a scatter gather list filled out describing th= e memory region. = When the user has finished transferring data into or out of the dat= a buffer, the user calls spdk_bdev_zcopy_end with the originally provided b= dev_io. The commit flag indicates whether the data in the buffer needs to b= e made "live". = The NVMe-oF target can take advantage of these new APIs by altering= the state machine in the RDMA layer to call spdk_bdev_zcopy_start to obtai= n data buffers after parsing the command, then performing RDMA transfers in= to or out of the provided region, and finally completing the command with s= pdk_bdev_zcopy_end. The work to do those modifications has not been done yet, and I exp= ect them to be a significant amount of effort. = As many bdev modules as possible should attempt to support zero cop= y semantics, but for those bdev modules that aren't able to, one of the pat= ches in my series will automatically emulate the zcopy commands using regul= ar reads and writes. This allows the higher level users of the bdev API to always make t= he zcopy calls for simplicity. The emulation path should be no worse, in te= rms of performance, than doing a regular read or write. = I believe I've addressed all known use cases with this design, and = while there will certainly be some future patches to fix problems that come= up, I'm confident that we're on the right track here. If there is any feed= back in this area at all, now is the time to provide it. = I think this also leads into some really awesome future work that I= 'll suggest in the hope that someone in the community is inspired. One idea= is that the bdev_nvme module could support zero copy semantics by returnin= g a pointer to a portion of a Controller Memory Buffer or Persistent Memory= Buffer suitable for data. There aren't many devices for sale today that ca= n do this, but it's in the spec and I imagine they're coming. This would al= low the NVMe-oF target, for example, to perform RDMA transfers peer-to-peer= between the NIC and the NVMe SSD. Similarly, for the bdev_pmem module, dir= ect pointers into the the persistent memory could be returned. The NVMe-oF = target could then perform RDMA transfers into or out of the persistent memo= ry directly, and the zcopy end call with commit set to true would perform t= he required flush instructions. = Thanks, Ben = = On Mon, 2018-07-09 at 19:19 +0000, Popuri, Sriram wrote: > I would like to jot down what requirements we are looking at. Whe= rever = > there is "=3D=3D=3D=3Dwait" for something, there is a potential c= ontext switch. > = > For a write request, the following is the workflow for zero copy: > 1) Host submits Write request > 2) Protocol validation of request > 4) Get iovs > 5) R2T/RDMA_READ > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > 5) Host transfers data (data_out/RDMA_READ_COMPLETE) > 6) Perform Write(spdk_bdev_writev_blocks) to bdev_module =3D=3D= =3D=3Dwait for = > bdev module to respond=3D=3D=3D=3D > 7) on bdev_module response, send rsp/status to host > 8) free req resources > = > Read path: > 1) Host submits Read request > 2) Protocol validatation of request > 3) Submit read to bdev_module > =3D=3D=3D=3Dwait for data blocks from bdev_module=3D=3D=3D=3D > 4) respond iov (data_in/RDMA_WRITE) > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > 5) host ack/RDMA_WRITE_COMPLETE > 6) free req resources > = > In future we will have Persistent reservations where there can be = > read/modify/update cases depending on how the bdev module handles= it. = > So the zero copy apis needs to address future requirements as wel= l. > = > The above workflows suggest we require a state to track at what s= tep = > the request is. A state will help to cleanup resources properly w= hen = > handling aborts, disconnects, namespace destroy workflows etc. > So the requirement is to figure out a way to clean up resources = > according to the state of the bdev_io in its life cycle and stitc= h in = > the zero copy workflows into existing spdk code which assumes buf= fers are pre-allocated. > = > I think Srikanth's proposed changes are made in target layer to = > populate and commit to bdev module. However his suggestion is cha= nging = > the bdev_io life cycle which I understand from John is not agreea= ble to others. > = > With the new zcopy apis also there is no clarity on the bdev_io l= ife = > cycle. A bdev_io is acquired during zcopy_start and it's not clea= r how = > to deal with the bdev_io allocated in read/write paths. > = > The orchestration of zero copy either be done at transport layer, = > target layer or bdev module layer. > = > We don't like the orchestration done from transport layer because: > 1) No clear way to deal with bdev_io (acquired from zcopy) and bd= ev_io = > (acquired from read/write codepaths). > 2) If I understand correctly, by the time a read/write request is = > submitted to target, the data_out would have happened. So the pro= tocol = > validation happens after data_out phase. For example an LBA out o= f = > range request or invalid SGL length is returned after a data_out = is = > performed. We want to run through the protocol validation before = data_out phase. > = > I feel the bdev module is the ideal place to deal with zero copy = > buffers. The bdev module knows how to get buffers and when to cla= im = > them. All it needs is a proper cleanup and by the time request hi= ts = > bdev module, bdev_io is acquired so no need to change bdev_io lif= e cycle or introduce a new io type. > = > I propose the following: > 1) Host submits Write request > 2) Protocol validation of request > spdk_nvmf_request_exec > 4) submit request to bdev module > spdk_bdev_writev_blocks(). bdev_io is acquired > 5) bdev module determines the buffers needs to be allocated. > Will perform something like what is prposed in the patch: > spdk_bdev_zcopy_start(populate=3Dfalse) > 6) New spdk_nvmf_request_exec_status > (SPDK_NVMF_REQUEST_EXEC_STATUS_BUFF_READY) > 7) Transport does R2T/RDMA_READ > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > 8) Host transfers data (data_out/RDMA_READ_COMPLETE) > 9) Perform Write again(spdk_nvmf_request_exec/spdk_bdev_writev_bl= ocks) = > to bdev_module. > =3D=3D=3D=3Dwait for bdev module to respond=3D=3D=3D=3D > 10) on bdev_module response, send rsp/status to host > 11) free req resources > free bdev_io spdk_bdev_free_io. > Here we need a hook to free resources allocated by bdev module > spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx) > = > Read path: > 1) Host submits Read request > 2) Protocol validatation of request > 3) Submit the request (spdk_bdev_readv_blocks) to bdev_module =3D= =3D=3D=3Dwait = > for data blocks from bdev_module=3D=3D=3D=3D > 4) respond iov (data_in/RDMA_WRITE) > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > 5) host ack/RDMA_WRITE_COMPLETE > 6) free req resources > free bdev_io spdk_bdev_free_io. > Here we need a hook to free resources allocated by bdev module > spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx) > = > If it makes sense we can work on a patch. > = > Regards, > ~Sriram > = > -----Original Message----- > From: Walker, Benjamin > Sent: Monday, July 9, 2018 11:22 PM > To: Meneghini, John ; Popuri, Sriram = > ; Kaligotla, Srikanth = > > Cc: raju.gottumukkala(a)broadcom.com; Rodriguez, Edwin = > ; spdk(a)lists.01.org; Pai, Madhu = > ; NGC-john.barnard- broadcom.com = > > Subject: Re: BDEV-IO Lifecycle - Need your input. > = > On Mon, 2018-07-09 at 16:18 +0000, Meneghini, John wrote: > > Hi Jim. > > = > > This is the patch series I believe Ben is proposing to replace > > Srikanth's: ht tps://review.gerrithub.io/#/c/spdk/spdk/+/41586= 0/ > > = > > pick a21bbcf2 nvmf: Move data buffer pool to generic layer pick > > 09586e54 bdev: Add a zero copy I/O path pick 1cc9fe95 bdev: Mak= e = > > malloc bdev use the new zero copy mechanism pick 40a1f62b bdevp= erf: > > Use new zcopy API for reads pick eba6f35a bdev: Emulate zero co= py = > > support when necessary > > = > > Is this correct? > = > = > My patch series doesn't entirely replace Srikanth's set of patche= s. = > There are two separate things necessary to implement zero copy. F= irst = > is the infrastructure within the bdev layer to make requests of a= bdev = > module to obtain or release a suitable buffer for a zero copy = > operation. Second is actually making the NVMe-oF target use that = > mechanism, which includes extending the bdev_io lifetime. My patc= h series only addresses the first part. > = > Thanks, > Ben > = > = > = > > = > > /John > > = > > On 7/9/18, 11:31 AM, "Harris, James R" wrote: > > = > > Hi Sriram: > > = > > Ben has some later patches in this series with some example= s: > > = > > https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - imple= ments = > > this zero copy API in the simple =E2=80=98malloc=E2=80=99 bdev = module > > https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses = the = > > zero copy API in the SPDK bdevperf utility > > = > > -Jim > > = > > = > > On 7/9/18, 3:34 AM, "Popuri, Sriram" wrote: > > = > > Sorry was not focusing on this change. Just give me a d= ay or = > > two to get back. > > From a quick glance I didn't understand how zcopy start= /end = > > fits into a read/write life cycle. Is there an example on how z= copy = > > start/end is consumed or someone can do give me a quick dump on= how = > > its envisioned to be used. > > = > > Regards, > > ~Sriram > > = > > -----Original Message----- > > From: Meneghini, John = > > Sent: Friday, July 6, 2018 10:16 PM > > To: Walker, Benjamin ; Ver= kamp, = > > Daniel ; Harris, James R = > > ; Kaligotla, Srikanth = > > > om> > > Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.o= rg; = > > Rodriguez, Edwin ; Pai, Madhu = > > ; NGC- john.barnard-broadcom.com = > > ; Popuri, Sriram > Popuri(a)netapp.com> > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > = > > I'm adding Sriram to this thread. Sriram is another Ne= tApp = > > engineer who is working on this stuff internally. > > = > > /John > > = > > On 7/6/18, 12:41 PM, "Walker, Benjamin" = > > > > wrote: > > = > > Hi Srikanth, > > = > > I wanted to check in to see if you had any addition= al = > > feedback on the zero copy > > operations in the bdev layer here: > > = > > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/ > > = > > This does not address extending the lifetime of the = > > bdev_io in the NVMe-oF > > target, but I think this is looking like the right = > > mechanism for the bdev layer. > > = > > Thanks, > > Ben > > = > > On Thu, 2018-06-21 at 18:11 -0700, Harris, James R = wrote: > > > Thanks Srikanth. Sounds like the spdk_bdev_io po= ol = > > sizing along with > > > spdk_bdev_queue_io_wait() meets your needs then = > > regarding the spdk_bdev_io > > > memory. > > > = > > > Regarding zcopy_start/zcopy_end =E2=80=93 it look= s like you=E2=80=99ve = > > already added a bunch > > > of comments to Ben=E2=80=99s patch on GerritHub. = For now I=E2=80=99d = > > say let=E2=80=99s continue our > > > discussion there. I=E2=80=99ve responded to a co= uple of = > > similar questions there and > > > I=E2=80=99m sure Ben will have more replies tomor= row. > > > = > > > -Jim > > > = > > > = > > > From: "Kaligotla, Srikanth" > > > Date: Thursday, June 21, 2018 at 5:01 PM > > > To: James Harris , "W= alker, = > > Benjamin" > > > > er(a)intel.com>, Daniel Verkamp > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > , > > > "Meneghini, John" , = > > "Rodriguez, Edwin" > > z(a)netapp.com>, "Pai, Madhu" = > > , > > "NGC- > > john.barnard- > > > broadcom.com" , = > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > org> > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > = > > > Hi Jim, > > > = > > > I wish I joined the community meeting, I also mis= sed the > > > spdk_bdev_queue_io_wait(). > > > = > > > So, there are 2 issues I am intending to solve, > > > = > > > 1. We want to ensure that an instance of bd= ev-io is > > acquired prior to or > > > along with I/O data buffer. This allows for bette= r = > > error handling when bdev_io > > > pool is exhausted. Yes, we can solve it via sizin= g the = > > pool right. The changes > > > queue-io-wait also addresses the problem. > > > 2. Most importantly, by extending the life = of bdev-io > > (Same life span as > > > nvmf_request structure), abort use case and other= use = > > cases that involve > > > cleaning up of IO data buffer are accomplished = > > effectively. Let me elaborate; > > > bdev-io is the fabric that connects the nvme comm= and = > > and operation with the > > > backend. The driver context and data buffer conte= xt is = > > stored in bdev-io. The > > > freeing up of bdev_io resource is pushed to the e= nd so = > > the I/O cleanup can > > > happen after controller has transmitted the data = to = > > the host. In absence of > > > bdev_io, we will have end up adding more and more= void = > > context in request > > > structure. Hence the push to extend the life of b= dev_io. > > > = > > > I see Ben=E2=80=99s recent patch =E2=80=9Czcopy_s= tart=E2=80=9D and = > > =E2=80=9Czcopy_end=E2=80=9D; We can make that work > > > as long as the bdev-io allocated/acquired stays t= ill the end. > > One of the > > > challenges I see with that patch is defining a = > > callback for the I/O > > > submission. For instance, zopy_start will allocat= e = > > bdev_io and submits the I/O > > > to the bdev device. Underlying implementation can= be = > > synchronous or can be > > > asynchronous. The callback for this submission sh= ould = > > check to see if it is a > > > BUFFER-READY or PENDING and accordingly relay it = back = > > to transport. Next phase > > > is actual I/O submission. Let=E2=80=99s say, it r= euses the = > > bdev-io obtained in ZCOPY- > > > START, now the callback should determine if it is= a = > > success or failure. > > > Finally when ZCOPY-END is invoked the supplied bd= ev_io = > > will have all necessary > > > data to release the WRITE buffer or unlock the re= ad = > > buffer based on the > > > operation performed. > > > = > > > I hope I=E2=80=99m making sense. I guess, my effo= rt is to = > > extend the lifespan of bdev- > > > io and let it continue to host the driver context= and = > > buffer context populated > > > during the BEGIN phase. > > > = > > > Regards, > > > Srikanth > > > = > > > From: Harris, James R = > > > Sent: Thursday, June 21, 2018 2:21 PM > > > To: Kaligotla, Srikanth = > > ; Walker, Benjamin > > jamin.walker(a)intel.com>; Verkamp, Daniel = > > > om> > > > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, = John = > > > > >; Rodriguez, Edwin ; = Pai, = > > Madhu > > pp.com>; NGC-john.barnard-broadcom.com = > > > m>; spdk(a)lists > > > .01.org > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > > = > > > Hi Srikanth, > > > = > > > Following up on this thread and the discussion in = > > yesterday=E2=80=99s community > > > meeting. > > > = > > > The recent spdk_bdev_queue_io_wait() changes allo= w an = > > SPDK application to work > > > better in general when the spdk_bdev_io buffer po= ol is = > > exhausted. We still > > > need to make changes to the NVMe-oF target to use= this = > > new API but that should > > > be done soon-ish. > > > = > > > With that in place, the spdk_bdev_io buffer pool = > > itself can be configured up > > > or down when the application starts. Currently t= he = > > default is 64K > > > spdk_bdev_io buffers. sizeof(struct spdk_bdev_io= ) =3D=3D > > 216 plus the per-IO > > > context size allocated for the bdev module. This= can = > > be up to > > 192 bytes > > > (virtio bdev module) but is likely much smaller f= or = > > you depending on the > > > context size for your ontap bdev module. > > > = > > > Let=E2=80=99s assume your per-IO context size is = 64 bytes. = > > 64K x (192 + > > 64) =3D 16MB. > > > = > > > I=E2=80=99m not sure how many spdk_bdev_io you ne= ed in flight = > > at any given time. 64K > > > seems like a lot but I=E2=80=99d be curious to he= ar your = > > thoughts on this. If this is > > > the right number, then worst case, there would be = > > about 16MB of DRAM that > > > would sit unused if the NVMe-oF target in your sy= stem = > > was not active. Is that > > > too burdensome for your application? > > > = > > > Thanks, > > > = > > > -Jim > > > = > > > = > > > = > > > From: "Kaligotla, Srikanth" > > > Date: Wednesday, June 20, 2018 at 12:47 PM > > > To: "Walker, Benjamin" , = > > James Harris > > is(a)intel.com>, Daniel Verkamp > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > , > > > "Meneghini, John" , = > > "Rodriguez, Edwin" > > z(a)netapp.com>, "Pai, Madhu" = > > , > > "NGC- > > john.barnard- > > > broadcom.com" , = > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > org> > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > = > > > Hello, > > > = > > > First revision of changes to extend the lifecycle= of = > > bdev_io are available for > > > review. I would like to solicit your input on the = > > proposed API/Code flow > > > changes. > > > = > > > https://review.gerrithub.io/c/spdk/spdk/+/415860 > > > = > > > Thanks, > > > Srikanth > > > = > > > = > > > From: "Kaligotla, Srikanth" > > > Date: Friday, May 11, 2018 at 2:27 PM > > > To: "Walker, Benjamin" , = > > "Harris, James R" > > .harris(a)intel.com> > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > , > > > "Meneghini, John" , = > > "Rodriguez, Edwin" > > z(a)netapp.com>, "Pai, Madhu" = > > , > > "NGC- > > john.barnard- > > > broadcom.com" , = > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > org> > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > = > > > CC: List > > > = > > > Hi Ben, > > > = > > > Your proposal to interface with the backend to ac= quire = > > and release buffers is > > > good. You have accurately stated that the challen= ge is = > > in developing intuitive > > > semantics. And that has been my struggle. To me, = there = > > are two problem > > > statements; > > > = > > > 1. It is expected that bdev_io pool is size= d correctly > > so > > the call to > > > get bdev_io succeeds. Failure to acquire bdev_io = will = > > result in DEVICE-ERROR. > > > The transport is already capable of handling temp= orary = > > memory failures by > > > moving the request to PENDING-Queue. Hence the = > > proposal to change the bdev_io > > > lifecycle and perhaps connect it with the = > > spdk_nvmf_request object. Thus all > > > buffer needs are addressed at the beginning of I/= O request. > > > 2. I/O Data buffers are sourced and managed= by the > > backend. One of the > > > challenges I see with your proposed interface is = the = > > lack of details like if > > > resource is being acquired for a READ operation o= r = > > WRITE operation. The > > > handling is quite different in each case. Since = > > bdev_io->type is overloaded > > > the type of I/O operation is lost. I suppose one = can = > > cast the cb_arg > > > (nvmf_request) and then proceed. Also, the bdev_i= o = > > should be present to > > > RELEASE the buffer. Zero copy semantics warrants = that = > > data buffer stays until > > > controller-to-host has occurred. In other words, = > > bdev_io lives till REQUEST > > > come to COMPLETE state. > > > = > > > What are your thoughts in introducing spdk_bdev_i= nit() = > > and > > spdk_bdev_fini() as > > > an alternative approach to extend the lifecyle of = > > bdev_io and allow data > > > buffer management via bdev fn_table ? > > > = > > > I hope I=E2=80=99m making sense=E2=80=A6 > > > = > > > Thanks, > > > Srikanth > > > = > > > From: Walker, Benjamin = > > > Sent: Friday, May 11, 2018 12:28 PM > > > To: Harris, James R ; = > > Kaligotla, Srikanth > > Kaligotla(a)netapp.com> > > > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, = John = > > > > >; Rodriguez, Edwin ; = Pai, = > > Madhu > > pp.com>; NGC-john.barnard-broadcom.com = > > > m> > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > > = > > > Hi Srikanth, > > > = > > > Yes - we'll need to introduce some way to acquire= and = > > release buffers from the > > > bdev layer earlier on in the state machine that = > > processes an NVMe-oF request. > > > I've had this patch out for review for several mo= nths = > > as a proposal for this > > > scenario: > > > = > > > https://review.gerrithub.io/#/c/spdk/spdk/+/38616= 6/ > > > = > > > It doesn't pass the tests - it's just a proposal = for = > > the interface. 90% of the > > > challenge here is in developing intuitive semanti= cs. > > > = > > > Thanks, > > > Ben > > > = > > > P.S. This is the kind of discussion that would fi= t = > > perfectly on the mailing > > > list. > > > = > > > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Sri= kanth wrote: > > > > Hi Ben, Hi James, > > > > = > > > > I would like to solicit opinions on the lifecyc= le = > > for bdev-io resource > > > > object. Attached is the image of RDMA state mac= hine = > > in its current > > > > implementation. When the REQUEST enters the = > > NEED-BUFFER state, the buffers > > > > necessary for carrying out I/O operation are = > > allocated/acquired from the > > > > memory pool. An instance of BDEV-IO comes into = > > existence after REQUEST > > > > reaches READY-TO-EXECUTE state. The BDEV-IO is = > > teared down as soon the > > > > backend returns. From a BDEV perspective, BDEV-= IO is = > > simply a translation > > > > unit that facilitates I/O buffers from the back= end. = > > The driver context > > > > embedded within bdev_io holds great deal of = > > information pertaining to the > > > > I/O under execution. It assists in error handli= ng, = > > dereferencing the buffers > > > > upon I/O completion and in abort handling. In = > > summary, the bdev_io stays > > > > alive until request has come to COMPLETE state.= I=E2=80=99d = > > like to hear peoples > > > > thoughts in introducing the plumbing to acquire = > > BDEV-IO resource in REQUEST- > > > > NEED-BUFFER state and release it in REQUEST-COM= PLETE = > > state. I will shortly > > > > have patch available for review that introduces = > > spdk_bdev_init and > > > > spdk_bdev_fini which in turn invokes the = > > corresponding bdev fn_table to > > > > initialize/cleanup. = > > > > = > > > > I wanted to use this email to communicate our i= ntent = > > and solicit your > > > > feedback. We have a working implementation of t= he = > > above proposal and prior > > > > to pushing it upstream for review would like to= hear = > > your thoughts. These > > > > proposed changes to upstream are a result of FC = > > transport work in > > > > collaboration with the Broadcom team who are al= so = > > copied to this mail. > > > > Myself and John will be at SPDK Dev conference = and = > > if required we can > > > > elaborate further on this proposal. > > > > = > > > > Thanks, > > > > Srikanth > > > = > > > > > = > > = > > = > > = > > = > > = _______________________________________________ SPDK mailing list SPDK(a)lists.01.org https://lists.01.org/mailman/listinfo/spdk = = _______________________________________________ SPDK mailing list SPDK(a)lists.01.org https://lists.01.org/mailman/listinfo/spdk = --===============7548827487943640762==--