From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============6391890951589655729==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.) Date: Thu, 12 Jul 2018 20:44:23 +0000 Message-ID: In-Reply-To: 82C9F782B054C94B9FC04A331649C77AAD4C1354@fmsmsx104.amr.corp.intel.com List-ID: To: spdk@lists.01.org --===============6391890951589655729== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Thu, 2018-07-12 at 01:16 +0000, Luse, Paul E wrote: > This is a great explanation Ben, thanks. I'll start looking through the > patches as well. A few quick questions for additional context and to make > sure I understand your terminology: > = > * Can you expand on " The populate flag determines whether the "live" dat= a is > present in the given buffer" maybe with an example or something (what do = you > mean by "live", you use it a few times) I struggled to find the words to explain this. When you do a zcopy_start, y= ou are returned some region(s) of memory. If populate is false, that buffer do= esn't contain any valid data. If populate is true, the buffer contains the data o= n the disk for the described region of blocks. Similarly with commit - if you mem= cpy into a buffer provided from zcopy_start, that doesn't necessarily become the data on a block device (i.e. visible from a separate read operation) until = the data is committed. Obviously, block device modules may have wildly varying internal implementations. For example, the malloc bdev module just allocate= s a single region of memory representing the block device and returns pointers = to it, so the populate and commit flags don't do anything at all. But the mall= oc bdev is a special case - the norm is that some portion of the disk will be cached in DRAM and the real data is stored on some persistent device. Obtai= ning a zero copy buffer may or may not have to page data into DRAM, and commit m= ay or may not have to page data out. > * You mention " I believe I've addressed all known use cases with this de= sign" > but in this email chain, at least, only cover the NVMe-oF use case. Can = you > list the known cases you mention and maybe just say a few words about the= m? > = I'm glad you ask because it needs to be explained, but I'm also not glad you asked because now I have to type out the giant explanation! There's a matrix of different things that need to be supported. On the bdev module side, we want to support a number of different implementations. 1) All data exists in addressable/DMA-able memory (equivalent to the malloc= bdev module) 2) Some data exists in addressable/DMA-able memory (a caching bdev, or the = nvme bdev module with a controller memory buffer). Further, within those two choices, there are two other options: 1) When updating the data in the buffer returned by zcopy_start, concurrent reads will immediately see the updates. This is what I was trying to say by "live" elsewhere. This is how the malloc bdev module works, and probably is= n't a great bdev module design because it can't make any atomicity guarantees. 2) When updating data in the buffer returned by zcopy_start, future reads f= or the given blocks won't see the data until the buffer has been committed. We don't have any bdev modules that actually work like this today, but it's ho= w I imagine a more robust caching bdev module would do it. On the network side we also have to support four types of network interface= s. 1) TPC/IP via sockets (iSCSI, NVMe-oF). For a read (which sends data over t= he network on the target side), we want to be able to ask the bdev module to provide a buffer that holds the data, then we use that to do a send() on the socket. That's the minimum number of copies. For a write (which recvs data = over the network) we don't use zero copy semantics because the data has to be gathered into a buffer in user space anyway. 2) RDMA via ibverbs (NVMe-oF). For all paths, we want to ask the bdev to pr= ovide us a buffer describing the region on the disk and then we do an RDMA transf= er into or out of it. 3) FC. For both paths, we want to ask the bdev to provide a buffer describi= ng the region on the disk and then we do a DMA to/from the FC HBA. 4) virtio (vhost). The data is already in DMA-safe memory, so we don't bene= fit from zero copy at all. So if you take all of these constraints and you work out an API, you basica= lly end up with what I proposed. > Thanks > Paul > = > -----Original Message----- > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benj= amin > Sent: Wednesday, July 11, 2018 2:53 PM > To: John.Meneghini(a)netapp.com; Sriram.Popuri(a)netapp.com; Srikanth.Kal= igotla(a)ne > tapp.com > Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org > Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.) > = > I've done a bunch of work here to formalize the zero copy API in the bdev > layer. > The patch series begins here: > = > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/ > = > There has been some good discussion on the patches themselves, but I want= ed to > circle back to the mailing list for a bit. The basic idea is to add two n= ew > functions to the bdev layer: > = > int spdk_bdev_zcopy_start(struct spdk_bdev_desc *desc, > struct spdk_io_channel *ch, > uint64_t offset_blocks, uint64_t num_blocks, > bool populate, > spdk_bdev_io_completion_cb cb, void *cb_arg); int > spdk_bdev_zcopy_end(struct spdk_bdev_io *bdev_io, bool commit, > spdk_bdev_io_completion_cb cb, void *cb_arg); > = > The zcopy start call makes a request to the backing block device to provi= de a > region of memory suitable for performing a zero copy update to the region > described by the offset_blocks and num_blocks parameters. The populate fl= ag > determines whether the "live" data is present in the given buffer. When t= his > request completes, it returns a bdev_io in the completion callback. That > bdev_io will have a scatter gather list filled out describing the memory > region. > = > When the user has finished transferring data into or out of the data buff= er, > the user calls spdk_bdev_zcopy_end with the originally provided bdev_io. = The > commit flag indicates whether the data in the buffer needs to be made "li= ve". > = > The NVMe-oF target can take advantage of these new APIs by altering the s= tate > machine in the RDMA layer to call spdk_bdev_zcopy_start to obtain data bu= ffers > after parsing the command, then performing RDMA transfers into or out of = the > provided region, and finally completing the command with spdk_bdev_zcopy_= end. > The work to do those modifications has not been done yet, and I expect th= em to > be a significant amount of effort. > = > As many bdev modules as possible should attempt to support zero copy > semantics, but for those bdev modules that aren't able to, one of the pat= ches > in my series will automatically emulate the zcopy commands using regular = reads > and writes. > This allows the higher level users of the bdev API to always make the zco= py > calls for simplicity. The emulation path should be no worse, in terms of > performance, than doing a regular read or write. > = > I believe I've addressed all known use cases with this design, and while = there > will certainly be some future patches to fix problems that come up, I'm > confident that we're on the right track here. If there is any feedback in= this > area at all, now is the time to provide it. > = > I think this also leads into some really awesome future work that I'll su= ggest > in the hope that someone in the community is inspired. One idea is that t= he > bdev_nvme module could support zero copy semantics by returning a pointer= to a > portion of a Controller Memory Buffer or Persistent Memory Buffer suitabl= e for > data. There aren't many devices for sale today that can do this, but it's= in > the spec and I imagine they're coming. This would allow the NVMe-oF targe= t, > for example, to perform RDMA transfers peer-to-peer between the NIC and t= he > NVMe SSD. Similarly, for the bdev_pmem module, direct pointers into the t= he > persistent memory could be returned. The NVMe-oF target could then perform > RDMA transfers into or out of the persistent memory directly, and the zco= py > end call with commit set to true would perform the required flush > instructions. > = > Thanks, > Ben > = > = > On Mon, 2018-07-09 at 19:19 +0000, Popuri, Sriram wrote: > > I would like to jot down what requirements we are looking at. Wherever = > > there is "=3D=3D=3D=3Dwait" for something, there is a potential context= switch. > > = > > For a write request, the following is the workflow for zero copy: > > 1) Host submits Write request > > 2) Protocol validation of request > > 4) Get iovs > > 5) R2T/RDMA_READ > > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > > 5) Host transfers data (data_out/RDMA_READ_COMPLETE) > > 6) Perform Write(spdk_bdev_writev_blocks) to bdev_module =3D=3D=3D=3Dwa= it for = > > bdev module to respond=3D=3D=3D=3D > > 7) on bdev_module response, send rsp/status to host > > 8) free req resources > > = > > Read path: > > 1) Host submits Read request > > 2) Protocol validatation of request > > 3) Submit read to bdev_module > > =3D=3D=3D=3Dwait for data blocks from bdev_module=3D=3D=3D=3D > > 4) respond iov (data_in/RDMA_WRITE) > > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > > 5) host ack/RDMA_WRITE_COMPLETE > > 6) free req resources > > = > > In future we will have Persistent reservations where there can be = > > read/modify/update cases depending on how the bdev module handles it. = > > So the zero copy apis needs to address future requirements as well. > > = > > The above workflows suggest we require a state to track at what step = > > the request is. A state will help to cleanup resources properly when = > > handling aborts, disconnects, namespace destroy workflows etc. > > So the requirement is to figure out a way to clean up resources = > > according to the state of the bdev_io in its life cycle and stitch in = > > the zero copy workflows into existing spdk code which assumes buffers a= re > > pre-allocated. > > = > > I think Srikanth's proposed changes are made in target layer to = > > populate and commit to bdev module. However his suggestion is changing = > > the bdev_io life cycle which I understand from John is not agreeable to > > others. > > = > > With the new zcopy apis also there is no clarity on the bdev_io life = > > cycle. A bdev_io is acquired during zcopy_start and it's not clear how = > > to deal with the bdev_io allocated in read/write paths. > > = > > The orchestration of zero copy either be done at transport layer, = > > target layer or bdev module layer. > > = > > We don't like the orchestration done from transport layer because: > > 1) No clear way to deal with bdev_io (acquired from zcopy) and bdev_io = > > (acquired from read/write codepaths). > > 2) If I understand correctly, by the time a read/write request is = > > submitted to target, the data_out would have happened. So the protocol = > > validation happens after data_out phase. For example an LBA out of = > > range request or invalid SGL length is returned after a data_out is = > > performed. We want to run through the protocol validation before data_o= ut > > phase. > > = > > I feel the bdev module is the ideal place to deal with zero copy = > > buffers. The bdev module knows how to get buffers and when to claim = > > them. All it needs is a proper cleanup and by the time request hits = > > bdev module, bdev_io is acquired so no need to change bdev_io life cycl= e or > > introduce a new io type. > > = > > I propose the following: > > 1) Host submits Write request > > 2) Protocol validation of request > > spdk_nvmf_request_exec > > 4) submit request to bdev module > > spdk_bdev_writev_blocks(). bdev_io is acquired > > 5) bdev module determines the buffers needs to be allocated. > > Will perform something like what is prposed in the patch: > > spdk_bdev_zcopy_start(populate=3Dfalse) > > 6) New spdk_nvmf_request_exec_status > > (SPDK_NVMF_REQUEST_EXEC_STATUS_BUFF_READY) > > 7) Transport does R2T/RDMA_READ > > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > > 8) Host transfers data (data_out/RDMA_READ_COMPLETE) > > 9) Perform Write again(spdk_nvmf_request_exec/spdk_bdev_writev_blocks) = > > to bdev_module. > > =3D=3D=3D=3Dwait for bdev module to respond=3D=3D=3D=3D > > 10) on bdev_module response, send rsp/status to host > > 11) free req resources > > free bdev_io spdk_bdev_free_io. > > Here we need a hook to free resources allocated by bdev module > > spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx) > > = > > Read path: > > 1) Host submits Read request > > 2) Protocol validatation of request > > 3) Submit the request (spdk_bdev_readv_blocks) to bdev_module =3D=3D=3D= =3Dwait = > > for data blocks from bdev_module=3D=3D=3D=3D > > 4) respond iov (data_in/RDMA_WRITE) > > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > > 5) host ack/RDMA_WRITE_COMPLETE > > 6) free req resources > > free bdev_io spdk_bdev_free_io. > > Here we need a hook to free resources allocated by bdev module > > spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx) > > = > > If it makes sense we can work on a patch. > > = > > Regards, > > ~Sriram > > = > > -----Original Message----- > > From: Walker, Benjamin > > Sent: Monday, July 9, 2018 11:22 PM > > To: Meneghini, John ; Popuri, Sriram = > > ; Kaligotla, Srikanth = > > > > Cc: raju.gottumukkala(a)broadcom.com; Rodriguez, Edwin = > > ; spdk(a)lists.01.org; Pai, Madhu = > > ; NGC-john.barnard- broadcom.com = > > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > = > > On Mon, 2018-07-09 at 16:18 +0000, Meneghini, John wrote: > > > Hi Jim. > > > = > > > This is the patch series I believe Ben is proposing to replace > > > Srikanth's: ht tps://review.gerrithub.io/#/c/spdk/spdk/+/415860/ > > > = > > > pick a21bbcf2 nvmf: Move data buffer pool to generic layer pick > > > 09586e54 bdev: Add a zero copy I/O path pick 1cc9fe95 bdev: Make = > > > malloc bdev use the new zero copy mechanism pick 40a1f62b bdevperf: > > > Use new zcopy API for reads pick eba6f35a bdev: Emulate zero copy = > > > support when necessary > > > = > > > Is this correct? > > = > > = > > My patch series doesn't entirely replace Srikanth's set of patches. = > > There are two separate things necessary to implement zero copy. First = > > is the infrastructure within the bdev layer to make requests of a bdev = > > module to obtain or release a suitable buffer for a zero copy = > > operation. Second is actually making the NVMe-oF target use that = > > mechanism, which includes extending the bdev_io lifetime. My patch seri= es > > only addresses the first part. > > = > > Thanks, > > Ben > > = > > = > > = > > > = > > > /John > > > = > > > =EF=BB=BFOn 7/9/18, 11:31 AM, "Harris, James R" wrote: > > > = > > > Hi Sriram: > > > = > > > Ben has some later patches in this series with some examples: > > > = > > > https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - implements = > > > this zero copy API in the simple =E2=80=98malloc=E2=80=99 bdev module > > > https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the = > > > zero copy API in the SPDK bdevperf utility > > > = > > > -Jim > > > = > > > = > > > On 7/9/18, 3:34 AM, "Popuri, Sriram" = wrote: > > > = > > > Sorry was not focusing on this change. Just give me a day or = > > > two to get back. > > > From a quick glance I didn't understand how zcopy start/end = > > > fits into a read/write life cycle. Is there an example on how zcopy = > > > start/end is consumed or someone can do give me a quick dump on how = > > > its envisioned to be used. > > > = > > > Regards, > > > ~Sriram > > > = > > > -----Original Message----- > > > From: Meneghini, John = > > > Sent: Friday, July 6, 2018 10:16 PM > > > To: Walker, Benjamin ; Verkamp, = > > > Daniel ; Harris, James R = > > > ; Kaligotla, Srikanth = > > > > > om> > > > Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; = > > > Rodriguez, Edwin ; Pai, Madhu = > > > ; NGC- john.barnard-broadcom.com = > > > ; Popuri, Sriram > > Popuri(a)netapp.com> > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > > = > > > I'm adding Sriram to this thread. Sriram is another NetApp = > > > engineer who is working on this stuff internally. > > > = > > > /John > > > = > > > On 7/6/18, 12:41 PM, "Walker, Benjamin" = > > > > > > wrote: > > > = > > > Hi Srikanth, > > > = > > > I wanted to check in to see if you had any additional = > > > feedback on the zero copy > > > operations in the bdev layer here: > > > = > > > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/ > > > = > > > This does not address extending the lifetime of the = > > > bdev_io in the NVMe-oF > > > target, but I think this is looking like the right = > > > mechanism for the bdev layer. > > > = > > > Thanks, > > > Ben > > > = > > > On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote: > > > > Thanks Srikanth. Sounds like the spdk_bdev_io pool = > > > sizing along with > > > > spdk_bdev_queue_io_wait() meets your needs then = > > > regarding the spdk_bdev_io > > > > memory. > > > > = > > > > Regarding zcopy_start/zcopy_end =E2=80=93 it looks like= you=E2=80=99ve = > > > already added a bunch > > > > of comments to Ben=E2=80=99s patch on GerritHub. For n= ow I=E2=80=99d = > > > say let=E2=80=99s continue our > > > > discussion there. I=E2=80=99ve responded to a couple o= f = > > > similar questions there and > > > > I=E2=80=99m sure Ben will have more replies tomorrow. > > > > = > > > > -Jim > > > > = > > > > = > > > > From: "Kaligotla, Srikanth" > > > > Date: Thursday, June 21, 2018 at 5:01 PM > > > > To: James Harris , "Walker, = > > > Benjamin" > > > > > > er(a)intel.com>, Daniel Verkamp > > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > > , > > > > "Meneghini, John" , = > > > "Rodriguez, Edwin" > > > z(a)netapp.com>, "Pai, Madhu" = > > > , > > > "NGC- > > > john.barnard- > > > > broadcom.com" , = > > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > > org> > > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > > = > > > > Hi Jim, > > > > = > > > > I wish I joined the community meeting, I also missed the > > > > spdk_bdev_queue_io_wait(). > > > > = > > > > So, there are 2 issues I am intending to solve, > > > > = > > > > 1. We want to ensure that an instance of bdev-io = is > > > acquired prior to or > > > > along with I/O data buffer. This allows for better = > > > error handling when bdev_io > > > > pool is exhausted. Yes, we can solve it via sizing the = > > > pool right. The changes > > > > queue-io-wait also addresses the problem. > > > > 2. Most importantly, by extending the life of bde= v-io > > > (Same life span as > > > > nvmf_request structure), abort use case and other use = > > > cases that involve > > > > cleaning up of IO data buffer are accomplished = > > > effectively. Let me elaborate; > > > > bdev-io is the fabric that connects the nvme command = > > > and operation with the > > > > backend. The driver context and data buffer context is = > > > stored in bdev-io. The > > > > freeing up of bdev_io resource is pushed to the end so = > > > the I/O cleanup can > > > > happen after controller has transmitted the data to = > > > the host. In absence of > > > > bdev_io, we will have end up adding more and more void = > > > context in request > > > > structure. Hence the push to extend the life of bdev_io. > > > > = > > > > I see Ben=E2=80=99s recent patch =E2=80=9Czcopy_start= =E2=80=9D and = > > > =E2=80=9Czcopy_end=E2=80=9D; We can make that work > > > > as long as the bdev-io allocated/acquired stays till the > > > end. > > > One of the > > > > challenges I see with that patch is defining a = > > > callback for the I/O > > > > submission. For instance, zopy_start will allocate = > > > bdev_io and submits the I/O > > > > to the bdev device. Underlying implementation can be = > > > synchronous or can be > > > > asynchronous. The callback for this submission should = > > > check to see if it is a > > > > BUFFER-READY or PENDING and accordingly relay it back = > > > to transport. Next phase > > > > is actual I/O submission. Let=E2=80=99s say, it reuses = the = > > > bdev-io obtained in ZCOPY- > > > > START, now the callback should determine if it is a = > > > success or failure. > > > > Finally when ZCOPY-END is invoked the supplied bdev_io = > > > will have all necessary > > > > data to release the WRITE buffer or unlock the read = > > > buffer based on the > > > > operation performed. > > > > = > > > > I hope I=E2=80=99m making sense. I guess, my effort is = to = > > > extend the lifespan of bdev- > > > > io and let it continue to host the driver context and = > > > buffer context populated > > > > during the BEGIN phase. > > > > = > > > > Regards, > > > > Srikanth > > > > = > > > > From: Harris, James R = > > > > Sent: Thursday, June 21, 2018 2:21 PM > > > > To: Kaligotla, Srikanth = > > > ; Walker, Benjamin > > > jamin.walker(a)intel.com>; Verkamp, Daniel = > > > > > om> > > > > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John = > > > > > > >; Rodriguez, Edwin ; Pai, = > > > Madhu > > > pp.com>; NGC-john.barnard-broadcom.com = > > > > > m>; spdk(a)lists > > > > .01.org > > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > > > = > > > > Hi Srikanth, > > > > = > > > > Following up on this thread and the discussion in = > > > yesterday=E2=80=99s community > > > > meeting. > > > > = > > > > The recent spdk_bdev_queue_io_wait() changes allow an = > > > SPDK application to work > > > > better in general when the spdk_bdev_io buffer pool is = > > > exhausted. We still > > > > need to make changes to the NVMe-oF target to use this = > > > new API but that should > > > > be done soon-ish. > > > > = > > > > With that in place, the spdk_bdev_io buffer pool = > > > itself can be configured up > > > > or down when the application starts. Currently the = > > > default is 64K > > > > spdk_bdev_io buffers. sizeof(struct spdk_bdev_io) =3D= =3D > > > 216 plus the per-IO > > > > context size allocated for the bdev module. This can = > > > be up to > > > 192 bytes > > > > (virtio bdev module) but is likely much smaller for = > > > you depending on the > > > > context size for your ontap bdev module. > > > > = > > > > Let=E2=80=99s assume your per-IO context size is 64 byt= es. = > > > 64K x (192 + > > > 64) =3D 16MB. > > > > = > > > > I=E2=80=99m not sure how many spdk_bdev_io you need in = flight = > > > at any given time. 64K > > > > seems like a lot but I=E2=80=99d be curious to hear you= r = > > > thoughts on this. If this is > > > > the right number, then worst case, there would be = > > > about 16MB of DRAM that > > > > would sit unused if the NVMe-oF target in your system = > > > was not active. Is that > > > > too burdensome for your application? > > > > = > > > > Thanks, > > > > = > > > > -Jim > > > > = > > > > = > > > > = > > > > From: "Kaligotla, Srikanth" > > > > Date: Wednesday, June 20, 2018 at 12:47 PM > > > > To: "Walker, Benjamin" , = > > > James Harris > > > is(a)intel.com>, Daniel Verkamp > > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > > , > > > > "Meneghini, John" , = > > > "Rodriguez, Edwin" > > > z(a)netapp.com>, "Pai, Madhu" = > > > , > > > "NGC- > > > john.barnard- > > > > broadcom.com" , = > > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > > org> > > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > > = > > > > Hello, > > > > = > > > > First revision of changes to extend the lifecycle of = > > > bdev_io are available for > > > > review. I would like to solicit your input on the = > > > proposed API/Code flow > > > > changes. > > > > = > > > > https://review.gerrithub.io/c/spdk/spdk/+/415860 > > > > = > > > > Thanks, > > > > Srikanth > > > > = > > > > = > > > > From: "Kaligotla, Srikanth" > > > > Date: Friday, May 11, 2018 at 2:27 PM > > > > To: "Walker, Benjamin" , = > > > "Harris, James R" > > > .harris(a)intel.com> > > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > > , > > > > "Meneghini, John" , = > > > "Rodriguez, Edwin" > > > z(a)netapp.com>, "Pai, Madhu" = > > > , > > > "NGC- > > > john.barnard- > > > > broadcom.com" , = > > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > > org> > > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > > = > > > > CC: List > > > > = > > > > Hi Ben, > > > > = > > > > Your proposal to interface with the backend to acquire = > > > and release buffers is > > > > good. You have accurately stated that the challenge is = > > > in developing intuitive > > > > semantics. And that has been my struggle. To me, there = > > > are two problem > > > > statements; > > > > = > > > > 1. It is expected that bdev_io pool is sized corr= ectly > > > so > > > the call to > > > > get bdev_io succeeds. Failure to acquire bdev_io will = > > > result in DEVICE-ERROR. > > > > The transport is already capable of handling temporary = > > > memory failures by > > > > moving the request to PENDING-Queue. Hence the = > > > proposal to change the bdev_io > > > > lifecycle and perhaps connect it with the = > > > spdk_nvmf_request object. Thus all > > > > buffer needs are addressed at the beginning of I/O requ= est. > > > > 2. I/O Data buffers are sourced and managed by the > > > backend. One of the > > > > challenges I see with your proposed interface is the = > > > lack of details like if > > > > resource is being acquired for a READ operation or = > > > WRITE operation. The > > > > handling is quite different in each case. Since = > > > bdev_io->type is overloaded > > > > the type of I/O operation is lost. I suppose one can = > > > cast the cb_arg > > > > (nvmf_request) and then proceed. Also, the bdev_io = > > > should be present to > > > > RELEASE the buffer. Zero copy semantics warrants that = > > > data buffer stays until > > > > controller-to-host has occurred. In other words, = > > > bdev_io lives till REQUEST > > > > come to COMPLETE state. > > > > = > > > > What are your thoughts in introducing spdk_bdev_init() = > > > and > > > spdk_bdev_fini() as > > > > an alternative approach to extend the lifecyle of = > > > bdev_io and allow data > > > > buffer management via bdev fn_table ? > > > > = > > > > I hope I=E2=80=99m making sense=E2=80=A6 > > > > = > > > > Thanks, > > > > Srikanth > > > > = > > > > From: Walker, Benjamin = > > > > Sent: Friday, May 11, 2018 12:28 PM > > > > To: Harris, James R ; = > > > Kaligotla, Srikanth > > > Kaligotla(a)netapp.com> > > > > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John = > > > > > > >; Rodriguez, Edwin ; Pai, = > > > Madhu > > > pp.com>; NGC-john.barnard-broadcom.com = > > > > > m> > > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > > > = > > > > Hi Srikanth, > > > > = > > > > Yes - we'll need to introduce some way to acquire and = > > > release buffers from the > > > > bdev layer earlier on in the state machine that = > > > processes an NVMe-oF request. > > > > I've had this patch out for review for several months = > > > as a proposal for this > > > > scenario: > > > > = > > > > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/ > > > > = > > > > It doesn't pass the tests - it's just a proposal for = > > > the interface. 90% of the > > > > challenge here is in developing intuitive semantics. > > > > = > > > > Thanks, > > > > Ben > > > > = > > > > P.S. This is the kind of discussion that would fit = > > > perfectly on the mailing > > > > list. > > > > = > > > > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth > > > wrote: > > > > > Hi Ben, Hi James, > > > > > = > > > > > I would like to solicit opinions on the lifecycle = > > > for bdev-io resource > > > > > object. Attached is the image of RDMA state machine = > > > in its current > > > > > implementation. When the REQUEST enters the = > > > NEED-BUFFER state, the buffers > > > > > necessary for carrying out I/O operation are = > > > allocated/acquired from the > > > > > memory pool. An instance of BDEV-IO comes into = > > > existence after REQUEST > > > > > reaches READY-TO-EXECUTE state. The BDEV-IO is = > > > teared down as soon the > > > > > backend returns. From a BDEV perspective, BDEV-IO is = > > > simply a translation > > > > > unit that facilitates I/O buffers from the backend. = > > > The driver context > > > > > embedded within bdev_io holds great deal of = > > > information pertaining to the > > > > > I/O under execution. It assists in error handling, = > > > dereferencing the buffers > > > > > upon I/O completion and in abort handling. In = > > > summary, the bdev_io stays > > > > > alive until request has come to COMPLETE state. I=E2= =80=99d = > > > like to hear peoples > > > > > thoughts in introducing the plumbing to acquire = > > > BDEV-IO resource in REQUEST- > > > > > NEED-BUFFER state and release it in REQUEST-COMPLETE = > > > state. I will shortly > > > > > have patch available for review that introduces = > > > spdk_bdev_init and > > > > > spdk_bdev_fini which in turn invokes the = > > > corresponding bdev fn_table to > > > > > initialize/cleanup. = > > > > > = > > > > > I wanted to use this email to communicate our intent = > > > and solicit your > > > > > feedback. We have a working implementation of the = > > > above proposal and prior > > > > > to pushing it upstream for review would like to hear = > > > your thoughts. These > > > > > proposed changes to upstream are a result of FC = > > > transport work in > > > > > collaboration with the Broadcom team who are also = > > > copied to this mail. > > > > > Myself and John will be at SPDK Dev conference and = > > > if required we can > > > > > elaborate further on this proposal. > > > > > = > > > > > Thanks, > > > > > Srikanth > > > > = > > > > > > > = > > > = > > > = > > > = > > > = > > > = > = > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk --===============6391890951589655729==--