From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============6213630722918904552==" MIME-Version: 1.0 From: Luse, Paul E Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.) Date: Thu, 12 Jul 2018 21:01:24 +0000 Message-ID: <82C9F782B054C94B9FC04A331649C77AAD4C4A72@fmsmsx104.amr.corp.intel.com> In-Reply-To: af665fedc52bc479928371899003b1dc4c622893.camel@intel.com List-ID: To: spdk@lists.01.org --===============6213630722918904552== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Awesome Ben, thanks. This is quite a bit more sophisticated and flexible th= an I think what generally comes to mind with "zero copy" in terms of naming= this thing but I guess I can't think of a better one - "data direct" or so= mething, I dunno. Killer feature though no matter what its called... = I was thinking some more about an earlier suggestion from John about using = bytes instead of blocks in the API, have you thought much about that? Thx Paul -----Original Message----- From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benjam= in Sent: Thursday, July 12, 2018 1:44 PM To: John.Meneghini(a)netapp.com; spdk(a)lists.01.org; Sriram.Popuri(a)netap= p.com; Srikanth.Kaligotla(a)netapp.com Cc: raju.gottumukkala(a)broadcom.com Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.) On Thu, 2018-07-12 at 01:16 +0000, Luse, Paul E wrote: > This is a great explanation Ben, thanks. I'll start looking through = > the patches as well. A few quick questions for additional context and = > to make sure I understand your terminology: > = > * Can you expand on " The populate flag determines whether the "live" = > data is present in the given buffer" maybe with an example or = > something (what do you mean by "live", you use it a few times) I struggled to find the words to explain this. When you do a zcopy_start, y= ou are returned some region(s) of memory. If populate is false, that buffer= doesn't contain any valid data. If populate is true, the buffer contains t= he data on the disk for the described region of blocks. Similarly with comm= it - if you memcpy into a buffer provided from zcopy_start, that doesn't ne= cessarily become the data on a block device (i.e. visible from a separate r= ead operation) until the data is committed. Obviously, block device modules= may have wildly varying internal implementations. For example, the malloc = bdev module just allocates a single region of memory representing the block= device and returns pointers to it, so the populate and commit flags don't = do anything at all. But the malloc bdev is a special case - the norm is tha= t some portion of the disk will be cached in DRAM and the real data is stor= ed on some persistent device. Obtaining a zero copy buffer may or may not h= ave to page data into DRAM, and commit may or may not have to page data out. > * You mention " I believe I've addressed all known use cases with this de= sign" > but in this email chain, at least, only cover the NVMe-oF use case. = > Can you list the known cases you mention and maybe just say a few words a= bout them? > = I'm glad you ask because it needs to be explained, but I'm also not glad yo= u asked because now I have to type out the giant explanation! There's a matrix of different things that need to be supported. On the bdev= module side, we want to support a number of different implementations. 1) All data exists in addressable/DMA-able memory (equivalent to the malloc= bdev module) 2) Some data exists in addressable/DMA-able memory (a caching bdev, or the = nvme bdev module with a controller memory buffer). Further, within those two choices, there are two other options: 1) When updating the data in the buffer returned by zcopy_start, concurrent= reads will immediately see the updates. This is what I was trying to say b= y "live" elsewhere. This is how the malloc bdev module works, and probably = isn't a great bdev module design because it can't make any atomicity guaran= tees. 2) When updating data in the buffer returned by zcopy_start, future reads f= or the given blocks won't see the data until the buffer has been committed.= We don't have any bdev modules that actually work like this today, but it'= s how I imagine a more robust caching bdev module would do it. On the network side we also have to support four types of network interface= s. 1) TPC/IP via sockets (iSCSI, NVMe-oF). For a read (which sends data over t= he network on the target side), we want to be able to ask the bdev module t= o provide a buffer that holds the data, then we use that to do a send() on = the socket. That's the minimum number of copies. For a write (which recvs d= ata over the network) we don't use zero copy semantics because the data has= to be gathered into a buffer in user space anyway. 2) RDMA via ibverbs (NVMe-oF). For all paths, we want to ask the bdev to pr= ovide us a buffer describing the region on the disk and then we do an RDMA = transfer into or out of it. 3) FC. For both paths, we want to ask the bdev to provide a buffer describi= ng the region on the disk and then we do a DMA to/from the FC HBA. 4) virtio (vhost). The data is already in DMA-safe memory, so we don't bene= fit from zero copy at all. So if you take all of these constraints and you work out an API, you basica= lly end up with what I proposed. > Thanks > Paul > = > -----Original Message----- > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, = > Benjamin > Sent: Wednesday, July 11, 2018 2:53 PM > To: John.Meneghini(a)netapp.com; Sriram.Popuri(a)netapp.com; = > Srikanth.Kaligotla(a)ne tapp.com > Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org > Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your = > input.) > = > I've done a bunch of work here to formalize the zero copy API in the = > bdev layer. > The patch series begins here: > = > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/ > = > There has been some good discussion on the patches themselves, but I = > wanted to circle back to the mailing list for a bit. The basic idea is = > to add two new functions to the bdev layer: > = > int spdk_bdev_zcopy_start(struct spdk_bdev_desc *desc, > struct spdk_io_channel *ch, > uint64_t offset_blocks, uint64_t num_blocks, > bool populate, > spdk_bdev_io_completion_cb cb, void *cb_arg); int = > spdk_bdev_zcopy_end(struct spdk_bdev_io *bdev_io, bool commit, > spdk_bdev_io_completion_cb cb, void *cb_arg); > = > The zcopy start call makes a request to the backing block device to = > provide a region of memory suitable for performing a zero copy update = > to the region described by the offset_blocks and num_blocks = > parameters. The populate flag determines whether the "live" data is = > present in the given buffer. When this request completes, it returns a = > bdev_io in the completion callback. That bdev_io will have a scatter = > gather list filled out describing the memory region. > = > When the user has finished transferring data into or out of the data = > buffer, the user calls spdk_bdev_zcopy_end with the originally = > provided bdev_io. The commit flag indicates whether the data in the buffe= r needs to be made "live". > = > The NVMe-oF target can take advantage of these new APIs by altering = > the state machine in the RDMA layer to call spdk_bdev_zcopy_start to = > obtain data buffers after parsing the command, then performing RDMA = > transfers into or out of the provided region, and finally completing the = command with spdk_bdev_zcopy_end. > The work to do those modifications has not been done yet, and I expect = > them to be a significant amount of effort. > = > As many bdev modules as possible should attempt to support zero copy = > semantics, but for those bdev modules that aren't able to, one of the = > patches in my series will automatically emulate the zcopy commands = > using regular reads and writes. > This allows the higher level users of the bdev API to always make the = > zcopy calls for simplicity. The emulation path should be no worse, in = > terms of performance, than doing a regular read or write. > = > I believe I've addressed all known use cases with this design, and = > while there will certainly be some future patches to fix problems that = > come up, I'm confident that we're on the right track here. If there is = > any feedback in this area at all, now is the time to provide it. > = > I think this also leads into some really awesome future work that I'll = > suggest in the hope that someone in the community is inspired. One = > idea is that the bdev_nvme module could support zero copy semantics by = > returning a pointer to a portion of a Controller Memory Buffer or = > Persistent Memory Buffer suitable for data. There aren't many devices = > for sale today that can do this, but it's in the spec and I imagine = > they're coming. This would allow the NVMe-oF target, for example, to = > perform RDMA transfers peer-to-peer between the NIC and the NVMe SSD. = > Similarly, for the bdev_pmem module, direct pointers into the the = > persistent memory could be returned. The NVMe-oF target could then = > perform RDMA transfers into or out of the persistent memory directly, = > and the zcopy end call with commit set to true would perform the required= flush instructions. > = > Thanks, > Ben > = > = > On Mon, 2018-07-09 at 19:19 +0000, Popuri, Sriram wrote: > > I would like to jot down what requirements we are looking at. = > > Wherever there is "=3D=3D=3D=3Dwait" for something, there is a potentia= l context switch. > > = > > For a write request, the following is the workflow for zero copy: > > 1) Host submits Write request > > 2) Protocol validation of request > > 4) Get iovs > > 5) R2T/RDMA_READ > > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > > 5) Host transfers data (data_out/RDMA_READ_COMPLETE) > > 6) Perform Write(spdk_bdev_writev_blocks) to bdev_module =3D=3D=3D=3Dwa= it = > > for bdev module to respond=3D=3D=3D=3D > > 7) on bdev_module response, send rsp/status to host > > 8) free req resources > > = > > Read path: > > 1) Host submits Read request > > 2) Protocol validatation of request > > 3) Submit read to bdev_module > > =3D=3D=3D=3Dwait for data blocks from bdev_module=3D=3D=3D=3D > > 4) respond iov (data_in/RDMA_WRITE) > > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > > 5) host ack/RDMA_WRITE_COMPLETE > > 6) free req resources > > = > > In future we will have Persistent reservations where there can be = > > read/modify/update cases depending on how the bdev module handles it. > > So the zero copy apis needs to address future requirements as well. > > = > > The above workflows suggest we require a state to track at what step = > > the request is. A state will help to cleanup resources properly when = > > handling aborts, disconnects, namespace destroy workflows etc. > > So the requirement is to figure out a way to clean up resources = > > according to the state of the bdev_io in its life cycle and stitch = > > in the zero copy workflows into existing spdk code which assumes = > > buffers are pre-allocated. > > = > > I think Srikanth's proposed changes are made in target layer to = > > populate and commit to bdev module. However his suggestion is = > > changing the bdev_io life cycle which I understand from John is not = > > agreeable to others. > > = > > With the new zcopy apis also there is no clarity on the bdev_io life = > > cycle. A bdev_io is acquired during zcopy_start and it's not clear = > > how to deal with the bdev_io allocated in read/write paths. > > = > > The orchestration of zero copy either be done at transport layer, = > > target layer or bdev module layer. > > = > > We don't like the orchestration done from transport layer because: > > 1) No clear way to deal with bdev_io (acquired from zcopy) and = > > bdev_io (acquired from read/write codepaths). > > 2) If I understand correctly, by the time a read/write request is = > > submitted to target, the data_out would have happened. So the = > > protocol validation happens after data_out phase. For example an LBA = > > out of range request or invalid SGL length is returned after a = > > data_out is performed. We want to run through the protocol = > > validation before data_out phase. > > = > > I feel the bdev module is the ideal place to deal with zero copy = > > buffers. The bdev module knows how to get buffers and when to claim = > > them. All it needs is a proper cleanup and by the time request hits = > > bdev module, bdev_io is acquired so no need to change bdev_io life = > > cycle or introduce a new io type. > > = > > I propose the following: > > 1) Host submits Write request > > 2) Protocol validation of request > > spdk_nvmf_request_exec > > 4) submit request to bdev module > > spdk_bdev_writev_blocks(). bdev_io is acquired > > 5) bdev module determines the buffers needs to be allocated. > > Will perform something like what is prposed in the patch: > > spdk_bdev_zcopy_start(populate=3Dfalse) > > 6) New spdk_nvmf_request_exec_status > > (SPDK_NVMF_REQUEST_EXEC_STATUS_BUFF_READY) > > 7) Transport does R2T/RDMA_READ > > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > > 8) Host transfers data (data_out/RDMA_READ_COMPLETE) > > 9) Perform Write = > > again(spdk_nvmf_request_exec/spdk_bdev_writev_blocks) > > to bdev_module. > > =3D=3D=3D=3Dwait for bdev module to respond=3D=3D=3D=3D > > 10) on bdev_module response, send rsp/status to host > > 11) free req resources > > free bdev_io spdk_bdev_free_io. > > Here we need a hook to free resources allocated by bdev module > > spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx) > > = > > Read path: > > 1) Host submits Read request > > 2) Protocol validatation of request > > 3) Submit the request (spdk_bdev_readv_blocks) to bdev_module = > > =3D=3D=3D=3Dwait for data blocks from bdev_module=3D=3D=3D=3D > > 4) respond iov (data_in/RDMA_WRITE) > > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > > 5) host ack/RDMA_WRITE_COMPLETE > > 6) free req resources > > free bdev_io spdk_bdev_free_io. > > Here we need a hook to free resources allocated by bdev module > > spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx) > > = > > If it makes sense we can work on a patch. > > = > > Regards, > > ~Sriram > > = > > -----Original Message----- > > From: Walker, Benjamin > > Sent: Monday, July 9, 2018 11:22 PM > > To: Meneghini, John ; Popuri, Sriram = > > ; Kaligotla, Srikanth = > > > > Cc: raju.gottumukkala(a)broadcom.com; Rodriguez, Edwin = > > ; spdk(a)lists.01.org; Pai, Madhu = > > ; NGC-john.barnard- broadcom.com = > > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > = > > On Mon, 2018-07-09 at 16:18 +0000, Meneghini, John wrote: > > > Hi Jim. > > > = > > > This is the patch series I believe Ben is proposing to replace > > > Srikanth's: ht tps://review.gerrithub.io/#/c/spdk/spdk/+/415860/ > > > = > > > pick a21bbcf2 nvmf: Move data buffer pool to generic layer pick > > > 09586e54 bdev: Add a zero copy I/O path pick 1cc9fe95 bdev: Make = > > > malloc bdev use the new zero copy mechanism pick 40a1f62b bdevperf: > > > Use new zcopy API for reads pick eba6f35a bdev: Emulate zero copy = > > > support when necessary > > > = > > > Is this correct? > > = > > = > > My patch series doesn't entirely replace Srikanth's set of patches. = > > There are two separate things necessary to implement zero copy. = > > First is the infrastructure within the bdev layer to make requests = > > of a bdev module to obtain or release a suitable buffer for a zero = > > copy operation. Second is actually making the NVMe-oF target use = > > that mechanism, which includes extending the bdev_io lifetime. My = > > patch series only addresses the first part. > > = > > Thanks, > > Ben > > = > > = > > = > > > = > > > /John > > > = > > > =EF=BB=BFOn 7/9/18, 11:31 AM, "Harris, James R" wrote: > > > = > > > Hi Sriram: > > > = > > > Ben has some later patches in this series with some examples: > > > = > > > https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - = > > > implements this zero copy API in the simple =E2=80=98malloc=E2=80=99 = bdev module > > > https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the = > > > zero copy API in the SPDK bdevperf utility > > > = > > > -Jim > > > = > > > = > > > On 7/9/18, 3:34 AM, "Popuri, Sriram" = wrote: > > > = > > > Sorry was not focusing on this change. Just give me a day = > > > or two to get back. > > > From a quick glance I didn't understand how zcopy = > > > start/end fits into a read/write life cycle. Is there an example = > > > on how zcopy start/end is consumed or someone can do give me a = > > > quick dump on how its envisioned to be used. > > > = > > > Regards, > > > ~Sriram > > > = > > > -----Original Message----- > > > From: Meneghini, John = > > > Sent: Friday, July 6, 2018 10:16 PM > > > To: Walker, Benjamin ; Verkamp, = > > > Daniel ; Harris, James R = > > > ; Kaligotla, Srikanth = > > > > > om> > > > Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; = > > > Rodriguez, Edwin ; Pai, Madhu = > > > ; NGC- john.barnard-broadcom.com = > > > ; Popuri, Sriram > > Popuri(a)netapp.com> > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > > = > > > I'm adding Sriram to this thread. Sriram is another = > > > NetApp engineer who is working on this stuff internally. > > > = > > > /John > > > = > > > On 7/6/18, 12:41 PM, "Walker, Benjamin" = > > > > > > wrote: > > > = > > > Hi Srikanth, > > > = > > > I wanted to check in to see if you had any additional = > > > feedback on the zero copy > > > operations in the bdev layer here: > > > = > > > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/ > > > = > > > This does not address extending the lifetime of the = > > > bdev_io in the NVMe-oF > > > target, but I think this is looking like the right = > > > mechanism for the bdev layer. > > > = > > > Thanks, > > > Ben > > > = > > > On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote: > > > > Thanks Srikanth. Sounds like the spdk_bdev_io pool = > > > sizing along with > > > > spdk_bdev_queue_io_wait() meets your needs then = > > > regarding the spdk_bdev_io > > > > memory. > > > > = > > > > Regarding zcopy_start/zcopy_end =E2=80=93 it looks like = > > > you=E2=80=99ve already added a bunch > > > > of comments to Ben=E2=80=99s patch on GerritHub. For n= ow = > > > I=E2=80=99d say let=E2=80=99s continue our > > > > discussion there. I=E2=80=99ve responded to a couple o= f = > > > similar questions there and > > > > I=E2=80=99m sure Ben will have more replies tomorrow. > > > > = > > > > -Jim > > > > = > > > > = > > > > From: "Kaligotla, Srikanth" > > > > Date: Thursday, June 21, 2018 at 5:01 PM > > > > To: James Harris , = > > > "Walker, Benjamin" > > > > > > er(a)intel.com>, Daniel Verkamp > > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > > , > > > > "Meneghini, John" , = > > > "Rodriguez, Edwin" > > > z(a)netapp.com>, "Pai, Madhu" = > > > , > > > "NGC- > > > john.barnard- > > > > broadcom.com" , = > > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > > org> > > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > > = > > > > Hi Jim, > > > > = > > > > I wish I joined the community meeting, I also missed the > > > > spdk_bdev_queue_io_wait(). > > > > = > > > > So, there are 2 issues I am intending to solve, > > > > = > > > > 1. We want to ensure that an instance of bdev-io = is > > > acquired prior to or > > > > along with I/O data buffer. This allows for better = > > > error handling when bdev_io > > > > pool is exhausted. Yes, we can solve it via sizing = > > > the pool right. The changes > > > > queue-io-wait also addresses the problem. > > > > 2. Most importantly, by extending the life of bde= v-io > > > (Same life span as > > > > nvmf_request structure), abort use case and other = > > > use cases that involve > > > > cleaning up of IO data buffer are accomplished = > > > effectively. Let me elaborate; > > > > bdev-io is the fabric that connects the nvme command = > > > and operation with the > > > > backend. The driver context and data buffer context = > > > is stored in bdev-io. The > > > > freeing up of bdev_io resource is pushed to the end = > > > so the I/O cleanup can > > > > happen after controller has transmitted the data to = > > > the host. In absence of > > > > bdev_io, we will have end up adding more and more = > > > void context in request > > > > structure. Hence the push to extend the life of bdev_io. > > > > = > > > > I see Ben=E2=80=99s recent patch =E2=80=9Czcopy_start= =E2=80=9D and = > > > =E2=80=9Czcopy_end=E2=80=9D; We can make that work > > > > as long as the bdev-io allocated/acquired stays till = > > > the end. > > > One of the > > > > challenges I see with that patch is defining a = > > > callback for the I/O > > > > submission. For instance, zopy_start will allocate = > > > bdev_io and submits the I/O > > > > to the bdev device. Underlying implementation can be = > > > synchronous or can be > > > > asynchronous. The callback for this submission = > > > should check to see if it is a > > > > BUFFER-READY or PENDING and accordingly relay it = > > > back to transport. Next phase > > > > is actual I/O submission. Let=E2=80=99s say, it reuses = the = > > > bdev-io obtained in ZCOPY- > > > > START, now the callback should determine if it is a = > > > success or failure. > > > > Finally when ZCOPY-END is invoked the supplied = > > > bdev_io will have all necessary > > > > data to release the WRITE buffer or unlock the read = > > > buffer based on the > > > > operation performed. > > > > = > > > > I hope I=E2=80=99m making sense. I guess, my effort is = to = > > > extend the lifespan of bdev- > > > > io and let it continue to host the driver context = > > > and buffer context populated > > > > during the BEGIN phase. > > > > = > > > > Regards, > > > > Srikanth > > > > = > > > > From: Harris, James R = > > > > Sent: Thursday, June 21, 2018 2:21 PM > > > > To: Kaligotla, Srikanth = > > > ; Walker, Benjamin > > > jamin.walker(a)intel.com>; Verkamp, Daniel = > > > > > om> > > > > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John = > > > > > > >; Rodriguez, Edwin ; Pai, = > > > Madhu > > > pp.com>; NGC-john.barnard-broadcom.com = > > > > > m>; spdk(a)lists > > > > .01.org > > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > > > = > > > > Hi Srikanth, > > > > = > > > > Following up on this thread and the discussion in = > > > yesterday=E2=80=99s community > > > > meeting. > > > > = > > > > The recent spdk_bdev_queue_io_wait() changes allow = > > > an SPDK application to work > > > > better in general when the spdk_bdev_io buffer pool = > > > is exhausted. We still > > > > need to make changes to the NVMe-oF target to use = > > > this new API but that should > > > > be done soon-ish. > > > > = > > > > With that in place, the spdk_bdev_io buffer pool = > > > itself can be configured up > > > > or down when the application starts. Currently the = > > > default is 64K > > > > spdk_bdev_io buffers. sizeof(struct spdk_bdev_io) = > > > =3D=3D > > > 216 plus the per-IO > > > > context size allocated for the bdev module. This = > > > can be up to > > > 192 bytes > > > > (virtio bdev module) but is likely much smaller for = > > > you depending on the > > > > context size for your ontap bdev module. > > > > = > > > > Let=E2=80=99s assume your per-IO context size is 64 byt= es. = > > > 64K x (192 + > > > 64) =3D 16MB. > > > > = > > > > I=E2=80=99m not sure how many spdk_bdev_io you need in = > > > flight at any given time. 64K > > > > seems like a lot but I=E2=80=99d be curious to hear you= r = > > > thoughts on this. If this is > > > > the right number, then worst case, there would be = > > > about 16MB of DRAM that > > > > would sit unused if the NVMe-oF target in your = > > > system was not active. Is that > > > > too burdensome for your application? > > > > = > > > > Thanks, > > > > = > > > > -Jim > > > > = > > > > = > > > > = > > > > From: "Kaligotla, Srikanth" > > > > Date: Wednesday, June 20, 2018 at 12:47 PM > > > > To: "Walker, Benjamin" , = > > > James Harris > > > is(a)intel.com>, Daniel Verkamp > > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > > , > > > > "Meneghini, John" , = > > > "Rodriguez, Edwin" > > > z(a)netapp.com>, "Pai, Madhu" = > > > , > > > "NGC- > > > john.barnard- > > > > broadcom.com" , = > > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > > org> > > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > > = > > > > Hello, > > > > = > > > > First revision of changes to extend the lifecycle of = > > > bdev_io are available for > > > > review. I would like to solicit your input on the = > > > proposed API/Code flow > > > > changes. > > > > = > > > > https://review.gerrithub.io/c/spdk/spdk/+/415860 > > > > = > > > > Thanks, > > > > Srikanth > > > > = > > > > = > > > > From: "Kaligotla, Srikanth" > > > > Date: Friday, May 11, 2018 at 2:27 PM > > > > To: "Walker, Benjamin" , = > > > "Harris, James R" > > > .harris(a)intel.com> > > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > > , > > > > "Meneghini, John" , = > > > "Rodriguez, Edwin" > > > z(a)netapp.com>, "Pai, Madhu" = > > > , > > > "NGC- > > > john.barnard- > > > > broadcom.com" , = > > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > > org> > > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > > = > > > > CC: List > > > > = > > > > Hi Ben, > > > > = > > > > Your proposal to interface with the backend to = > > > acquire and release buffers is > > > > good. You have accurately stated that the challenge = > > > is in developing intuitive > > > > semantics. And that has been my struggle. To me, = > > > there are two problem > > > > statements; > > > > = > > > > 1. It is expected that bdev_io pool is sized corr= ectly > > > so > > > the call to > > > > get bdev_io succeeds. Failure to acquire bdev_io = > > > will result in DEVICE-ERROR. > > > > The transport is already capable of handling = > > > temporary memory failures by > > > > moving the request to PENDING-Queue. Hence the = > > > proposal to change the bdev_io > > > > lifecycle and perhaps connect it with the = > > > spdk_nvmf_request object. Thus all > > > > buffer needs are addressed at the beginning of I/O requ= est. > > > > 2. I/O Data buffers are sourced and managed by the > > > backend. One of the > > > > challenges I see with your proposed interface is the = > > > lack of details like if > > > > resource is being acquired for a READ operation or = > > > WRITE operation. The > > > > handling is quite different in each case. Since = > > > bdev_io->type is overloaded > > > > the type of I/O operation is lost. I suppose one can = > > > cast the cb_arg > > > > (nvmf_request) and then proceed. Also, the bdev_io = > > > should be present to > > > > RELEASE the buffer. Zero copy semantics warrants = > > > that data buffer stays until > > > > controller-to-host has occurred. In other words, = > > > bdev_io lives till REQUEST > > > > come to COMPLETE state. > > > > = > > > > What are your thoughts in introducing = > > > spdk_bdev_init() and > > > spdk_bdev_fini() as > > > > an alternative approach to extend the lifecyle of = > > > bdev_io and allow data > > > > buffer management via bdev fn_table ? > > > > = > > > > I hope I=E2=80=99m making sense=E2=80=A6 > > > > = > > > > Thanks, > > > > Srikanth > > > > = > > > > From: Walker, Benjamin = > > > > Sent: Friday, May 11, 2018 12:28 PM > > > > To: Harris, James R ; = > > > Kaligotla, Srikanth > > > Kaligotla(a)netapp.com> > > > > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John = > > > > > > >; Rodriguez, Edwin ; Pai, = > > > Madhu > > > pp.com>; NGC-john.barnard-broadcom.com = > > > > > m> > > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > > > = > > > > Hi Srikanth, > > > > = > > > > Yes - we'll need to introduce some way to acquire = > > > and release buffers from the > > > > bdev layer earlier on in the state machine that = > > > processes an NVMe-oF request. > > > > I've had this patch out for review for several = > > > months as a proposal for this > > > > scenario: > > > > = > > > > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/ > > > > = > > > > It doesn't pass the tests - it's just a proposal for = > > > the interface. 90% of the > > > > challenge here is in developing intuitive semantics. > > > > = > > > > Thanks, > > > > Ben > > > > = > > > > P.S. This is the kind of discussion that would fit = > > > perfectly on the mailing > > > > list. > > > > = > > > > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, = > > > Srikanth > > > wrote: > > > > > Hi Ben, Hi James, > > > > > = > > > > > I would like to solicit opinions on the lifecycle = > > > for bdev-io resource > > > > > object. Attached is the image of RDMA state = > > > machine in its current > > > > > implementation. When the REQUEST enters the = > > > NEED-BUFFER state, the buffers > > > > > necessary for carrying out I/O operation are = > > > allocated/acquired from the > > > > > memory pool. An instance of BDEV-IO comes into = > > > existence after REQUEST > > > > > reaches READY-TO-EXECUTE state. The BDEV-IO is = > > > teared down as soon the > > > > > backend returns. From a BDEV perspective, BDEV-IO = > > > is simply a translation > > > > > unit that facilitates I/O buffers from the backend. = > > > The driver context > > > > > embedded within bdev_io holds great deal of = > > > information pertaining to the > > > > > I/O under execution. It assists in error handling, = > > > dereferencing the buffers > > > > > upon I/O completion and in abort handling. In = > > > summary, the bdev_io stays > > > > > alive until request has come to COMPLETE state. = > > > I=E2=80=99d like to hear peoples > > > > > thoughts in introducing the plumbing to acquire = > > > BDEV-IO resource in REQUEST- > > > > > NEED-BUFFER state and release it in = > > > REQUEST-COMPLETE state. I will shortly > > > > > have patch available for review that introduces = > > > spdk_bdev_init and > > > > > spdk_bdev_fini which in turn invokes the = > > > corresponding bdev fn_table to > > > > > initialize/cleanup. = > > > > > = > > > > > I wanted to use this email to communicate our = > > > intent and solicit your > > > > > feedback. We have a working implementation of the = > > > above proposal and prior > > > > > to pushing it upstream for review would like to = > > > hear your thoughts. These > > > > > proposed changes to upstream are a result of FC = > > > transport work in > > > > > collaboration with the Broadcom team who are also = > > > copied to this mail. > > > > > Myself and John will be at SPDK Dev conference and = > > > if required we can > > > > > elaborate further on this proposal. > > > > > = > > > > > Thanks, > > > > > Srikanth > > > > = > > > > > > > = > > > = > > > = > > > = > > > = > > > = > = > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk _______________________________________________ SPDK mailing list SPDK(a)lists.01.org https://lists.01.org/mailman/listinfo/spdk --===============6213630722918904552==--