From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============6037326446900982012==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.) Date: Thu, 12 Jul 2018 21:16:12 +0000 Message-ID: <120d47e820a90e9c37db850bce14001e35d3ea70.camel@intel.com> In-Reply-To: 82C9F782B054C94B9FC04A331649C77AAD4C4A72@fmsmsx104.amr.corp.intel.com List-ID: To: spdk@lists.01.org --===============6037326446900982012== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Thu, 2018-07-12 at 21:01 +0000, Luse, Paul E wrote: > Awesome Ben, thanks. This is quite a bit more sophisticated and flexible = than > I think what generally comes to mind with "zero copy" in terms of naming = this > thing but I guess I can't think of a better one - "data direct" or someth= ing, > I dunno. Killer feature though no matter what its called... = > = > I was thinking some more about an earlier suggestion from John about using > bytes instead of blocks in the API, have you thought much about that? Sure - the length of the buffer could be specified in either bytes or block= s, as long as when using bytes the length is a block multiple. We have a similar scenario with the regular bdev I/O calls (spdk_bdev_read is in bytes, spdk_bdev_read_blocks is in blocks). We could do the same thing for the zco= py API where there is a byte and a block variant. If we're talking about performing I/O in non-block sizes, then no - that br= eaks about a million different things. This is a block device abstraction layer,= with emphasis on the word block. Certainly, for some classes of devices some of = the device may be accessible at a byte granularity, and we could discuss ways to expose that to the user in a generic way at some point down the road. Maybe= a strategy like what NVMe devices do with CMB/PMR would work. However, I think it's much more likely that accessing byte-addressable memory should just us= e a different access method (like libpmem), rather than a block stack. I consid= er this discussion very separate from the zcopy API (because it would equally affect regular read and write), so I've left it out here. > = > Thx > Paul > = > -----Original Message----- > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, Benj= amin > Sent: Thursday, July 12, 2018 1:44 PM > To: John.Meneghini(a)netapp.com; spdk(a)lists.01.org; Sriram.Popuri(a)net= app.com; Sr > ikanth.Kaligotla(a)netapp.com > Cc: raju.gottumukkala(a)broadcom.com > Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.) > = > On Thu, 2018-07-12 at 01:16 +0000, Luse, Paul E wrote: > > This is a great explanation Ben, thanks. I'll start looking through = > > the patches as well. A few quick questions for additional context and = > > to make sure I understand your terminology: > > = > > * Can you expand on " The populate flag determines whether the "live" = > > data is present in the given buffer" maybe with an example or = > > something (what do you mean by "live", you use it a few times) > = > I struggled to find the words to explain this. When you do a zcopy_start,= you > are returned some region(s) of memory. If populate is false, that buffer > doesn't contain any valid data. If populate is true, the buffer contains = the > data on the disk for the described region of blocks. Similarly with commi= t - > if you memcpy into a buffer provided from zcopy_start, that doesn't > necessarily become the data on a block device (i.e. visible from a separa= te > read operation) until the data is committed. Obviously, block device modu= les > may have wildly varying internal implementations. For example, the malloc= bdev > module just allocates a single region of memory representing the block de= vice > and returns pointers to it, so the populate and commit flags don't do any= thing > at all. But the malloc bdev is a special case - the norm is that some por= tion > of the disk will be cached in DRAM and the real data is stored on some > persistent device. Obtaining a zero copy buffer may or may not have to pa= ge > data into DRAM, and commit may or may not have to page data out. > = > > * You mention " I believe I've addressed all known use cases with this > > design" > > but in this email chain, at least, only cover the NVMe-oF use case. = > > Can you list the known cases you mention and maybe just say a few words > > about them? > > = > = > I'm glad you ask because it needs to be explained, but I'm also not glad = you > asked because now I have to type out the giant explanation! > = > There's a matrix of different things that need to be supported. On the bd= ev > module side, we want to support a number of different implementations. > = > 1) All data exists in addressable/DMA-able memory (equivalent to the mall= oc > bdev > module) > 2) Some data exists in addressable/DMA-able memory (a caching bdev, or the > nvme bdev module with a controller memory buffer). > = > Further, within those two choices, there are two other options: > = > 1) When updating the data in the buffer returned by zcopy_start, concurre= nt > reads will immediately see the updates. This is what I was trying to say = by > "live" elsewhere. This is how the malloc bdev module works, and probably = isn't > a great bdev module design because it can't make any atomicity guarantees. > 2) When updating data in the buffer returned by zcopy_start, future reads= for > the given blocks won't see the data until the buffer has been committed. = We > don't have any bdev modules that actually work like this today, but it's = how I > imagine a more robust caching bdev module would do it. > = > On the network side we also have to support four types of network interfa= ces. > = > 1) TPC/IP via sockets (iSCSI, NVMe-oF). For a read (which sends data over= the > network on the target side), we want to be able to ask the bdev module to > provide a buffer that holds the data, then we use that to do a send() on = the > socket. That's the minimum number of copies. For a write (which recvs data > over the network) we don't use zero copy semantics because the data has t= o be > gathered into a buffer in user space anyway. > 2) RDMA via ibverbs (NVMe-oF). For all paths, we want to ask the bdev to > provide us a buffer describing the region on the disk and then we do an R= DMA > transfer into or out of it. > 3) FC. For both paths, we want to ask the bdev to provide a buffer descri= bing > the region on the disk and then we do a DMA to/from the FC HBA. > 4) virtio (vhost). The data is already in DMA-safe memory, so we don't be= nefit > from zero copy at all. > = > So if you take all of these constraints and you work out an API, you basi= cally > end up with what I proposed. > = > > Thanks > > Paul > > = > > -----Original Message----- > > From: SPDK [mailto:spdk-bounces(a)lists.01.org] On Behalf Of Walker, = > > Benjamin > > Sent: Wednesday, July 11, 2018 2:53 PM > > To: John.Meneghini(a)netapp.com; Sriram.Popuri(a)netapp.com; = > > Srikanth.Kaligotla(a)ne tapp.com > > Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org > > Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your = > > input.) > > = > > I've done a bunch of work here to formalize the zero copy API in the = > > bdev layer. > > The patch series begins here: > > = > > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/ > > = > > There has been some good discussion on the patches themselves, but I = > > wanted to circle back to the mailing list for a bit. The basic idea is = > > to add two new functions to the bdev layer: > > = > > int spdk_bdev_zcopy_start(struct spdk_bdev_desc *desc, > > struct spdk_io_channel *ch, > > uint64_t offset_blocks, uint64_t num_blocks, > > bool populate, > > spdk_bdev_io_completion_cb cb, void *cb_arg); int = > > spdk_bdev_zcopy_end(struct spdk_bdev_io *bdev_io, bool commit, > > spdk_bdev_io_completion_cb cb, void *cb_arg); > > = > > The zcopy start call makes a request to the backing block device to = > > provide a region of memory suitable for performing a zero copy update = > > to the region described by the offset_blocks and num_blocks = > > parameters. The populate flag determines whether the "live" data is = > > present in the given buffer. When this request completes, it returns a = > > bdev_io in the completion callback. That bdev_io will have a scatter = > > gather list filled out describing the memory region. > > = > > When the user has finished transferring data into or out of the data = > > buffer, the user calls spdk_bdev_zcopy_end with the originally = > > provided bdev_io. The commit flag indicates whether the data in the buf= fer > > needs to be made "live". > > = > > The NVMe-oF target can take advantage of these new APIs by altering = > > the state machine in the RDMA layer to call spdk_bdev_zcopy_start to = > > obtain data buffers after parsing the command, then performing RDMA = > > transfers into or out of the provided region, and finally completing the > > command with spdk_bdev_zcopy_end. > > The work to do those modifications has not been done yet, and I expect = > > them to be a significant amount of effort. > > = > > As many bdev modules as possible should attempt to support zero copy = > > semantics, but for those bdev modules that aren't able to, one of the = > > patches in my series will automatically emulate the zcopy commands = > > using regular reads and writes. > > This allows the higher level users of the bdev API to always make the = > > zcopy calls for simplicity. The emulation path should be no worse, in = > > terms of performance, than doing a regular read or write. > > = > > I believe I've addressed all known use cases with this design, and = > > while there will certainly be some future patches to fix problems that = > > come up, I'm confident that we're on the right track here. If there is = > > any feedback in this area at all, now is the time to provide it. > > = > > I think this also leads into some really awesome future work that I'll = > > suggest in the hope that someone in the community is inspired. One = > > idea is that the bdev_nvme module could support zero copy semantics by = > > returning a pointer to a portion of a Controller Memory Buffer or = > > Persistent Memory Buffer suitable for data. There aren't many devices = > > for sale today that can do this, but it's in the spec and I imagine = > > they're coming. This would allow the NVMe-oF target, for example, to = > > perform RDMA transfers peer-to-peer between the NIC and the NVMe SSD. = > > Similarly, for the bdev_pmem module, direct pointers into the the = > > persistent memory could be returned. The NVMe-oF target could then = > > perform RDMA transfers into or out of the persistent memory directly, = > > and the zcopy end call with commit set to true would perform the requir= ed > > flush instructions. > > = > > Thanks, > > Ben > > = > > = > > On Mon, 2018-07-09 at 19:19 +0000, Popuri, Sriram wrote: > > > I would like to jot down what requirements we are looking at. = > > > Wherever there is "=3D=3D=3D=3Dwait" for something, there is a potent= ial context > > > switch. > > > = > > > For a write request, the following is the workflow for zero copy: > > > 1) Host submits Write request > > > 2) Protocol validation of request > > > 4) Get iovs > > > 5) R2T/RDMA_READ > > > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > > > 5) Host transfers data (data_out/RDMA_READ_COMPLETE) > > > 6) Perform Write(spdk_bdev_writev_blocks) to bdev_module =3D=3D=3D=3D= wait = > > > for bdev module to respond=3D=3D=3D=3D > > > 7) on bdev_module response, send rsp/status to host > > > 8) free req resources > > > = > > > Read path: > > > 1) Host submits Read request > > > 2) Protocol validatation of request > > > 3) Submit read to bdev_module > > > =3D=3D=3D=3Dwait for data blocks from bdev_module=3D=3D=3D=3D > > > 4) respond iov (data_in/RDMA_WRITE) > > > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > > > 5) host ack/RDMA_WRITE_COMPLETE > > > 6) free req resources > > > = > > > In future we will have Persistent reservations where there can be = > > > read/modify/update cases depending on how the bdev module handles it. > > > So the zero copy apis needs to address future requirements as well. > > > = > > > The above workflows suggest we require a state to track at what step = > > > the request is. A state will help to cleanup resources properly when = > > > handling aborts, disconnects, namespace destroy workflows etc. > > > So the requirement is to figure out a way to clean up resources = > > > according to the state of the bdev_io in its life cycle and stitch = > > > in the zero copy workflows into existing spdk code which assumes = > > > buffers are pre-allocated. > > > = > > > I think Srikanth's proposed changes are made in target layer to = > > > populate and commit to bdev module. However his suggestion is = > > > changing the bdev_io life cycle which I understand from John is not = > > > agreeable to others. > > > = > > > With the new zcopy apis also there is no clarity on the bdev_io life = > > > cycle. A bdev_io is acquired during zcopy_start and it's not clear = > > > how to deal with the bdev_io allocated in read/write paths. > > > = > > > The orchestration of zero copy either be done at transport layer, = > > > target layer or bdev module layer. > > > = > > > We don't like the orchestration done from transport layer because: > > > 1) No clear way to deal with bdev_io (acquired from zcopy) and = > > > bdev_io (acquired from read/write codepaths). > > > 2) If I understand correctly, by the time a read/write request is = > > > submitted to target, the data_out would have happened. So the = > > > protocol validation happens after data_out phase. For example an LBA = > > > out of range request or invalid SGL length is returned after a = > > > data_out is performed. We want to run through the protocol = > > > validation before data_out phase. > > > = > > > I feel the bdev module is the ideal place to deal with zero copy = > > > buffers. The bdev module knows how to get buffers and when to claim = > > > them. All it needs is a proper cleanup and by the time request hits = > > > bdev module, bdev_io is acquired so no need to change bdev_io life = > > > cycle or introduce a new io type. > > > = > > > I propose the following: > > > 1) Host submits Write request > > > 2) Protocol validation of request > > > spdk_nvmf_request_exec > > > 4) submit request to bdev module > > > spdk_bdev_writev_blocks(). bdev_io is acquired > > > 5) bdev module determines the buffers needs to be allocated. > > > Will perform something like what is prposed in the patch: > > > spdk_bdev_zcopy_start(populate=3Dfalse) > > > 6) New spdk_nvmf_request_exec_status > > > (SPDK_NVMF_REQUEST_EXEC_STATUS_BUFF_READY) > > > 7) Transport does R2T/RDMA_READ > > > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > > > 8) Host transfers data (data_out/RDMA_READ_COMPLETE) > > > 9) Perform Write = > > > again(spdk_nvmf_request_exec/spdk_bdev_writev_blocks) > > > to bdev_module. > > > =3D=3D=3D=3Dwait for bdev module to respond=3D=3D=3D=3D > > > 10) on bdev_module response, send rsp/status to host > > > 11) free req resources > > > free bdev_io spdk_bdev_free_io. > > > Here we need a hook to free resources allocated by bdev module > > > spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx) > > > = > > > Read path: > > > 1) Host submits Read request > > > 2) Protocol validatation of request > > > 3) Submit the request (spdk_bdev_readv_blocks) to bdev_module = > > > =3D=3D=3D=3Dwait for data blocks from bdev_module=3D=3D=3D=3D > > > 4) respond iov (data_in/RDMA_WRITE) > > > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > > > 5) host ack/RDMA_WRITE_COMPLETE > > > 6) free req resources > > > free bdev_io spdk_bdev_free_io. > > > Here we need a hook to free resources allocated by bdev module > > > spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx) > > > = > > > If it makes sense we can work on a patch. > > > = > > > Regards, > > > ~Sriram > > > = > > > -----Original Message----- > > > From: Walker, Benjamin > > > Sent: Monday, July 9, 2018 11:22 PM > > > To: Meneghini, John ; Popuri, Sriram = > > > ; Kaligotla, Srikanth = > > > > > > Cc: raju.gottumukkala(a)broadcom.com; Rodriguez, Edwin = > > > ; spdk(a)lists.01.org; Pai, Madhu = > > > ; NGC-john.barnard- broadcom.com = > > > > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > > = > > > On Mon, 2018-07-09 at 16:18 +0000, Meneghini, John wrote: > > > > Hi Jim. > > > > = > > > > This is the patch series I believe Ben is proposing to replace > > > > Srikanth's: ht tps://review.gerrithub.io/#/c/spdk/spdk/+/415860/ > > > > = > > > > pick a21bbcf2 nvmf: Move data buffer pool to generic layer pick > > > > 09586e54 bdev: Add a zero copy I/O path pick 1cc9fe95 bdev: Make = > > > > malloc bdev use the new zero copy mechanism pick 40a1f62b bdevperf: > > > > Use new zcopy API for reads pick eba6f35a bdev: Emulate zero copy = > > > > support when necessary > > > > = > > > > Is this correct? > > > = > > > = > > > My patch series doesn't entirely replace Srikanth's set of patches. = > > > There are two separate things necessary to implement zero copy. = > > > First is the infrastructure within the bdev layer to make requests = > > > of a bdev module to obtain or release a suitable buffer for a zero = > > > copy operation. Second is actually making the NVMe-oF target use = > > > that mechanism, which includes extending the bdev_io lifetime. My = > > > patch series only addresses the first part. > > > = > > > Thanks, > > > Ben > > > = > > > = > > > = > > > > = > > > > /John > > > > = > > > > =EF=BB=BFOn 7/9/18, 11:31 AM, "Harris, James R" > > > > wrote: > > > > = > > > > Hi Sriram: > > > > = > > > > Ben has some later patches in this series with some examples: > > > > = > > > > https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - = > > > > implements this zero copy API in the simple =E2=80=98malloc=E2=80= =99 bdev module > > > > https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the = > > > > zero copy API in the SPDK bdevperf utility > > > > = > > > > -Jim > > > > = > > > > = > > > > On 7/9/18, 3:34 AM, "Popuri, Sriram" > > > > wrote: > > > > = > > > > Sorry was not focusing on this change. Just give me a day = > > > > or two to get back. > > > > From a quick glance I didn't understand how zcopy = > > > > start/end fits into a read/write life cycle. Is there an example = > > > > on how zcopy start/end is consumed or someone can do give me a = > > > > quick dump on how its envisioned to be used. > > > > = > > > > Regards, > > > > ~Sriram > > > > = > > > > -----Original Message----- > > > > From: Meneghini, John = > > > > Sent: Friday, July 6, 2018 10:16 PM > > > > To: Walker, Benjamin ; Verkamp= , = > > > > Daniel ; Harris, James R = > > > > ; Kaligotla, Srikanth = > > > > > > > om> > > > > Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; = > > > > Rodriguez, Edwin ; Pai, Madhu = > > > > ; NGC- john.barnard-broadcom.com = > > > > ; Popuri, Sriram > > > Popuri(a)netapp.com> > > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > > > = > > > > I'm adding Sriram to this thread. Sriram is another = > > > > NetApp engineer who is working on this stuff internally. > > > > = > > > > /John > > > > = > > > > On 7/6/18, 12:41 PM, "Walker, Benjamin" = > > > > > > > > wrote: > > > > = > > > > Hi Srikanth, > > > > = > > > > I wanted to check in to see if you had any additional = > > > > feedback on the zero copy > > > > operations in the bdev layer here: > > > > = > > > > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/ > > > > = > > > > This does not address extending the lifetime of the = > > > > bdev_io in the NVMe-oF > > > > target, but I think this is looking like the right = > > > > mechanism for the bdev layer. > > > > = > > > > Thanks, > > > > Ben > > > > = > > > > On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrot= e: > > > > > Thanks Srikanth. Sounds like the spdk_bdev_io pool = > > > > sizing along with > > > > > spdk_bdev_queue_io_wait() meets your needs then = > > > > regarding the spdk_bdev_io > > > > > memory. > > > > > = > > > > > Regarding zcopy_start/zcopy_end =E2=80=93 it looks li= ke = > > > > you=E2=80=99ve already added a bunch > > > > > of comments to Ben=E2=80=99s patch on GerritHub. For= now = > > > > I=E2=80=99d say let=E2=80=99s continue our > > > > > discussion there. I=E2=80=99ve responded to a couple= of = > > > > similar questions there and > > > > > I=E2=80=99m sure Ben will have more replies tomorrow. > > > > > = > > > > > -Jim > > > > > = > > > > > = > > > > > From: "Kaligotla, Srikanth" > > > > > > > > > Date: Thursday, June 21, 2018 at 5:01 PM > > > > > To: James Harris , = > > > > "Walker, Benjamin" > > > > > > > > er(a)intel.com>, Daniel Verkamp > > > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > > > , > > > > > "Meneghini, John" , = > > > > "Rodriguez, Edwin" > > > > z(a)netapp.com>, "Pai, Madhu" = > > > > , > > > > "NGC- > > > > john.barnard- > > > > > broadcom.com" , = > > > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > > > org> > > > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > > > = > > > > > Hi Jim, > > > > > = > > > > > I wish I joined the community meeting, I also missed = the > > > > > spdk_bdev_queue_io_wait(). > > > > > = > > > > > So, there are 2 issues I am intending to solve, > > > > > = > > > > > 1. We want to ensure that an instance of bdev-i= o is > > > > acquired prior to or > > > > > along with I/O data buffer. This allows for better = > > > > error handling when bdev_io > > > > > pool is exhausted. Yes, we can solve it via sizing = > > > > the pool right. The changes > > > > > queue-io-wait also addresses the problem. > > > > > 2. Most importantly, by extending the life of b= dev- > > > > io > > > > (Same life span as > > > > > nvmf_request structure), abort use case and other = > > > > use cases that involve > > > > > cleaning up of IO data buffer are accomplished = > > > > effectively. Let me elaborate; > > > > > bdev-io is the fabric that connects the nvme command = > > > > and operation with the > > > > > backend. The driver context and data buffer context = > > > > is stored in bdev-io. The > > > > > freeing up of bdev_io resource is pushed to the end = > > > > so the I/O cleanup can > > > > > happen after controller has transmitted the data to = > > > > the host. In absence of > > > > > bdev_io, we will have end up adding more and more = > > > > void context in request > > > > > structure. Hence the push to extend the life of bdev_= io. > > > > > = > > > > > I see Ben=E2=80=99s recent patch =E2=80=9Czcopy_start= =E2=80=9D and = > > > > =E2=80=9Czcopy_end=E2=80=9D; We can make that work > > > > > as long as the bdev-io allocated/acquired stays till = > > > > the end. > > > > One of the > > > > > challenges I see with that patch is defining a = > > > > callback for the I/O > > > > > submission. For instance, zopy_start will allocate = > > > > bdev_io and submits the I/O > > > > > to the bdev device. Underlying implementation can be = > > > > synchronous or can be > > > > > asynchronous. The callback for this submission = > > > > should check to see if it is a > > > > > BUFFER-READY or PENDING and accordingly relay it = > > > > back to transport. Next phase > > > > > is actual I/O submission. Let=E2=80=99s say, it reuse= s the = > > > > bdev-io obtained in ZCOPY- > > > > > START, now the callback should determine if it is a = > > > > success or failure. > > > > > Finally when ZCOPY-END is invoked the supplied = > > > > bdev_io will have all necessary > > > > > data to release the WRITE buffer or unlock the read = > > > > buffer based on the > > > > > operation performed. > > > > > = > > > > > I hope I=E2=80=99m making sense. I guess, my effort i= s to = > > > > extend the lifespan of bdev- > > > > > io and let it continue to host the driver context = > > > > and buffer context populated > > > > > during the BEGIN phase. > > > > > = > > > > > Regards, > > > > > Srikanth > > > > > = > > > > > From: Harris, James R = > > > > > Sent: Thursday, June 21, 2018 2:21 PM > > > > > To: Kaligotla, Srikanth = > > > > ; Walker, Benjamin > > > > jamin.walker(a)intel.com>; Verkamp, Daniel = > > > > > > > om> > > > > > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John = > > > > > > > > >; Rodriguez, Edwin ; Pai, = > > > > Madhu > > > > pp.com>; NGC-john.barnard-broadcom.com = > > > > > > > m>; spdk(a)lists > > > > > .01.org > > > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > > > > = > > > > > Hi Srikanth, > > > > > = > > > > > Following up on this thread and the discussion in = > > > > yesterday=E2=80=99s community > > > > > meeting. > > > > > = > > > > > The recent spdk_bdev_queue_io_wait() changes allow = > > > > an SPDK application to work > > > > > better in general when the spdk_bdev_io buffer pool = > > > > is exhausted. We still > > > > > need to make changes to the NVMe-oF target to use = > > > > this new API but that should > > > > > be done soon-ish. > > > > > = > > > > > With that in place, the spdk_bdev_io buffer pool = > > > > itself can be configured up > > > > > or down when the application starts. Currently the = > > > > default is 64K > > > > > spdk_bdev_io buffers. sizeof(struct spdk_bdev_io) = > > > > =3D=3D > > > > 216 plus the per-IO > > > > > context size allocated for the bdev module. This = > > > > can be up to > > > > 192 bytes > > > > > (virtio bdev module) but is likely much smaller for = > > > > you depending on the > > > > > context size for your ontap bdev module. > > > > > = > > > > > Let=E2=80=99s assume your per-IO context size is 64 b= ytes. = > > > > 64K x (192 + > > > > 64) =3D 16MB. > > > > > = > > > > > I=E2=80=99m not sure how many spdk_bdev_io you need i= n = > > > > flight at any given time. 64K > > > > > seems like a lot but I=E2=80=99d be curious to hear y= our = > > > > thoughts on this. If this is > > > > > the right number, then worst case, there would be = > > > > about 16MB of DRAM that > > > > > would sit unused if the NVMe-oF target in your = > > > > system was not active. Is that > > > > > too burdensome for your application? > > > > > = > > > > > Thanks, > > > > > = > > > > > -Jim > > > > > = > > > > > = > > > > > = > > > > > From: "Kaligotla, Srikanth" > > > > > > > > > Date: Wednesday, June 20, 2018 at 12:47 PM > > > > > To: "Walker, Benjamin" , = > > > > James Harris > > > > is(a)intel.com>, Daniel Verkamp > > > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > > > , > > > > > "Meneghini, John" , = > > > > "Rodriguez, Edwin" > > > > z(a)netapp.com>, "Pai, Madhu" = > > > > , > > > > "NGC- > > > > john.barnard- > > > > > broadcom.com" , = > > > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > > > org> > > > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > > > = > > > > > Hello, > > > > > = > > > > > First revision of changes to extend the lifecycle of = > > > > bdev_io are available for > > > > > review. I would like to solicit your input on the = > > > > proposed API/Code flow > > > > > changes. > > > > > = > > > > > https://review.gerrithub.io/c/spdk/spdk/+/415860 > > > > > = > > > > > Thanks, > > > > > Srikanth > > > > > = > > > > > = > > > > > From: "Kaligotla, Srikanth" > > > > > > > > > Date: Friday, May 11, 2018 at 2:27 PM > > > > > To: "Walker, Benjamin" , = > > > > "Harris, James R" > > > > .harris(a)intel.com> > > > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > > > , > > > > > "Meneghini, John" , = > > > > "Rodriguez, Edwin" > > > > z(a)netapp.com>, "Pai, Madhu" = > > > > , > > > > "NGC- > > > > john.barnard- > > > > > broadcom.com" , = > > > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > > > org> > > > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > > > = > > > > > CC: List > > > > > = > > > > > Hi Ben, > > > > > = > > > > > Your proposal to interface with the backend to = > > > > acquire and release buffers is > > > > > good. You have accurately stated that the challenge = > > > > is in developing intuitive > > > > > semantics. And that has been my struggle. To me, = > > > > there are two problem > > > > > statements; > > > > > = > > > > > 1. It is expected that bdev_io pool is sized > > > > correctly > > > > so > > > > the call to > > > > > get bdev_io succeeds. Failure to acquire bdev_io = > > > > will result in DEVICE-ERROR. > > > > > The transport is already capable of handling = > > > > temporary memory failures by > > > > > moving the request to PENDING-Queue. Hence the = > > > > proposal to change the bdev_io > > > > > lifecycle and perhaps connect it with the = > > > > spdk_nvmf_request object. Thus all > > > > > buffer needs are addressed at the beginning of I/O > > > > request. > > > > > 2. I/O Data buffers are sourced and managed by = the > > > > backend. One of the > > > > > challenges I see with your proposed interface is the = > > > > lack of details like if > > > > > resource is being acquired for a READ operation or = > > > > WRITE operation. The > > > > > handling is quite different in each case. Since = > > > > bdev_io->type is overloaded > > > > > the type of I/O operation is lost. I suppose one can = > > > > cast the cb_arg > > > > > (nvmf_request) and then proceed. Also, the bdev_io = > > > > should be present to > > > > > RELEASE the buffer. Zero copy semantics warrants = > > > > that data buffer stays until > > > > > controller-to-host has occurred. In other words, = > > > > bdev_io lives till REQUEST > > > > > come to COMPLETE state. > > > > > = > > > > > What are your thoughts in introducing = > > > > spdk_bdev_init() and > > > > spdk_bdev_fini() as > > > > > an alternative approach to extend the lifecyle of = > > > > bdev_io and allow data > > > > > buffer management via bdev fn_table ? > > > > > = > > > > > I hope I=E2=80=99m making sense=E2=80=A6 > > > > > = > > > > > Thanks, > > > > > Srikanth > > > > > = > > > > > From: Walker, Benjamin = > > > > > Sent: Friday, May 11, 2018 12:28 PM > > > > > To: Harris, James R ; = > > > > Kaligotla, Srikanth > > > > Kaligotla(a)netapp.com> > > > > > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John = > > > > > > > > >; Rodriguez, Edwin ; Pai, = > > > > Madhu > > > > pp.com>; NGC-john.barnard-broadcom.com = > > > > > > > m> > > > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > > > > = > > > > > Hi Srikanth, > > > > > = > > > > > Yes - we'll need to introduce some way to acquire = > > > > and release buffers from the > > > > > bdev layer earlier on in the state machine that = > > > > processes an NVMe-oF request. > > > > > I've had this patch out for review for several = > > > > months as a proposal for this > > > > > scenario: > > > > > = > > > > > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/ > > > > > = > > > > > It doesn't pass the tests - it's just a proposal for = > > > > the interface. 90% of the > > > > > challenge here is in developing intuitive semantics. > > > > > = > > > > > Thanks, > > > > > Ben > > > > > = > > > > > P.S. This is the kind of discussion that would fit = > > > > perfectly on the mailing > > > > > list. > > > > > = > > > > > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, = > > > > Srikanth > > > > wrote: > > > > > > Hi Ben, Hi James, > > > > > > = > > > > > > I would like to solicit opinions on the lifecycle = > > > > for bdev-io resource > > > > > > object. Attached is the image of RDMA state = > > > > machine in its current > > > > > > implementation. When the REQUEST enters the = > > > > NEED-BUFFER state, the buffers > > > > > > necessary for carrying out I/O operation are = > > > > allocated/acquired from the > > > > > > memory pool. An instance of BDEV-IO comes into = > > > > existence after REQUEST > > > > > > reaches READY-TO-EXECUTE state. The BDEV-IO is = > > > > teared down as soon the > > > > > > backend returns. From a BDEV perspective, BDEV-IO = > > > > is simply a translation > > > > > > unit that facilitates I/O buffers from the backend. = > > > > The driver context > > > > > > embedded within bdev_io holds great deal of = > > > > information pertaining to the > > > > > > I/O under execution. It assists in error handling, = > > > > dereferencing the buffers > > > > > > upon I/O completion and in abort handling. In = > > > > summary, the bdev_io stays > > > > > > alive until request has come to COMPLETE state. = > > > > I=E2=80=99d like to hear peoples > > > > > > thoughts in introducing the plumbing to acquire = > > > > BDEV-IO resource in REQUEST- > > > > > > NEED-BUFFER state and release it in = > > > > REQUEST-COMPLETE state. I will shortly > > > > > > have patch available for review that introduces = > > > > spdk_bdev_init and > > > > > > spdk_bdev_fini which in turn invokes the = > > > > corresponding bdev fn_table to > > > > > > initialize/cleanup. = > > > > > > = > > > > > > I wanted to use this email to communicate our = > > > > intent and solicit your > > > > > > feedback. We have a working implementation of the = > > > > above proposal and prior > > > > > > to pushing it upstream for review would like to = > > > > hear your thoughts. These > > > > > > proposed changes to upstream are a result of FC = > > > > transport work in > > > > > > collaboration with the Broadcom team who are also = > > > > copied to this mail. > > > > > > Myself and John will be at SPDK Dev conference and = > > > > if required we can > > > > > > elaborate further on this proposal. > > > > > > = > > > > > > Thanks, > > > > > > Srikanth > > > > > = > > > > > > > > > = > > > > = > > > > = > > > > = > > > > = > > > > = > > = > > _______________________________________________ > > SPDK mailing list > > SPDK(a)lists.01.org > > https://lists.01.org/mailman/listinfo/spdk > > _______________________________________________ > > SPDK mailing list > > SPDK(a)lists.01.org > > https://lists.01.org/mailman/listinfo/spdk > = > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk --===============6037326446900982012==--