From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============0196388318930556725==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] Zero Copy (was BDEV-IO Lifecycle - Need your input.) Date: Wed, 11 Jul 2018 21:52:30 +0000 Message-ID: <20fe60ce3eb2ea4d25636aa8132e2e2fa38d07f4.camel@intel.com> In-Reply-To: BN7PR06MB4033AC159D1C4417681BE72B92440@BN7PR06MB4033.namprd06.prod.outlook.com List-ID: To: spdk@lists.01.org --===============0196388318930556725== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable I've done a bunch of work here to formalize the zero copy API in the bdev l= ayer. The patch series begins here: https://review.gerrithub.io/#/c/spdk/spdk/+/386166/ There has been some good discussion on the patches themselves, but I wanted= to circle back to the mailing list for a bit. The basic idea is to add two new functions to the bdev layer: int spdk_bdev_zcopy_start(struct spdk_bdev_desc *desc, struct spdk_io_channel *ch, uint64_t offset_blocks, uint64_t num_blocks, bool populate, spdk_bdev_io_completion_cb cb, void *cb_arg); int spdk_bdev_zcopy_end(struct spdk_bdev_io *bdev_io, bool commit, spdk_bdev_io_completion_cb cb, void *cb_arg); The zcopy start call makes a request to the backing block device to provide= a region of memory suitable for performing a zero copy update to the region described by the offset_blocks and num_blocks parameters. The populate flag determines whether the "live" data is present in the given buffer. When this request completes, it returns a bdev_io in the completion callback. That bd= ev_io will have a scatter gather list filled out describing the memory region. When the user has finished transferring data into or out of the data buffer= , the user calls spdk_bdev_zcopy_end with the originally provided bdev_io. The co= mmit flag indicates whether the data in the buffer needs to be made "live". The NVMe-oF target can take advantage of these new APIs by altering the sta= te machine in the RDMA layer to call spdk_bdev_zcopy_start to obtain data buff= ers after parsing the command, then performing RDMA transfers into or out of the provided region, and finally completing the command with spdk_bdev_zcopy_en= d. The work to do those modifications has not been done yet, and I expect them= to be a significant amount of effort. As many bdev modules as possible should attempt to support zero copy semant= ics, but for those bdev modules that aren't able to, one of the patches in my se= ries will automatically emulate the zcopy commands using regular reads and write= s. This allows the higher level users of the bdev API to always make the zcopy calls for simplicity. The emulation path should be no worse, in terms of performance, than doing a regular read or write. I believe I've addressed all known use cases with this design, and while th= ere will certainly be some future patches to fix problems that come up, I'm confident that we're on the right track here. If there is any feedback in t= his area at all, now is the time to provide it. I think this also leads into some really awesome future work that I'll sugg= est in the hope that someone in the community is inspired. One idea is that the bdev_nvme module could support zero copy semantics by returning a pointer t= o a portion of a Controller Memory Buffer or Persistent Memory Buffer suitable = for data. There aren't many devices for sale today that can do this, but it's i= n the spec and I imagine they're coming. This would allow the NVMe-oF target, for example, to perform RDMA transfers peer-to-peer between the NIC and the NVMe SSD. Similarly, for the bdev_pmem module, direct pointers into the the persistent memory could be returned. The NVMe-oF target could then perform = RDMA transfers into or out of the persistent memory directly, and the zcopy end = call with commit set to true would perform the required flush instructions. Thanks, Ben On Mon, 2018-07-09 at 19:19 +0000, Popuri, Sriram wrote: > I would like to jot down what requirements we are looking at. Wherever th= ere > is "=3D=3D=3D=3Dwait" for something, there is a potential context switch. > = > For a write request, the following is the workflow for zero copy: > 1) Host submits Write request > 2) Protocol validation of request > 4) Get iovs > 5) R2T/RDMA_READ > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > 5) Host transfers data (data_out/RDMA_READ_COMPLETE) > 6) Perform Write(spdk_bdev_writev_blocks) to bdev_module > =3D=3D=3D=3Dwait for bdev module to respond=3D=3D=3D=3D > 7) on bdev_module response, send rsp/status to host > 8) free req resources > = > Read path: > 1) Host submits Read request > 2) Protocol validatation of request > 3) Submit read to bdev_module > =3D=3D=3D=3Dwait for data blocks from bdev_module=3D=3D=3D=3D > 4) respond iov (data_in/RDMA_WRITE) > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > 5) host ack/RDMA_WRITE_COMPLETE > 6) free req resources > = > In future we will have Persistent reservations where there can be > read/modify/update cases depending on how the bdev module handles it. So = the > zero copy apis needs to address future requirements as well. > = > The above workflows suggest we require a state to track at what step the > request is. A state will help to cleanup resources properly when handling > aborts, disconnects, namespace destroy workflows etc. > So the requirement is to figure out a way to clean up resources according= to > the state of the bdev_io in its life cycle and stitch in the zero copy > workflows into existing spdk code which assumes buffers are pre-allocated. > = > I think Srikanth's proposed changes are made in target layer to populate = and > commit to bdev module. However his suggestion is changing the bdev_io life > cycle which I understand from John is not agreeable to others. > = > With the new zcopy apis also there is no clarity on the bdev_io life cycl= e. A > bdev_io is acquired during zcopy_start and it's not clear how to deal wit= h the > bdev_io allocated in read/write paths. > = > The orchestration of zero copy either be done at transport layer, target = layer > or bdev module layer. > = > We don't like the orchestration done from transport layer because: > 1) No clear way to deal with bdev_io (acquired from zcopy) and bdev_io > (acquired from read/write codepaths). > 2) If I understand correctly, by the time a read/write request is submitt= ed to > target, the data_out would have happened. So the protocol validation happ= ens > after data_out phase. For example an LBA out of range request or invalid = SGL > length is returned after a data_out is performed. We want to run through = the > protocol validation before data_out phase. > = > I feel the bdev module is the ideal place to deal with zero copy buffers.= The > bdev module knows how to get buffers and when to claim them. All it needs= is a > proper cleanup and by the time request hits bdev module, bdev_io is acqui= red > so no need to change bdev_io life cycle or introduce a new io type. > = > I propose the following: > 1) Host submits Write request > 2) Protocol validation of request > spdk_nvmf_request_exec > 4) submit request to bdev module > spdk_bdev_writev_blocks(). bdev_io is acquired > 5) bdev module determines the buffers needs to be allocated. > Will perform something like what is prposed in the patch: > spdk_bdev_zcopy_start(populate=3Dfalse) > 6) New spdk_nvmf_request_exec_status > (SPDK_NVMF_REQUEST_EXEC_STATUS_BUFF_READY) > 7) Transport does R2T/RDMA_READ > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > 8) Host transfers data (data_out/RDMA_READ_COMPLETE) > 9) Perform Write again(spdk_nvmf_request_exec/spdk_bdev_writev_blocks) to > bdev_module. > =3D=3D=3D=3Dwait for bdev module to respond=3D=3D=3D=3D > 10) on bdev_module response, send rsp/status to host > 11) free req resources > free bdev_io spdk_bdev_free_io. > Here we need a hook to free resources allocated by bdev module > spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx) > = > Read path: > 1) Host submits Read request > 2) Protocol validatation of request > 3) Submit the request (spdk_bdev_readv_blocks) to bdev_module > =3D=3D=3D=3Dwait for data blocks from bdev_module=3D=3D=3D=3D > 4) respond iov (data_in/RDMA_WRITE) > =3D=3D=3D=3Dwait for host=3D=3D=3D=3D > 5) host ack/RDMA_WRITE_COMPLETE > 6) free req resources > free bdev_io spdk_bdev_free_io. > Here we need a hook to free resources allocated by bdev module > spdk_bdev_free_io_driver_ctx(bdev_io->driver_ctx) > = > If it makes sense we can work on a patch. > = > Regards, > ~Sriram > = > -----Original Message----- > From: Walker, Benjamin = > Sent: Monday, July 9, 2018 11:22 PM > To: Meneghini, John ; Popuri, Sriram @netapp.com>; Kaligotla, Srikanth > Cc: raju.gottumukkala(a)broadcom.com; Rodriguez, Edwin > ; spdk(a)lists.01.org; Pai, Madhu ; NGC-john= .barnard- > broadcom.com > Subject: Re: BDEV-IO Lifecycle - Need your input. > = > On Mon, 2018-07-09 at 16:18 +0000, Meneghini, John wrote: > > Hi Jim. > > = > > This is the patch series I believe Ben is proposing to replace = > > Srikanth's: ht tps://review.gerrithub.io/#/c/spdk/spdk/+/415860/ > > = > > pick a21bbcf2 nvmf: Move data buffer pool to generic layer pick = > > 09586e54 bdev: Add a zero copy I/O path pick 1cc9fe95 bdev: Make = > > malloc bdev use the new zero copy mechanism pick 40a1f62b bdevperf: = > > Use new zcopy API for reads pick eba6f35a bdev: Emulate zero copy = > > support when necessary > > = > > Is this correct? > = > = > My patch series doesn't entirely replace Srikanth's set of patches. There= are > two separate things necessary to implement zero copy. First is the > infrastructure within the bdev layer to make requests of a bdev module to > obtain or release a suitable buffer for a zero copy operation. Second is > actually making the NVMe-oF target use that mechanism, which includes > extending the bdev_io lifetime. My patch series only addresses the first = part. > = > Thanks, > Ben > = > = > = > > = > > /John > > = > > =EF=BB=BFOn 7/9/18, 11:31 AM, "Harris, James R" wrote: > > = > > Hi Sriram: > > = > > Ben has some later patches in this series with some examples: > > = > > https://review.gerrithub.io/#/c/spdk/spdk/+/386167/ - implements = > > this zero copy API in the simple =E2=80=98malloc=E2=80=99 bdev module > > https://review.gerrithub.io/#/c/spdk/spdk/+/416579/ - uses the = > > zero copy API in the SPDK bdevperf utility > > = > > -Jim > > = > > = > > On 7/9/18, 3:34 AM, "Popuri, Sriram" w= rote: > > = > > Sorry was not focusing on this change. Just give me a day or = > > two to get back. > > From a quick glance I didn't understand how zcopy start/end = > > fits into a read/write life cycle. Is there an example on how zcopy = > > start/end is consumed or someone can do give me a quick dump on how = > > its envisioned to be used. > > = > > Regards, > > ~Sriram > > = > > -----Original Message----- > > From: Meneghini, John = > > Sent: Friday, July 6, 2018 10:16 PM > > To: Walker, Benjamin ; Verkamp, = > > Daniel ; Harris, James R = > > ; Kaligotla, Srikanth > om> > > Cc: raju.gottumukkala(a)broadcom.com; spdk(a)lists.01.org; = > > Rodriguez, Edwin ; Pai, Madhu = > > ; NGC- john.barnard-broadcom.com > oadcom.com>; Popuri, Sriram > Popuri(a)netapp.com> > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > = > > I'm adding Sriram to this thread. Sriram is another NetApp = > > engineer who is working on this stuff internally. > > = > > /John > > = > > On 7/6/18, 12:41 PM, "Walker, Benjamin" = > > > > wrote: > > = > > Hi Srikanth, > > = > > I wanted to check in to see if you had any additional = > > feedback on the zero copy > > operations in the bdev layer here: > > = > > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/ > > = > > This does not address extending the lifetime of the = > > bdev_io in the NVMe-oF > > target, but I think this is looking like the right = > > mechanism for the bdev layer. > > = > > Thanks, > > Ben > > = > > On Thu, 2018-06-21 at 18:11 -0700, Harris, James R wrote: > > > Thanks Srikanth. Sounds like the spdk_bdev_io pool = > > sizing along with > > > spdk_bdev_queue_io_wait() meets your needs then = > > regarding the spdk_bdev_io > > > memory. > > > = > > > Regarding zcopy_start/zcopy_end =E2=80=93 it looks like y= ou=E2=80=99ve = > > already added a bunch > > > of comments to Ben=E2=80=99s patch on GerritHub. For now= I=E2=80=99d = > > say let=E2=80=99s continue our > > > discussion there. I=E2=80=99ve responded to a couple of = similar = > > questions there and > > > I=E2=80=99m sure Ben will have more replies tomorrow. > > > = > > > -Jim > > > = > > > = > > > From: "Kaligotla, Srikanth" > > > Date: Thursday, June 21, 2018 at 5:01 PM > > > To: James Harris , "Walker, > > Benjamin" > > > > er(a)intel.com>, Daniel Verkamp > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > , > > > "Meneghini, John" , = > > "Rodriguez, Edwin" > > z(a)netapp.com>, "Pai, Madhu" , = > > "NGC- > > john.barnard- > > > broadcom.com" , = > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > org> > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > = > > > Hi Jim, > > > = > > > I wish I joined the community meeting, I also missed the > > > spdk_bdev_queue_io_wait(). > > > = > > > So, there are 2 issues I am intending to solve, > > > = > > > 1. We want to ensure that an instance of bdev-io is > > acquired prior to or > > > along with I/O data buffer. This allows for better error = > > handling when bdev_io > > > pool is exhausted. Yes, we can solve it via sizing the = > > pool right. The changes > > > queue-io-wait also addresses the problem. > > > 2. Most importantly, by extending the life of bdev-= io > > (Same life span as > > > nvmf_request structure), abort use case and other use = > > cases that involve > > > cleaning up of IO data buffer are accomplished = > > effectively. Let me elaborate; > > > bdev-io is the fabric that connects the nvme command and = > > operation with the > > > backend. The driver context and data buffer context is = > > stored in bdev-io. The > > > freeing up of bdev_io resource is pushed to the end so = > > the I/O cleanup can > > > happen after controller has transmitted the data to the = > > host. In absence of > > > bdev_io, we will have end up adding more and more void = > > context in request > > > structure. Hence the push to extend the life of bdev_io. > > > = > > > I see Ben=E2=80=99s recent patch =E2=80=9Czcopy_start=E2= =80=9D and =E2=80=9Czcopy_end=E2=80=9D; = > > We can make that work > > > as long as the bdev-io allocated/acquired stays till the = end. > > One of the > > > challenges I see with that patch is defining a callback = > > for the I/O > > > submission. For instance, zopy_start will allocate = > > bdev_io and submits the I/O > > > to the bdev device. Underlying implementation can be = > > synchronous or can be > > > asynchronous. The callback for this submission should = > > check to see if it is a > > > BUFFER-READY or PENDING and accordingly relay it back to = > > transport. Next phase > > > is actual I/O submission. Let=E2=80=99s say, it reuses th= e = > > bdev-io obtained in ZCOPY- > > > START, now the callback should determine if it is a = > > success or failure. > > > Finally when ZCOPY-END is invoked the supplied bdev_io = > > will have all necessary > > > data to release the WRITE buffer or unlock the read = > > buffer based on the > > > operation performed. > > > = > > > I hope I=E2=80=99m making sense. I guess, my effort is to= extend = > > the lifespan of bdev- > > > io and let it continue to host the driver context and = > > buffer context populated > > > during the BEGIN phase. > > > = > > > Regards, > > > Srikanth > > > = > > > From: Harris, James R = > > > Sent: Thursday, June 21, 2018 2:21 PM > > > To: Kaligotla, Srikanth = ; = > > Walker, Benjamin > > jamin.walker(a)intel.com>; Verkamp, Daniel = > > > om> > > > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John = > > > > >; Rodriguez, Edwin ; Pai, = > > Madhu > > pp.com>; NGC-john.barnard-broadcom.com = > > > m>; spdk(a)lists > > > .01.org > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > > = > > > Hi Srikanth, > > > = > > > Following up on this thread and the discussion in = > > yesterday=E2=80=99s community > > > meeting. > > > = > > > The recent spdk_bdev_queue_io_wait() changes allow an = > > SPDK application to work > > > better in general when the spdk_bdev_io buffer pool is = > > exhausted. We still > > > need to make changes to the NVMe-oF target to use this = > > new API but that should > > > be done soon-ish. > > > = > > > With that in place, the spdk_bdev_io buffer pool itself = > > can be configured up > > > or down when the application starts. Currently the = > > default is 64K > > > spdk_bdev_io buffers. sizeof(struct spdk_bdev_io) =3D=3D = > > 216 plus the per-IO > > > context size allocated for the bdev module. This can be = > > up to > > 192 bytes > > > (virtio bdev module) but is likely much smaller for you = > > depending on the > > > context size for your ontap bdev module. > > > = > > > Let=E2=80=99s assume your per-IO context size is 64 bytes= . 64K = > > x (192 + > > 64) =3D 16MB. > > > = > > > I=E2=80=99m not sure how many spdk_bdev_io you need in fl= ight at = > > any given time. 64K > > > seems like a lot but I=E2=80=99d be curious to hear your = > > thoughts on this. If this is > > > the right number, then worst case, there would be about = > > 16MB of DRAM that > > > would sit unused if the NVMe-oF target in your system = > > was not active. Is that > > > too burdensome for your application? > > > = > > > Thanks, > > > = > > > -Jim > > > = > > > = > > > = > > > From: "Kaligotla, Srikanth" > > > Date: Wednesday, June 20, 2018 at 12:47 PM > > > To: "Walker, Benjamin" , = > > James Harris > > is(a)intel.com>, Daniel Verkamp > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > , > > > "Meneghini, John" , = > > "Rodriguez, Edwin" > > z(a)netapp.com>, "Pai, Madhu" , = > > "NGC- > > john.barnard- > > > broadcom.com" , = > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > org> > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > = > > > Hello, > > > = > > > First revision of changes to extend the lifecycle of = > > bdev_io are available for > > > review. I would like to solicit your input on the = > > proposed API/Code flow > > > changes. > > > = > > > https://review.gerrithub.io/c/spdk/spdk/+/415860 > > > = > > > Thanks, > > > Srikanth > > > = > > > = > > > From: "Kaligotla, Srikanth" > > > Date: Friday, May 11, 2018 at 2:27 PM > > > To: "Walker, Benjamin" , = > > "Harris, James R" > > .harris(a)intel.com> > > > Cc: "raju.gottumukkala(a)broadcom.com" = > > , > > > "Meneghini, John" , = > > "Rodriguez, Edwin" > > z(a)netapp.com>, "Pai, Madhu" , = > > "NGC- > > john.barnard- > > > broadcom.com" , = > > "spdk(a)lists.01.org" < spdk(a)lists.01. > > > org> > > > Subject: RE: BDEV-IO Lifecycle - Need your input. > > > = > > > CC: List > > > = > > > Hi Ben, > > > = > > > Your proposal to interface with the backend to acquire = > > and release buffers is > > > good. You have accurately stated that the challenge is = > > in developing intuitive > > > semantics. And that has been my struggle. To me, there = > > are two problem > > > statements; > > > = > > > 1. It is expected that bdev_io pool is sized correc= tly > > so > > the call to > > > get bdev_io succeeds. Failure to acquire bdev_io will = > > result in DEVICE-ERROR. > > > The transport is already capable of handling temporary = > > memory failures by > > > moving the request to PENDING-Queue. Hence the proposal = > > to change the bdev_io > > > lifecycle and perhaps connect it with the = > > spdk_nvmf_request object. Thus all > > > buffer needs are addressed at the beginning of I/O reques= t. > > > 2. I/O Data buffers are sourced and managed by the > > backend. One of the > > > challenges I see with your proposed interface is the = > > lack of details like if > > > resource is being acquired for a READ operation or WRITE = > > operation. The > > > handling is quite different in each case. Since = > > bdev_io->type is overloaded > > > the type of I/O operation is lost. I suppose one can = > > cast the cb_arg > > > (nvmf_request) and then proceed. Also, the bdev_io = > > should be present to > > > RELEASE the buffer. Zero copy semantics warrants that = > > data buffer stays until > > > controller-to-host has occurred. In other words, bdev_io = > > lives till REQUEST > > > come to COMPLETE state. > > > = > > > What are your thoughts in introducing spdk_bdev_init() = > > and > > spdk_bdev_fini() as > > > an alternative approach to extend the lifecyle of = > > bdev_io and allow data > > > buffer management via bdev fn_table ? > > > = > > > I hope I=E2=80=99m making sense=E2=80=A6 > > > = > > > Thanks, > > > Srikanth > > > = > > > From: Walker, Benjamin = > > > Sent: Friday, May 11, 2018 12:28 PM > > > To: Harris, James R ; = > > Kaligotla, Srikanth > > Kaligotla(a)netapp.com> > > > Cc: raju.gottumukkala(a)broadcom.com; Meneghini, John = > > > > >; Rodriguez, Edwin ; Pai, = > > Madhu > > pp.com>; NGC-john.barnard-broadcom.com = > > > m> > > > Subject: Re: BDEV-IO Lifecycle - Need your input. > > > = > > > Hi Srikanth, > > > = > > > Yes - we'll need to introduce some way to acquire and = > > release buffers from the > > > bdev layer earlier on in the state machine that = > > processes an NVMe-oF request. > > > I've had this patch out for review for several months as = > > a proposal for this > > > scenario: > > > = > > > https://review.gerrithub.io/#/c/spdk/spdk/+/386166/ > > > = > > > It doesn't pass the tests - it's just a proposal for the = > > interface. 90% of the > > > challenge here is in developing intuitive semantics. > > > = > > > Thanks, > > > Ben > > > = > > > P.S. This is the kind of discussion that would fit = > > perfectly on the mailing > > > list. > > > = > > > On Wed, 2018-05-09 at 20:35 +0000, Kaligotla, Srikanth wr= ote: > > > > Hi Ben, Hi James, > > > > = > > > > I would like to solicit opinions on the lifecycle for = > > bdev-io resource > > > > object. Attached is the image of RDMA state machine in = > > its current > > > > implementation. When the REQUEST enters the = > > NEED-BUFFER state, the buffers > > > > necessary for carrying out I/O operation are = > > allocated/acquired from the > > > > memory pool. An instance of BDEV-IO comes into = > > existence after REQUEST > > > > reaches READY-TO-EXECUTE state. The BDEV-IO is teared = > > down as soon the > > > > backend returns. From a BDEV perspective, BDEV-IO is = > > simply a translation > > > > unit that facilitates I/O buffers from the backend. = > > The driver context > > > > embedded within bdev_io holds great deal of = > > information pertaining to the > > > > I/O under execution. It assists in error handling, = > > dereferencing the buffers > > > > upon I/O completion and in abort handling. In summary, = > > the bdev_io stays > > > > alive until request has come to COMPLETE state. I=E2=80= =99d = > > like to hear peoples > > > > thoughts in introducing the plumbing to acquire = > > BDEV-IO resource in REQUEST- > > > > NEED-BUFFER state and release it in REQUEST-COMPLETE = > > state. I will shortly > > > > have patch available for review that introduces = > > spdk_bdev_init and > > > > spdk_bdev_fini which in turn invokes the corresponding = > > bdev fn_table to > > > > initialize/cleanup. = > > > > = > > > > I wanted to use this email to communicate our intent = > > and solicit your > > > > feedback. We have a working implementation of the = > > above proposal and prior > > > > to pushing it upstream for review would like to hear = > > your thoughts. These > > > > proposed changes to upstream are a result of FC = > > transport work in > > > > collaboration with the Broadcom team who are also = > > copied to this mail. > > > > Myself and John will be at SPDK Dev conference and if = > > required we can > > > > elaborate further on this proposal. > > > > = > > > > Thanks, > > > > Srikanth > > > = > > > > > = > > = > > = > > = > > = > >=20 --===============0196388318930556725==--