Re: [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library

From: Bruce Richardson <bruce.richardson@intel.com>
To: fengchengwen <fengchengwen@huawei.com>
Cc: thomas@monjalon.net, ferruh.yigit@intel.com, dev@dpdk.org,
	nipun.gupta@nxp.com, hemant.agrawal@nxp.com,
	maxime.coquelin@redhat.com, honnappa.nagarahalli@arm.com,
	jerinj@marvell.com, david.marchand@redhat.com,
	jerinjacobk@gmail.com
Subject: Re: [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library
Date: Fri, 18 Jun 2021 10:30:55 +0100	[thread overview]
Message-ID: <YMxnz5EwXL3x07Yd@bricha3-MOBL.ger.corp.intel.com> (raw)
In-Reply-To: <25d29598-c26d-8497-2867-9b650c79df49@huawei.com>

On Fri, Jun 18, 2021 at 04:52:00PM +0800, fengchengwen wrote:
> On 2021/6/17 22:18, Bruce Richardson wrote:
> > On Thu, Jun 17, 2021 at 12:02:00PM +0100, Bruce Richardson wrote:
> >> On Thu, Jun 17, 2021 at 05:48:05PM +0800, fengchengwen wrote:
> >>> On 2021/6/17 1:31, Bruce Richardson wrote:
> >>>> On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> >>>>> On 2021/6/16 0:38, Bruce Richardson wrote:
> >>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> >>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
> >>>>>>> device.
> >>>>>>>
> >>>>>>> The APIs of dmadev library exposes some generic operations which
> >>>>>>> can enable configuration and I/O with the DMA devices.
> >>>>>>>
> >>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com> ---
> >>>>>> Thanks for sending this.
> >>>>>>
> >>>>>> Of most interest to me right now are the key data-plane APIs.
> >>>>>> While we are still in the prototyping phase, below is a draft of
> >>>>>> what we are thinking for the key enqueue/perform_ops/completed_ops
> >>>>>> APIs.
> >>>>>>
> >>>>>> Some key differences I note in below vs your original RFC: * Use
> >>>>>> of void pointers rather than iova addresses. While using iova's
> >>>>>> makes sense in the general case when using hardware, in that it
> >>>>>> can work with both physical addresses and virtual addresses, if we
> >>>>>> change the APIs to use void pointers instead it will still work
> >>>>>> for DPDK in VA mode, while at the same time allow use of software
> >>>>>> fallbacks in error cases, and also a stub driver than uses memcpy
> >>>>>> in the background. Finally, using iova's makes the APIs a lot more
> >>>>>> awkward to use with anything but mbufs or similar buffers where we
> >>>>>> already have a pre-computed physical address.
> >>>>>
> >>>>> The iova is an hint to application, and widely used in DPDK.  If
> >>>>> switch to void, how to pass the address (iova or just va ?) this
> >>>>> may introduce implementation dependencies here.
> >>>>>
> >>>>> Or always pass the va, and the driver performs address translation,
> >>>>> and this translation may cost too much cpu I think.
> >>>>>
> >>>>
> >>>> On the latter point, about driver doing address translation I would
> >>>> agree.  However, we probably need more discussion about the use of
> >>>> iova vs just virtual addresses. My thinking on this is that if we
> >>>> specify the API using iovas it will severely hurt usability of the
> >>>> API, since it forces the user to take more inefficient codepaths in
> >>>> a large number of cases. Given a pointer to the middle of an mbuf,
> >>>> one cannot just pass that straight as an iova but must instead do a
> >>>> translation into offset from mbuf pointer and then readd the offset
> >>>> to the mbuf base address.
> >>>>
> >>>> My preference therefore is to require the use of an IOMMU when using
> >>>> a dmadev, so that it can be a much closer analog of memcpy. Once an
> >>>> iommu is present, DPDK will run in VA mode, allowing virtual
> >>>> addresses to our hugepage memory to be sent directly to hardware.
> >>>> Also, when using dmadevs on top of an in-kernel driver, that kernel
> >>>> driver may do all iommu management for the app, removing further the
> >>>> restrictions on what memory can be addressed by hardware.
> >>>
> >>> Some DMA devices many don't support IOMMU or IOMMU bypass default, so
> >>> driver may should call rte_mem_virt2phy() do the address translate,
> >>> but the rte_mem_virt2phy() cost too many CPU cycles.
> >>>
> >>> If the API defined as iova, it will work fine in: 1) If DMA don't
> >>> support IOMMU or IOMMU bypass, then start application with
> >>> --iova-mode=pa 2) If DMA support IOMMU, --iova-mode=pa/va work both
> >>> fine
> >>>
> >>
> >> I suppose if we keep the iova as the datatype, we can just cast "void
> >> *" pointers to that in the case that virtual addresses can be used
> >> directly. I believe your RFC included a capability query API - "uses
> >> void * as iova" should probably be one of those capabilities, and that
> >> would resolve this.  If DPDK is in iova=va mode because of the
> >> presence of an iommu, all drivers could report this capability too.
> >>
> >>>>
> >>>>>> * Use of id values rather than user-provided handles. Allowing the
> >>>>>> user/app to manage the amount of data stored per operation is a
> >>>>>> better solution, I feel than proscribing a certain about of
> >>>>>> in-driver tracking. Some apps may not care about anything other
> >>>>>> than a job being completed, while other apps may have significant
> >>>>>> metadata to be tracked. Taking the user-context handles out of the
> >>>>>> API also makes the driver code simpler.
> >>>>>
> >>>>> The user-provided handle was mainly used to simply application
> >>>>> implementation, It provides the ability to quickly locate contexts.
> >>>>>
> >>>>> The "use of id values" seem like the dma_cookie of Linux DMA engine
> >>>>> framework, user will get a unique dma_cookie after calling
> >>>>> dmaengine_submit(), and then could use it to call
> >>>>> dma_async_is_tx_complete() to get completion status.
> >>>>>
> >>>>
> >>>> Yes, the idea of the id is the same - to locate contexts. The main
> >>>> difference is that if we have the driver manage contexts or pointer
> >>>> to contexts, as well as giving more work to the driver, it
> >>>> complicates the APIs for measuring completions. If we use an
> >>>> ID-based approach, where the app maintains its own ring of contexts
> >>>> (if any), it avoids the need to have an "out" parameter array for
> >>>> returning those contexts, which needs to be appropriately sized.
> >>>> Instead we can just report that all ids up to N are completed. [This
> >>>> would be similar to your suggestion that N jobs be reported as done,
> >>>> in that no contexts are provided, it's just that knowing the ID of
> >>>> what is completed is generally more useful than the number (which
> >>>> can be obviously got by subtracting the old value)]
> >>>>
> >>>> We are still working on prototyping all this, but would hope to have
> >>>> a functional example of all this soon.
> >>>>
> >>>>> How about define the copy prototype as following: dma_cookie_t
> >>>>> rte_dmadev_copy(uint16_t dev_id, xxx) while the dma_cookie_t is
> >>>>> int32 and is monotonically increasing, when >=0 mean enqueue
> >>>>> successful else fail.  when complete the dmadev will return latest
> >>>>> completed dma_cookie, and the application could use the dma_cookie
> >>>>> to quick locate contexts.
> >>>>>
> >>>>
> >>>> If I understand this correctly, I believe this is largely what I was
> >>>> suggesting - just with the typedef for the type? In which case it
> >>>> obviously looks good to me.
> >>>>
> >>>>>> * I've kept a single combined API for completions, which differs
> >>>>>> from the separate error handling completion API you propose. I
> >>>>>> need to give the two function approach a bit of thought, but
> >>>>>> likely both could work. If we (likely) never expect failed ops,
> >>>>>> then the specifics of error handling should not matter that much.
> >>>>>
> >>>>> The rte_ioat_completed_ops API is too complex, and consider some
> >>>>> applications may never copy fail, so split them as two API.  It's
> >>>>> indeed not friendly to other scenarios that always require error
> >>>>> handling.
> >>>>>
> >>>>> I prefer use completed operations number as return value other than
> >>>>> the ID so that application could simple judge whether have new
> >>>>> completed operations, and the new prototype: uint16_t
> >>>>> rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie,
> >>>>> uint32_t *status, uint16_t max_status, uint16_t *num_fails);
> >>>>>
> >>>>> 1) for normal case which never expect failed ops: just call: ret =
> >>>>> rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL); 2) for other
> >>>>> case: ret = rte_dmadev_completed(dev_id, &cookie, &status,
> >>>>> max_status, &fails); at this point the fails <= ret <= max_status
> >>>>>
> >>>> Completely agree that we need to plan for the happy-day case where
> >>>> all is passing. Looking at the prototypes you have above, I am ok
> >>>> with returning number of completed ops as the return value with the
> >>>> final completed cookie as an "out" parameter.  For handling errors,
> >>>> I'm ok with what you propose above, just with one small adjustment -
> >>>> I would remove the restriction that ret <= max_status.
> >>>>
> >>>> In case of zero-failures, we can report as many ops succeeding as we
> >>>> like, and even in case of failure, we can still report as many
> >>>> successful ops as we like before we start filling in the status
> >>>> field. For example, if 32 ops are completed, and the last one fails,
> >>>> we can just fill in one entry into status, and return 32.
> >>>> Alternatively if the 4th last one fails we fill in 4 entries and
> >>>> return 32. The only requirements would be: * fails <= max_status *
> >>>> fails <= ret * cookie holds the id of the last entry in status.
> >>>
> >>> I think we understand the same:
> >>>
> >>> The fails <= ret <= max_status include following situation: 1) If
> >>> max_status is 32, and there are 32 completed ops, then the ret will
> >>> be 32 no matter which ops is failed 2) If max_status is 33, and there
> >>> are 32 completed ops, then the ret will be 32 3) If max_status is 16,
> >>> and there are 32 completed ops, then the ret will be 16
> >>>
> >>> and the cookie always hold the id of the last returned completed ops,
> >>> no matter it's completed successful or failed
> >>>
> >>
> >> I actually disagree on the #3. If max_status is 16, there are 32
> >> completed ops, and *no failures* the ret will be 32, not 16, because
> >> we are not returning any status entries so max_status need not apply.
> >> Keeping that same scenario #3, depending on the number of failures and
> >> the point of them, the return value may similarly vary, for example: *
> >> if job #28 fails, then ret could still be 32, cookie would be the
> >> cookie for that job, "fails" parameter would return as 4, with status
> >> holding the failure of 28 plus the succeeded status of jobs 29-31,
> >> i.e. 4 elements.  * if job #5 fails, then we can't fit the status list
> >> from 5 though 31 in an array of 16, so "fails" == 16(max_status) and
> >> status contains the 16 statuses starting from #5, which means that
> >> cookie contains the value for job #20 and ret is 21.
> >>
> >> In other words, ignore max_status and status parameters *unless we
> >> have an error to return*, meaning the fast-path/happy-day case works
> >> as fast as possible. You don't need to worry about sizing your status
> >> array to be big, and you always get back a large number of completions
> >> when available. Your fastpath code only need check the "fails"
> >> parameter to see if status needs to ever be consulted, and in normal
> >> case it doesn't.
> >>
> >> If this is too complicated, maybe we can simplify a little by
> >> returning just one failure at a time, though at the cost of making
> >> error handling slower?
> >>
> >> rte_dmadev_completed(dev_id, &cookie, &failure_status)
> >>
> >> In this case, we always return the number of completed ops on success,
> >> while on failure, we return the first error code. For a single error,
> >> this works fine, but if we get a burst of errors together, things will
> >> work slower - which may be acceptable if errors are very rare.
> >> However, for idxd at least if a fence occurs after a failure all jobs
> >> in the batch after the fence would be skipped, which would lead to the
> >> "burst of errors" case.  Therefore, I'd prefer to have the original
> >> suggestion allowing multiple errors to be reported at a time.
> >>
> >> /Bruce
> > 
> > Apologies for self-reply, but thinking about it more, a combination of
> > normal-case and error-case APIs may be just simpler:
> > 
> > int rte_dmadev_completed(dev_id, &cookie)
> > 
> > returns number of items completed and cookie of last item. If there is
> > an error, returns all successfull values up to the error entry and
> > returns -1 on subsequent call.
> > 
> > int rte_dmadev_completed_status(dev_id, &cookie, max_status,
> > status_array, &error_count)
> > 
> > this is a slower completion API which behaves like you originally said
> > above, returning number of completions x, 0 <= x <= max_status, with x
> > status values filled into array, and the number of unsuccessful values
> > in the error_count value.
> > 
> > This would allow code to be written in the application to use
> > rte_dmadev_completed() in the normal case, and on getting a "-1" value,
> > use rte_dmadev_completed_status() to get the error details. If strings
> > of errors might be expected, the app can continually use the
> > completed_status() function until error_count returns 0, and then
> > switch back to the faster/simpler version.
> 
> This two-function simplify the status_array's maintenance because we
> don't need init it to zero.  I think it's a good trade-off between
> performance and rich error info (status code).
> 
> Here I'd like to discuss the 'burst size', which is widely used in DPDK
> application (e.g.  nic polling or ring en/dequeue).  Currently we don't
> define a max completed ops in rte_dmadev_completed() API, the return
> value may greater than 'burst size' of application, this may result in
> the application need to maintain (or remember) the return value of the
> function and special handling at the next poll.
> 
> Also consider there may multiple calls rte_dmadev_completed to check
> fail, it may make it difficult for the application to use.
> 
> So I prefer following prototype: uint16_t rte_dmadev_completed(uint16_t
> dev_id, dma_cookie_t *cookie, uint16_t nb_cpls, bool *has_error) --
> nb_cpls: indicate max process operations number -- has_error: indicate if
> there is an error -- return value: the number of successful completed
> operations.  -- example: 1) If there are already 32 completed ops, and
> 4th is error, and nb_cpls is 32, then the ret will be 3(because 1/2/3th
> is OK), and has_error will be true.  2) If there are already 32 completed
> ops, and all successful completed, then the ret will be min(32, nb_cpls),
> and has_error will be false.  3) If there are already 32 completed ops,
> and all failed completed, then the ret will be 0, and has_error will be
> true.  uint16_t rte_dmadev_completed_status(uint16_t dev_id, dma_cookie_t
> *cookie, uint16_t nb_status, uint32_t *status) -- return value: the
> number of failed completed operations.
> 
> The application use the following invocation order when polling:
> has_error = false; // could be init to false by dmadev API, we need
> discuss ret = rte_dmadev_completed(dev_id, &cookie, bust_size,
> &has_error); // process successful completed case: for (int i = 0; i <
> ret; i++) { } if (unlikely(has_error)) { // process failed completed case
> ret = rte_dmadev_completed_status(dev_id, &cookie, burst_size - ret,
> status_array); for (int i = 0; i < ret; i++) { // ...  } }
>
Seems reasonable. Let's go with this as an option for now - I just want to
check for all these the perf impacts to the offload cost. Once I get our
prototype working with some hardware (hopefully very soon) I can check this
out directly.