All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bruce Richardson <bruce.richardson@intel.com>
To: fengchengwen <fengchengwen@huawei.com>
Cc: thomas@monjalon.net, ferruh.yigit@intel.com, dev@dpdk.org,
	nipun.gupta@nxp.com, hemant.agrawal@nxp.com,
	maxime.coquelin@redhat.com, honnappa.nagarahalli@arm.com,
	jerinj@marvell.com, david.marchand@redhat.com,
	jerinjacobk@gmail.com
Subject: Re: [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library
Date: Fri, 18 Jun 2021 10:30:55 +0100	[thread overview]
Message-ID: <YMxnz5EwXL3x07Yd@bricha3-MOBL.ger.corp.intel.com> (raw)
In-Reply-To: <25d29598-c26d-8497-2867-9b650c79df49@huawei.com>

On Fri, Jun 18, 2021 at 04:52:00PM +0800, fengchengwen wrote:
> On 2021/6/17 22:18, Bruce Richardson wrote:
> > On Thu, Jun 17, 2021 at 12:02:00PM +0100, Bruce Richardson wrote:
> >> On Thu, Jun 17, 2021 at 05:48:05PM +0800, fengchengwen wrote:
> >>> On 2021/6/17 1:31, Bruce Richardson wrote:
> >>>> On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> >>>>> On 2021/6/16 0:38, Bruce Richardson wrote:
> >>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> >>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA
> >>>>>>> device.
> >>>>>>>
> >>>>>>> The APIs of dmadev library exposes some generic operations which
> >>>>>>> can enable configuration and I/O with the DMA devices.
> >>>>>>>
> >>>>>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com> ---
> >>>>>> Thanks for sending this.
> >>>>>>
> >>>>>> Of most interest to me right now are the key data-plane APIs.
> >>>>>> While we are still in the prototyping phase, below is a draft of
> >>>>>> what we are thinking for the key enqueue/perform_ops/completed_ops
> >>>>>> APIs.
> >>>>>>
> >>>>>> Some key differences I note in below vs your original RFC: * Use
> >>>>>> of void pointers rather than iova addresses. While using iova's
> >>>>>> makes sense in the general case when using hardware, in that it
> >>>>>> can work with both physical addresses and virtual addresses, if we
> >>>>>> change the APIs to use void pointers instead it will still work
> >>>>>> for DPDK in VA mode, while at the same time allow use of software
> >>>>>> fallbacks in error cases, and also a stub driver than uses memcpy
> >>>>>> in the background. Finally, using iova's makes the APIs a lot more
> >>>>>> awkward to use with anything but mbufs or similar buffers where we
> >>>>>> already have a pre-computed physical address.
> >>>>>
> >>>>> The iova is an hint to application, and widely used in DPDK.  If
> >>>>> switch to void, how to pass the address (iova or just va ?) this
> >>>>> may introduce implementation dependencies here.
> >>>>>
> >>>>> Or always pass the va, and the driver performs address translation,
> >>>>> and this translation may cost too much cpu I think.
> >>>>>
> >>>>
> >>>> On the latter point, about driver doing address translation I would
> >>>> agree.  However, we probably need more discussion about the use of
> >>>> iova vs just virtual addresses. My thinking on this is that if we
> >>>> specify the API using iovas it will severely hurt usability of the
> >>>> API, since it forces the user to take more inefficient codepaths in
> >>>> a large number of cases. Given a pointer to the middle of an mbuf,
> >>>> one cannot just pass that straight as an iova but must instead do a
> >>>> translation into offset from mbuf pointer and then readd the offset
> >>>> to the mbuf base address.
> >>>>
> >>>> My preference therefore is to require the use of an IOMMU when using
> >>>> a dmadev, so that it can be a much closer analog of memcpy. Once an
> >>>> iommu is present, DPDK will run in VA mode, allowing virtual
> >>>> addresses to our hugepage memory to be sent directly to hardware.
> >>>> Also, when using dmadevs on top of an in-kernel driver, that kernel
> >>>> driver may do all iommu management for the app, removing further the
> >>>> restrictions on what memory can be addressed by hardware.
> >>>
> >>> Some DMA devices many don't support IOMMU or IOMMU bypass default, so
> >>> driver may should call rte_mem_virt2phy() do the address translate,
> >>> but the rte_mem_virt2phy() cost too many CPU cycles.
> >>>
> >>> If the API defined as iova, it will work fine in: 1) If DMA don't
> >>> support IOMMU or IOMMU bypass, then start application with
> >>> --iova-mode=pa 2) If DMA support IOMMU, --iova-mode=pa/va work both
> >>> fine
> >>>
> >>
> >> I suppose if we keep the iova as the datatype, we can just cast "void
> >> *" pointers to that in the case that virtual addresses can be used
> >> directly. I believe your RFC included a capability query API - "uses
> >> void * as iova" should probably be one of those capabilities, and that
> >> would resolve this.  If DPDK is in iova=va mode because of the
> >> presence of an iommu, all drivers could report this capability too.
> >>
> >>>>
> >>>>>> * Use of id values rather than user-provided handles. Allowing the
> >>>>>> user/app to manage the amount of data stored per operation is a
> >>>>>> better solution, I feel than proscribing a certain about of
> >>>>>> in-driver tracking. Some apps may not care about anything other
> >>>>>> than a job being completed, while other apps may have significant
> >>>>>> metadata to be tracked. Taking the user-context handles out of the
> >>>>>> API also makes the driver code simpler.
> >>>>>
> >>>>> The user-provided handle was mainly used to simply application
> >>>>> implementation, It provides the ability to quickly locate contexts.
> >>>>>
> >>>>> The "use of id values" seem like the dma_cookie of Linux DMA engine
> >>>>> framework, user will get a unique dma_cookie after calling
> >>>>> dmaengine_submit(), and then could use it to call
> >>>>> dma_async_is_tx_complete() to get completion status.
> >>>>>
> >>>>
> >>>> Yes, the idea of the id is the same - to locate contexts. The main
> >>>> difference is that if we have the driver manage contexts or pointer
> >>>> to contexts, as well as giving more work to the driver, it
> >>>> complicates the APIs for measuring completions. If we use an
> >>>> ID-based approach, where the app maintains its own ring of contexts
> >>>> (if any), it avoids the need to have an "out" parameter array for
> >>>> returning those contexts, which needs to be appropriately sized.
> >>>> Instead we can just report that all ids up to N are completed. [This
> >>>> would be similar to your suggestion that N jobs be reported as done,
> >>>> in that no contexts are provided, it's just that knowing the ID of
> >>>> what is completed is generally more useful than the number (which
> >>>> can be obviously got by subtracting the old value)]
> >>>>
> >>>> We are still working on prototyping all this, but would hope to have
> >>>> a functional example of all this soon.
> >>>>
> >>>>> How about define the copy prototype as following: dma_cookie_t
> >>>>> rte_dmadev_copy(uint16_t dev_id, xxx) while the dma_cookie_t is
> >>>>> int32 and is monotonically increasing, when >=0 mean enqueue
> >>>>> successful else fail.  when complete the dmadev will return latest
> >>>>> completed dma_cookie, and the application could use the dma_cookie
> >>>>> to quick locate contexts.
> >>>>>
> >>>>
> >>>> If I understand this correctly, I believe this is largely what I was
> >>>> suggesting - just with the typedef for the type? In which case it
> >>>> obviously looks good to me.
> >>>>
> >>>>>> * I've kept a single combined API for completions, which differs
> >>>>>> from the separate error handling completion API you propose. I
> >>>>>> need to give the two function approach a bit of thought, but
> >>>>>> likely both could work. If we (likely) never expect failed ops,
> >>>>>> then the specifics of error handling should not matter that much.
> >>>>>
> >>>>> The rte_ioat_completed_ops API is too complex, and consider some
> >>>>> applications may never copy fail, so split them as two API.  It's
> >>>>> indeed not friendly to other scenarios that always require error
> >>>>> handling.
> >>>>>
> >>>>> I prefer use completed operations number as return value other than
> >>>>> the ID so that application could simple judge whether have new
> >>>>> completed operations, and the new prototype: uint16_t
> >>>>> rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie,
> >>>>> uint32_t *status, uint16_t max_status, uint16_t *num_fails);
> >>>>>
> >>>>> 1) for normal case which never expect failed ops: just call: ret =
> >>>>> rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL); 2) for other
> >>>>> case: ret = rte_dmadev_completed(dev_id, &cookie, &status,
> >>>>> max_status, &fails); at this point the fails <= ret <= max_status
> >>>>>
> >>>> Completely agree that we need to plan for the happy-day case where
> >>>> all is passing. Looking at the prototypes you have above, I am ok
> >>>> with returning number of completed ops as the return value with the
> >>>> final completed cookie as an "out" parameter.  For handling errors,
> >>>> I'm ok with what you propose above, just with one small adjustment -
> >>>> I would remove the restriction that ret <= max_status.
> >>>>
> >>>> In case of zero-failures, we can report as many ops succeeding as we
> >>>> like, and even in case of failure, we can still report as many
> >>>> successful ops as we like before we start filling in the status
> >>>> field. For example, if 32 ops are completed, and the last one fails,
> >>>> we can just fill in one entry into status, and return 32.
> >>>> Alternatively if the 4th last one fails we fill in 4 entries and
> >>>> return 32. The only requirements would be: * fails <= max_status *
> >>>> fails <= ret * cookie holds the id of the last entry in status.
> >>>
> >>> I think we understand the same:
> >>>
> >>> The fails <= ret <= max_status include following situation: 1) If
> >>> max_status is 32, and there are 32 completed ops, then the ret will
> >>> be 32 no matter which ops is failed 2) If max_status is 33, and there
> >>> are 32 completed ops, then the ret will be 32 3) If max_status is 16,
> >>> and there are 32 completed ops, then the ret will be 16
> >>>
> >>> and the cookie always hold the id of the last returned completed ops,
> >>> no matter it's completed successful or failed
> >>>
> >>
> >> I actually disagree on the #3. If max_status is 16, there are 32
> >> completed ops, and *no failures* the ret will be 32, not 16, because
> >> we are not returning any status entries so max_status need not apply.
> >> Keeping that same scenario #3, depending on the number of failures and
> >> the point of them, the return value may similarly vary, for example: *
> >> if job #28 fails, then ret could still be 32, cookie would be the
> >> cookie for that job, "fails" parameter would return as 4, with status
> >> holding the failure of 28 plus the succeeded status of jobs 29-31,
> >> i.e. 4 elements.  * if job #5 fails, then we can't fit the status list
> >> from 5 though 31 in an array of 16, so "fails" == 16(max_status) and
> >> status contains the 16 statuses starting from #5, which means that
> >> cookie contains the value for job #20 and ret is 21.
> >>
> >> In other words, ignore max_status and status parameters *unless we
> >> have an error to return*, meaning the fast-path/happy-day case works
> >> as fast as possible. You don't need to worry about sizing your status
> >> array to be big, and you always get back a large number of completions
> >> when available. Your fastpath code only need check the "fails"
> >> parameter to see if status needs to ever be consulted, and in normal
> >> case it doesn't.
> >>
> >> If this is too complicated, maybe we can simplify a little by
> >> returning just one failure at a time, though at the cost of making
> >> error handling slower?
> >>
> >> rte_dmadev_completed(dev_id, &cookie, &failure_status)
> >>
> >> In this case, we always return the number of completed ops on success,
> >> while on failure, we return the first error code. For a single error,
> >> this works fine, but if we get a burst of errors together, things will
> >> work slower - which may be acceptable if errors are very rare.
> >> However, for idxd at least if a fence occurs after a failure all jobs
> >> in the batch after the fence would be skipped, which would lead to the
> >> "burst of errors" case.  Therefore, I'd prefer to have the original
> >> suggestion allowing multiple errors to be reported at a time.
> >>
> >> /Bruce
> > 
> > Apologies for self-reply, but thinking about it more, a combination of
> > normal-case and error-case APIs may be just simpler:
> > 
> > int rte_dmadev_completed(dev_id, &cookie)
> > 
> > returns number of items completed and cookie of last item. If there is
> > an error, returns all successfull values up to the error entry and
> > returns -1 on subsequent call.
> > 
> > int rte_dmadev_completed_status(dev_id, &cookie, max_status,
> > status_array, &error_count)
> > 
> > this is a slower completion API which behaves like you originally said
> > above, returning number of completions x, 0 <= x <= max_status, with x
> > status values filled into array, and the number of unsuccessful values
> > in the error_count value.
> > 
> > This would allow code to be written in the application to use
> > rte_dmadev_completed() in the normal case, and on getting a "-1" value,
> > use rte_dmadev_completed_status() to get the error details. If strings
> > of errors might be expected, the app can continually use the
> > completed_status() function until error_count returns 0, and then
> > switch back to the faster/simpler version.
> 
> This two-function simplify the status_array's maintenance because we
> don't need init it to zero.  I think it's a good trade-off between
> performance and rich error info (status code).
> 
> Here I'd like to discuss the 'burst size', which is widely used in DPDK
> application (e.g.  nic polling or ring en/dequeue).  Currently we don't
> define a max completed ops in rte_dmadev_completed() API, the return
> value may greater than 'burst size' of application, this may result in
> the application need to maintain (or remember) the return value of the
> function and special handling at the next poll.
> 
> Also consider there may multiple calls rte_dmadev_completed to check
> fail, it may make it difficult for the application to use.
> 
> So I prefer following prototype: uint16_t rte_dmadev_completed(uint16_t
> dev_id, dma_cookie_t *cookie, uint16_t nb_cpls, bool *has_error) --
> nb_cpls: indicate max process operations number -- has_error: indicate if
> there is an error -- return value: the number of successful completed
> operations.  -- example: 1) If there are already 32 completed ops, and
> 4th is error, and nb_cpls is 32, then the ret will be 3(because 1/2/3th
> is OK), and has_error will be true.  2) If there are already 32 completed
> ops, and all successful completed, then the ret will be min(32, nb_cpls),
> and has_error will be false.  3) If there are already 32 completed ops,
> and all failed completed, then the ret will be 0, and has_error will be
> true.  uint16_t rte_dmadev_completed_status(uint16_t dev_id, dma_cookie_t
> *cookie, uint16_t nb_status, uint32_t *status) -- return value: the
> number of failed completed operations.
> 
> The application use the following invocation order when polling:
> has_error = false; // could be init to false by dmadev API, we need
> discuss ret = rte_dmadev_completed(dev_id, &cookie, bust_size,
> &has_error); // process successful completed case: for (int i = 0; i <
> ret; i++) { } if (unlikely(has_error)) { // process failed completed case
> ret = rte_dmadev_completed_status(dev_id, &cookie, burst_size - ret,
> status_array); for (int i = 0; i < ret; i++) { // ...  } }
>
Seems reasonable. Let's go with this as an option for now - I just want to
check for all these the perf impacts to the offload cost. Once I get our
prototype working with some hardware (hopefully very soon) I can check this
out directly.

  reply	other threads:[~2021-06-18  9:31 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-15 13:22 [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library Chengwen Feng
2021-06-15 16:38 ` Bruce Richardson
2021-06-16  7:09   ` Morten Brørup
2021-06-16 10:17     ` fengchengwen
2021-06-16 12:09       ` Morten Brørup
2021-06-16 13:06       ` Bruce Richardson
2021-06-16 14:37       ` Jerin Jacob
2021-06-17  9:15         ` Bruce Richardson
2021-06-18  5:52           ` Jerin Jacob
2021-06-18  9:41             ` fengchengwen
2021-06-22 17:25               ` Jerin Jacob
2021-06-23  3:30                 ` fengchengwen
2021-06-23  7:21                   ` Jerin Jacob
2021-06-23  9:37                     ` Bruce Richardson
2021-06-23 11:40                       ` Jerin Jacob
2021-06-23 14:19                         ` Bruce Richardson
2021-06-24  6:49                           ` Jerin Jacob
2021-06-23  9:41                 ` Bruce Richardson
2021-06-23 10:10                   ` Morten Brørup
2021-06-23 11:46                   ` Jerin Jacob
2021-06-23 14:22                     ` Bruce Richardson
2021-06-18  9:55             ` Bruce Richardson
2021-06-22 17:31               ` Jerin Jacob
2021-06-22 19:17                 ` Bruce Richardson
2021-06-23  7:00                   ` Jerin Jacob
2021-06-16  9:41   ` fengchengwen
2021-06-16 17:31     ` Bruce Richardson
2021-06-16 18:08       ` Jerin Jacob
2021-06-16 19:13         ` Bruce Richardson
2021-06-17  7:42           ` Jerin Jacob
2021-06-17  8:00             ` Bruce Richardson
2021-06-18  5:16               ` Jerin Jacob
2021-06-18 10:03                 ` Bruce Richardson
2021-06-22 17:36                   ` Jerin Jacob
2021-06-17  9:48       ` fengchengwen
2021-06-17 11:02         ` Bruce Richardson
2021-06-17 14:18           ` Bruce Richardson
2021-06-18  8:52             ` fengchengwen
2021-06-18  9:30               ` Bruce Richardson [this message]
2021-06-22 17:51               ` Jerin Jacob
2021-06-23  3:50                 ` fengchengwen
2021-06-23 11:00                   ` Jerin Jacob
2021-06-23 14:56                   ` Bruce Richardson
2021-06-24 12:19                     ` fengchengwen
2021-06-26  3:59                       ` [dpdk-dev] dmadev discussion summary fengchengwen
2021-06-28 10:00                         ` Bruce Richardson
2021-06-28 11:14                           ` Ananyev, Konstantin
2021-06-28 12:53                             ` Bruce Richardson
2021-07-02 13:31                           ` fengchengwen
2021-07-01 15:01                         ` Jerin Jacob
2021-07-01 16:33                           ` Bruce Richardson
2021-07-02  7:39                             ` Morten Brørup
2021-07-02 10:05                               ` Bruce Richardson
2021-07-02 13:45                           ` fengchengwen
2021-07-02 14:57                             ` Morten Brørup
2021-07-03  0:32                               ` fengchengwen
2021-07-03  8:53                                 ` Morten Brørup
2021-07-03  9:08                                   ` Jerin Jacob
2021-07-03 12:24                                     ` Morten Brørup
2021-07-04  7:43                                       ` Jerin Jacob
2021-07-05 10:28                                         ` Morten Brørup
2021-07-06  7:11                                           ` fengchengwen
2021-07-03  9:45                                   ` fengchengwen
2021-07-03 12:00                                     ` Morten Brørup
2021-07-04  7:34                                       ` Jerin Jacob
2021-07-02  7:07                         ` Liang Ma
2021-07-02 13:59                           ` fengchengwen
2021-06-24  7:03                   ` [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library Jerin Jacob
2021-06-24  7:59                     ` Morten Brørup
2021-06-24  8:05                       ` Jerin Jacob
2021-06-23  5:34       ` Hu, Jiayu
2021-06-23 11:07         ` Jerin Jacob
2021-06-16  2:17 ` Wang, Haiyue
2021-06-16  8:04   ` Bruce Richardson
2021-06-16  8:16     ` Wang, Haiyue
2021-06-16 12:14 ` David Marchand
2021-06-16 13:11   ` Bruce Richardson
2021-06-16 16:48     ` Honnappa Nagarahalli
2021-06-16 19:10       ` Bruce Richardson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YMxnz5EwXL3x07Yd@bricha3-MOBL.ger.corp.intel.com \
    --to=bruce.richardson@intel.com \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=fengchengwen@huawei.com \
    --cc=ferruh.yigit@intel.com \
    --cc=hemant.agrawal@nxp.com \
    --cc=honnappa.nagarahalli@arm.com \
    --cc=jerinj@marvell.com \
    --cc=jerinjacobk@gmail.com \
    --cc=maxime.coquelin@redhat.com \
    --cc=nipun.gupta@nxp.com \
    --cc=thomas@monjalon.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.