Re: [dpdk-dev] [PATCH v2] doc/compress: clarify error handling on data-plane

From: "Trahe, Fiona" <fiona.trahe@intel.com>
To: Shally Verma <shallyv@marvell.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: "akhil.goyal@nxp.com" <akhil.goyal@nxp.com>,
	Ashish Gupta <ashishg@marvell.com>,
	"Daly, Lee" <lee.daly@intel.com>, Sunila Sahu <ssahu@marvell.com>,
	"stable@dpdk.org" <stable@dpdk.org>,
	"Trahe, Fiona" <fiona.trahe@intel.com>
Subject: Re: [dpdk-dev] [PATCH v2] doc/compress: clarify error handling on data-plane
Date: Tue, 30 Apr 2019 16:33:30 +0000	[thread overview]
Message-ID: <348A99DA5F5B7549AA880327E580B4358974F5BC@IRSMSX101.ger.corp.intel.com> (raw)
In-Reply-To: <BN6PR1801MB20521B48C169DA254FF78019AD260@BN6PR1801MB2052.namprd18.prod.outlook.com>

Hi Shally,

> -----Original Message-----
> From: Shally Verma [mailto:shallyv@marvell.com]
> Sent: Thursday, April 18, 2019 1:12 PM
> To: Trahe, Fiona <fiona.trahe@intel.com>; dev@dpdk.org
> Cc: akhil.goyal@nxp.com; Ashish Gupta <ashishg@marvell.com>; Daly, Lee <lee.daly@intel.com>; Sunila
> Sahu <ssahu@marvell.com>; stable@dpdk.org
> Subject: RE: [PATCH v2] doc/compress: clarify error handling on data-plane
> 
> Hi Fiona,
> 
> 
> > -----Original Message-----
> > From: Fiona Trahe <fiona.trahe@intel.com>
> > Sent: Tuesday, April 9, 2019 8:26 PM
> > To: dev@dpdk.org
> > Cc: akhil.goyal@nxp.com; Ashish Gupta <ashishg@marvell.com>;
> > lee.daly@intel.com; Sunila Sahu <ssahu@marvell.com>; Shally Verma
> > <shallyv@marvell.com>; Fiona Trahe <fiona.trahe@intel.com>;
> > stable@dpdk.org
> > Subject: [PATCH v2] doc/compress: clarify error handling on data-plane
> >
> > Fixed some typos and clarified which errors should be returned when and
> > why on the enqueue and dequeue APIs.
> >
> > Fixes: a584d3bea902 ("doc: add compressdev library guide")
> > cc: stable@dpdk.org
> >
> > Signed-off-by: Fiona Trahe <fiona.trahe@intel.com>
> > ---
> > v2 changes:
> >  - changed "0 or undefined" to just "undefined" as 0 is superfluous.
> >
> >
> >  doc/guides/prog_guide/compressdev.rst | 46
> > ++++++++++++++++++++++++++++++++---
> >  1 file changed, 43 insertions(+), 3 deletions(-)
> >
> > diff --git a/doc/guides/prog_guide/compressdev.rst
> > b/doc/guides/prog_guide/compressdev.rst
> > index ad9703753..c700dd103 100644
> > --- a/doc/guides/prog_guide/compressdev.rst
> > +++ b/doc/guides/prog_guide/compressdev.rst
> > @@ -201,7 +201,7 @@ for stateful processing of ops.
> >  Operation Status
> >  ~~~~~~~~~~~~~~~~
> >  Each operation carries a status information updated by PMD after it is
> > processed.
> > -following are currently supported status:
> > +Following are currently supported:
> >
> >  - RTE_COMP_OP_STATUS_SUCCESS,
> >      Operation is successfully completed @@ -227,14 +227,54 @@ following are
> > currently supported status:
> >      is not an error case. Output data up to op.produced can be used and
> >      next op in the stream should continue on from op.consumed+1.
> >
> > +Operation status after enqueue / dequeue
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +Some of the above values will only arise in the op after an
> > +``rte_compressdev_enqueue_burst()``, some only after an
> > +``rte_compressdev_dequeue_burst()``. For optimal performance on the
> > +data-plane an application is not expected to check the ``op.status`` of
> > +all ops after both enqueue and dequeue, it should be sufficient to only
> > +check after dequeue. To facilitate this optimisation, most errors which
> > +may reasonably be expected to occur in a production environment will be
> > returned by the PMD on the ``dequeue``.
> > +op.status may hold the following values after dequeue:
> > +
> > +- RTE_COMP_OP_STATUS_SUCCESS
> > +- RTE_COMP_OP_STATUS_ERROR
> > +- RTE_COMP_OP_STATUS_OUT_OF_SPACE_TERMINATED
> > +- RTE_COMP_OP_STATUS_OUT_OF_SPACE_RECOVERABLE
> > +
> > +There are some exceptions whereby errors can occur on the ``enqueue``.
> > +For any error which can occur in a production environment and can be
> > +successful after a retry with the same op the PMD may return the error
> > +on the enqueue.
> This statement looks bit confusing.
> Seems like we are trying to add a description regarding op status check even after the enqueue call unlike
> current scenario, where app only check for it
> after dequeue?
[Fiona] The line following this explains that there is no need to check op.status in this case.
Maybe it's not obvious that the application SHOULD check that all ops are enqueued?
I can reword as:
The application should always check the value returned by the enqueue.
If less than the full burst is enqueued there's no
need for the application to check op.status of any or every op - it can
simply retry from the return value+1 in a later enqueue and expect success.

If this isn't clear, can you suggest other wording - or expand on what's not unclear.

> So if less than the full burst is enqueued there's no
> > +need for the application to check op.status - the application can
> > +simply retry in a later enqueue and expect success. Though the application
> > is not expected to check for these, the values are as follows:
> > +
> > +- RTE_COMP_OP_STATUS_NOT_PROCESSED  - could occur if a hardware
> > device's queue is full, after a dequeue a retry of the enqueue can be
> > successful.
> > +
> > +- RTE_COMP_OP_STATUS_ERROR - could occur due to out-of-memory or
> > other transient condition which could clear after a time.
> > +
> > +Other errors may also occur on an ``enqueue``, but they are only
> > +expected to arise during development. As a retry with the same op won't
> > +be successful, if a performant application wants to avoid checking
> > +op.status on the enqueue it should ensure these never arise in a
> > +production environment, e.g. by checking device capabilities and validating
> > input parameters before sending operations. Examples are:
> > +
> > +- RTE_COMP_OP_STATUS_INVALID_ARGS
> > +- RTE_COMP_OP_STATUS_ERROR (if due to a condition which is not
> > +transient)
> How does app identify error is transient or permanent?
[Fiona] The app doesn't - it's up to the PMD to decide what the appropriate
return is using the logic described above. If the PMD encounters an error that's 
not transient, and can be reasonably expected in a production environment,
then it should forward the error to the dequeue - where it expects the application
to check for it.
Should I add this note at the end of this section?

> 
> > +- RTE_COMP_OP_STATUS_INVALID_STATE
> > +
> > +If an application doesn't safeguard against these AND doesn't check the
> > +op.status of the next op which was not enqueued, but just retries, it could
> > result in an infinite loop.
> > +
> May be , you are trying to insist whenever the number_enqueued ops < number actually enqueued , then
> caller should check for status of num_enqueued + 1th op to know exact reason, and
> take corrective measure before re-enqueuing it for following status condition:
> 1. INVALID_ARGS
> 2. INVALID_STATE
> 3 ERROR
> 
> Is that correct?
[Fiona] Only if the application has a slow-path for debug in non-production code.
In production code, for high-performance, I'm saying it's reasonable not to do this check.

The reason for calling this out is we did have cases where errors were being returned on the enqueue,
But the performance tool was not checking op.status of enqueued+1 and so getting into an infinite loop.
Seems reasonable that a performant application would also not check.
We've removed those cases from QAT PMD, and behave according to above logic.
But  it's not described explicitly in this detail on the API - so think it's worth calling out this expectation.
Does it make sense?

> 
> 
> >  Produced, Consumed And Operation Status
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >  - If status is RTE_COMP_OP_STATUS_SUCCESS,
> >      consumed = amount of data read from input buffer, and
> >      produced = amount of data written in destination buffer
> > -- If status is RTE_COMP_OP_STATUS_FAILURE,
> > -    consumed = produced = 0 or undefined
> > +- If status is RTE_COMP_OP_STATUS_ERROR,
> > +    consumed = produced = undefined
> >  - If status is RTE_COMP_OP_STATUS_OUT_OF_SPACE_TERMINATED,
> >      consumed = 0 and
> >      produced = usually 0, but in decompression cases a PMD may return > 0
> > --
> Acked.
> 
> > 2.13.6