From mboxrd@z Thu Jan  1 00:00:00 1970
From: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Subject: Re: [RFC] Generic flow director/filtering/classification
	API
Date: Tue, 19 Jul 2016 15:12:19 +0200
Message-ID: <20160719131219.GK7621@6wind.com>
References: <20160705181646.GO7621@6wind.com>
 <6A0DE07E22DDAD4C9103DF62FEBC09090348E1A7@shsmsx102.ccr.corp.intel.com>
 <20160707102650.GU7621@6wind.com>
 <6A0DE07E22DDAD4C9103DF62FEBC090903492563@shsmsx102.ccr.corp.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Cc: "dev@dpdk.org" <dev@dpdk.org>, Thomas Monjalon <thomas.monjalon@6wind.com>,
 "Zhang, Helin" <helin.zhang@intel.com>,
 "Wu, Jingjing" <jingjing.wu@intel.com>,
 Rasesh Mody <rasesh.mody@qlogic.com>,
 Ajit Khaparde <ajit.khaparde@broadcom.com>,
 Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>,
 Jan Medala <jan@semihalf.com>, John Daley <johndale@cisco.com>,
 "Chen, Jing D" <jing.d.chen@intel.com>,
 "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 Matej Vido <matejvido@gmail.com>,
 Alejandro Lucero <alejandro.lucero@netronome.com>,
 Sony Chacko <sony.chacko@qlogic.com>,
 Jerin Jacob <jerin.jacob@caviumnetworks.com>,
 "De Lara Guarch, Pablo" <pablo.de.lara.guarch@intel.com>,
 Olga Shern <olgas@mellanox.com>
To: "Lu, Wenzhuo" <wenzhuo.lu@intel.com>
Return-path: <dev-bounces@dpdk.org>
Received: from mail-wm0-f51.google.com (mail-wm0-f51.google.com [74.125.82.51])
 by dpdk.org (Postfix) with ESMTP id 22384F72
 for <dev@dpdk.org>; Tue, 19 Jul 2016 15:12:25 +0200 (CEST)
Received: by mail-wm0-f51.google.com with SMTP id o80so25917241wme.1
 for <dev@dpdk.org>; Tue, 19 Jul 2016 06:12:25 -0700 (PDT)
Content-Disposition: inline
In-Reply-To: <6A0DE07E22DDAD4C9103DF62FEBC090903492563@shsmsx102.ccr.corp.intel.com>
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On Tue, Jul 19, 2016 at 08:11:48AM +0000, Lu, Wenzhuo wrote:
> Hi Adrien,
> Thanks for your clarification.  Most of my questions are clear, but sti=
ll something may need to be discussed, comment below.

Hi Wenzhuo,

Please see below.

[...]
> > > > Requirements for a new API:
> > > >
> > > > - Flexible and extensible without causing API/ABI problems for ex=
isting
> > > >   applications.
> > > > - Should be unambiguous and easy to use.
> > > > - Support existing filtering features and actions listed in `Filt=
er types`_.
> > > > - Support packet alteration.
> > > > - In case of overlapping filters, their priority should be well d=
ocumented.
> > > Does that mean we don't guarantee the consistent of priority? The p=
riority can
> > be different on different NICs. So the behavior of the actions  can b=
e different.
> > Right?
> >=20
> > No, the intent is precisely to define what happens in order to get a =
consistent
> > result across different devices, and document cases with undefined be=
havior.
> > There must be no room left for interpretation.
> >=20
> > For example, the API must describe what happens when two overlapping =
filters
> > (e.g. one matching an Ethernet header, another one matching an IP hea=
der)
> > match a given packet at a given priority level.
> >=20
> > It is documented in section 4.1.1 (priorities) as "undefined behavior=
".
> > Applications remain free to do it and deal with consequences, at leas=
t they know
> > they cannot expect a consistent outcome, unless they use different pr=
iority
> > levels for both rules, see also 4.4.5 (flow rules priority).
> >=20
> > > Seems the users still need to aware the some details of the HW? Do =
we need
> > to add the negotiation for the priority?
> >=20
> > Priorities as defined in this document may not be directly mappable t=
o HW
> > capabilities (e.g. HW does not support enough priorities, or that som=
e corner
> > case make them not work as described), in which case the PMD may choo=
se to
> > simulate priorities (again 4.4.5), as long as the end result follows =
the
> > specification.
> >=20
> > So users must not be aware of some HW details, the PMD does and must
> > perform the needed workarounds to suit their expectations. Users may =
only be
> > impacted by errors while attempting to create rules that are either u=
nsupported
> > or would cause them (or existing rules) to diverge from the spec.
> The problem is sometime the priority of the filters is fixed according =
to
> > HW's implementation. For example, on ixgbe, n-tuple has a higher
> > priority than flow director.

As a side note I did not know that N-tuple had a higher priority than flo=
w
director on ixgbe, priorities among filter types do not seem to be
documented at all in DPDK. This is one of the reasons I think we need a
generic API to handle flow configuration.

So, today an application cannot combine N-tuple and FDIR flow rules and g=
et
a reliable outcome, unless it is designed for specific devices with a kno=
wn
behavior.

> What's the right behavior of PMD if APP want to create a flow director =
rule which has a higher or even equal priority than an existing n-tuple r=
ule? Should PMD return fail?=20

First remember applications only deal with the generic API, PMDs are
responsible for choosing the most appropriate HW implementation to use
according to the requested flow rules (FDIR, N-tuple or anything else).

For the specific case of FDIR vs N-tuple, if the underlying HW supports b=
oth
I do not see why the PMD would create a N-tuple rule. Doesn't FDIR suppor=
t
everything N-tuple can do and much more?

Assuming such a thing happened anyway, that the PMD had to create a rule
using a high priority filter type and that the application requests the
creation of a rule that can only be done using a lower priority filter ty=
pe,
but also requested a higher priority for that rule, then yes, it should
obviously fail.

That is, unless the PMD can perform some kind of workaround to have both.

> If so, do we need more fail reasons? According to this RFC, I think we =
need return " EEXIST: collision with an existing rule. ", but it's not ve=
ry clear, APP doesn't know the problem is priority, maybe more detailed r=
eason is helpful.

Possibly, I've defined a basic set of errors, there are quite a number of
errno values to choose from. However I think we should not define too man=
y
values. In my opinion the basic set covers every possible failure:

- EINVAL: invalid format, rule is broken or cannot be understood by the P=
MD
  anyhow.

- ENOTSUP: pattern/actions look fine but something in the requested rule =
is
  not supported and thus cannot be applied.

- EEXIST: pattern/actions are fine and could have been applied if only so=
me
  other rule did not prevent the PMD to do it (I see it as the closest th=
ing
  to "ETOOBAD" which unfortunately does not exist).

- ENOMEM: like EEXIST, except it is due to the lack of resources not beca=
use
  of another rule. I wasn't sure which of ENOMEM or ENOSPC was better but
  settled on ENOMEM as it is well known. Still open to debate.

Errno values are only useful to get a rough idea of the reason, and anoth=
er
mechanism is needed to pinpoint the exact problem for debugging/reporting
purposes, something like:

 enum rte_flow_error_type {
     RTE_FLOW_ERROR_TYPE_NONE,
     RTE_FLOW_ERROR_TYPE_UNKNOWN,
     RTE_FLOW_ERROR_TYPE_PRIORITY,
     RTE_FLOW_ERROR_TYPE_PATTERN,
     RTE_FLOW_ERROR_TYPE_ACTION,
 };

 struct rte_flow_error {
     enum rte_flow_error_type type;
     void *offset; /* Points to the exact pattern item or action. */
     const char *message;
 };

Then either provide an optional struct rte_flow_error pointer to
rte_flow_validate(), or a separate function (rte_flow_analyze()?), since
processing this may be quite expensive and applications may not care abou=
t
the exact reason.

What do you suggest?

> > > > Behavior
> > > > --------
> > > >
> > > > - API operations are synchronous and blocking (``EAGAIN`` cannot =
be
> > > >   returned).
> > > >
> > > > - There is no provision for reentrancy/multi-thread safety, altho=
ugh nothing
> > > >   should prevent different devices from being configured at the s=
ame
> > > >   time. PMDs may protect their control path functions accordingly=
.
> > > >
> > > > - Stopping the data path (TX/RX) should not be necessary when man=
aging
> > flow
> > > >   rules. If this cannot be achieved naturally or with workarounds=
 (such as
> > > >   temporarily replacing the burst function pointers), an appropri=
ate error
> > > >   code must be returned (``EBUSY``).
> > > PMD cannot stop the data path without adding lock. So I think if so=
me rules
> > cannot be applied without stopping rx/tx, PMD has to return fail.
> > > Or let the APP to stop the data path.
> >=20
> > Agreed, that is the intent. If the PMD cannot touch flow rules for so=
me reason
> > even after trying really hard, then it just returns EBUSY.
> >=20
> > Perhaps we should write down that applications may get a different ou=
tcome
> > after stopping the data path if they get EBUSY?
> Agree, it's better to describe more about the APP. BTW, I checked the b=
ehavior of ixgbe/igb, I think we can add/delete filters during runtime. H=
opefully we'll not hit too many EBUSY problems on other NICs :)

OK, I will add it.

> > > > - PMDs, not applications, are responsible for maintaining flow ru=
les
> > > >   configuration when stopping and restarting a port or performing=
 other
> > > >   actions which may affect them. They can only be destroyed expli=
citly.
> > > Don=E2=80=99t understand " They can only be destroyed explicitly."
> >=20
> > This part says that as long as an application has not called
> > rte_flow_destroy() on a flow rule, it never disappears, whatever happ=
ens to the
> > port (stopped, restarted). The application is not responsible for re-=
creating rules
> > after that.
> >=20
> > Note that according to the specification, this may translate to not b=
eing able to
> > stop a port as long as a flow rule is present, depending on how nice =
the PMD
> > intends to be with applications. Implementation can be done in small =
steps with
> > minimal amount of code on the PMD side.
> Does it mean PMD should store and maintain all the rules? Why not let r=
te do that? I think if PMD maintain all the rules, it means every kind of=
 NIC should have a copy of code for the rules. But if rte do that, only o=
ne copy of code need to be maintained, right?

I've considered having rules stored in a common format understood at the =
RTE
level and not specific to each PMD and decided that the opaque rte_flow
pointer was a better choice for the following reasons:=20

- Even though flow rules management is done in the control path, processi=
ng
  must be as fast as possible. Letting PMDs store flow rules using their =
own
  internal representation gives them the chance to achieve better
  performance.

- An opaque context managed by PMDs would probably have to be stored
  somewhere as well anyway.

- PMDs may not need to allocate/store anything at all if they exclusively
  rely on HW state for everything. In my opinion, the generic API has eno=
ugh
  constraints for this to work and maintain consistency between flow
  rules. Note this is currently how most PMDs implement FDIR and other
  filter types.

- RTE can (and will) provide helpers to avoid most of the code redundancy=
,
  PMDs are free to use them or manage everything by themselves.

- Given that the opaque rte_flow pointer associated with a flow rule is t=
o
  be stored by the application, PMDs do not even have to keep references =
to
  them.

- The flow rules format described in this specification (pattern / action=
s)
  will be used by applications directly, and will be free to arrange them=
 in
  lists, trees or in any other way if they need to keep flow specificatio=
ns
  around for further processing.

> When the port is stopped and restarted, rte can reconfigure the rules. =
Is the concern that PMD may adjust the sequence of the rules according to=
 the priority, so every NIC has a different list of rules? But PMD can ad=
just them again when rte reconfiguring the rules.

What about PMDs able to stop and restart ports without destroying their o=
wn
flow rules? If we assume flow rules must be destroyed when stopping a por=
t,
these PMDs are needlessly penalized with slower stop/start cycles. Think
about it assuming thousands of flow rules.

Thus from an application point of view, whatever happens when stopping an=
d
restarting a port should not matter. If a flow rule was present before, i=
t
must still be present afterwards. If the PMD had to destroy flow rules an=
d
re-create them, it does not actually matter if they differ slightly at th=
e
HW level, as long as:

- Existing opaque flow rule pointers (rte_flow) are still valid to the PM=
D
  and refer to the same rules.

- The overall behavior of all rules is the same.

The list of rules you think of (patterns / actions) is maintained by
applications (not RTE), and only if they need them. RTE would needlessly
duplicate this.

--=20
Adrien Mazarguil
6WIND