From mboxrd@z Thu Jan  1 00:00:00 1970
From: John Fastabend <john.fastabend@gmail.com>
Subject: Re: [RFC] Generic flow director/filtering/classification
 API
Date: Wed, 10 Aug 2016 09:46:27 -0700
Message-ID: <57AB5A63.1090106@gmail.com>
References: <20160705181646.GO7621@6wind.com>
 <20160721081335.GA15856@chelsio.com> <20160721170738.GT7621@6wind.com>
 <20160725113229.GA24036@chelsio.com> <579640E2.50702@gmail.com>
 <20160726100731.GA2542@chelsio.com> <20160803164410.GH3336@6wind.com>
 <57A241FC.30508@gmail.com> <20160804132453.GN3336@6wind.com>
 <57AA4F80.6040101@gmail.com> <20160810133755.GB3336@6wind.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
To: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>, dev@dpdk.org,
 Thomas Monjalon <thomas.monjalon@6wind.com>,
 Helin Zhang <helin.zhang@intel.com>, Jingjing Wu <jingjing.wu@intel.com>,
 Rasesh Mody <rasesh.mody@qlogic.com>,
 Ajit Khaparde <ajit.khaparde@broadcom.com>, Wenzhuo Lu
 <wenzhuo.lu@intel.com>, Jan Medala <jan@semihalf.com>,
 John Daley <johndale@cisco.com>, Jing Chen <jing.d.chen@intel.com>,
 Konstantin Ananyev <konstantin.ananyev@intel.com>,
 Matej Vido <matejvido@gmail.com>,
 Alejandro Lucero <alejandro.lucero@netronome.com>,
 Sony Chacko <sony.chacko@qlogic.com>,
 Jerin Jacob <jerin.jacob@caviumnetworks.com>,
 Pablo de Lara <pablo.de.lara.guarch@intel.com>,
 Olga Shern <olgas@mellanox.com>, Kumar A S <kumaras@chelsio.com>,
 Nirranjan Kirubaharan <nirranjan@chelsio.com>,
 Indranil Choudhury <indranil@chelsio.com>
Return-path: <dev-bounces@dpdk.org>
Received: from mail-pa0-f47.google.com (mail-pa0-f47.google.com
 [209.85.220.47]) by dpdk.org (Postfix) with ESMTP id E88DE5913
 for <dev@dpdk.org>; Wed, 10 Aug 2016 18:46:53 +0200 (CEST)
Received: by mail-pa0-f47.google.com with SMTP id fi15so17389672pac.1
 for <dev@dpdk.org>; Wed, 10 Aug 2016 09:46:53 -0700 (PDT)
In-Reply-To: <20160810133755.GB3336@6wind.com>
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On 16-08-10 06:37 AM, Adrien Mazarguil wrote:
> On Tue, Aug 09, 2016 at 02:47:44PM -0700, John Fastabend wrote:
>> On 16-08-04 06:24 AM, Adrien Mazarguil wrote:
>>> On Wed, Aug 03, 2016 at 12:11:56PM -0700, John Fastabend wrote:
> [...]
>>>> The problem is keeping priorities in order and/or possibly breaking
>>>> rules apart (e.g. you have an L2 table and an L3 table) becomes very
>>>> complex to manage at driver level. I think its easier for the
>>>> application which has some context to do this. The application "knows"
>>>> if its a router for example will likely be able to pack rules better
>>>> than a PMD will.
>>>
>>> I don't think most applications know they are L2 or L3 routers. They may not
>>> know more than the pattern provided to the PMD, which may indeed end at a L2
>>> or L3 protocol. If the application simply chooses a table based on this
>>> information, then the PMD could have easily done the same.
>>>
>>
>> But when we start thinking about encap/decap then its natural to start
>> using this interface to implement various forwarding dataplanes. And one
>> common way to organize a switch is into a TEP, router, switch
>> (mac/vlan), ACL tables, etc. In fact we see this topology starting to
>> show up in the NICs now.
>>
>> Further each table may be "managed" by a different entity. In which
>> case the software will want to manage the physical and virtual networks
>> separately.
>>
>> It doesn't make sense to me to require a software aggregator object to
>> marshal the rules into a flat table then for a PMD to split them apart
>> again.
> 
> OK, my point was mostly about handling basic cases easily and making sure
> applications do not have to bother with petty HW details when they do not
> want to, yet still get maximum performance by having the PMD make the most
> appropriate choices automatically.
> 
> You've convinced me that in many cases PMDs won't be able to optimize
> efficiently and that conscious applications will know better. The API has to
> provide the ability to do so. I think it's fine as long as it is not
> mandatory.
> 

Great. I also agree making table feature _not_ mandatory for many use
cases will be helpful. I'm just making sure we get all the use cases I
know of covered.

>>> I understand the issue is what happens when applications really want to
>>> define e.g. L2/L3/L2 rules in this specific order (or any ordering that
>>> cannot be satisfied by HW due to table constraints).
>>>
>>> By exposing tables, in such a case applications should move all rules from
>>> L2 to a L3 table themselves (assuming this is even supported) to guarantee
>>> ordering between rules, or fail to add them. This is basically what the PMD
>>> could have done, possibly in a more efficient manner in my opinion.
>>
>> I disagree with the more efficient comment :)
>>
>> If the software layer is working on L2/TEP/ACL/router layers merging
>> them just to pull them back apart is not going to be more efficient.
> 
> Moving flow rules around cannot be efficient by definition, however I think
> that attempting to describe table capabilities may be as complicated as
> describing HW bit-masking features. Applications may get it wrong as a
> result while a PMD would not make any mistake.
> 
> Your use case is valid though, if the application already groups rules, then
> sharing this information with the PMD would make sense from a performance
> standpoint.
> 
>>> Let's assume two opposite scenarios for this discussion:
>>>
>>> - App #1 is a command-line interface directly mapped to flow rules, which
>>>   basically gets slow random input from users depending on how they want to
>>>   configure their traffic. All rules differ considerably (L2, L3, L4, some
>>>   with incomplete bit-masks, etc). All in all, few but complex rules with
>>>   specific priorities.
>>>
>>
>> Agree with this and in this case the application should be behind any
>> network physical/virtual and not giving rules like encap/decap/etc. This
>> application either sits on the physical function and "owns" the hardware
>> resource or sits behind a virtual switch.
>>
>>
>>> - App #2 is something like OVS, creating and deleting a large number of very
>>>   specific (without incomplete bit-masks) and mostly identical
>>>   single-priority rules automatically and very frequently.
>>>
>>
>> Maybe for OVS but not all virtual switches are built with flat tables
>> at the bottom like this. Nor is it optimal it necessarily optimal.
>>
>> Another application (the one I'm concerned about :) would be build as
>> a pipeline, something like
>>
>> 	ACL -> TEP -> ACL -> VEB -> ACL
>>
>> If I have hardware that supports a TEP hardware block an ACL hardware
>> block and a VEB  block for example I don't want to merge my control
>> plane into a single table. The merging in this case is just pure
>> overhead/complexity for no gain.
> 
> It could be done by dedicating priority ranges for each item in the
> pipeline but then it would be clunky. OK then, let's discuss the best
> approach to implement this.
> 
> [...]
>>>> Its not about mask vs no mask. The devices with multiple tables that I
>>>> have don't have this mask limitations. Its about how to optimally pack
>>>> the rules and who implements that logic. I think its best done in the
>>>> application where I have the context.
>>>>
>>>> Is there a way to omit the table field if the PMD is expected to do
>>>> a best effort and add the table field if the user wants explicit
>>>> control over table mgmt. This would support both models. I at least
>>>> would like to have explicit control over rule population in my pipeline
>>>> for use cases where I'm building a pipeline on top of the hardware.
>>>
>>> Yes that's a possibility. Perhaps the table ID to use could be specified as
>>> a meta pattern item? We'd still need methods to report how many tables exist
>>> and perhaps some way to report their limitations, these could be later
>>> through a separate set of functions.
>>
>> Sure I think a meta pattern item would be fine or put it in the API call
>> directly, something like
>>
>>   rte_flow_create(port_id, pattern, actions);
>>   rte_flow_create_table(port_id, table_id, pattern, actions);
> 
> I suggest using a common method for both cases, either seems fine to me, as
> long as a default table value can be provided (zero) when applications do
> not care.
> 

Works for me just use zero as the default when the application has no
preference and expects PMD to do the table mapping.

> Now about tables management, I think there is no need to not expose table
> capabilities (in case they have different capabilities) but instead provide
> guidelines as part of the specification to encourage applications writers to
> group similar rules in tables. A previously discussed, flow rules priorities
> would be specific to the table they are affected to.

This seems sufficient to me.

> 
> Like flow rules, table priorities could be handled through their index with
> index 0 having the highest priority. Like flow rule priorities, table
> indices wouldn't have to be contiguous.
> 
> If this works for you, how about renaming "tables" to "groups"?
> 

Works for me. And actually I like renaming them "groups" as this seems
more neutral to how the hardware actually implements a group. For
example I've worked on hardware with multiple Tunnel Endpoint engines
but we exposed it as a single "group" to simplify the user interface.

.John