All of lore.kernel.org
 help / color / mirror / Atom feed
* Flows! Offload them.
@ 2015-02-26  7:42 Jiri Pirko
  2015-02-26  8:38 ` Simon Horman
                   ` (3 more replies)
  0 siblings, 4 replies; 53+ messages in thread
From: Jiri Pirko @ 2015-02-26  7:42 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, jpettit,
	joestringer, john.r.fastabend, jhs, sfeldma, f.fainelli, roopa,
	linville, simon.horman, shrijeet, gospo, bcrl

Hello everyone.

I would like to discuss big next step for switch offloading. Probably
the most complicated one we have so far. That is to be able to offload flows.
Leaving nftables aside for a moment, I see 2 big usecases:
- TC filters and actions offload.
- OVS key match and actions offload.

I think it might sense to ignore OVS for now. The reason is ongoing efford
to replace OVS kernel datapath with TC subsystem. After that, OVS offload
will not longer be needed and we'll get it for free with TC offload
implementation. So we can focus on TC now.

Here is my list of actions to achieve some results in near future:
1) finish cls_openflow classifier and iproute part of it
2) extend switchdev API for TC cls and acts offloading (using John's flow api?)
3) use rocker to provide offload for cls_openflow and couple of selected actions
4) improve cls_openflow performance (hashtables etc)
5) improve TC subsystem performance in both slow and fast path
    -RTNL mutex and qdisc lock removal/reduction, lockless stats update.
6) implement "named sockets" (working name) and implement TC support for that
    -ingress qdisc attach, act_mirred target
7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
8) implement TC act_mpls
9) suggest to switch OVS userspace from OVS genl to TC API

This is my personal action list, but you are *very welcome* to step in to help.
Point 2) haunts me at night....
I believe that John is already working on 2) and part of 3).

What do you think?

Thanks!

Jiri

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26  7:42 Flows! Offload them Jiri Pirko
@ 2015-02-26  8:38 ` Simon Horman
  2015-02-26  9:16   ` Jiri Pirko
  2015-02-26 11:22 ` Sowmini Varadhan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 53+ messages in thread
From: Simon Horman @ 2015-02-26  8:38 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	jpettit, joestringer, john.r.fastabend, jhs, sfeldma, f.fainelli,
	roopa, linville, shrijeet, gospo, bcrl

Hi Jiri,

On Thu, Feb 26, 2015 at 08:42:14AM +0100, Jiri Pirko wrote:
> Hello everyone.
> 
> I would like to discuss big next step for switch offloading. Probably
> the most complicated one we have so far. That is to be able to offload flows.
> Leaving nftables aside for a moment, I see 2 big usecases:
> - TC filters and actions offload.
> - OVS key match and actions offload.
> 
> I think it might sense to ignore OVS for now. The reason is ongoing efford
> to replace OVS kernel datapath with TC subsystem. After that, OVS offload
> will not longer be needed and we'll get it for free with TC offload
> implementation. So we can focus on TC now.
> 
> Here is my list of actions to achieve some results in near future:
> 1) finish cls_openflow classifier and iproute part of it
> 2) extend switchdev API for TC cls and acts offloading (using John's flow api?)
> 3) use rocker to provide offload for cls_openflow and couple of selected actions
> 4) improve cls_openflow performance (hashtables etc)
> 5) improve TC subsystem performance in both slow and fast path
>     -RTNL mutex and qdisc lock removal/reduction, lockless stats update.
> 6) implement "named sockets" (working name) and implement TC support for that
>     -ingress qdisc attach, act_mirred target
> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
> 8) implement TC act_mpls
> 9) suggest to switch OVS userspace from OVS genl to TC API
> 
> This is my personal action list, but you are *very welcome* to step in to help.
> Point 2) haunts me at night....
> I believe that John is already working on 2) and part of 3).
> 
> What do you think?

>From my point of view the question of replacing the kernel datapath with TC
is orthogonal to the question of flow offloads. This is because I believe
there is some consensus around the idea that, at least in the case of Open
vSwitch, the decision to offload flows should made in user-space where
flows are already managed. And in that case datapath will not be
transparently offloading of flows.  And thus flow offload may be performed
independently of the kernel datapath, weather that be via flow manipulation
portions of John's Flow API, TC, or some other means.

Regardless of the above, I have three question relating to the scheme you
outline above:

1. Open vSwitch flows are independent of a device. My recollection
   is that while they typically match in the in_port (ingress port)
   this is not a requirement. Conversely my understanding is that
   TC classifiers attach to a netdev. I'm wondering how this
   difference can be reconciled.

   I asked this question at your presentation at Netdev 0.1 and Jamal
   indicated a possibility was to attach to the bridge netdev. But unless I
   misunderstand things that would actually have the effect of a flow
   matching in_port=host.

   Of course things could be changed around to give the behaviour that
   Jamal described. Or perhaps it is already the case. But then
   how would one match on in_port=host?

2. In a similar vein, does the named sockets approach allow for the scheme
   that Open vSwitch supports of matching on in_port=tunnel_port.

3. As mentioned above my understanding is that there is some consensus that
   there should be a mechanism to allow decisions about which flows are
   offloaded to be managed by user-space.

   It seems to me that could be achieved within the context of what
   you describe above using a flag or similar denoting weather a flow
   should be added to hardware or software. Or perhaps two flags allowing
   for a flow to be added to both hardware and software. Am I on the
   right track here?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26  8:38 ` Simon Horman
@ 2015-02-26  9:16   ` Jiri Pirko
  2015-02-26 13:33     ` Thomas Graf
  2015-02-26 16:04     ` Tom Herbert
  0 siblings, 2 replies; 53+ messages in thread
From: Jiri Pirko @ 2015-02-26  9:16 UTC (permalink / raw)
  To: Simon Horman
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	jpettit, joestringer, john.r.fastabend, jhs, sfeldma, f.fainelli,
	roopa, linville, shrijeet, gospo, bcrl

Thu, Feb 26, 2015 at 09:38:01AM CET, simon.horman@netronome.com wrote:
>Hi Jiri,
>
>On Thu, Feb 26, 2015 at 08:42:14AM +0100, Jiri Pirko wrote:
>> Hello everyone.
>> 
>> I would like to discuss big next step for switch offloading. Probably
>> the most complicated one we have so far. That is to be able to offload flows.
>> Leaving nftables aside for a moment, I see 2 big usecases:
>> - TC filters and actions offload.
>> - OVS key match and actions offload.
>> 
>> I think it might sense to ignore OVS for now. The reason is ongoing efford
>> to replace OVS kernel datapath with TC subsystem. After that, OVS offload
>> will not longer be needed and we'll get it for free with TC offload
>> implementation. So we can focus on TC now.
>> 
>> Here is my list of actions to achieve some results in near future:
>> 1) finish cls_openflow classifier and iproute part of it
>> 2) extend switchdev API for TC cls and acts offloading (using John's flow api?)
>> 3) use rocker to provide offload for cls_openflow and couple of selected actions
>> 4) improve cls_openflow performance (hashtables etc)
>> 5) improve TC subsystem performance in both slow and fast path
>>     -RTNL mutex and qdisc lock removal/reduction, lockless stats update.
>> 6) implement "named sockets" (working name) and implement TC support for that
>>     -ingress qdisc attach, act_mirred target
>> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
>> 8) implement TC act_mpls
>> 9) suggest to switch OVS userspace from OVS genl to TC API
>> 
>> This is my personal action list, but you are *very welcome* to step in to help.
>> Point 2) haunts me at night....
>> I believe that John is already working on 2) and part of 3).
>> 
>> What do you think?
>
>From my point of view the question of replacing the kernel datapath with TC
>is orthogonal to the question of flow offloads. This is because I believe
>there is some consensus around the idea that, at least in the case of Open
>vSwitch, the decision to offload flows should made in user-space where
>flows are already managed. And in that case datapath will not be
>transparently offloading of flows.  And thus flow offload may be performed
>independently of the kernel datapath, weather that be via flow manipulation
>portions of John's Flow API, TC, or some other means.

Well, on netdev01, I believe that a consensus was reached that for every
switch offloaded functionality there has to be an implementation in
kernel. What John's Flow API originally did was to provide a way to
configure hardware independently of kernel. So the right way is to
configure kernel and, if hw allows it, to offload the configuration to hw.

In this case, seems to me logical to offload from one place, that being
TC. The reason is, as I stated above, the possible conversion from OVS
datapath to TC.

>
>Regardless of the above, I have three question relating to the scheme you
>outline above:
>
>1. Open vSwitch flows are independent of a device. My recollection
>   is that while they typically match in the in_port (ingress port)
>   this is not a requirement. Conversely my understanding is that
>   TC classifiers attach to a netdev. I'm wondering how this
>   difference can be reconciled.

What I plan as well, and forgot to mention it in my list, is to provide
a possibility to bind one ingress qdisc instance to multiple devices.
The main reason is to avoid duplication of cls and act instances.

But even without this change, you can have per-dev ingress qdisc with
same cls and acts. There you do not have to match on in_port.


>
>   I asked this question at your presentation at Netdev 0.1 and Jamal
>   indicated a possibility was to attach to the bridge netdev. But unless I
>   misunderstand things that would actually have the effect of a flow
>   matching in_port=host.

No, bridge is not in the picture. Just select couple of netdevices,
attach ingress qdisc and push cls and acts there.

>
>   Of course things could be changed around to give the behaviour that
>   Jamal described. Or perhaps it is already the case. But then
>   how would one match on in_port=host?
>
>2. In a similar vein, does the named sockets approach allow for the scheme
>   that Open vSwitch supports of matching on in_port=tunnel_port.

That I plan to implement. I have to look at this more deeper, but the
idea is to be able to attach ingress qdisc to this named socket.

>
>3. As mentioned above my understanding is that there is some consensus that
>   there should be a mechanism to allow decisions about which flows are
>   offloaded to be managed by user-space.
>
>   It seems to me that could be achieved within the context of what
>   you describe above using a flag or similar denoting weather a flow
>   should be added to hardware or software. Or perhaps two flags allowing
>   for a flow to be added to both hardware and software. Am I on the
>   right track here?

Yes, I believe that this should be implemented in one way or another. I
have to think about this a bit more. I think that flows should be
inserted in kernel always and optionally to enable/disable insertion to hw.


Thanks!

Jiri

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26  7:42 Flows! Offload them Jiri Pirko
  2015-02-26  8:38 ` Simon Horman
@ 2015-02-26 11:22 ` Sowmini Varadhan
  2015-02-26 11:39   ` Jiri Pirko
  2015-02-26 12:51 ` Thomas Graf
  2015-02-26 19:32 ` Florian Fainelli
  3 siblings, 1 reply; 53+ messages in thread
From: Sowmini Varadhan @ 2015-02-26 11:22 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	jpettit, joestringer, john.r.fastabend, jhs, sfeldma, f.fainelli,
	roopa, linville, simon.horman, shrijeet, gospo, bcrl

On (02/26/15 08:42), Jiri Pirko wrote:
> 6) implement "named sockets" (working name) and implement TC support for that
>     -ingress qdisc attach, act_mirred target
> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets

Can you elaborate a bit on the above two?

FWIW I've been looking at the problem of RDS over TCP, which is
an instance of layered sockets that tunnels the application payload
in TCP. 

RDS over IB provides QoS support using the features available in
IB- to supply an analog of that for RDS-TCP, you'd need to plug
into tc's CBQ support, and also provide hooks for packet (.1p, dscp)
marking. 

Perhaps there is some overlap to what you are thinking of in #6 and #7 
above?

--Sowmini

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 11:22 ` Sowmini Varadhan
@ 2015-02-26 11:39   ` Jiri Pirko
  2015-02-26 15:42     ` Sowmini Varadhan
  2015-02-27 13:15     ` Named sockets WAS(Re: " Jamal Hadi Salim
  0 siblings, 2 replies; 53+ messages in thread
From: Jiri Pirko @ 2015-02-26 11:39 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	jpettit, joestringer, john.r.fastabend, jhs, sfeldma, f.fainelli,
	roopa, linville, simon.horman, shrijeet, gospo, bcrl

Thu, Feb 26, 2015 at 12:22:52PM CET, sowmini.varadhan@oracle.com wrote:
>On (02/26/15 08:42), Jiri Pirko wrote:
>> 6) implement "named sockets" (working name) and implement TC support for that
>>     -ingress qdisc attach, act_mirred target
>> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
>
>Can you elaborate a bit on the above two?

Sure. If you look into net/openvswitch/vport-vxlan.c for example, there
is a socket created by vxlan_sock_add. vxlan_rcv is called on rx and
vxlan_xmit_skb to xmit.

What I have on mind is to allow to create tunnels using "ip" but not as
a device but rather just as a wrapper of these functions (and others alike).

To identify the instance we name it (OVS has it identified and vport).
After that, tc could allow to attach ingress qdisk not only to a device,
but to this named socket as well. Similary with tc action mirred, it would
be possible to forward not only to a device, but to this named socket as
well. All should be very light.


>
>FWIW I've been looking at the problem of RDS over TCP, which is
>an instance of layered sockets that tunnels the application payload
>in TCP. 
>
>RDS over IB provides QoS support using the features available in
>IB- to supply an analog of that for RDS-TCP, you'd need to plug
>into tc's CBQ support, and also provide hooks for packet (.1p, dscp)
>marking. 
>
>Perhaps there is some overlap to what you are thinking of in #6 and #7 
>above?

I'm not talking about QoS at all. See the description above.

Jiri

>
>--Sowmini

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26  7:42 Flows! Offload them Jiri Pirko
  2015-02-26  8:38 ` Simon Horman
  2015-02-26 11:22 ` Sowmini Varadhan
@ 2015-02-26 12:51 ` Thomas Graf
  2015-02-26 13:17   ` Jiri Pirko
  2015-02-26 19:32 ` Florian Fainelli
  3 siblings, 1 reply; 53+ messages in thread
From: Thomas Graf @ 2015-02-26 12:51 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, dborkman, ogerlitz, jesse, jpettit,
	joestringer, john.r.fastabend, jhs, sfeldma, f.fainelli, roopa,
	linville, simon.horman, shrijeet, gospo, bcrl

On 02/26/15 at 08:42am, Jiri Pirko wrote:
> Hello everyone.
> 
> I would like to discuss big next step for switch offloading. Probably
> the most complicated one we have so far. That is to be able to offload flows.
> Leaving nftables aside for a moment, I see 2 big usecases:
> - TC filters and actions offload.
> - OVS key match and actions offload.
> 
> I think it might sense to ignore OVS for now. The reason is ongoing efford
> to replace OVS kernel datapath with TC subsystem. After that, OVS offload
> will not longer be needed and we'll get it for free with TC offload
> implementation. So we can focus on TC now.
>
> Here is my list of actions to achieve some results in near future:
> 1) finish cls_openflow classifier and iproute part of it

I still think that you should consider renaming this or merging
it with cls_flow. I don't see any relation to OpenFlow in what
you proposed in the last RFC.

> 2) extend switchdev API for TC cls and acts offloading (using John's flow api?)
> 3) use rocker to provide offload for cls_openflow and couple of selected actions
> 4) improve cls_openflow performance (hashtables etc)
> 5) improve TC subsystem performance in both slow and fast path
>     -RTNL mutex and qdisc lock removal/reduction, lockless stats update.
> 6) implement "named sockets" (working name) and implement TC support for that
>     -ingress qdisc attach, act_mirred target
> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
> 8) implement TC act_mpls
> 9) suggest to switch OVS userspace from OVS genl to TC API

I think everybody agrees to unifying code paths and getting rid
of parallism assuming that it does not introduce a performance
regression for flow setup rate, throughput, and scale.

I would also include Linux bridge in this effort as it's also
based on programmable flows and would thus also benefit from
the implemented offload functionality.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 12:51 ` Thomas Graf
@ 2015-02-26 13:17   ` Jiri Pirko
  0 siblings, 0 replies; 53+ messages in thread
From: Jiri Pirko @ 2015-02-26 13:17 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, davem, nhorman, andy, dborkman, ogerlitz, jesse, jpettit,
	joestringer, john.r.fastabend, jhs, sfeldma, f.fainelli, roopa,
	linville, simon.horman, shrijeet, gospo, bcrl

Thu, Feb 26, 2015 at 01:51:43PM CET, tgraf@suug.ch wrote:
>On 02/26/15 at 08:42am, Jiri Pirko wrote:
>> Hello everyone.
>> 
>> I would like to discuss big next step for switch offloading. Probably
>> the most complicated one we have so far. That is to be able to offload flows.
>> Leaving nftables aside for a moment, I see 2 big usecases:
>> - TC filters and actions offload.
>> - OVS key match and actions offload.
>> 
>> I think it might sense to ignore OVS for now. The reason is ongoing efford
>> to replace OVS kernel datapath with TC subsystem. After that, OVS offload
>> will not longer be needed and we'll get it for free with TC offload
>> implementation. So we can focus on TC now.
>>
>> Here is my list of actions to achieve some results in near future:
>> 1) finish cls_openflow classifier and iproute part of it
>
>I still think that you should consider renaming this or merging
>it with cls_flow. I don't see any relation to OpenFlow in what
>you proposed in the last RFC.

cls_flow does something different. I believe that it is not a good idea
to merge it with cls_openflow.

Relation to OpenFlow is that the cls follows the specification in what
fields of pks should be used for matching.

But I get your point. cls_openflow is the working name. I have no
problem in changing it so something different.


>
>> 2) extend switchdev API for TC cls and acts offloading (using John's flow api?)
>> 3) use rocker to provide offload for cls_openflow and couple of selected actions
>> 4) improve cls_openflow performance (hashtables etc)
>> 5) improve TC subsystem performance in both slow and fast path
>>     -RTNL mutex and qdisc lock removal/reduction, lockless stats update.
>> 6) implement "named sockets" (working name) and implement TC support for that
>>     -ingress qdisc attach, act_mirred target
>> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
>> 8) implement TC act_mpls
>> 9) suggest to switch OVS userspace from OVS genl to TC API
>
>I think everybody agrees to unifying code paths and getting rid
>of parallism assuming that it does not introduce a performance
>regression for flow setup rate, throughput, and scale.

Great. I sure plan to focus on performance (on both fast and slow path).

>
>I would also include Linux bridge in this effort as it's also
>based on programmable flows and would thus also benefit from
>the implemented offload functionality.

Well I would leave the bridge alone for now. We are able to offload fdbs
there now easily. But I'm open to the discussion for the future.

Thanks!

Jiri

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26  9:16   ` Jiri Pirko
@ 2015-02-26 13:33     ` Thomas Graf
  2015-02-26 15:23       ` John Fastabend
                         ` (2 more replies)
  2015-02-26 16:04     ` Tom Herbert
  1 sibling, 3 replies; 53+ messages in thread
From: Thomas Graf @ 2015-02-26 13:33 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Simon Horman, netdev, davem, nhorman, andy, dborkman, ogerlitz,
	jesse, jpettit, joestringer, john.r.fastabend, jhs, sfeldma,
	f.fainelli, roopa, linville, shrijeet, gospo, bcrl

On 02/26/15 at 10:16am, Jiri Pirko wrote:
> Well, on netdev01, I believe that a consensus was reached that for every
> switch offloaded functionality there has to be an implementation in
> kernel.

Agreed. This should not prevent the policy being driven from user
space though.

> What John's Flow API originally did was to provide a way to
> configure hardware independently of kernel. So the right way is to
> configure kernel and, if hw allows it, to offload the configuration to hw.
> 
> In this case, seems to me logical to offload from one place, that being
> TC. The reason is, as I stated above, the possible conversion from OVS
> datapath to TC.

Offloading of TC definitely makes a lot of sense. I think that even in
that case you will already encounter independent configuration of
hardware and kernel. Example: The hardware provides a fixed, generic
function to push up to n bytes onto a packet. This hardware function
could be used to implement TC actions "push_vlan", "push_vxlan",
"push_mpls". You would you would likely agree that TC should make use
of such a function even if the hardware version is different from the
software version. So I don't think we'll have a 1:1 mapping for all
configurations, regardless of whether the how is decided in kernel or
user space.

My primiary concern of *only* allowing to decide how to program the
hardware in the kernel is the lack of context; A given L3/L4 software
pipeline in the Linux kernel consists of various subsystems: tc
ingress, linux bridge, various iptables chains, routing rules, routing
tables, tc egress, etc. All of them can be stacked in almost unlimited
combinations using virtual software devices and segmented using
net namespaces.

Given this complexity we'll most likely have to solve some of it with
a flag to control offloading (as already introduced for bridging) and
allow the user to shoot himself in the foot (as Jamal and others
pointed out a couple of times). I currently don't see how the kernel
could *always* get it right automatically. We need some additional
input from the user (See also Patrick's comments regarding iptables
offload)

However, for certain datacenter server use cases we actually have the
full user intent in user space as we configure all of the kernel
subsystems from a single central management agent running locally
on the server (OpenStack, Kubernetes, Mesos, ...), i.e. we do know
exactly what the user wants on the system as a whole. This intent is
then split into small configuration pieces to configure iptables, tc,
routes on multiple net namespaces (for example to implement VRF).

E.g. A VRF in software would make use of net namespaces which holds
tenant specific ACLs, routes and QoS settings. A separate action
would fwd packets to the namespace. Easy and straight forward in
software. OTOH, the hardware, capable of implementing the ACLs,
would also need to know about the tc action which selected the
namespace when attempting to offload the ACL as it would otherwise
ACLs to wrong packets.

I would love to have the possibility to make use of that rich intent
avaiable in user space to program the hardware in combination with
configuring the kernel.

Would love to hear your thoughts on this. I think we all share the same
goal which is to have in-kernel drivers for chips which can perform
advanced switching and support it natively with Linux and have it
become the de-facto standard for both hardware switch management and
compute servers.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 13:33     ` Thomas Graf
@ 2015-02-26 15:23       ` John Fastabend
  2015-02-26 20:16         ` Neil Horman
       [not found]       ` <CAGpadYGrjfkZqe0k7D05+cy3pY=1hXZtQqtV0J-8ogU80K7BUQ@mail.gmail.com>
  2015-02-26 17:38       ` David Ahern
  2 siblings, 1 reply; 53+ messages in thread
From: John Fastabend @ 2015-02-26 15:23 UTC (permalink / raw)
  To: Thomas Graf, Jiri Pirko
  Cc: Simon Horman, netdev, davem, nhorman, andy, dborkman, ogerlitz,
	jesse, jpettit, joestringer, jhs, sfeldma, f.fainelli, roopa,
	linville, shrijeet, gospo, bcrl

On 02/26/2015 05:33 AM, Thomas Graf wrote:
> On 02/26/15 at 10:16am, Jiri Pirko wrote:
>> Well, on netdev01, I believe that a consensus was reached that for every
>> switch offloaded functionality there has to be an implementation in
>> kernel.
> 
> Agreed. This should not prevent the policy being driven from user
> space though.
> 
>> What John's Flow API originally did was to provide a way to
>> configure hardware independently of kernel. So the right way is to
>> configure kernel and, if hw allows it, to offload the configuration to hw.
>>
>> In this case, seems to me logical to offload from one place, that being
>> TC. The reason is, as I stated above, the possible conversion from OVS
>> datapath to TC.
> 
> Offloading of TC definitely makes a lot of sense. I think that even in
> that case you will already encounter independent configuration of
> hardware and kernel. Example: The hardware provides a fixed, generic
> function to push up to n bytes onto a packet. This hardware function
> could be used to implement TC actions "push_vlan", "push_vxlan",
> "push_mpls". You would you would likely agree that TC should make use
> of such a function even if the hardware version is different from the
> software version. So I don't think we'll have a 1:1 mapping for all
> configurations, regardless of whether the how is decided in kernel or
> user space.

Just to expand slightly on this. I don't think you can get to a 1:1
mapping here. One reason is hardware typically has a TCAM and limited
size. So you need a _policy_ to determine when to push rules into the
hardware. The kernel doesn't know when to do this and I don't believe
its the kernel's place to start enforcing policy like this. One thing I likely
need to do is get some more "worlds" in rocker so we aren't stuck only
thinking about the infinite size OF_DPA world. The OF_DPA world is only
one world and not a terribly flexible one at that when compared with the
NPU folk. So minimally you need a flag to indicate rules go into hardware
vs software.

That said I think the bigger mismatch between software and hardware is
you program it differently because the data structures are different. Maybe
a u32 example would help. For parsing with u32 you might build a parse
graph with a root and some leaf nodes. In hardware you want to collapse
this down onto the hardware. I argue this is not a kernel task because
there are lots of ways to do this and there are trade-offs made with
respect to space and performance and which table to use when it could be
handled by a set of tables. Another example is a virtual switch possibly
OVS but we have others. The software does some "unmasking" (there term)
before sending the rules into the software dataplane cache. Basically this
means we can ignore priority in the hash lookup. However this is not how you
would optimally use hardware. Maybe I should do another write up with
some more concrete examples.

There are also lots of use cases to _not_ have hardware and software in
sync. A flag allows this.

My only point is I think we need to allow users to optimally use there
hardware either via 'tc' or my previous 'flow' tool. Actually in my
opinion I still think its best to have both interfaces.

I'll go get some coffee now and hopefully that is somewhat clear.

> 
> My primiary concern of *only* allowing to decide how to program the
> hardware in the kernel is the lack of context; A given L3/L4 software
> pipeline in the Linux kernel consists of various subsystems: tc
> ingress, linux bridge, various iptables chains, routing rules, routing
> tables, tc egress, etc. All of them can be stacked in almost unlimited
> combinations using virtual software devices and segmented using
> net namespaces.
> 
> Given this complexity we'll most likely have to solve some of it with
> a flag to control offloading (as already introduced for bridging) and
> allow the user to shoot himself in the foot (as Jamal and others
> pointed out a couple of times). I currently don't see how the kernel
> could *always* get it right automatically. We need some additional
> input from the user (See also Patrick's comments regarding iptables
> offload)
> 
> However, for certain datacenter server use cases we actually have the
> full user intent in user space as we configure all of the kernel
> subsystems from a single central management agent running locally
> on the server (OpenStack, Kubernetes, Mesos, ...), i.e. we do know
> exactly what the user wants on the system as a whole. This intent is
> then split into small configuration pieces to configure iptables, tc,
> routes on multiple net namespaces (for example to implement VRF).
> 
> E.g. A VRF in software would make use of net namespaces which holds
> tenant specific ACLs, routes and QoS settings. A separate action
> would fwd packets to the namespace. Easy and straight forward in
> software. OTOH, the hardware, capable of implementing the ACLs,
> would also need to know about the tc action which selected the
> namespace when attempting to offload the ACL as it would otherwise
> ACLs to wrong packets.
> 
> I would love to have the possibility to make use of that rich intent
> avaiable in user space to program the hardware in combination with
> configuring the kernel.
> 
> Would love to hear your thoughts on this. I think we all share the same
> goal which is to have in-kernel drivers for chips which can perform
> advanced switching and support it natively with Linux and have it
> become the de-facto standard for both hardware switch management and
> compute servers.
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
       [not found]       ` <CAGpadYGrjfkZqe0k7D05+cy3pY=1hXZtQqtV0J-8ogU80K7BUQ@mail.gmail.com>
@ 2015-02-26 15:39         ` John Fastabend
       [not found]           ` <CAGpadYHfNcDR2ojubkCJ8-nJTQkdLkPsAwJu0wOKU82bLDzhww@mail.gmail.com>
  2015-02-27 13:33           ` Jamal Hadi Salim
  0 siblings, 2 replies; 53+ messages in thread
From: John Fastabend @ 2015-02-26 15:39 UTC (permalink / raw)
  To: Shrijeet Mukherjee, Thomas Graf
  Cc: Jiri Pirko, Simon Horman, netdev, davem, nhorman, andy, dborkman,
	ogerlitz, jesse, jpettit, joestringer, jhs, sfeldma, f.fainelli,
	roopa, linville, gospo, bcrl

On 02/26/2015 07:25 AM, Shrijeet Mukherjee wrote:
>     However, for certain datacenter server use cases we actually have the
>     full user intent in user space as we configure all of the kernel
>     subsystems from a single central management agent running locally
>     on the server (OpenStack, Kubernetes, Mesos, ...), i.e. we do know
>     exactly what the user wants on the system as a whole. This intent is
>     then split into small configuration pieces to configure iptables, tc,
>     routes on multiple net namespaces (for example to implement VRF).
> 
>     E.g. A VRF in software would make use of net namespaces which holds
>     tenant specific ACLs, routes and QoS settings. A separate action
>     would fwd packets to the namespace. Easy and straight forward in
>     software. OTOH, the hardware, capable of implementing the ACLs,
>     would also need to know about the tc action which selected the
>     namespace when attempting to offload the ACL as it would otherwise
>     ACLs to wrong packets.
> 
> 
> This is a new angle that I believe we have talked around in the context of user space policy, but not really considered.
> 
> So the issue is what if you have a classifier and forward action which points to a device which the element doing the classification does not have access to right ?
> 
> This problem obliquely showed up in the context of route table entries not in the "external" table but present in the software tables as well.
> 
> Maybe the scheme requires an implicit "send to software" device which then diverts traffic to the right place ? Would creating an implicit, un-offload device address these concerns ?

So I think there is a relatively simple solution for this. Assuming
I read the description correctly namely packet ingress' nic/switch
and you want it to land in a namespace.

Today we support offloaded macvlan's and SR-IOV. What I would expect
is user creates a set of macvlan's that are "offloaded" this just means
they are bound to a set of hardware queues and do not go through the
normal receive path. Then assigning these to a namespace is the same
as any other netdev.

Hardware has an action to forward to "VSI" (virtual station interface)
which matches on a packet and forwards it to either a VF or set of 
queues bound to a macvlan. Or you can do the forwarding using standards
based protocols such as EVB (edge virtual bridging).

So its a simple set of steps with the flow api,

	1. create macvlan with dfwd_offload set
	2. push netdev into namespace
	3. add flow rule to match traffic and send to VSI
		./flow -i ethx set_rule match xyz action fwd_vsi 3

The VSI# is reported by ip link today its a bit clumsy so that interface
could be cleaned up.

Here is a case where trying to map this onto a 'tc' action in software
is a bit awkward and you convoluted what is really a simple operation.
Anyways this is not really an "offload" in the sense that your taking
something that used to run in software and moving it 1:1 into hardware.
Adding SR-IOV/VMDQ support requries new constructs. By the way if you
don't like my "flow" tool and you want to move it onto "tc" that could
be done as well but the steps are the same.

.John

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 11:39   ` Jiri Pirko
@ 2015-02-26 15:42     ` Sowmini Varadhan
  2015-02-27 13:15     ` Named sockets WAS(Re: " Jamal Hadi Salim
  1 sibling, 0 replies; 53+ messages in thread
From: Sowmini Varadhan @ 2015-02-26 15:42 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	jpettit, joestringer, john.r.fastabend, jhs, sfeldma, f.fainelli,
	roopa, linville, simon.horman, shrijeet, gospo, bcrl

> 
> Sure. If you look into net/openvswitch/vport-vxlan.c for example, there
> is a socket created by vxlan_sock_add.
>  vxlan_rcv is called on rx and vxlan_xmit_skb to xmit.
   :
> What I have on mind is to allow to create tunnels using "ip" but not as
> a device but rather just as a wrapper of these functions (and others alike).

Could you elaborate on what the wrapper will look like? will
it be a socket? or something else?

For contextual comparison:
  For RDS, the listen side of the TCP socket is created when the 
  rds_tcp module is initialized. The client side is created when a RDS
  packet is sent out In the case of RDS, something similar is achieved
  by creating a PF_RDS socket, which can then be used as a datagram socket
  (i.e., no need to do connect/accept). In the rds module, what happens is
  that the rds_sock gets plumbed up with the underlying kernel TCP socket.
  
  The the fanout per RDS port on the receive side happens via ->sk_data_ready
  (in rds_tcp_ready). On the send side, rds_sendmsg sets up the client
  socket (if necessary).  
  
  All of this is done such that multiple RDS sockets share a single 
  underlying kernel tcp socket.
  
  But perhaps there is one significant difference for vxlan- vxlan
  is encapsulating L2 frames in UDP, so the socket layering model
  may not fit so well, except when uspace is creating an entire L2 frame
  (which may be fine with ovs/dpdk, I'm not sure what scenarios you
  have in mind).

> To identify the instance we name it (OVS has it identified and vport).

not sure I follow the name-space you have in mind here, how is fanout 
going to be achieved?  (for rds, we determine which endpoint should get
the packet  based on the rds sport/dport)

> After that, tc could allow to attach ingress qdisk not only to a device,
> but to this named socket as well. Similary with tc action mirred, it would
> be possible to forward not only to a device, but to this named socket as
> well. All should be very light.

This is the part that I'm interested in.. in the RDS case, the flows
are going to be specified based on the sport/rport in the rds_header,
but as far as the rest of the tcp/ip stack is concerned, the rds_header
is just opaque payload bytes.  I realize tc and iptables support that
DPI in theory, and that one can use CLI interfaces to set this up
(I dont know if the system calls used by  tc are published as a
stable library to applications?) but I would be interested in 
kernel-socket options to set up the tc hooks so that operations on
the RDS socket can be translated into flows and other config 
on the shared tcp socket. 

> I'm not talking about QoS at all. See the description above.

Understood, but I mentioned qos because tc is typically used to specify
flows for QoS managing algorithms like cbq.

I realize that you are focussed on offloading some of this to h/w,
but you mentioned a "name-based" socket, and tc hooks (for flows in the
inner L2 frame?), and thats the design-detail I'm most interested in..

--Sowmini

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26  9:16   ` Jiri Pirko
  2015-02-26 13:33     ` Thomas Graf
@ 2015-02-26 16:04     ` Tom Herbert
  2015-02-26 16:17       ` Jiri Pirko
  2015-02-26 18:16       ` Scott Feldman
  1 sibling, 2 replies; 53+ messages in thread
From: Tom Herbert @ 2015-02-26 16:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Simon Horman, Linux Netdev List, David Miller, Neil Horman,
	Andy Gospodarek, Thomas Graf, Daniel Borkmann, Or Gerlitz,
	Jesse Gross, jpettit, Joe Stringer, John Fastabend,
	Jamal Hadi Salim, Scott Feldman, Florian Fainelli, Roopa Prabhu,
	John Linville, shrijeet, Andy Gospodarek, bcrl

On Thu, Feb 26, 2015 at 1:16 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Thu, Feb 26, 2015 at 09:38:01AM CET, simon.horman@netronome.com wrote:
>>Hi Jiri,
>>
>>On Thu, Feb 26, 2015 at 08:42:14AM +0100, Jiri Pirko wrote:
>>> Hello everyone.
>>>
>>> I would like to discuss big next step for switch offloading. Probably
>>> the most complicated one we have so far. That is to be able to offload flows.
>>> Leaving nftables aside for a moment, I see 2 big usecases:
>>> - TC filters and actions offload.
>>> - OVS key match and actions offload.
>>>
>>> I think it might sense to ignore OVS for now. The reason is ongoing efford
>>> to replace OVS kernel datapath with TC subsystem. After that, OVS offload
>>> will not longer be needed and we'll get it for free with TC offload
>>> implementation. So we can focus on TC now.
>>>
>>> Here is my list of actions to achieve some results in near future:
>>> 1) finish cls_openflow classifier and iproute part of it
>>> 2) extend switchdev API for TC cls and acts offloading (using John's flow api?)
>>> 3) use rocker to provide offload for cls_openflow and couple of selected actions
>>> 4) improve cls_openflow performance (hashtables etc)
>>> 5) improve TC subsystem performance in both slow and fast path
>>>     -RTNL mutex and qdisc lock removal/reduction, lockless stats update.
>>> 6) implement "named sockets" (working name) and implement TC support for that
>>>     -ingress qdisc attach, act_mirred target
>>> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
>>> 8) implement TC act_mpls
>>> 9) suggest to switch OVS userspace from OVS genl to TC API
>>>
>>> This is my personal action list, but you are *very welcome* to step in to help.
>>> Point 2) haunts me at night....
>>> I believe that John is already working on 2) and part of 3).
>>>
>>> What do you think?
>>
> >From my point of view the question of replacing the kernel datapath with TC
>>is orthogonal to the question of flow offloads. This is because I believe
>>there is some consensus around the idea that, at least in the case of Open
>>vSwitch, the decision to offload flows should made in user-space where
>>flows are already managed. And in that case datapath will not be
>>transparently offloading of flows.  And thus flow offload may be performed
>>independently of the kernel datapath, weather that be via flow manipulation
>>portions of John's Flow API, TC, or some other means.
>
> Well, on netdev01, I believe that a consensus was reached that for every
> switch offloaded functionality there has to be an implementation in
> kernel. What John's Flow API originally did was to provide a way to
> configure hardware independently of kernel. So the right way is to
> configure kernel and, if hw allows it, to offload the configuration to hw.
>
> In this case, seems to me logical to offload from one place, that being
> TC. The reason is, as I stated above, the possible conversion from OVS
> datapath to TC.
>
Sorry if I'm asking dumb questions, but this is about where I usually
start to get lost in these discussions ;-). Is the aim of switch
offload to offload OVS or kernel functions of routing, iptables, tc,
etc.? These are very different I believe. As far as I can tell OVS
model of "flows" (like Openflow) is currently incompatible with the
rest of the kernel. So if the plan is convert OVS datapath to TC does
that mean introducing that model into core kernel?

Tom

>>
>>Regardless of the above, I have three question relating to the scheme you
>>outline above:
>>
>>1. Open vSwitch flows are independent of a device. My recollection
>>   is that while they typically match in the in_port (ingress port)
>>   this is not a requirement. Conversely my understanding is that
>>   TC classifiers attach to a netdev. I'm wondering how this
>>   difference can be reconciled.
>
> What I plan as well, and forgot to mention it in my list, is to provide
> a possibility to bind one ingress qdisc instance to multiple devices.
> The main reason is to avoid duplication of cls and act instances.
>
> But even without this change, you can have per-dev ingress qdisc with
> same cls and acts. There you do not have to match on in_port.
>
>
>>
>>   I asked this question at your presentation at Netdev 0.1 and Jamal
>>   indicated a possibility was to attach to the bridge netdev. But unless I
>>   misunderstand things that would actually have the effect of a flow
>>   matching in_port=host.
>
> No, bridge is not in the picture. Just select couple of netdevices,
> attach ingress qdisc and push cls and acts there.
>
>>
>>   Of course things could be changed around to give the behaviour that
>>   Jamal described. Or perhaps it is already the case. But then
>>   how would one match on in_port=host?
>>
>>2. In a similar vein, does the named sockets approach allow for the scheme
>>   that Open vSwitch supports of matching on in_port=tunnel_port.
>
> That I plan to implement. I have to look at this more deeper, but the
> idea is to be able to attach ingress qdisc to this named socket.
>
>>
>>3. As mentioned above my understanding is that there is some consensus that
>>   there should be a mechanism to allow decisions about which flows are
>>   offloaded to be managed by user-space.
>>
>>   It seems to me that could be achieved within the context of what
>>   you describe above using a flag or similar denoting weather a flow
>>   should be added to hardware or software. Or perhaps two flags allowing
>>   for a flow to be added to both hardware and software. Am I on the
>>   right track here?
>
> Yes, I believe that this should be implemented in one way or another. I
> have to think about this a bit more. I think that flows should be
> inserted in kernel always and optionally to enable/disable insertion to hw.
>
>
> Thanks!
>
> Jiri
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 16:04     ` Tom Herbert
@ 2015-02-26 16:17       ` Jiri Pirko
  2015-02-26 18:15         ` Tom Herbert
  2015-02-26 18:16       ` Scott Feldman
  1 sibling, 1 reply; 53+ messages in thread
From: Jiri Pirko @ 2015-02-26 16:17 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Simon Horman, Linux Netdev List, David Miller, Neil Horman,
	Andy Gospodarek, Thomas Graf, Daniel Borkmann, Or Gerlitz,
	Jesse Gross, jpettit, Joe Stringer, John Fastabend,
	Jamal Hadi Salim, Scott Feldman, Florian Fainelli, Roopa Prabhu,
	John Linville, shrijeet, Andy Gospodarek, bcrl

Thu, Feb 26, 2015 at 05:04:31PM CET, therbert@google.com wrote:
>On Thu, Feb 26, 2015 at 1:16 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Thu, Feb 26, 2015 at 09:38:01AM CET, simon.horman@netronome.com wrote:
>>>Hi Jiri,
>>>
>>>On Thu, Feb 26, 2015 at 08:42:14AM +0100, Jiri Pirko wrote:
>>>> Hello everyone.
>>>>
>>>> I would like to discuss big next step for switch offloading. Probably
>>>> the most complicated one we have so far. That is to be able to offload flows.
>>>> Leaving nftables aside for a moment, I see 2 big usecases:
>>>> - TC filters and actions offload.
>>>> - OVS key match and actions offload.
>>>>
>>>> I think it might sense to ignore OVS for now. The reason is ongoing efford
>>>> to replace OVS kernel datapath with TC subsystem. After that, OVS offload
>>>> will not longer be needed and we'll get it for free with TC offload
>>>> implementation. So we can focus on TC now.
>>>>
>>>> Here is my list of actions to achieve some results in near future:
>>>> 1) finish cls_openflow classifier and iproute part of it
>>>> 2) extend switchdev API for TC cls and acts offloading (using John's flow api?)
>>>> 3) use rocker to provide offload for cls_openflow and couple of selected actions
>>>> 4) improve cls_openflow performance (hashtables etc)
>>>> 5) improve TC subsystem performance in both slow and fast path
>>>>     -RTNL mutex and qdisc lock removal/reduction, lockless stats update.
>>>> 6) implement "named sockets" (working name) and implement TC support for that
>>>>     -ingress qdisc attach, act_mirred target
>>>> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
>>>> 8) implement TC act_mpls
>>>> 9) suggest to switch OVS userspace from OVS genl to TC API
>>>>
>>>> This is my personal action list, but you are *very welcome* to step in to help.
>>>> Point 2) haunts me at night....
>>>> I believe that John is already working on 2) and part of 3).
>>>>
>>>> What do you think?
>>>
>> >From my point of view the question of replacing the kernel datapath with TC
>>>is orthogonal to the question of flow offloads. This is because I believe
>>>there is some consensus around the idea that, at least in the case of Open
>>>vSwitch, the decision to offload flows should made in user-space where
>>>flows are already managed. And in that case datapath will not be
>>>transparently offloading of flows.  And thus flow offload may be performed
>>>independently of the kernel datapath, weather that be via flow manipulation
>>>portions of John's Flow API, TC, or some other means.
>>
>> Well, on netdev01, I believe that a consensus was reached that for every
>> switch offloaded functionality there has to be an implementation in
>> kernel. What John's Flow API originally did was to provide a way to
>> configure hardware independently of kernel. So the right way is to
>> configure kernel and, if hw allows it, to offload the configuration to hw.
>>
>> In this case, seems to me logical to offload from one place, that being
>> TC. The reason is, as I stated above, the possible conversion from OVS
>> datapath to TC.
>>
>Sorry if I'm asking dumb questions, but this is about where I usually
>start to get lost in these discussions ;-). Is the aim of switch
>offload to offload OVS or kernel functions of routing, iptables, tc,
>etc.? These are very different I believe. As far as I can tell OVS
>model of "flows" (like Openflow) is currently incompatible with the
>rest of the kernel. So if the plan is convert OVS datapath to TC does
>that mean introducing that model into core kernel?

The thing is that you can achieve very similar model as OVS with TC.
OVS uses rx_handler.
TC uses handle_ing hook.
Those are in the same place in the receive path.
After that, ovs processes skb through key matches, and does some actions.
The same is done in TC cls_* and act_*.
Finally skb is forwarded to some netdev by dev_queue_xmit (in both OVS
and TC).

I certainly simplified things. But I do not see the different model you
are talking about.

Jiri

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
       [not found]           ` <CAGpadYHfNcDR2ojubkCJ8-nJTQkdLkPsAwJu0wOKU82bLDzhww@mail.gmail.com>
@ 2015-02-26 16:33             ` Thomas Graf
  2015-02-26 16:53             ` John Fastabend
  1 sibling, 0 replies; 53+ messages in thread
From: Thomas Graf @ 2015-02-26 16:33 UTC (permalink / raw)
  To: Shrijeet Mukherjee
  Cc: John Fastabend, Jiri Pirko, Simon Horman, netdev, davem, nhorman,
	andy, dborkman, ogerlitz, jesse, jpettit, joestringer, jhs,
	sfeldma, f.fainelli, roopa, linville, gospo, bcrl

On 02/26/15 at 07:51am, Shrijeet Mukherjee wrote:
> That is the un-offload device I was referencing. If we standardize and
> implicitly make the available .. all packets that are needing to be sent to
> a construct that is not readily availble in hardware goes to this VSI and
> then software fwded. I am saying though that when this path is invoked the
> path after the VSI is not offloaded.

Agreed. I think John was pointing out how this is achievable with the
flow API while integrating into existing kernel concepts.

The point I was making is that if you have iptables ACLs that you want
offloaded you need all dependencies leading up to that iptables rule to
be programmed into hardware as well. A VRF implemented as netns is a
typical example for this but any form of network virtualization will
require this.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
       [not found]           ` <CAGpadYHfNcDR2ojubkCJ8-nJTQkdLkPsAwJu0wOKU82bLDzhww@mail.gmail.com>
  2015-02-26 16:33             ` Thomas Graf
@ 2015-02-26 16:53             ` John Fastabend
  1 sibling, 0 replies; 53+ messages in thread
From: John Fastabend @ 2015-02-26 16:53 UTC (permalink / raw)
  To: Shrijeet Mukherjee
  Cc: Thomas Graf, Jiri Pirko, Simon Horman, netdev, davem, nhorman,
	andy, dborkman, ogerlitz, jesse, jpettit, joestringer, jhs,
	sfeldma, f.fainelli, roopa, linville, gospo, bcrl

On 02/26/2015 07:51 AM, Shrijeet Mukherjee wrote:
> 
> 
> On Thursday, February 26, 2015, John Fastabend <john.r.fastabend@intel.com <javascript:_e(%7B%7D,'cvml','john.r.fastabend@intel.com');>> wrote:
> 
>     On 02/26/2015 07:25 AM, Shrijeet Mukherjee wrote:
>     >     However, for certain datacenter server use cases we actually have the
>     >     full user intent in user space as we configure all of the kernel
>     >     subsystems from a single central management agent running locally
>     >     on the server (OpenStack, Kubernetes, Mesos, ...), i.e. we do know
>     >     exactly what the user wants on the system as a whole. This intent is
>     >     then split into small configuration pieces to configure iptables, tc,
>     >     routes on multiple net namespaces (for example to implement VRF).
>     >
>     >     E.g. A VRF in software would make use of net namespaces which holds
>     >     tenant specific ACLs, routes and QoS settings. A separate action
>     >     would fwd packets to the namespace. Easy and straight forward in
>     >     software. OTOH, the hardware, capable of implementing the ACLs,
>     >     would also need to know about the tc action which selected the
>     >     namespace when attempting to offload the ACL as it would otherwise
>     >     ACLs to wrong packets.
>     >
>     >
>     > This is a new angle that I believe we have talked around in the context of user space policy, but not really considered.
>     >
>     > So the issue is what if you have a classifier and forward action which points to a device which the element doing the classification does not have access to right ?
>     >
>     > This problem obliquely showed up in the context of route table entries not in the "external" table but present in the software tables as well.
>     >
>     > Maybe the scheme requires an implicit "send to software" device which then diverts traffic to the right place ? Would creating an implicit, un-offload device address these concerns ?
> 
>     So I think there is a relatively simple solution for this. Assuming
>     I read the description correctly namely packet ingress' nic/switch
>     and you want it to land in a namespace.
> 
>     Today we support offloaded macvlan's and SR-IOV. What I would expect
>     is user creates a set of macvlan's that are "offloaded" this just means
>     they are bound to a set of hardware queues and do not go through the
>     normal receive path. Then assigning these to a namespace is the same
>     as any other netdev.
> 
>     Hardware has an action to forward to "VSI" (virtual station interface)
>     which matches on a packet and forwards it to either a VF or set of
>     queues bound to a macvlan. Or you can do the forwarding using standards
>     based protocols such as EVB (edge virtual bridging).
> 
>     So its a simple set of steps with the flow api,
> 
>             1. create macvlan with dfwd_offload set
>             2. push netdev into namespace
>             3. add flow rule to match traffic and send to VSI
>                     ./flow -i ethx set_rule match xyz action fwd_vsi 3
> 
>     The VSI# is reported by ip link today its a bit clumsy so that interface
>     could be cleaned up.
> 
>     Here is a case where trying to map this onto a 'tc' action in software
>     is a bit awkward and you convoluted what is really a simple operation.
>     Anyways this is not really an "offload" in the sense that your taking
>     something that used to run in software and moving it 1:1 into hardware.
>     Adding SR-IOV/VMDQ support requries new constructs. By the way if you
>     don't like my "flow" tool and you want to move it onto "tc" that could
>     be done as well but the steps are the same.
> 
>     .John
> 
> 
> +1 
> 
> That is the un-offload device I was referencing. If we standardize
> and implicitly make the available .. all packets that are needing to
> be sent to a construct that is not readily availble in hardware goes
> to this VSI and then software fwded. I am saying though that when
> this path is invoked the path after the VSI is not offloaded.

Right, also the VSI may be the endpoint of the traffic. It could be
a VM for example or an application that is using the TCAM to offload
classification and data structures that are CPU expensive. In these
examples there is no software fwd path.

.John

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 13:33     ` Thomas Graf
  2015-02-26 15:23       ` John Fastabend
       [not found]       ` <CAGpadYGrjfkZqe0k7D05+cy3pY=1hXZtQqtV0J-8ogU80K7BUQ@mail.gmail.com>
@ 2015-02-26 17:38       ` David Ahern
  2 siblings, 0 replies; 53+ messages in thread
From: David Ahern @ 2015-02-26 17:38 UTC (permalink / raw)
  To: Thomas Graf, Jiri Pirko
  Cc: Simon Horman, netdev, davem, nhorman, andy, dborkman, ogerlitz,
	jesse, jpettit, joestringer, john.r.fastabend, jhs, sfeldma,
	f.fainelli, roopa, linville, shrijeet, gospo, bcrl

On 2/26/15 6:33 AM, Thomas Graf wrote:
> E.g. A VRF in software would make use of net namespaces which holds
> tenant specific ACLs, routes and QoS settings. A separate action
> would fwd packets to the namespace. Easy and straight forward in
> software.

namespace == L1 separation
VRF == L3 separation

Why is there an insistence that an L1 construct is appropriate for L3 
isolation? Has anyone other than 6wind actually done a 1000+ VRFs with 
the Linux stack?

David

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 16:17       ` Jiri Pirko
@ 2015-02-26 18:15         ` Tom Herbert
  2015-02-26 19:05           ` Thomas Graf
                             ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Tom Herbert @ 2015-02-26 18:15 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Simon Horman, Linux Netdev List, David Miller, Neil Horman,
	Andy Gospodarek, Thomas Graf, Daniel Borkmann, Or Gerlitz,
	Jesse Gross, jpettit, Joe Stringer, John Fastabend,
	Jamal Hadi Salim, Scott Feldman, Florian Fainelli, Roopa Prabhu,
	John Linville, shrijeet, Andy Gospodarek, bcrl

On Thu, Feb 26, 2015 at 8:17 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Thu, Feb 26, 2015 at 05:04:31PM CET, therbert@google.com wrote:
>>On Thu, Feb 26, 2015 at 1:16 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> Thu, Feb 26, 2015 at 09:38:01AM CET, simon.horman@netronome.com wrote:
>>>>Hi Jiri,
>>>>
>>>>On Thu, Feb 26, 2015 at 08:42:14AM +0100, Jiri Pirko wrote:
>>>>> Hello everyone.
>>>>>
>>>>> I would like to discuss big next step for switch offloading. Probably
>>>>> the most complicated one we have so far. That is to be able to offload flows.
>>>>> Leaving nftables aside for a moment, I see 2 big usecases:
>>>>> - TC filters and actions offload.
>>>>> - OVS key match and actions offload.
>>>>>
>>>>> I think it might sense to ignore OVS for now. The reason is ongoing efford
>>>>> to replace OVS kernel datapath with TC subsystem. After that, OVS offload
>>>>> will not longer be needed and we'll get it for free with TC offload
>>>>> implementation. So we can focus on TC now.
>>>>>
>>>>> Here is my list of actions to achieve some results in near future:
>>>>> 1) finish cls_openflow classifier and iproute part of it
>>>>> 2) extend switchdev API for TC cls and acts offloading (using John's flow api?)
>>>>> 3) use rocker to provide offload for cls_openflow and couple of selected actions
>>>>> 4) improve cls_openflow performance (hashtables etc)
>>>>> 5) improve TC subsystem performance in both slow and fast path
>>>>>     -RTNL mutex and qdisc lock removal/reduction, lockless stats update.
>>>>> 6) implement "named sockets" (working name) and implement TC support for that
>>>>>     -ingress qdisc attach, act_mirred target
>>>>> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
>>>>> 8) implement TC act_mpls
>>>>> 9) suggest to switch OVS userspace from OVS genl to TC API
>>>>>
>>>>> This is my personal action list, but you are *very welcome* to step in to help.
>>>>> Point 2) haunts me at night....
>>>>> I believe that John is already working on 2) and part of 3).
>>>>>
>>>>> What do you think?
>>>>
>>> >From my point of view the question of replacing the kernel datapath with TC
>>>>is orthogonal to the question of flow offloads. This is because I believe
>>>>there is some consensus around the idea that, at least in the case of Open
>>>>vSwitch, the decision to offload flows should made in user-space where
>>>>flows are already managed. And in that case datapath will not be
>>>>transparently offloading of flows.  And thus flow offload may be performed
>>>>independently of the kernel datapath, weather that be via flow manipulation
>>>>portions of John's Flow API, TC, or some other means.
>>>
>>> Well, on netdev01, I believe that a consensus was reached that for every
>>> switch offloaded functionality there has to be an implementation in
>>> kernel. What John's Flow API originally did was to provide a way to
>>> configure hardware independently of kernel. So the right way is to
>>> configure kernel and, if hw allows it, to offload the configuration to hw.
>>>
>>> In this case, seems to me logical to offload from one place, that being
>>> TC. The reason is, as I stated above, the possible conversion from OVS
>>> datapath to TC.
>>>
>>Sorry if I'm asking dumb questions, but this is about where I usually
>>start to get lost in these discussions ;-). Is the aim of switch
>>offload to offload OVS or kernel functions of routing, iptables, tc,
>>etc.? These are very different I believe. As far as I can tell OVS
>>model of "flows" (like Openflow) is currently incompatible with the
>>rest of the kernel. So if the plan is convert OVS datapath to TC does
>>that mean introducing that model into core kernel?
>
> The thing is that you can achieve very similar model as OVS with TC.
> OVS uses rx_handler.
> TC uses handle_ing hook.
> Those are in the same place in the receive path.
> After that, ovs processes skb through key matches, and does some actions.
> The same is done in TC cls_* and act_*.
> Finally skb is forwarded to some netdev by dev_queue_xmit (in both OVS
> and TC).
>
> I certainly simplified things. But I do not see the different model you
> are talking about.
>
But, routing (aka switching) in the stack is not configured through
TC. We have a whole forwarding and routing infrastructure (eg.
iproute) with optimizations that  allow routes to be cached in
sockets, etc. To me, it seems like offloading that basic functionality
is a prerequisite before attempting to offload more advanced policy
mechanisms of TC, netfilter, etc.

Tom

> Jiri

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 16:04     ` Tom Herbert
  2015-02-26 16:17       ` Jiri Pirko
@ 2015-02-26 18:16       ` Scott Feldman
  1 sibling, 0 replies; 53+ messages in thread
From: Scott Feldman @ 2015-02-26 18:16 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jiri Pirko, Simon Horman, Linux Netdev List, David Miller,
	Neil Horman, Andy Gospodarek, Thomas Graf, Daniel Borkmann,
	Or Gerlitz, Jesse Gross, jpettit, Joe Stringer, John Fastabend,
	Jamal Hadi Salim, Florian Fainelli, Roopa Prabhu, John Linville,
	Shrijeet Mukherjee, Andy Gospodarek, Benjamin LaHaise

On Thu, Feb 26, 2015 at 8:04 AM, Tom Herbert <therbert@google.com> wrote:

> Sorry if I'm asking dumb questions, but this is about where I usually
> start to get lost in these discussions ;-). Is the aim of switch
> offload to offload OVS or kernel functions of routing, iptables, tc,
> etc.?

Both, which is why it's hard to follow, and there are active efforts
on both fronts.  What has stuck so far with switchdev is offloading
kernel functions, such as L2 bridging to existing switch chips (real
(DSA) or fake (rocker)).  WIP on extended that to L3 and maybe a
subset of nftables.  So far, the kernel is the "application", and
we're offloading (sorry, overloaded term, but I can't think of an
alternative) the kernel fwd data paths to capable hardware.  But there
are "impedance" mismatches between what we can do today in the kernel
and fixed hardware.  On the other front we have offloading OVS, or
more in more general terms, programming the switch from some other
"application", local or remote using the proposed flow api or tc or
some combination.  Maybe these two fronts merge down the road where
the kernel is just another application using the flow api.  Grand
unification theory?


> These are very different I believe. As far as I can tell OVS
> model of "flows" (like Openflow) is currently incompatible with the
> rest of the kernel.
> So if the plan is convert OVS datapath to TC does
> that mean introducing that model into core kernel?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 18:15         ` Tom Herbert
@ 2015-02-26 19:05           ` Thomas Graf
  2015-02-27  9:00           ` Jiri Pirko
  2015-02-28 20:02           ` David Miller
  2 siblings, 0 replies; 53+ messages in thread
From: Thomas Graf @ 2015-02-26 19:05 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jiri Pirko, Simon Horman, Linux Netdev List, David Miller,
	Neil Horman, Andy Gospodarek, Daniel Borkmann, Or Gerlitz,
	Jesse Gross, jpettit, Joe Stringer, John Fastabend,
	Jamal Hadi Salim, Scott Feldman, Florian Fainelli, Roopa Prabhu,
	John Linville, shrijeet, Andy Gospodarek, bcrl

On 02/26/15 at 10:15am, Tom Herbert wrote:
> But, routing (aka switching) in the stack is not configured through
> TC. We have a whole forwarding and routing infrastructure (eg.
> iproute) with optimizations that  allow routes to be cached in
> sockets, etc. To me, it seems like offloading that basic functionality
> is a prerequisite before attempting to offload more advanced policy
> mechanisms of TC, netfilter, etc.

It is visible that you are coming from container focused world
with sockets on the host ;-) The L3 offload desire comes primiarly
from a VM centric host where decap+L3 offload to a VF allows to
spend zero cycles on the host kernel. I'm not sure if offload
makes sense at all for container workloads w/o VM isolation per
tentant or orch system or something similar.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26  7:42 Flows! Offload them Jiri Pirko
                   ` (2 preceding siblings ...)
  2015-02-26 12:51 ` Thomas Graf
@ 2015-02-26 19:32 ` Florian Fainelli
  2015-02-26 20:58   ` John Fastabend
  3 siblings, 1 reply; 53+ messages in thread
From: Florian Fainelli @ 2015-02-26 19:32 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, jpettit,
	joestringer, john.r.fastabend, jhs, sfeldma, roopa, linville,
	simon.horman, shrijeet, gospo, bcrl

Hi Jiri,

On 25/02/15 23:42, Jiri Pirko wrote:
> Hello everyone.
> 
> I would like to discuss big next step for switch offloading. Probably
> the most complicated one we have so far. That is to be able to offload flows.
> Leaving nftables aside for a moment, I see 2 big usecases:
> - TC filters and actions offload.
> - OVS key match and actions offload.
> 
> I think it might sense to ignore OVS for now. The reason is ongoing efford
> to replace OVS kernel datapath with TC subsystem. After that, OVS offload
> will not longer be needed and we'll get it for free with TC offload
> implementation. So we can focus on TC now.

What is not necessarily clear to me, is if we leave nftables aside for
now from flow offloading, does that mean the entire flow offloading will
now be controlled and going with the TC subsystem necessarily?

I am not questioning the choice for TC, I am just wondering if
ultimately there is the need for a lower layer, which is below, such
that both tc and e.g: nftables can benefit from it?

I guess my larger question is, if I need to learn about new flows
entering the stack, how is that going to wind-up looking like?

> 
> Here is my list of actions to achieve some results in near future:
> 1) finish cls_openflow classifier and iproute part of it
> 2) extend switchdev API for TC cls and acts offloading (using John's flow api?)
> 3) use rocker to provide offload for cls_openflow and couple of selected actions
> 4) improve cls_openflow performance (hashtables etc)
> 5) improve TC subsystem performance in both slow and fast path
>     -RTNL mutex and qdisc lock removal/reduction, lockless stats update.
> 6) implement "named sockets" (working name) and implement TC support for that
>     -ingress qdisc attach, act_mirred target
> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
> 8) implement TC act_mpls
> 9) suggest to switch OVS userspace from OVS genl to TC API
> 
> This is my personal action list, but you are *very welcome* to step in to help.
> Point 2) haunts me at night....
> I believe that John is already working on 2) and part of 3).
> 
> What do you think?
-- 
Florian

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 15:23       ` John Fastabend
@ 2015-02-26 20:16         ` Neil Horman
  2015-02-26 21:11           ` John Fastabend
  2015-02-26 21:52           ` Simon Horman
  0 siblings, 2 replies; 53+ messages in thread
From: Neil Horman @ 2015-02-26 20:16 UTC (permalink / raw)
  To: John Fastabend
  Cc: Thomas Graf, Jiri Pirko, Simon Horman, netdev, davem, andy,
	dborkman, ogerlitz, jesse, jpettit, joestringer, jhs, sfeldma,
	f.fainelli, roopa, linville, shrijeet, gospo, bcrl

On Thu, Feb 26, 2015 at 07:23:36AM -0800, John Fastabend wrote:
> On 02/26/2015 05:33 AM, Thomas Graf wrote:
> > On 02/26/15 at 10:16am, Jiri Pirko wrote:
> >> Well, on netdev01, I believe that a consensus was reached that for every
> >> switch offloaded functionality there has to be an implementation in
> >> kernel.
> > 
> > Agreed. This should not prevent the policy being driven from user
> > space though.
> > 
> >> What John's Flow API originally did was to provide a way to
> >> configure hardware independently of kernel. So the right way is to
> >> configure kernel and, if hw allows it, to offload the configuration to hw.
> >>
> >> In this case, seems to me logical to offload from one place, that being
> >> TC. The reason is, as I stated above, the possible conversion from OVS
> >> datapath to TC.
> > 
> > Offloading of TC definitely makes a lot of sense. I think that even in
> > that case you will already encounter independent configuration of
> > hardware and kernel. Example: The hardware provides a fixed, generic
> > function to push up to n bytes onto a packet. This hardware function
> > could be used to implement TC actions "push_vlan", "push_vxlan",
> > "push_mpls". You would you would likely agree that TC should make use
> > of such a function even if the hardware version is different from the
> > software version. So I don't think we'll have a 1:1 mapping for all
> > configurations, regardless of whether the how is decided in kernel or
> > user space.
> 
> Just to expand slightly on this. I don't think you can get to a 1:1
> mapping here. One reason is hardware typically has a TCAM and limited
> size. So you need a _policy_ to determine when to push rules into the
> hardware. The kernel doesn't know when to do this and I don't believe
> its the kernel's place to start enforcing policy like this. One thing I likely
> need to do is get some more "worlds" in rocker so we aren't stuck only
> thinking about the infinite size OF_DPA world. The OF_DPA world is only
> one world and not a terribly flexible one at that when compared with the
> NPU folk. So minimally you need a flag to indicate rules go into hardware
> vs software.
> 
> That said I think the bigger mismatch between software and hardware is
> you program it differently because the data structures are different. Maybe
> a u32 example would help. For parsing with u32 you might build a parse
> graph with a root and some leaf nodes. In hardware you want to collapse
> this down onto the hardware. I argue this is not a kernel task because
> there are lots of ways to do this and there are trade-offs made with
> respect to space and performance and which table to use when it could be
> handled by a set of tables. Another example is a virtual switch possibly
> OVS but we have others. The software does some "unmasking" (there term)
> before sending the rules into the software dataplane cache. Basically this
> means we can ignore priority in the hash lookup. However this is not how you
> would optimally use hardware. Maybe I should do another write up with
> some more concrete examples.
> 
> There are also lots of use cases to _not_ have hardware and software in
> sync. A flag allows this.
> 
> My only point is I think we need to allow users to optimally use there
> hardware either via 'tc' or my previous 'flow' tool. Actually in my
> opinion I still think its best to have both interfaces.
> 
> I'll go get some coffee now and hopefully that is somewhat clear.


I've been thinking about the policy apect of this, and the more I think about
it, the more I wonder if not allowing some sort of common policy in the kernel
is really the right thing to do here.  I know thats somewhat blasphemous, but
this isn't really administrative poilcy that we're talking about, at least not
100%.  Its more of a behavioral profile that we're trying to enforce.  That may
be splitting hairs, but I think theres precidence for the latter.  That is to
say, we configure qdiscs to limit traffic flow to certain rates, and configure
policies which drop traffic that violates it (which includes random discard,
which is the antithesis of deterministic policy).  I'm not sure I see this as
any different, espcially if we limit its scope.  That is to say, why couldn't we
allow the kernel to program a predetermined set of policies that the admin can
set (i.e. offload routing to a hardware cache of X size with an lru
victimization).  If other well defined policies make sense, we can add them and
exposes options via iproute2 or some such to set them.  For the use case where
such pre-packaged policies don't make sense, we have things like the flow api to
offer users who want to be able to control their hardware in a more fine grained
approach.

Neil

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 19:32 ` Florian Fainelli
@ 2015-02-26 20:58   ` John Fastabend
  2015-02-26 21:45     ` Florian Fainelli
  2015-02-27 14:01     ` Driver level interface WAS(Re: " Jamal Hadi Salim
  0 siblings, 2 replies; 53+ messages in thread
From: John Fastabend @ 2015-02-26 20:58 UTC (permalink / raw)
  To: Florian Fainelli, Jiri Pirko, netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, jpettit,
	joestringer, jhs, sfeldma, roopa, linville, simon.horman,
	shrijeet, gospo, bcrl

On 02/26/2015 11:32 AM, Florian Fainelli wrote:
> Hi Jiri,
> 
> On 25/02/15 23:42, Jiri Pirko wrote:
>> Hello everyone.
>>
>> I would like to discuss big next step for switch offloading. Probably
>> the most complicated one we have so far. That is to be able to offload flows.
>> Leaving nftables aside for a moment, I see 2 big usecases:
>> - TC filters and actions offload.
>> - OVS key match and actions offload.
>>
>> I think it might sense to ignore OVS for now. The reason is ongoing efford
>> to replace OVS kernel datapath with TC subsystem. After that, OVS offload
>> will not longer be needed and we'll get it for free with TC offload
>> implementation. So we can focus on TC now.
> 
> What is not necessarily clear to me, is if we leave nftables aside for
> now from flow offloading, does that mean the entire flow offloading will
> now be controlled and going with the TC subsystem necessarily?
> 
> I am not questioning the choice for TC, I am just wondering if
> ultimately there is the need for a lower layer, which is below, such
> that both tc and e.g: nftables can benefit from it?

My thinking on this is to use the FlowAPI ndo_ops as the bottom layer.
What I would much prefer (having to actually write drivers) is that
we have one API to the driver and tc, nft, whatever map onto that API.

Then my driver implements a ndo_set_flow op and a ndo_del_flow op. What
I'm working on now is the map from tc onto the flow API I'm hoping this
sounds like a good idea to folks.

Neil, suggested we might need a reservation concept where tc can reserve
some space in a TCAM, similarly nft can reserve some space. Also I have
applications in user space that want to reserve some space to offload
their specific data structures. This idea seems like a good one to me.

> 
> I guess my larger question is, if I need to learn about new flows
> entering the stack, how is that going to wind-up looking like?
> 
>>
>> Here is my list of actions to achieve some results in near future:
>> 1) finish cls_openflow classifier and iproute part of it
>> 2) extend switchdev API for TC cls and acts offloading (using John's flow api?)
>> 3) use rocker to provide offload for cls_openflow and couple of selected actions
>> 4) improve cls_openflow performance (hashtables etc)
>> 5) improve TC subsystem performance in both slow and fast path
>>     -RTNL mutex and qdisc lock removal/reduction, lockless stats update.
>> 6) implement "named sockets" (working name) and implement TC support for that
>>     -ingress qdisc attach, act_mirred target
>> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
>> 8) implement TC act_mpls
>> 9) suggest to switch OVS userspace from OVS genl to TC API
>>
>> This is my personal action list, but you are *very welcome* to step in to help.
>> Point 2) haunts me at night....
>> I believe that John is already working on 2) and part of 3).
>>
>> What do you think?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 20:16         ` Neil Horman
@ 2015-02-26 21:11           ` John Fastabend
  2015-02-27  1:17             ` Neil Horman
  2015-02-27  8:53             ` Jiri Pirko
  2015-02-26 21:52           ` Simon Horman
  1 sibling, 2 replies; 53+ messages in thread
From: John Fastabend @ 2015-02-26 21:11 UTC (permalink / raw)
  To: Neil Horman
  Cc: Thomas Graf, Jiri Pirko, Simon Horman, netdev, davem, andy,
	dborkman, ogerlitz, jesse, jpettit, joestringer, jhs, sfeldma,
	f.fainelli, roopa, linville, shrijeet, gospo, bcrl

On 02/26/2015 12:16 PM, Neil Horman wrote:
> On Thu, Feb 26, 2015 at 07:23:36AM -0800, John Fastabend wrote:
>> On 02/26/2015 05:33 AM, Thomas Graf wrote:
>>> On 02/26/15 at 10:16am, Jiri Pirko wrote:
>>>> Well, on netdev01, I believe that a consensus was reached that for every
>>>> switch offloaded functionality there has to be an implementation in
>>>> kernel.
>>>
>>> Agreed. This should not prevent the policy being driven from user
>>> space though.
>>>
>>>> What John's Flow API originally did was to provide a way to
>>>> configure hardware independently of kernel. So the right way is to
>>>> configure kernel and, if hw allows it, to offload the configuration to hw.
>>>>
>>>> In this case, seems to me logical to offload from one place, that being
>>>> TC. The reason is, as I stated above, the possible conversion from OVS
>>>> datapath to TC.
>>>
>>> Offloading of TC definitely makes a lot of sense. I think that even in
>>> that case you will already encounter independent configuration of
>>> hardware and kernel. Example: The hardware provides a fixed, generic
>>> function to push up to n bytes onto a packet. This hardware function
>>> could be used to implement TC actions "push_vlan", "push_vxlan",
>>> "push_mpls". You would you would likely agree that TC should make use
>>> of such a function even if the hardware version is different from the
>>> software version. So I don't think we'll have a 1:1 mapping for all
>>> configurations, regardless of whether the how is decided in kernel or
>>> user space.
>>
>> Just to expand slightly on this. I don't think you can get to a 1:1
>> mapping here. One reason is hardware typically has a TCAM and limited
>> size. So you need a _policy_ to determine when to push rules into the
>> hardware. The kernel doesn't know when to do this and I don't believe
>> its the kernel's place to start enforcing policy like this. One thing I likely
>> need to do is get some more "worlds" in rocker so we aren't stuck only
>> thinking about the infinite size OF_DPA world. The OF_DPA world is only
>> one world and not a terribly flexible one at that when compared with the
>> NPU folk. So minimally you need a flag to indicate rules go into hardware
>> vs software.
>>
>> That said I think the bigger mismatch between software and hardware is
>> you program it differently because the data structures are different. Maybe
>> a u32 example would help. For parsing with u32 you might build a parse
>> graph with a root and some leaf nodes. In hardware you want to collapse
>> this down onto the hardware. I argue this is not a kernel task because
>> there are lots of ways to do this and there are trade-offs made with
>> respect to space and performance and which table to use when it could be
>> handled by a set of tables. Another example is a virtual switch possibly
>> OVS but we have others. The software does some "unmasking" (there term)
>> before sending the rules into the software dataplane cache. Basically this
>> means we can ignore priority in the hash lookup. However this is not how you
>> would optimally use hardware. Maybe I should do another write up with
>> some more concrete examples.
>>
>> There are also lots of use cases to _not_ have hardware and software in
>> sync. A flag allows this.
>>
>> My only point is I think we need to allow users to optimally use there
>> hardware either via 'tc' or my previous 'flow' tool. Actually in my
>> opinion I still think its best to have both interfaces.
>>
>> I'll go get some coffee now and hopefully that is somewhat clear.
> 
> 
> I've been thinking about the policy apect of this, and the more I think about
> it, the more I wonder if not allowing some sort of common policy in the kernel
> is really the right thing to do here.  I know thats somewhat blasphemous, but
> this isn't really administrative poilcy that we're talking about, at least not
> 100%.  Its more of a behavioral profile that we're trying to enforce.  That may
> be splitting hairs, but I think theres precidence for the latter.  That is to
> say, we configure qdiscs to limit traffic flow to certain rates, and configure
> policies which drop traffic that violates it (which includes random discard,
> which is the antithesis of deterministic policy).  I'm not sure I see this as
> any different, espcially if we limit its scope.  That is to say, why couldn't we
> allow the kernel to program a predetermined set of policies that the admin can
> set (i.e. offload routing to a hardware cache of X size with an lru
> victimization).  If other well defined policies make sense, we can add them and
> exposes options via iproute2 or some such to set them.  For the use case where
> such pre-packaged policies don't make sense, we have things like the flow api to
> offer users who want to be able to control their hardware in a more fine grained
> approach.
> 
> Neil
> 

Hi Neil,

I actually like this idea a lot. I might tweak a bit in that we could have some
feature bits or something like feature bits that expose how to split up the
hardware cache and give sizes.

So the hypervisor (see I think of end hosts) or administrators could come in and
say I want a route table and a nft table. This creates a "flavor" over how the
hardware is going to be used. Another use case may not be doing routing at all
but have an application that wants to manage the hardware at a more fine grained
level with the exception of some nft commands so it could have a "nft"+"flow"
flavor. Insert your favorite use case here.

.John

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 20:58   ` John Fastabend
@ 2015-02-26 21:45     ` Florian Fainelli
  2015-02-26 23:06       ` John Fastabend
  2015-02-27 18:37       ` Neil Horman
  2015-02-27 14:01     ` Driver level interface WAS(Re: " Jamal Hadi Salim
  1 sibling, 2 replies; 53+ messages in thread
From: Florian Fainelli @ 2015-02-26 21:45 UTC (permalink / raw)
  To: John Fastabend, Jiri Pirko, netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, jpettit,
	joestringer, jhs, sfeldma, roopa, linville, simon.horman,
	shrijeet, gospo, bcrl

On 26/02/15 12:58, John Fastabend wrote:
> On 02/26/2015 11:32 AM, Florian Fainelli wrote:
>> Hi Jiri,
>>
>> On 25/02/15 23:42, Jiri Pirko wrote:
>>> Hello everyone.
>>>
>>> I would like to discuss big next step for switch offloading. Probably
>>> the most complicated one we have so far. That is to be able to offload flows.
>>> Leaving nftables aside for a moment, I see 2 big usecases:
>>> - TC filters and actions offload.
>>> - OVS key match and actions offload.
>>>
>>> I think it might sense to ignore OVS for now. The reason is ongoing efford
>>> to replace OVS kernel datapath with TC subsystem. After that, OVS offload
>>> will not longer be needed and we'll get it for free with TC offload
>>> implementation. So we can focus on TC now.
>>
>> What is not necessarily clear to me, is if we leave nftables aside for
>> now from flow offloading, does that mean the entire flow offloading will
>> now be controlled and going with the TC subsystem necessarily?
>>
>> I am not questioning the choice for TC, I am just wondering if
>> ultimately there is the need for a lower layer, which is below, such
>> that both tc and e.g: nftables can benefit from it?
> 
> My thinking on this is to use the FlowAPI ndo_ops as the bottom layer.
> What I would much prefer (having to actually write drivers) is that
> we have one API to the driver and tc, nft, whatever map onto that API.

Ok, I think this is indeed the right approach.

> 
> Then my driver implements a ndo_set_flow op and a ndo_del_flow op. What
> I'm working on now is the map from tc onto the flow API I'm hoping this
> sounds like a good idea to folks.

Sounds good to me.

> 
> Neil, suggested we might need a reservation concept where tc can reserve
> some space in a TCAM, similarly nft can reserve some space. Also I have
> applications in user space that want to reserve some space to offload
> their specific data structures. This idea seems like a good one to me.

Humm, I guess the question is how and when do we do this reservation, is
it upon first potential access from e.g: tc or nft to an offloading
capable hardware, and if so, upon first attempt to offload an operation?

If we are to interface with a TCAM, some operations might require more
slices than others, which will limit the number of actions available,
but it is hard to know ahead of time.

> 
>>
>> I guess my larger question is, if I need to learn about new flows
>> entering the stack, how is that going to wind-up looking like?
>>
>>>
>>> Here is my list of actions to achieve some results in near future:
>>> 1) finish cls_openflow classifier and iproute part of it
>>> 2) extend switchdev API for TC cls and acts offloading (using John's flow api?)
>>> 3) use rocker to provide offload for cls_openflow and couple of selected actions
>>> 4) improve cls_openflow performance (hashtables etc)
>>> 5) improve TC subsystem performance in both slow and fast path
>>>     -RTNL mutex and qdisc lock removal/reduction, lockless stats update.
>>> 6) implement "named sockets" (working name) and implement TC support for that
>>>     -ingress qdisc attach, act_mirred target
>>> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
>>> 8) implement TC act_mpls
>>> 9) suggest to switch OVS userspace from OVS genl to TC API
>>>
>>> This is my personal action list, but you are *very welcome* to step in to help.
>>> Point 2) haunts me at night....
>>> I believe that John is already working on 2) and part of 3).
>>>
>>> What do you think?
> 


-- 
Florian

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 20:16         ` Neil Horman
  2015-02-26 21:11           ` John Fastabend
@ 2015-02-26 21:52           ` Simon Horman
  2015-02-27  1:22             ` Neil Horman
  1 sibling, 1 reply; 53+ messages in thread
From: Simon Horman @ 2015-02-26 21:52 UTC (permalink / raw)
  To: Neil Horman
  Cc: John Fastabend, Thomas Graf, Jiri Pirko, netdev, davem, andy,
	dborkman, ogerlitz, jesse, jpettit, joestringer, jhs, sfeldma,
	f.fainelli, roopa, linville, shrijeet, gospo, bcrl

On Thu, Feb 26, 2015 at 03:16:35PM -0500, Neil Horman wrote:
> On Thu, Feb 26, 2015 at 07:23:36AM -0800, John Fastabend wrote:
> > On 02/26/2015 05:33 AM, Thomas Graf wrote:
> > > On 02/26/15 at 10:16am, Jiri Pirko wrote:
> > >> Well, on netdev01, I believe that a consensus was reached that for every
> > >> switch offloaded functionality there has to be an implementation in
> > >> kernel.
> > > 
> > > Agreed. This should not prevent the policy being driven from user
> > > space though.
> > > 
> > >> What John's Flow API originally did was to provide a way to
> > >> configure hardware independently of kernel. So the right way is to
> > >> configure kernel and, if hw allows it, to offload the configuration to hw.
> > >>
> > >> In this case, seems to me logical to offload from one place, that being
> > >> TC. The reason is, as I stated above, the possible conversion from OVS
> > >> datapath to TC.
> > > 
> > > Offloading of TC definitely makes a lot of sense. I think that even in
> > > that case you will already encounter independent configuration of
> > > hardware and kernel. Example: The hardware provides a fixed, generic
> > > function to push up to n bytes onto a packet. This hardware function
> > > could be used to implement TC actions "push_vlan", "push_vxlan",
> > > "push_mpls". You would you would likely agree that TC should make use
> > > of such a function even if the hardware version is different from the
> > > software version. So I don't think we'll have a 1:1 mapping for all
> > > configurations, regardless of whether the how is decided in kernel or
> > > user space.
> > 
> > Just to expand slightly on this. I don't think you can get to a 1:1
> > mapping here. One reason is hardware typically has a TCAM and limited
> > size. So you need a _policy_ to determine when to push rules into the
> > hardware. The kernel doesn't know when to do this and I don't believe
> > its the kernel's place to start enforcing policy like this. One thing I likely
> > need to do is get some more "worlds" in rocker so we aren't stuck only
> > thinking about the infinite size OF_DPA world. The OF_DPA world is only
> > one world and not a terribly flexible one at that when compared with the
> > NPU folk. So minimally you need a flag to indicate rules go into hardware
> > vs software.
> > 
> > That said I think the bigger mismatch between software and hardware is
> > you program it differently because the data structures are different. Maybe
> > a u32 example would help. For parsing with u32 you might build a parse
> > graph with a root and some leaf nodes. In hardware you want to collapse
> > this down onto the hardware. I argue this is not a kernel task because
> > there are lots of ways to do this and there are trade-offs made with
> > respect to space and performance and which table to use when it could be
> > handled by a set of tables. Another example is a virtual switch possibly
> > OVS but we have others. The software does some "unmasking" (there term)
> > before sending the rules into the software dataplane cache. Basically this
> > means we can ignore priority in the hash lookup. However this is not how you
> > would optimally use hardware. Maybe I should do another write up with
> > some more concrete examples.
> > 
> > There are also lots of use cases to _not_ have hardware and software in
> > sync. A flag allows this.
> > 
> > My only point is I think we need to allow users to optimally use there
> > hardware either via 'tc' or my previous 'flow' tool. Actually in my
> > opinion I still think its best to have both interfaces.
> > 
> > I'll go get some coffee now and hopefully that is somewhat clear.
> 
> 
> I've been thinking about the policy apect of this, and the more I think
> about it, the more I wonder if not allowing some sort of common policy in
> the kernel is really the right thing to do here.  I know thats somewhat
> blasphemous, but this isn't really administrative poilcy that we're
> talking about, at least not 100%.  Its more of a behavioral profile that
> we're trying to enforce.  That may be splitting hairs, but I think theres
> precidence for the latter.  That is to say, we configure qdiscs to limit
> traffic flow to certain rates, and configure policies which drop traffic
> that violates it (which includes random discard, which is the antithesis
> of deterministic policy).  I'm not sure I see this as any different,
> espcially if we limit its scope.  That is to say, why couldn't we allow
> the kernel to program a predetermined set of policies that the admin can
> set (i.e. offload routing to a hardware cache of X size with an lru
> victimization).  If other well defined policies make sense, we can add
> them and exposes options via iproute2 or some such to set them.  For the
> use case where such pre-packaged policies don't make sense, we have
> things like the flow api to offer users who want to be able to control
> their hardware in a more fine grained approach.

In general I agree that it makes sense to have have sane offload policy
in the kernel and provide a mechanism to override that. Things that already
work should continue to work: just faster or with fewer CPU cycles consumed.

I am, however, not entirely convinced that it is always possible to
implement such a sane default policy that is worth the code complexity -
I'm thinking in particular of Open vSwitch where management of flows is
already in user-space.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 21:45     ` Florian Fainelli
@ 2015-02-26 23:06       ` John Fastabend
  2015-02-27 18:37       ` Neil Horman
  1 sibling, 0 replies; 53+ messages in thread
From: John Fastabend @ 2015-02-26 23:06 UTC (permalink / raw)
  To: Florian Fainelli, Jiri Pirko, netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, jpettit,
	joestringer, jhs, sfeldma, roopa, linville, simon.horman,
	shrijeet, gospo, bcrl

On 02/26/2015 01:45 PM, Florian Fainelli wrote:
> On 26/02/15 12:58, John Fastabend wrote:
>> On 02/26/2015 11:32 AM, Florian Fainelli wrote:
>>> Hi Jiri,
>>>
>>> On 25/02/15 23:42, Jiri Pirko wrote:
>>>> Hello everyone.
>>>>
>>>> I would like to discuss big next step for switch offloading. Probably
>>>> the most complicated one we have so far. That is to be able to offload flows.
>>>> Leaving nftables aside for a moment, I see 2 big usecases:
>>>> - TC filters and actions offload.
>>>> - OVS key match and actions offload.
>>>>
>>>> I think it might sense to ignore OVS for now. The reason is ongoing efford
>>>> to replace OVS kernel datapath with TC subsystem. After that, OVS offload
>>>> will not longer be needed and we'll get it for free with TC offload
>>>> implementation. So we can focus on TC now.
>>>
>>> What is not necessarily clear to me, is if we leave nftables aside for
>>> now from flow offloading, does that mean the entire flow offloading will
>>> now be controlled and going with the TC subsystem necessarily?
>>>
>>> I am not questioning the choice for TC, I am just wondering if
>>> ultimately there is the need for a lower layer, which is below, such
>>> that both tc and e.g: nftables can benefit from it?
>>
>> My thinking on this is to use the FlowAPI ndo_ops as the bottom layer.
>> What I would much prefer (having to actually write drivers) is that
>> we have one API to the driver and tc, nft, whatever map onto that API.
> 
> Ok, I think this is indeed the right approach.
> 
>>
>> Then my driver implements a ndo_set_flow op and a ndo_del_flow op. What
>> I'm working on now is the map from tc onto the flow API I'm hoping this
>> sounds like a good idea to folks.
> 
> Sounds good to me.
> 
>>
>> Neil, suggested we might need a reservation concept where tc can reserve
>> some space in a TCAM, similarly nft can reserve some space. Also I have
>> applications in user space that want to reserve some space to offload
>> their specific data structures. This idea seems like a good one to me.
> 
> Humm, I guess the question is how and when do we do this reservation, is
> it upon first potential access from e.g: tc or nft to an offloading
> capable hardware, and if so, upon first attempt to offload an operation?

hmm I don't think this will work right because your nft configuration might
consume the entire tcam before 'tc' gets a chance to run.

> 
> If we are to interface with a TCAM, some operations might require more
> slices than others, which will limit the number of actions available,
> but it is hard to know ahead of time.

Right, one thing I've changed in the FlowAPI from the v3 I last sent is I
changed the ndo get ops to a model where the driver registers with the
kernel.

In v3 code the driver gave the model of the hardware how many tables it
has, what headers it supports, approximate size of each to the kernel
only when the kernel queried it. Now I have the driver call a register
routine at init time and the kernel runs some sanity checks on the model
to verify the actions/headers/tables are well formed. For example I check
all the actions match well-defined actions the kernel knows about to avoid
drivers exporting actions we can't understand.

Thinking out loud now but could we move this hardware table model register
hook to post init and have some configuration decide this? Maybe make the
configuration explicit from an API and change the reservation time from
module init time to later when userspace kicks it with a configuration.
Before this any calls into the driver will fail. We could add pre-defined
setup's that the init scripts could call for users who want a no-touch
switch system.

Another thought I think is worth noting is how we handle this today out
of kernel. We let the user define tables and give them labels via a create
command. This way the user can say

	" create table label acl use matches xyz actions abc min_size n"

or

	" create table label route use matches xyz actions abc min_size n"

and so on. This requires users to be knowledgeable enough to "know" how they
want to size their tables but gives the user flexibility to define this policy.
In practice though I don't think this is something you do on the cmd line its
probably a configuration pushed from a controller, libvirt or something. Its
part of the provisioning step on the system.

Thanks,
John


> 
>>
>>>
>>> I guess my larger question is, if I need to learn about new flows
>>> entering the stack, how is that going to wind-up looking like?
>>>
>>>>
>>>> Here is my list of actions to achieve some results in near future:
>>>> 1) finish cls_openflow classifier and iproute part of it
>>>> 2) extend switchdev API for TC cls and acts offloading (using John's flow api?)
>>>> 3) use rocker to provide offload for cls_openflow and couple of selected actions
>>>> 4) improve cls_openflow performance (hashtables etc)
>>>> 5) improve TC subsystem performance in both slow and fast path
>>>>     -RTNL mutex and qdisc lock removal/reduction, lockless stats update.
>>>> 6) implement "named sockets" (working name) and implement TC support for that
>>>>     -ingress qdisc attach, act_mirred target
>>>> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
>>>> 8) implement TC act_mpls
>>>> 9) suggest to switch OVS userspace from OVS genl to TC API
>>>>
>>>> This is my personal action list, but you are *very welcome* to step in to help.
>>>> Point 2) haunts me at night....
>>>> I believe that John is already working on 2) and part of 3).
>>>>
>>>> What do you think?
>>
> 
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 21:11           ` John Fastabend
@ 2015-02-27  1:17             ` Neil Horman
  2015-02-27  8:53             ` Jiri Pirko
  1 sibling, 0 replies; 53+ messages in thread
From: Neil Horman @ 2015-02-27  1:17 UTC (permalink / raw)
  To: John Fastabend
  Cc: Thomas Graf, Jiri Pirko, Simon Horman, netdev, davem, andy,
	dborkman, ogerlitz, jesse, jpettit, joestringer, jhs, sfeldma,
	f.fainelli, roopa, linville, shrijeet, gospo, bcrl

On Thu, Feb 26, 2015 at 01:11:23PM -0800, John Fastabend wrote:
> On 02/26/2015 12:16 PM, Neil Horman wrote:
> > On Thu, Feb 26, 2015 at 07:23:36AM -0800, John Fastabend wrote:
> >> On 02/26/2015 05:33 AM, Thomas Graf wrote:
> >>> On 02/26/15 at 10:16am, Jiri Pirko wrote:
> >>>> Well, on netdev01, I believe that a consensus was reached that for every
> >>>> switch offloaded functionality there has to be an implementation in
> >>>> kernel.
> >>>
> >>> Agreed. This should not prevent the policy being driven from user
> >>> space though.
> >>>
> >>>> What John's Flow API originally did was to provide a way to
> >>>> configure hardware independently of kernel. So the right way is to
> >>>> configure kernel and, if hw allows it, to offload the configuration to hw.
> >>>>
> >>>> In this case, seems to me logical to offload from one place, that being
> >>>> TC. The reason is, as I stated above, the possible conversion from OVS
> >>>> datapath to TC.
> >>>
> >>> Offloading of TC definitely makes a lot of sense. I think that even in
> >>> that case you will already encounter independent configuration of
> >>> hardware and kernel. Example: The hardware provides a fixed, generic
> >>> function to push up to n bytes onto a packet. This hardware function
> >>> could be used to implement TC actions "push_vlan", "push_vxlan",
> >>> "push_mpls". You would you would likely agree that TC should make use
> >>> of such a function even if the hardware version is different from the
> >>> software version. So I don't think we'll have a 1:1 mapping for all
> >>> configurations, regardless of whether the how is decided in kernel or
> >>> user space.
> >>
> >> Just to expand slightly on this. I don't think you can get to a 1:1
> >> mapping here. One reason is hardware typically has a TCAM and limited
> >> size. So you need a _policy_ to determine when to push rules into the
> >> hardware. The kernel doesn't know when to do this and I don't believe
> >> its the kernel's place to start enforcing policy like this. One thing I likely
> >> need to do is get some more "worlds" in rocker so we aren't stuck only
> >> thinking about the infinite size OF_DPA world. The OF_DPA world is only
> >> one world and not a terribly flexible one at that when compared with the
> >> NPU folk. So minimally you need a flag to indicate rules go into hardware
> >> vs software.
> >>
> >> That said I think the bigger mismatch between software and hardware is
> >> you program it differently because the data structures are different. Maybe
> >> a u32 example would help. For parsing with u32 you might build a parse
> >> graph with a root and some leaf nodes. In hardware you want to collapse
> >> this down onto the hardware. I argue this is not a kernel task because
> >> there are lots of ways to do this and there are trade-offs made with
> >> respect to space and performance and which table to use when it could be
> >> handled by a set of tables. Another example is a virtual switch possibly
> >> OVS but we have others. The software does some "unmasking" (there term)
> >> before sending the rules into the software dataplane cache. Basically this
> >> means we can ignore priority in the hash lookup. However this is not how you
> >> would optimally use hardware. Maybe I should do another write up with
> >> some more concrete examples.
> >>
> >> There are also lots of use cases to _not_ have hardware and software in
> >> sync. A flag allows this.
> >>
> >> My only point is I think we need to allow users to optimally use there
> >> hardware either via 'tc' or my previous 'flow' tool. Actually in my
> >> opinion I still think its best to have both interfaces.
> >>
> >> I'll go get some coffee now and hopefully that is somewhat clear.
> > 
> > 
> > I've been thinking about the policy apect of this, and the more I think about
> > it, the more I wonder if not allowing some sort of common policy in the kernel
> > is really the right thing to do here.  I know thats somewhat blasphemous, but
> > this isn't really administrative poilcy that we're talking about, at least not
> > 100%.  Its more of a behavioral profile that we're trying to enforce.  That may
> > be splitting hairs, but I think theres precidence for the latter.  That is to
> > say, we configure qdiscs to limit traffic flow to certain rates, and configure
> > policies which drop traffic that violates it (which includes random discard,
> > which is the antithesis of deterministic policy).  I'm not sure I see this as
> > any different, espcially if we limit its scope.  That is to say, why couldn't we
> > allow the kernel to program a predetermined set of policies that the admin can
> > set (i.e. offload routing to a hardware cache of X size with an lru
> > victimization).  If other well defined policies make sense, we can add them and
> > exposes options via iproute2 or some such to set them.  For the use case where
> > such pre-packaged policies don't make sense, we have things like the flow api to
> > offer users who want to be able to control their hardware in a more fine grained
> > approach.
> > 
> > Neil
> > 
> 
> Hi Neil,
> 
> I actually like this idea a lot. I might tweak a bit in that we could have some
> feature bits or something like feature bits that expose how to split up the
> hardware cache and give sizes.
> 
> So the hypervisor (see I think of end hosts) or administrators could come in and
> say I want a route table and a nft table. This creates a "flavor" over how the
> hardware is going to be used. Another use case may not be doing routing at all
> but have an application that wants to manage the hardware at a more fine grained
> level with the exception of some nft commands so it could have a "nft"+"flow"
> flavor. Insert your favorite use case here.
> 
Yes, this is exactly my thought.  Some rudimentary policy to codify common use
cases for hardware offload for the individals who want to do more traditionally
administered network tasks, but allow other users to go "off the reservation" if
you will, to implement more fine grained control hardware behavior for those
cases in which more non-traditional/strict policies are required

Two other things to note, the creation of these larger offload units (neighbor
table offload/nft/routing/etc), can have relationships drawn between them so
that a dependency graph can be created, allowing us to ensure that for a given
device, neighbor/arp can be offloaded before routing/nft is, which I know we've
all mused about as a necessity.

Additionally, and this is really a bit off topic, but since OVS was mentioned in
this thread several times, I'd make the assertion that OVS is a case where your
low level flow API would be more appropriate for use.  They (the OVS
developers), have moved lots of traffic forwarding decisions into user space,
and away from the kernel mechanism, and as such, would likely benefit from
custom offloading rules more than in-kernel dataplane offload objects.

Best
Neil

> .John
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 21:52           ` Simon Horman
@ 2015-02-27  1:22             ` Neil Horman
  2015-02-27  1:52               ` Tom Herbert
  2015-02-27  8:41               ` Thomas Graf
  0 siblings, 2 replies; 53+ messages in thread
From: Neil Horman @ 2015-02-27  1:22 UTC (permalink / raw)
  To: Simon Horman
  Cc: John Fastabend, Thomas Graf, Jiri Pirko, netdev, davem, andy,
	dborkman, ogerlitz, jesse, jpettit, joestringer, jhs, sfeldma,
	f.fainelli, roopa, linville, shrijeet, gospo, bcrl

On Fri, Feb 27, 2015 at 06:52:58AM +0900, Simon Horman wrote:
> On Thu, Feb 26, 2015 at 03:16:35PM -0500, Neil Horman wrote:
> > On Thu, Feb 26, 2015 at 07:23:36AM -0800, John Fastabend wrote:
> > > On 02/26/2015 05:33 AM, Thomas Graf wrote:
> > > > On 02/26/15 at 10:16am, Jiri Pirko wrote:
> > > >> Well, on netdev01, I believe that a consensus was reached that for every
> > > >> switch offloaded functionality there has to be an implementation in
> > > >> kernel.
> > > > 
> > > > Agreed. This should not prevent the policy being driven from user
> > > > space though.
> > > > 
> > > >> What John's Flow API originally did was to provide a way to
> > > >> configure hardware independently of kernel. So the right way is to
> > > >> configure kernel and, if hw allows it, to offload the configuration to hw.
> > > >>
> > > >> In this case, seems to me logical to offload from one place, that being
> > > >> TC. The reason is, as I stated above, the possible conversion from OVS
> > > >> datapath to TC.
> > > > 
> > > > Offloading of TC definitely makes a lot of sense. I think that even in
> > > > that case you will already encounter independent configuration of
> > > > hardware and kernel. Example: The hardware provides a fixed, generic
> > > > function to push up to n bytes onto a packet. This hardware function
> > > > could be used to implement TC actions "push_vlan", "push_vxlan",
> > > > "push_mpls". You would you would likely agree that TC should make use
> > > > of such a function even if the hardware version is different from the
> > > > software version. So I don't think we'll have a 1:1 mapping for all
> > > > configurations, regardless of whether the how is decided in kernel or
> > > > user space.
> > > 
> > > Just to expand slightly on this. I don't think you can get to a 1:1
> > > mapping here. One reason is hardware typically has a TCAM and limited
> > > size. So you need a _policy_ to determine when to push rules into the
> > > hardware. The kernel doesn't know when to do this and I don't believe
> > > its the kernel's place to start enforcing policy like this. One thing I likely
> > > need to do is get some more "worlds" in rocker so we aren't stuck only
> > > thinking about the infinite size OF_DPA world. The OF_DPA world is only
> > > one world and not a terribly flexible one at that when compared with the
> > > NPU folk. So minimally you need a flag to indicate rules go into hardware
> > > vs software.
> > > 
> > > That said I think the bigger mismatch between software and hardware is
> > > you program it differently because the data structures are different. Maybe
> > > a u32 example would help. For parsing with u32 you might build a parse
> > > graph with a root and some leaf nodes. In hardware you want to collapse
> > > this down onto the hardware. I argue this is not a kernel task because
> > > there are lots of ways to do this and there are trade-offs made with
> > > respect to space and performance and which table to use when it could be
> > > handled by a set of tables. Another example is a virtual switch possibly
> > > OVS but we have others. The software does some "unmasking" (there term)
> > > before sending the rules into the software dataplane cache. Basically this
> > > means we can ignore priority in the hash lookup. However this is not how you
> > > would optimally use hardware. Maybe I should do another write up with
> > > some more concrete examples.
> > > 
> > > There are also lots of use cases to _not_ have hardware and software in
> > > sync. A flag allows this.
> > > 
> > > My only point is I think we need to allow users to optimally use there
> > > hardware either via 'tc' or my previous 'flow' tool. Actually in my
> > > opinion I still think its best to have both interfaces.
> > > 
> > > I'll go get some coffee now and hopefully that is somewhat clear.
> > 
> > 
> > I've been thinking about the policy apect of this, and the more I think
> > about it, the more I wonder if not allowing some sort of common policy in
> > the kernel is really the right thing to do here.  I know thats somewhat
> > blasphemous, but this isn't really administrative poilcy that we're
> > talking about, at least not 100%.  Its more of a behavioral profile that
> > we're trying to enforce.  That may be splitting hairs, but I think theres
> > precidence for the latter.  That is to say, we configure qdiscs to limit
> > traffic flow to certain rates, and configure policies which drop traffic
> > that violates it (which includes random discard, which is the antithesis
> > of deterministic policy).  I'm not sure I see this as any different,
> > espcially if we limit its scope.  That is to say, why couldn't we allow
> > the kernel to program a predetermined set of policies that the admin can
> > set (i.e. offload routing to a hardware cache of X size with an lru
> > victimization).  If other well defined policies make sense, we can add
> > them and exposes options via iproute2 or some such to set them.  For the
> > use case where such pre-packaged policies don't make sense, we have
> > things like the flow api to offer users who want to be able to control
> > their hardware in a more fine grained approach.
> 
> In general I agree that it makes sense to have have sane offload policy
> in the kernel and provide a mechanism to override that. Things that already
> work should continue to work: just faster or with fewer CPU cycles consumed.
> 
Yes, exactly that, for the general traditional networking use case, that is
exactly what we want, to opportunistically move traffic faster with less load on
the cpu.  We don't nominally care what traffic is offloaded, as long as the
hardware does a better job than just software alone.  If we get an occasional
miss and have to do stuff in software, so be it.

> I am, however, not entirely convinced that it is always possible to
> implement such a sane default policy that is worth the code complexity -
> I'm thinking in particular of Open vSwitch where management of flows is
> already in user-space.
So, this is a case in which I think John F.'s low level flow API is more well
suited.  OVS has implemented a user space dataplane that circumvents alot of the
kernel mechanisms for traffic forwarding.  For that sort of application, the
traditional kernel offload "objects" aren't really appropriate.  Instead, OVS
can use the low level flow API to construct its own custom offload pipeline
using whatever rules and policies that it wants.

Of course, using the low level flow API is incompatible with the in-kernel
object offload idea that I'm proposing, but I see the two as able to co-exist,
much like firewalld co-exists with iptables.  You can use both, but you have to
be aware that using the lower layer interface might break the others higher
level oeprations.  And if that happens, its on you to manage it.

Best
Neil

> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-27  1:22             ` Neil Horman
@ 2015-02-27  1:52               ` Tom Herbert
  2015-03-02 13:49                 ` Andy Gospodarek
  2015-02-27  8:41               ` Thomas Graf
  1 sibling, 1 reply; 53+ messages in thread
From: Tom Herbert @ 2015-02-27  1:52 UTC (permalink / raw)
  To: Neil Horman
  Cc: Simon Horman, John Fastabend, Thomas Graf, Jiri Pirko,
	Linux Netdev List, David Miller, Andy Gospodarek,
	Daniel Borkmann, Or Gerlitz, Jesse Gross, jpettit, Joe Stringer,
	Jamal Hadi Salim, Scott Feldman, Florian Fainelli, Roopa Prabhu,
	John Linville, Shrijeet Mukherjee, Andy Gospodarek, bcrl

On Thu, Feb 26, 2015 at 5:22 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> On Fri, Feb 27, 2015 at 06:52:58AM +0900, Simon Horman wrote:
>> On Thu, Feb 26, 2015 at 03:16:35PM -0500, Neil Horman wrote:
>> > On Thu, Feb 26, 2015 at 07:23:36AM -0800, John Fastabend wrote:
>> > > On 02/26/2015 05:33 AM, Thomas Graf wrote:
>> > > > On 02/26/15 at 10:16am, Jiri Pirko wrote:
>> > > >> Well, on netdev01, I believe that a consensus was reached that for every
>> > > >> switch offloaded functionality there has to be an implementation in
>> > > >> kernel.
>> > > >
>> > > > Agreed. This should not prevent the policy being driven from user
>> > > > space though.
>> > > >
>> > > >> What John's Flow API originally did was to provide a way to
>> > > >> configure hardware independently of kernel. So the right way is to
>> > > >> configure kernel and, if hw allows it, to offload the configuration to hw.
>> > > >>
>> > > >> In this case, seems to me logical to offload from one place, that being
>> > > >> TC. The reason is, as I stated above, the possible conversion from OVS
>> > > >> datapath to TC.
>> > > >
>> > > > Offloading of TC definitely makes a lot of sense. I think that even in
>> > > > that case you will already encounter independent configuration of
>> > > > hardware and kernel. Example: The hardware provides a fixed, generic
>> > > > function to push up to n bytes onto a packet. This hardware function
>> > > > could be used to implement TC actions "push_vlan", "push_vxlan",
>> > > > "push_mpls". You would you would likely agree that TC should make use
>> > > > of such a function even if the hardware version is different from the
>> > > > software version. So I don't think we'll have a 1:1 mapping for all
>> > > > configurations, regardless of whether the how is decided in kernel or
>> > > > user space.
>> > >
>> > > Just to expand slightly on this. I don't think you can get to a 1:1
>> > > mapping here. One reason is hardware typically has a TCAM and limited
>> > > size. So you need a _policy_ to determine when to push rules into the
>> > > hardware. The kernel doesn't know when to do this and I don't believe
>> > > its the kernel's place to start enforcing policy like this. One thing I likely
>> > > need to do is get some more "worlds" in rocker so we aren't stuck only
>> > > thinking about the infinite size OF_DPA world. The OF_DPA world is only
>> > > one world and not a terribly flexible one at that when compared with the
>> > > NPU folk. So minimally you need a flag to indicate rules go into hardware
>> > > vs software.
>> > >
>> > > That said I think the bigger mismatch between software and hardware is
>> > > you program it differently because the data structures are different. Maybe
>> > > a u32 example would help. For parsing with u32 you might build a parse
>> > > graph with a root and some leaf nodes. In hardware you want to collapse
>> > > this down onto the hardware. I argue this is not a kernel task because
>> > > there are lots of ways to do this and there are trade-offs made with
>> > > respect to space and performance and which table to use when it could be
>> > > handled by a set of tables. Another example is a virtual switch possibly
>> > > OVS but we have others. The software does some "unmasking" (there term)
>> > > before sending the rules into the software dataplane cache. Basically this
>> > > means we can ignore priority in the hash lookup. However this is not how you
>> > > would optimally use hardware. Maybe I should do another write up with
>> > > some more concrete examples.
>> > >
>> > > There are also lots of use cases to _not_ have hardware and software in
>> > > sync. A flag allows this.
>> > >
>> > > My only point is I think we need to allow users to optimally use there
>> > > hardware either via 'tc' or my previous 'flow' tool. Actually in my
>> > > opinion I still think its best to have both interfaces.
>> > >
>> > > I'll go get some coffee now and hopefully that is somewhat clear.
>> >
>> >
>> > I've been thinking about the policy apect of this, and the more I think
>> > about it, the more I wonder if not allowing some sort of common policy in
>> > the kernel is really the right thing to do here.  I know thats somewhat
>> > blasphemous, but this isn't really administrative poilcy that we're
>> > talking about, at least not 100%.  Its more of a behavioral profile that
>> > we're trying to enforce.  That may be splitting hairs, but I think theres
>> > precidence for the latter.  That is to say, we configure qdiscs to limit
>> > traffic flow to certain rates, and configure policies which drop traffic
>> > that violates it (which includes random discard, which is the antithesis
>> > of deterministic policy).  I'm not sure I see this as any different,
>> > espcially if we limit its scope.  That is to say, why couldn't we allow
>> > the kernel to program a predetermined set of policies that the admin can
>> > set (i.e. offload routing to a hardware cache of X size with an lru
>> > victimization).  If other well defined policies make sense, we can add
>> > them and exposes options via iproute2 or some such to set them.  For the
>> > use case where such pre-packaged policies don't make sense, we have
>> > things like the flow api to offer users who want to be able to control
>> > their hardware in a more fine grained approach.
>>
>> In general I agree that it makes sense to have have sane offload policy
>> in the kernel and provide a mechanism to override that. Things that already
>> work should continue to work: just faster or with fewer CPU cycles consumed.
>>
> Yes, exactly that, for the general traditional networking use case, that is
> exactly what we want, to opportunistically move traffic faster with less load on
> the cpu.  We don't nominally care what traffic is offloaded, as long as the
> hardware does a better job than just software alone.  If we get an occasional
> miss and have to do stuff in software, so be it.
>
+1 on an in kernel "Network Resource Manager". This also came up in
Sunil's plan to configure RPS affinities from a driver so I'm taking
liberty by generalizing the concept :-).

>> I am, however, not entirely convinced that it is always possible to
>> implement such a sane default policy that is worth the code complexity -
>> I'm thinking in particular of Open vSwitch where management of flows is
>> already in user-space.
> So, this is a case in which I think John F.'s low level flow API is more well
> suited.  OVS has implemented a user space dataplane that circumvents alot of the
> kernel mechanisms for traffic forwarding.  For that sort of application, the
> traditional kernel offload "objects" aren't really appropriate.  Instead, OVS
> can use the low level flow API to construct its own custom offload pipeline
> using whatever rules and policies that it wants.
>
> Of course, using the low level flow API is incompatible with the in-kernel
> object offload idea that I'm proposing, but I see the two as able to co-exist,
> much like firewalld co-exists with iptables.  You can use both, but you have to
> be aware that using the lower layer interface might break the others higher
> level oeprations.  And if that happens, its on you to manage it.
>
> Best
> Neil
>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-27  1:22             ` Neil Horman
  2015-02-27  1:52               ` Tom Herbert
@ 2015-02-27  8:41               ` Thomas Graf
  2015-02-27 12:59                 ` Neil Horman
                                   ` (2 more replies)
  1 sibling, 3 replies; 53+ messages in thread
From: Thomas Graf @ 2015-02-27  8:41 UTC (permalink / raw)
  To: Neil Horman
  Cc: Simon Horman, John Fastabend, Jiri Pirko, netdev, davem, andy,
	dborkman, ogerlitz, jesse, jpettit, joestringer, jhs, sfeldma,
	f.fainelli, roopa, linville, shrijeet, gospo, bcrl

On 02/26/15 at 08:22pm, Neil Horman wrote:
> Yes, exactly that, for the general traditional networking use case, that is
> exactly what we want, to opportunistically move traffic faster with less load on
> the cpu.  We don't nominally care what traffic is offloaded, as long as the
> hardware does a better job than just software alone.  If we get an occasional
> miss and have to do stuff in software, so be it.

Blind random offload of some packets is better than nothing but knowing
and having control over which packets are offloaded is essential. You
typically don't want to randomly give one flow priority over another ;-)
Some software CPUs might not be able to handle the load. I know what
you mean though and as long as we allow to disable and overwrite this
behaviour we are good.

> So, this is a case in which I think John F.'s low level flow API is more well
> suited.  OVS has implemented a user space dataplane that circumvents alot of the
> kernel mechanisms for traffic forwarding.  For that sort of application, the
> traditional kernel offload "objects" aren't really appropriate.  Instead, OVS
> can use the low level flow API to construct its own custom offload pipeline
> using whatever rules and policies that it wants.

Maybe I'm misunderstanding your statement here but I think it's essential
that the kernel is able to handle whatever we program in hardware even
if the hardware tables look differrent than the software tables, no matter
whether the configuration occurs through OVS or not. A punt to software
should always work even if it does not happen. So while I believe that
OVS needs more control over the hardware than available through the
datapath cache it must program both the hardware and software in parallel
even though the building blocks for doing so might look different.

> Of course, using the low level flow API is incompatible with the in-kernel
> object offload idea that I'm proposing, but I see the two as able to co-exist,
> much like firewalld co-exists with iptables.  You can use both, but you have to
> be aware that using the lower layer interface might break the others higher
> level oeprations.  And if that happens, its on you to manage it.

I think this does not have to be mutually exclusive. An example would
be a well defined egress qdisc which is offloaded into it's own table.
If OVS is aware of the table it can make use of it while configuring
that table through the regular qdisc software API.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 21:11           ` John Fastabend
  2015-02-27  1:17             ` Neil Horman
@ 2015-02-27  8:53             ` Jiri Pirko
  2015-02-27 16:00               ` John Fastabend
  1 sibling, 1 reply; 53+ messages in thread
From: Jiri Pirko @ 2015-02-27  8:53 UTC (permalink / raw)
  To: John Fastabend
  Cc: Neil Horman, Thomas Graf, Simon Horman, netdev, davem, andy,
	dborkman, ogerlitz, jesse, jpettit, joestringer, jhs, sfeldma,
	f.fainelli, roopa, linville, shrijeet, gospo, bcrl

Thu, Feb 26, 2015 at 10:11:23PM CET, john.r.fastabend@intel.com wrote:
>On 02/26/2015 12:16 PM, Neil Horman wrote:
>> On Thu, Feb 26, 2015 at 07:23:36AM -0800, John Fastabend wrote:
>>> On 02/26/2015 05:33 AM, Thomas Graf wrote:
>>>> On 02/26/15 at 10:16am, Jiri Pirko wrote:
>>>>> Well, on netdev01, I believe that a consensus was reached that for every
>>>>> switch offloaded functionality there has to be an implementation in
>>>>> kernel.
>>>>
>>>> Agreed. This should not prevent the policy being driven from user
>>>> space though.
>>>>
>>>>> What John's Flow API originally did was to provide a way to
>>>>> configure hardware independently of kernel. So the right way is to
>>>>> configure kernel and, if hw allows it, to offload the configuration to hw.
>>>>>
>>>>> In this case, seems to me logical to offload from one place, that being
>>>>> TC. The reason is, as I stated above, the possible conversion from OVS
>>>>> datapath to TC.
>>>>
>>>> Offloading of TC definitely makes a lot of sense. I think that even in
>>>> that case you will already encounter independent configuration of
>>>> hardware and kernel. Example: The hardware provides a fixed, generic
>>>> function to push up to n bytes onto a packet. This hardware function
>>>> could be used to implement TC actions "push_vlan", "push_vxlan",
>>>> "push_mpls". You would you would likely agree that TC should make use
>>>> of such a function even if the hardware version is different from the
>>>> software version. So I don't think we'll have a 1:1 mapping for all
>>>> configurations, regardless of whether the how is decided in kernel or
>>>> user space.
>>>
>>> Just to expand slightly on this. I don't think you can get to a 1:1
>>> mapping here. One reason is hardware typically has a TCAM and limited
>>> size. So you need a _policy_ to determine when to push rules into the
>>> hardware. The kernel doesn't know when to do this and I don't believe
>>> its the kernel's place to start enforcing policy like this. One thing I likely
>>> need to do is get some more "worlds" in rocker so we aren't stuck only
>>> thinking about the infinite size OF_DPA world. The OF_DPA world is only
>>> one world and not a terribly flexible one at that when compared with the
>>> NPU folk. So minimally you need a flag to indicate rules go into hardware
>>> vs software.
>>>
>>> That said I think the bigger mismatch between software and hardware is
>>> you program it differently because the data structures are different. Maybe
>>> a u32 example would help. For parsing with u32 you might build a parse
>>> graph with a root and some leaf nodes. In hardware you want to collapse
>>> this down onto the hardware. I argue this is not a kernel task because
>>> there are lots of ways to do this and there are trade-offs made with
>>> respect to space and performance and which table to use when it could be
>>> handled by a set of tables. Another example is a virtual switch possibly
>>> OVS but we have others. The software does some "unmasking" (there term)
>>> before sending the rules into the software dataplane cache. Basically this
>>> means we can ignore priority in the hash lookup. However this is not how you
>>> would optimally use hardware. Maybe I should do another write up with
>>> some more concrete examples.
>>>
>>> There are also lots of use cases to _not_ have hardware and software in
>>> sync. A flag allows this.
>>>
>>> My only point is I think we need to allow users to optimally use there
>>> hardware either via 'tc' or my previous 'flow' tool. Actually in my
>>> opinion I still think its best to have both interfaces.
>>>
>>> I'll go get some coffee now and hopefully that is somewhat clear.
>> 
>> 
>> I've been thinking about the policy apect of this, and the more I think about
>> it, the more I wonder if not allowing some sort of common policy in the kernel
>> is really the right thing to do here.  I know thats somewhat blasphemous, but
>> this isn't really administrative poilcy that we're talking about, at least not
>> 100%.  Its more of a behavioral profile that we're trying to enforce.  That may
>> be splitting hairs, but I think theres precidence for the latter.  That is to
>> say, we configure qdiscs to limit traffic flow to certain rates, and configure
>> policies which drop traffic that violates it (which includes random discard,
>> which is the antithesis of deterministic policy).  I'm not sure I see this as
>> any different, espcially if we limit its scope.  That is to say, why couldn't we
>> allow the kernel to program a predetermined set of policies that the admin can
>> set (i.e. offload routing to a hardware cache of X size with an lru
>> victimization).  If other well defined policies make sense, we can add them and
>> exposes options via iproute2 or some such to set them.  For the use case where
>> such pre-packaged policies don't make sense, we have things like the flow api to
>> offer users who want to be able to control their hardware in a more fine grained
>> approach.
>> 
>> Neil
>> 
>
>Hi Neil,
>
>I actually like this idea a lot. I might tweak a bit in that we could have some
>feature bits or something like feature bits that expose how to split up the
>hardware cache and give sizes.
>
>So the hypervisor (see I think of end hosts) or administrators could come in and
>say I want a route table and a nft table. This creates a "flavor" over how the
>hardware is going to be used. Another use case may not be doing routing at all
>but have an application that wants to manage the hardware at a more fine grained
>level with the exception of some nft commands so it could have a "nft"+"flow"
>flavor. Insert your favorite use case here.

I'm not sure I understand. You said that admin could say: "I want a
route table and a nft table". But how does he say it? Isn't is enough
just to insert some rules into these 2 things and that would give hw a
clue what the admin is doing and what he wants? I believe that this
offload should happen transparently.

Of course, you may want to balance resources as you said the hw capacity
is limited. But I would leave that optional. API unknown so far...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 18:15         ` Tom Herbert
  2015-02-26 19:05           ` Thomas Graf
@ 2015-02-27  9:00           ` Jiri Pirko
  2015-02-28 20:02           ` David Miller
  2 siblings, 0 replies; 53+ messages in thread
From: Jiri Pirko @ 2015-02-27  9:00 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Simon Horman, Linux Netdev List, David Miller, Neil Horman,
	Andy Gospodarek, Thomas Graf, Daniel Borkmann, Or Gerlitz,
	Jesse Gross, jpettit, Joe Stringer, John Fastabend,
	Jamal Hadi Salim, Scott Feldman, Florian Fainelli, Roopa Prabhu,
	John Linville, shrijeet, Andy Gospodarek, bcrl

Thu, Feb 26, 2015 at 07:15:24PM CET, therbert@google.com wrote:
>On Thu, Feb 26, 2015 at 8:17 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Thu, Feb 26, 2015 at 05:04:31PM CET, therbert@google.com wrote:
>>>On Thu, Feb 26, 2015 at 1:16 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>> Thu, Feb 26, 2015 at 09:38:01AM CET, simon.horman@netronome.com wrote:
>>>>>Hi Jiri,
>>>>>
>>>>>On Thu, Feb 26, 2015 at 08:42:14AM +0100, Jiri Pirko wrote:
>>>>>> Hello everyone.
>>>>>>
>>>>>> I would like to discuss big next step for switch offloading. Probably
>>>>>> the most complicated one we have so far. That is to be able to offload flows.
>>>>>> Leaving nftables aside for a moment, I see 2 big usecases:
>>>>>> - TC filters and actions offload.
>>>>>> - OVS key match and actions offload.
>>>>>>
>>>>>> I think it might sense to ignore OVS for now. The reason is ongoing efford
>>>>>> to replace OVS kernel datapath with TC subsystem. After that, OVS offload
>>>>>> will not longer be needed and we'll get it for free with TC offload
>>>>>> implementation. So we can focus on TC now.
>>>>>>
>>>>>> Here is my list of actions to achieve some results in near future:
>>>>>> 1) finish cls_openflow classifier and iproute part of it
>>>>>> 2) extend switchdev API for TC cls and acts offloading (using John's flow api?)
>>>>>> 3) use rocker to provide offload for cls_openflow and couple of selected actions
>>>>>> 4) improve cls_openflow performance (hashtables etc)
>>>>>> 5) improve TC subsystem performance in both slow and fast path
>>>>>>     -RTNL mutex and qdisc lock removal/reduction, lockless stats update.
>>>>>> 6) implement "named sockets" (working name) and implement TC support for that
>>>>>>     -ingress qdisc attach, act_mirred target
>>>>>> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
>>>>>> 8) implement TC act_mpls
>>>>>> 9) suggest to switch OVS userspace from OVS genl to TC API
>>>>>>
>>>>>> This is my personal action list, but you are *very welcome* to step in to help.
>>>>>> Point 2) haunts me at night....
>>>>>> I believe that John is already working on 2) and part of 3).
>>>>>>
>>>>>> What do you think?
>>>>>
>>>> >From my point of view the question of replacing the kernel datapath with TC
>>>>>is orthogonal to the question of flow offloads. This is because I believe
>>>>>there is some consensus around the idea that, at least in the case of Open
>>>>>vSwitch, the decision to offload flows should made in user-space where
>>>>>flows are already managed. And in that case datapath will not be
>>>>>transparently offloading of flows.  And thus flow offload may be performed
>>>>>independently of the kernel datapath, weather that be via flow manipulation
>>>>>portions of John's Flow API, TC, or some other means.
>>>>
>>>> Well, on netdev01, I believe that a consensus was reached that for every
>>>> switch offloaded functionality there has to be an implementation in
>>>> kernel. What John's Flow API originally did was to provide a way to
>>>> configure hardware independently of kernel. So the right way is to
>>>> configure kernel and, if hw allows it, to offload the configuration to hw.
>>>>
>>>> In this case, seems to me logical to offload from one place, that being
>>>> TC. The reason is, as I stated above, the possible conversion from OVS
>>>> datapath to TC.
>>>>
>>>Sorry if I'm asking dumb questions, but this is about where I usually
>>>start to get lost in these discussions ;-). Is the aim of switch
>>>offload to offload OVS or kernel functions of routing, iptables, tc,
>>>etc.? These are very different I believe. As far as I can tell OVS
>>>model of "flows" (like Openflow) is currently incompatible with the
>>>rest of the kernel. So if the plan is convert OVS datapath to TC does
>>>that mean introducing that model into core kernel?
>>
>> The thing is that you can achieve very similar model as OVS with TC.
>> OVS uses rx_handler.
>> TC uses handle_ing hook.
>> Those are in the same place in the receive path.
>> After that, ovs processes skb through key matches, and does some actions.
>> The same is done in TC cls_* and act_*.
>> Finally skb is forwarded to some netdev by dev_queue_xmit (in both OVS
>> and TC).
>>
>> I certainly simplified things. But I do not see the different model you
>> are talking about.
>>
>But, routing (aka switching) in the stack is not configured through
>TC. We have a whole forwarding and routing infrastructure (eg.
>iproute) with optimizations that  allow routes to be cached in
>sockets, etc. To me, it seems like offloading that basic functionality
>is a prerequisite before attempting to offload more advanced policy
>mechanisms of TC, netfilter, etc.

I believe we are talking about 2 separate cases. Case one is to
offload L2, L3 traditional infrastructure we have in kernel now.

Case two is to offload independent OVS DP infrastructure. I'm just saying
that OVS DP can be replaced by TC (subpart of that including ingress
qdisq, cls and acts). Then we can offload this TC subpart.

These 2 cases can be handled separatelly.

Also I believe that offload needs to be done per-partes one way or
another. So I imagine that cls_openflow can be the first classifier to
get offloaded.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-27  8:41               ` Thomas Graf
@ 2015-02-27 12:59                 ` Neil Horman
  2015-03-01  9:36                 ` Arad, Ronen
  2015-03-01  9:47                 ` Arad, Ronen
  2 siblings, 0 replies; 53+ messages in thread
From: Neil Horman @ 2015-02-27 12:59 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Simon Horman, John Fastabend, Jiri Pirko, netdev, davem, andy,
	dborkman, ogerlitz, jesse, jpettit, joestringer, jhs, sfeldma,
	f.fainelli, roopa, linville, shrijeet, gospo, bcrl

On Fri, Feb 27, 2015 at 08:41:41AM +0000, Thomas Graf wrote:
> On 02/26/15 at 08:22pm, Neil Horman wrote:
> > Yes, exactly that, for the general traditional networking use case, that is
> > exactly what we want, to opportunistically move traffic faster with less load on
> > the cpu.  We don't nominally care what traffic is offloaded, as long as the
> > hardware does a better job than just software alone.  If we get an occasional
> > miss and have to do stuff in software, so be it.
> 
> Blind random offload of some packets is better than nothing but knowing
> and having control over which packets are offloaded is essential. You
> typically don't want to randomly give one flow priority over another ;-)
> Some software CPUs might not be able to handle the load. I know what
> you mean though and as long as we allow to disable and overwrite this
> behaviour we are good.
> 

Yes, exactly this.  I disagree with your assertion that what I'm proposing is
blind or random (quite the opposite), I'm proposing best effort offload of high
level kernel functions with well defined generic policies for what to do on
resource exhaustion overflow.  You are correct that this might lead to random
(or perhaps better put, arbitrary), flows reaching the cpu occasionally, but
thats the best effort part.  Using canned policies will lead to that, and if
thats intolerable to you as an administrator, then you have the flow api to
offer more fine grained control over what exactly you want to do.

Interestingly, As part of a policy specification, I wonder if we could
incorporate a flow rate aspect that would lead a driver to only offload a
given flow for a given functionality if it sends more than X amount of traffic
through the cpu.

> > So, this is a case in which I think John F.'s low level flow API is more well
> > suited.  OVS has implemented a user space dataplane that circumvents alot of the
> > kernel mechanisms for traffic forwarding.  For that sort of application, the
> > traditional kernel offload "objects" aren't really appropriate.  Instead, OVS
> > can use the low level flow API to construct its own custom offload pipeline
> > using whatever rules and policies that it wants.
> 
> Maybe I'm misunderstanding your statement here but I think it's essential
> that the kernel is able to handle whatever we program in hardware even
> if the hardware tables look differrent than the software tables, no matter
> whether the configuration occurs through OVS or not. A punt to software
> should always work even if it does not happen. So while I believe that
> OVS needs more control over the hardware than available through the
> datapath cache it must program both the hardware and software in parallel
> even though the building blocks for doing so might look different.
> 
I think parallel programming of the hardware and software is essential in _most_
use cases, but possibly not all, and in those cases, I think Johns flow API is
the solution.  Mine makes sense for all the traditional use cases in which we
just want more packets to go faster, and be able to deal with the consequences
in a best effort fashion if the fast path can't do its job.

As an example, lets imagine that a company wishes to build an appliance that
allows ipsec tunneling using a custom assymetric crypto algorithm that they have
codified with a TCM unit in the hardware datapath.  All we can do is program the
public and private keys on the hardware.  In this case we can setup a software
datapath to represent the mirror image of the hardware datapath, but it would be
non-functional, as we don't know what the magic crypto alg is.  In this case
Johns Flow API is essential because its able to setup the datapath in hardware
that has no true software parallel.  Additionaly its imperative to use that API
to ensure that all flows via that tunnel go through hardware, as there is no
point in overflowing the traffic to the cpu.

> > Of course, using the low level flow API is incompatible with the in-kernel
> > object offload idea that I'm proposing, but I see the two as able to co-exist,
> > much like firewalld co-exists with iptables.  You can use both, but you have to
> > be aware that using the lower layer interface might break the others higher
> > level oeprations.  And if that happens, its on you to manage it.
> 
> I think this does not have to be mutually exclusive. An example would
> be a well defined egress qdisc which is offloaded into it's own table.
> If OVS is aware of the table it can make use of it while configuring
> that table through the regular qdisc software API.
> 
Absolutely, I should have said "may be incompatible".  In this case my thought
was that, if you offloaded l2 and l3 forwarding to a device, some of the
dataplane elements to preform those functions would be allocated to that
purpose.  If you then were to use the lower level Flow api to do some sort of
custom datapath manipulation in hardware, the hardware may or may not have
resources to dedicate to that purpose.  In the event that they did not the flow
API would have to fail (or vice versa if you used the flow api first, then tried
to offload l2 forwarding).  If however sufficient resources were available to do
both, then all is well and the two can co-exist.

Regards
Neil

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Named sockets WAS(Re: Flows! Offload them.
  2015-02-26 11:39   ` Jiri Pirko
  2015-02-26 15:42     ` Sowmini Varadhan
@ 2015-02-27 13:15     ` Jamal Hadi Salim
  1 sibling, 0 replies; 53+ messages in thread
From: Jamal Hadi Salim @ 2015-02-27 13:15 UTC (permalink / raw)
  To: Jiri Pirko, Sowmini Varadhan
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	jpettit, joestringer, john.r.fastabend, sfeldma, f.fainelli,
	roopa, linville, simon.horman, shrijeet, gospo, bcrl


Sorry - catching up with the discussion; so many parallel topics
buried..
I just wanna put my TU ("thumbs up", yes I didnt want to do the
pedestrian +1) for the named socket concept. With the explosion of
in-kernel sockets all which intend to do host protocol processing
this would be a very nice abstraction to have.
But i do believe this could also be useful for user space redirecting;
we already have very scalable socket code interfacing which could be
taken advantage of.

cheers,
jamal

On 02/26/15 06:39, Jiri Pirko wrote:
> Thu, Feb 26, 2015 at 12:22:52PM CET, sowmini.varadhan@oracle.com wrote:
>> On (02/26/15 08:42), Jiri Pirko wrote:
>>> 6) implement "named sockets" (working name) and implement TC support for that
>>>      -ingress qdisc attach, act_mirred target
>>> 7) allow tunnels (VXLAN, Geneve, GRE) to be created as named sockets
>>
>> Can you elaborate a bit on the above two?
>
> Sure. If you look into net/openvswitch/vport-vxlan.c for example, there
> is a socket created by vxlan_sock_add. vxlan_rcv is called on rx and
> vxlan_xmit_skb to xmit.
>
> What I have on mind is to allow to create tunnels using "ip" but not as
> a device but rather just as a wrapper of these functions (and others alike).
>
> To identify the instance we name it (OVS has it identified and vport).
> After that, tc could allow to attach ingress qdisk not only to a device,
> but to this named socket as well. Similary with tc action mirred, it would
> be possible to forward not only to a device, but to this named socket as
> well. All should be very light.
>
>
>>
>> FWIW I've been looking at the problem of RDS over TCP, which is
>> an instance of layered sockets that tunnels the application payload
>> in TCP.
>>
>> RDS over IB provides QoS support using the features available in
>> IB- to supply an analog of that for RDS-TCP, you'd need to plug
>> into tc's CBQ support, and also provide hooks for packet (.1p, dscp)
>> marking.
>>
>> Perhaps there is some overlap to what you are thinking of in #6 and #7
>> above?
>
> I'm not talking about QoS at all. See the description above.
>
> Jiri
>
>>
>> --Sowmini

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 15:39         ` John Fastabend
       [not found]           ` <CAGpadYHfNcDR2ojubkCJ8-nJTQkdLkPsAwJu0wOKU82bLDzhww@mail.gmail.com>
@ 2015-02-27 13:33           ` Jamal Hadi Salim
  2015-02-27 15:23             ` John Fastabend
  1 sibling, 1 reply; 53+ messages in thread
From: Jamal Hadi Salim @ 2015-02-27 13:33 UTC (permalink / raw)
  To: John Fastabend, Shrijeet Mukherjee, Thomas Graf
  Cc: Jiri Pirko, Simon Horman, netdev, davem, nhorman, andy, dborkman,
	ogerlitz, jesse, jpettit, joestringer, sfeldma, f.fainelli,
	roopa, linville, gospo, bcrl

On 02/26/15 10:39, John Fastabend wrote:

> So I think there is a relatively simple solution for this. Assuming
> I read the description correctly namely packet ingress' nic/switch
> and you want it to land in a namespace.
>
> Today we support offloaded macvlan's and SR-IOV. What I would expect
> is user creates a set of macvlan's that are "offloaded" this just means
> they are bound to a set of hardware queues and do not go through the
> normal receive path. Then assigning these to a namespace is the same
> as any other netdev.
>
> Hardware has an action to forward to "VSI" (virtual station interface)
> which matches on a packet and forwards it to either a VF or set of
> queues bound to a macvlan. Or you can do the forwarding using standards
> based protocols such as EVB (edge virtual bridging).
>
> So its a simple set of steps with the flow api,
>
> 	1. create macvlan with dfwd_offload set
> 	2. push netdev into namespace
> 	3. add flow rule to match traffic and send to VSI
> 		./flow -i ethx set_rule match xyz action fwd_vsi 3
>
> The VSI# is reported by ip link today its a bit clumsy so that interface
> could be cleaned up.
>
> Here is a case where trying to map this onto a 'tc' action in software
> is a bit awkward and you convoluted what is really a simple operation.
> Anyways this is not really an "offload" in the sense that your taking
> something that used to run in software and moving it 1:1 into hardware.
> Adding SR-IOV/VMDQ support requries new constructs. By the way if you
> don't like my "flow" tool and you want to move it onto "tc" that could
> be done as well but the steps are the same.
>

Sorry for the top post - just wanted to leave the context intact.
TU (that is a "thumbs up" from an anti +1 person) on what you said
above. But i dont see the issue you bring up in step #3. If i was to
say:

tc filter add ingress ethx classifier foo priority X \
match xyz action redirect macvlan3 offload

where "offload" sets the netlink or classifier specific instruction
to offload.

You can easily map macvlan3 to vsi 3.

cheers,
jamal

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Driver level interface WAS(Re: Flows! Offload them.
  2015-02-26 20:58   ` John Fastabend
  2015-02-26 21:45     ` Florian Fainelli
@ 2015-02-27 14:01     ` Jamal Hadi Salim
  1 sibling, 0 replies; 53+ messages in thread
From: Jamal Hadi Salim @ 2015-02-27 14:01 UTC (permalink / raw)
  To: John Fastabend, Florian Fainelli, Jiri Pirko, netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, jpettit,
	joestringer, sfeldma, roopa, linville, simon.horman, shrijeet,
	gospo, bcrl

On 02/26/15 15:58, John Fastabend wrote:

> My thinking on this is to use the FlowAPI ndo_ops as the bottom layer.
> What I would much prefer (having to actually write drivers) is that
> we have one API to the driver and tc, nft, whatever map onto that API.
>
> Then my driver implements a ndo_set_flow op and a ndo_del_flow op. What
> I'm working on now is the map from tc onto the flow API I'm hoping this
> sounds like a good idea to folks.
>

In the name of starting simple (just like we did with L2 and upcoming
L3), I think the idea is sane.
If a driver exposes a "5 tuple classifier" - that is a hell lot
easier to deal with from a policy perspective. I dont need to know
that when you tell me "I can handle 10K rules" that somehow in your
clever magic (or "IPR" as some people like to claim it) you managed
to use only 20K Tcam entries or 5K compressed Sram  entries. That
becomes your problem as a driver. IOW:
the driver should be in charge of the hardware resource management.
In my experience, it is a nightmare having to deal with a situation
where i just installed a rule and now in some bizare twisted
plot i just lost or gained 100 entries.
So a strategy for a driver interface depending on underlying hardware
is the right thing to do; in such a case i think tcam based hardware
offload would be fine fit for your API. There may be other strategies
to deal with different hardware - however, as long as we are consistent
we should be able to replace things when necessary.

Likewise i find the simple "set" thing to be a usability challenge
(we have done it in tc with pedit, but why the fsck would i use that
to change an ip address? could easily have done the nat action with it,
but there is a point where usability becomes paramount)

> Neil, suggested we might need a reservation concept where tc can reserve
> some space in a TCAM, similarly nft can reserve some space. Also I have
> applications in user space that want to reserve some space to offload
> their specific data structures. This idea seems like a good one to me.
>

I think this is treading into "advanced territory" as well.
Who gets to decide on what app uses what is reserved?
 From a kernel perspective:
I can understand conceptually we could label the entries to an owner.
We already do this with routing (look for field "protocol" in rtm).
However, theres a big can of worms that will need to be opened - what
happens when the owner of such reservation dies etc?
is tc/nftables allowed to add rules or it is based on user id?
You need to deal not just with such issues but authentication,
authorization etc of individual apps - such richness belongs to user
space imo.

cheers,
jamal

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-27 13:33           ` Jamal Hadi Salim
@ 2015-02-27 15:23             ` John Fastabend
  2015-03-02 13:45               ` Jamal Hadi Salim
  0 siblings, 1 reply; 53+ messages in thread
From: John Fastabend @ 2015-02-27 15:23 UTC (permalink / raw)
  To: Jamal Hadi Salim, Shrijeet Mukherjee, Thomas Graf
  Cc: Jiri Pirko, Simon Horman, netdev, davem, nhorman, andy, dborkman,
	ogerlitz, jesse, jpettit, joestringer, sfeldma, f.fainelli,
	roopa, linville, gospo, bcrl

On 02/27/2015 05:33 AM, Jamal Hadi Salim wrote:
> On 02/26/15 10:39, John Fastabend wrote:
> 
>> So I think there is a relatively simple solution for this. Assuming
>> I read the description correctly namely packet ingress' nic/switch
>> and you want it to land in a namespace.
>>
>> Today we support offloaded macvlan's and SR-IOV. What I would expect
>> is user creates a set of macvlan's that are "offloaded" this just means
>> they are bound to a set of hardware queues and do not go through the
>> normal receive path. Then assigning these to a namespace is the same
>> as any other netdev.
>>
>> Hardware has an action to forward to "VSI" (virtual station interface)
>> which matches on a packet and forwards it to either a VF or set of
>> queues bound to a macvlan. Or you can do the forwarding using standards
>> based protocols such as EVB (edge virtual bridging).
>>
>> So its a simple set of steps with the flow api,
>>
>>     1. create macvlan with dfwd_offload set
>>     2. push netdev into namespace
>>     3. add flow rule to match traffic and send to VSI
>>         ./flow -i ethx set_rule match xyz action fwd_vsi 3
>>
>> The VSI# is reported by ip link today its a bit clumsy so that interface
>> could be cleaned up.
>>
>> Here is a case where trying to map this onto a 'tc' action in software
>> is a bit awkward and you convoluted what is really a simple operation.
>> Anyways this is not really an "offload" in the sense that your taking
>> something that used to run in software and moving it 1:1 into hardware.
>> Adding SR-IOV/VMDQ support requries new constructs. By the way if you
>> don't like my "flow" tool and you want to move it onto "tc" that could
>> be done as well but the steps are the same.
>>
> 
> Sorry for the top post - just wanted to leave the context intact.
> TU (that is a "thumbs up" from an anti +1 person) on what you said
> above. But i dont see the issue you bring up in step #3. If i was to
> say:
> 
> tc filter add ingress ethx classifier foo priority X \
> match xyz action redirect macvlan3 offload
> 
> where "offload" sets the netlink or classifier specific instruction
> to offload.
> 
> You can easily map macvlan3 to vsi 3.
> 

This is why I said if you want to map my "flow" tool onto "tc" it can
be done. :) I made a jump from macvlans that have net device representations
to VFs being assigned to user space (VMs) where there is no net device
to "redirect" to. So my explanation wasn't clear. Couple additional notes,

The only issue I have with your 'tc' case is the "ingress" qdisc is per port where
my flow tool is scoped at the switch pipeline so I would modify your cmd
slightly and use a new classifier hook,

 tc filter add sw-pipeline ethx classifier flow priority X \
    match xyz action redirect macvlan3 offload

The other thing to note is "redirect" to macvlan3 doesn't handle the case
where the VF is direct assigned to a VM in this case we don't have a netdev
in the hypervisor to "redirect" to. Using the VSI# allows the hardware
to translate the redirect cmd to the VF.

So one more slight tweak,

 tc filter add sw-pipeline ethx classifier flow priority X \
    match xyz action redirect vsi:2 offload

use the "vsi:" prefix to indicate this is for a hardware mapped VSI this
way the user can use what ever notion is more convenient.

> cheers,
> jamal
> 
> 
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-27  8:53             ` Jiri Pirko
@ 2015-02-27 16:00               ` John Fastabend
  0 siblings, 0 replies; 53+ messages in thread
From: John Fastabend @ 2015-02-27 16:00 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: John Fastabend, Neil Horman, Thomas Graf, Simon Horman, netdev,
	davem, andy, dborkman, ogerlitz, jesse, jpettit, joestringer,
	jhs, sfeldma, f.fainelli, roopa, linville, shrijeet, gospo, bcrl

On 02/27/2015 12:53 AM, Jiri Pirko wrote:
> Thu, Feb 26, 2015 at 10:11:23PM CET, john.r.fastabend@intel.com wrote:
>> On 02/26/2015 12:16 PM, Neil Horman wrote:
>>> On Thu, Feb 26, 2015 at 07:23:36AM -0800, John Fastabend wrote:
>>>> On 02/26/2015 05:33 AM, Thomas Graf wrote:
>>>>> On 02/26/15 at 10:16am, Jiri Pirko wrote:
>>>>>> Well, on netdev01, I believe that a consensus was reached that for every
>>>>>> switch offloaded functionality there has to be an implementation in
>>>>>> kernel.
>>>>>
>>>>> Agreed. This should not prevent the policy being driven from user
>>>>> space though.
>>>>>
>>>>>> What John's Flow API originally did was to provide a way to
>>>>>> configure hardware independently of kernel. So the right way is to
>>>>>> configure kernel and, if hw allows it, to offload the configuration to hw.
>>>>>>
>>>>>> In this case, seems to me logical to offload from one place, that being
>>>>>> TC. The reason is, as I stated above, the possible conversion from OVS
>>>>>> datapath to TC.
>>>>>
>>>>> Offloading of TC definitely makes a lot of sense. I think that even in
>>>>> that case you will already encounter independent configuration of
>>>>> hardware and kernel. Example: The hardware provides a fixed, generic
>>>>> function to push up to n bytes onto a packet. This hardware function
>>>>> could be used to implement TC actions "push_vlan", "push_vxlan",
>>>>> "push_mpls". You would you would likely agree that TC should make use
>>>>> of such a function even if the hardware version is different from the
>>>>> software version. So I don't think we'll have a 1:1 mapping for all
>>>>> configurations, regardless of whether the how is decided in kernel or
>>>>> user space.
>>>>
>>>> Just to expand slightly on this. I don't think you can get to a 1:1
>>>> mapping here. One reason is hardware typically has a TCAM and limited
>>>> size. So you need a _policy_ to determine when to push rules into the
>>>> hardware. The kernel doesn't know when to do this and I don't believe
>>>> its the kernel's place to start enforcing policy like this. One thing I likely
>>>> need to do is get some more "worlds" in rocker so we aren't stuck only
>>>> thinking about the infinite size OF_DPA world. The OF_DPA world is only
>>>> one world and not a terribly flexible one at that when compared with the
>>>> NPU folk. So minimally you need a flag to indicate rules go into hardware
>>>> vs software.
>>>>
>>>> That said I think the bigger mismatch between software and hardware is
>>>> you program it differently because the data structures are different. Maybe
>>>> a u32 example would help. For parsing with u32 you might build a parse
>>>> graph with a root and some leaf nodes. In hardware you want to collapse
>>>> this down onto the hardware. I argue this is not a kernel task because
>>>> there are lots of ways to do this and there are trade-offs made with
>>>> respect to space and performance and which table to use when it could be
>>>> handled by a set of tables. Another example is a virtual switch possibly
>>>> OVS but we have others. The software does some "unmasking" (there term)
>>>> before sending the rules into the software dataplane cache. Basically this
>>>> means we can ignore priority in the hash lookup. However this is not how you
>>>> would optimally use hardware. Maybe I should do another write up with
>>>> some more concrete examples.
>>>>
>>>> There are also lots of use cases to _not_ have hardware and software in
>>>> sync. A flag allows this.
>>>>
>>>> My only point is I think we need to allow users to optimally use there
>>>> hardware either via 'tc' or my previous 'flow' tool. Actually in my
>>>> opinion I still think its best to have both interfaces.
>>>>
>>>> I'll go get some coffee now and hopefully that is somewhat clear.
>>>
>>>
>>> I've been thinking about the policy apect of this, and the more I think about
>>> it, the more I wonder if not allowing some sort of common policy in the kernel
>>> is really the right thing to do here.  I know thats somewhat blasphemous, but
>>> this isn't really administrative poilcy that we're talking about, at least not
>>> 100%.  Its more of a behavioral profile that we're trying to enforce.  That may
>>> be splitting hairs, but I think theres precidence for the latter.  That is to
>>> say, we configure qdiscs to limit traffic flow to certain rates, and configure
>>> policies which drop traffic that violates it (which includes random discard,
>>> which is the antithesis of deterministic policy).  I'm not sure I see this as
>>> any different, espcially if we limit its scope.  That is to say, why couldn't we
>>> allow the kernel to program a predetermined set of policies that the admin can
>>> set (i.e. offload routing to a hardware cache of X size with an lru
>>> victimization).  If other well defined policies make sense, we can add them and
>>> exposes options via iproute2 or some such to set them.  For the use case where
>>> such pre-packaged policies don't make sense, we have things like the flow api to
>>> offer users who want to be able to control their hardware in a more fine grained
>>> approach.
>>>
>>> Neil
>>>
>>
>> Hi Neil,
>>
>> I actually like this idea a lot. I might tweak a bit in that we could have some
>> feature bits or something like feature bits that expose how to split up the
>> hardware cache and give sizes.
>>
>> So the hypervisor (see I think of end hosts) or administrators could come in and
>> say I want a route table and a nft table. This creates a "flavor" over how the
>> hardware is going to be used. Another use case may not be doing routing at all
>> but have an application that wants to manage the hardware at a more fine grained
>> level with the exception of some nft commands so it could have a "nft"+"flow"
>> flavor. Insert your favorite use case here.
>
> I'm not sure I understand. You said that admin could say: "I want a
> route table and a nft table". But how does he say it? Isn't is enough
> just to insert some rules into these 2 things and that would give hw a
> clue what the admin is doing and what he wants? I believe that this
> offload should happen transparently.
>
> Of course, you may want to balance resources as you said the hw capacity
> is limited. But I would leave that optional. API unknown so far...

You can leave it optional and queue off the rule inserts but its not
very robust if the hardware capacity is smaller than the software
pipeline. This is the case for some of the setups we would like to get
working on Linux. In this case you can't have the transparent offload
because 'tc' or whatever may load a large set of irrelevant rules into
the hardware pipeline consuming the resources for flows that are control
traffic or exception cases. Perhaps worse what gets put into the
hardware is not very predictable.

So I agree we can support both cases we just need the API/feature bits
to turn it on/off. As far as the API I was thinking a "create" table
API would be sufficient. The kernel could request a "tc" table of size
1k and a "route" table of 4k for example then user space applications
could manage any un-reserved tables. I'm experimenting a bit with how
to make this work now. Maybe "create" should be "reserve".

.John

>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 21:45     ` Florian Fainelli
  2015-02-26 23:06       ` John Fastabend
@ 2015-02-27 18:37       ` Neil Horman
  1 sibling, 0 replies; 53+ messages in thread
From: Neil Horman @ 2015-02-27 18:37 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: John Fastabend, Jiri Pirko, netdev, davem, andy, tgraf, dborkman,
	ogerlitz, jesse, jpettit, joestringer, jhs, sfeldma, roopa,
	linville, simon.horman, shrijeet, gospo, bcrl

On Thu, Feb 26, 2015 at 01:45:37PM -0800, Florian Fainelli wrote:
> On 26/02/15 12:58, John Fastabend wrote:
> > On 02/26/2015 11:32 AM, Florian Fainelli wrote:
> >> Hi Jiri,
> >>
> >> On 25/02/15 23:42, Jiri Pirko wrote:
> >>> Hello everyone.
> >>>
> >>> I would like to discuss big next step for switch offloading. Probably
> >>> the most complicated one we have so far. That is to be able to offload flows.
> >>> Leaving nftables aside for a moment, I see 2 big usecases:
> >>> - TC filters and actions offload.
> >>> - OVS key match and actions offload.
> >>>
> >>> I think it might sense to ignore OVS for now. The reason is ongoing efford
> >>> to replace OVS kernel datapath with TC subsystem. After that, OVS offload
> >>> will not longer be needed and we'll get it for free with TC offload
> >>> implementation. So we can focus on TC now.
> >>
> >> What is not necessarily clear to me, is if we leave nftables aside for
> >> now from flow offloading, does that mean the entire flow offloading will
> >> now be controlled and going with the TC subsystem necessarily?
> >>
> >> I am not questioning the choice for TC, I am just wondering if
> >> ultimately there is the need for a lower layer, which is below, such
> >> that both tc and e.g: nftables can benefit from it?
> > 
> > My thinking on this is to use the FlowAPI ndo_ops as the bottom layer.
> > What I would much prefer (having to actually write drivers) is that
> > we have one API to the driver and tc, nft, whatever map onto that API.
> 
> Ok, I think this is indeed the right approach.
> 
> > 
> > Then my driver implements a ndo_set_flow op and a ndo_del_flow op. What
> > I'm working on now is the map from tc onto the flow API I'm hoping this
> > sounds like a good idea to folks.
> 
> Sounds good to me.
> 
> > 
> > Neil, suggested we might need a reservation concept where tc can reserve
> > some space in a TCAM, similarly nft can reserve some space. Also I have
> > applications in user space that want to reserve some space to offload
> > their specific data structures. This idea seems like a good one to me.
> 
> Humm, I guess the question is how and when do we do this reservation, is
> it upon first potential access from e.g: tc or nft to an offloading
> capable hardware, and if so, upon first attempt to offload an operation?
> 
I think we do this using administrative direction.  It seems to me like the
approach would be to enhance tools like iproute2 to indicate the desire to
offload various functions to hardware.  That is to say, I could envision a
command like the following:

tc offload dev eth0 enable policy strict 

This would cause the tc subsytem to call through johns flow api to reserve
whatever hw dataplace resources are needed to fully offload all the tc
qdiscs/actions/filters to the hardware, using the strict policy (strict being a
made up token to indicate a policy in which the entirety of the tc state for
that device should be moved into the hardware or fail).

This can likewise be done with l2 forarding (via the bridge command) or l3
forwarding (via the ip command).

> If we are to interface with a TCAM, some operations might require more
> slices than others, which will limit the number of actions available,
> but it is hard to know ahead of time.
> 
Yes, thats correct, resource size estimations will have to be made by the
administrator, and might lead to failed offload attempts, or under-utilized
hardware.  But I think thats the price we pay for having higher level
functionality offloaded.  If someone wants to be more efficient, then they use
the low level flow api to get better performance (and deal with any of the
behavioral quirks that might arise)

Neil

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-26 18:15         ` Tom Herbert
  2015-02-26 19:05           ` Thomas Graf
  2015-02-27  9:00           ` Jiri Pirko
@ 2015-02-28 20:02           ` David Miller
  2015-02-28 21:31             ` Jiri Pirko
  2 siblings, 1 reply; 53+ messages in thread
From: David Miller @ 2015-02-28 20:02 UTC (permalink / raw)
  To: therbert
  Cc: jiri, simon.horman, netdev, nhorman, andy, tgraf, dborkman,
	ogerlitz, jesse, jpettit, joestringer, john.r.fastabend, jhs,
	sfeldma, f.fainelli, roopa, linville, shrijeet, gospo, bcrl

From: Tom Herbert <therbert@google.com>
Date: Thu, 26 Feb 2015 10:15:24 -0800

> But, routing (aka switching) in the stack is not configured through
> TC. We have a whole forwarding and routing infrastructure (eg.
> iproute) with optimizations that  allow routes to be cached in
> sockets, etc. To me, it seems like offloading that basic functionality
> is a prerequisite before attempting to offload more advanced policy
> mechanisms of TC, netfilter, etc.

+1

I think this is the most important post in this entire thread.

The current proposal here is jumping the gun by several weeks if
not months.

We have no idea what roadblocks or design barriers we will hit
with just plain L2 and L3 forwarding yet.

Therefore it is premature to move over to making major decisions
about flow offloading before we've made more progress with the
more fundamental (and in my opinion much more important) offload
facilities.

Jiri, if you want to implement an example set of classifiers in
software that can completely replace openvswitch's datapath, that's
fine.  And it's a completely independant task to designing how we'll
offload flows to hardware in the future.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-28 20:02           ` David Miller
@ 2015-02-28 21:31             ` Jiri Pirko
  0 siblings, 0 replies; 53+ messages in thread
From: Jiri Pirko @ 2015-02-28 21:31 UTC (permalink / raw)
  To: David Miller
  Cc: therbert, simon.horman, netdev, nhorman, andy, tgraf, dborkman,
	ogerlitz, jesse, jpettit, joestringer, john.r.fastabend, jhs,
	sfeldma, f.fainelli, roopa, linville, shrijeet, gospo, bcrl

Sat, Feb 28, 2015 at 09:02:33PM CET, davem@davemloft.net wrote:
>From: Tom Herbert <therbert@google.com>
>Date: Thu, 26 Feb 2015 10:15:24 -0800
>
>> But, routing (aka switching) in the stack is not configured through
>> TC. We have a whole forwarding and routing infrastructure (eg.
>> iproute) with optimizations that  allow routes to be cached in
>> sockets, etc. To me, it seems like offloading that basic functionality
>> is a prerequisite before attempting to offload more advanced policy
>> mechanisms of TC, netfilter, etc.
>
>+1
>
>I think this is the most important post in this entire thread.
>
>The current proposal here is jumping the gun by several weeks if
>not months.
>
>We have no idea what roadblocks or design barriers we will hit
>with just plain L2 and L3 forwarding yet.
>
>Therefore it is premature to move over to making major decisions
>about flow offloading before we've made more progress with the
>more fundamental (and in my opinion much more important) offload
>facilities.

Fair enough. I just felt that the flows are happening just now
(John's work) so I wanted people to stay in focus. But I completely
agree that plain L2, L3 forwarding should be handled first. Then some
of the quirks connected with that, flows and other stuff after.

>
>Jiri, if you want to implement an example set of classifiers in
>software that can completely replace openvswitch's datapath, that's
>fine.  And it's a completely independant task to designing how we'll
>offload flows to hardware in the future.

Nod. Will do that.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Flows! Offload them.
  2015-02-27  8:41               ` Thomas Graf
  2015-02-27 12:59                 ` Neil Horman
@ 2015-03-01  9:36                 ` Arad, Ronen
  2015-03-01 14:05                   ` Neil Horman
  2015-03-01  9:47                 ` Arad, Ronen
  2 siblings, 1 reply; 53+ messages in thread
From: Arad, Ronen @ 2015-03-01  9:36 UTC (permalink / raw)
  To: Thomas Graf, Neil Horman, netdev
  Cc: Simon Horman, Fastabend, John R, Jiri Pirko, davem, andy,
	dborkman, ogerlitz, jesse, jpettit, joestringer, jhs, sfeldma,
	f.fainelli, roopa, linville, shrijeet, gospo, bcrl



>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>Behalf Of Thomas Graf
>Sent: Friday, February 27, 2015 12:42 AM
>To: Neil Horman
>Cc: Simon Horman; Fastabend, John R; Jiri Pirko; netdev@vger.kernel.org;
>davem@davemloft.net; andy@greyhouse.net; dborkman@redhat.com;
>ogerlitz@mellanox.com; jesse@nicira.com; jpettit@nicira.com;
>joestringer@nicira.com; jhs@mojatatu.com; sfeldma@gmail.com;
>f.fainelli@gmail.com; roopa@cumulusnetworks.com; linville@tuxdriver.com;
>shrijeet@gmail.com; gospo@cumulusnetworks.com; bcrl@kvack.org
>Subject: Re: Flows! Offload them.
>
>Blind random offload of some packets is better than nothing but knowing
>and having control over which packets are offloaded is essential. You
>typically don't want to randomly give one flow priority over another ;-)
>Some software CPUs might not be able to handle the load. I know what
>you mean though and as long as we allow to disable and overwrite this
>behaviour we are good.
>

Random offloading of flows does not preserve policy in general.
For example let's consider two flows A.B.C.0/24 and A.B.C.D with the more
specific rule has higher priority.
If only the first rule is offloaded to HW, packets matched by the second
rule will not be processed as expected by the user.
Deciding which flow could be offloaded is an optimization decision that
is better handled outside of the kernel. 
  
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: Flows! Offload them.
  2015-02-27  8:41               ` Thomas Graf
  2015-02-27 12:59                 ` Neil Horman
  2015-03-01  9:36                 ` Arad, Ronen
@ 2015-03-01  9:47                 ` Arad, Ronen
  2015-03-01 17:20                   ` Neil Horman
  2 siblings, 1 reply; 53+ messages in thread
From: Arad, Ronen @ 2015-03-01  9:47 UTC (permalink / raw)
  To: Thomas Graf, Neil Horman, netdev
  Cc: Simon Horman, Fastabend, John R, Jiri Pirko, davem, andy,
	dborkman, ogerlitz, jesse, jpettit, joestringer, jhs, sfeldma,
	f.fainelli, roopa, linville, shrijeet, gospo, bcrl



>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>Behalf Of Thomas Graf
>Sent: Friday, February 27, 2015 12:42 AM
>To: Neil Horman
>Cc: Simon Horman; Fastabend, John R; Jiri Pirko; netdev@vger.kernel.org;
>davem@davemloft.net; andy@greyhouse.net; dborkman@redhat.com;
>ogerlitz@mellanox.com; jesse@nicira.com; jpettit@nicira.com;
>joestringer@nicira.com; jhs@mojatatu.com; sfeldma@gmail.com;
>f.fainelli@gmail.com; roopa@cumulusnetworks.com; linville@tuxdriver.com;
>shrijeet@gmail.com; gospo@cumulusnetworks.com; bcrl@kvack.org
>Subject: Re: Flows! Offload them.
>
>
>Maybe I'm misunderstanding your statement here but I think it's essential
>that the kernel is able to handle whatever we program in hardware even
>if the hardware tables look differrent than the software tables, no matter
>whether the configuration occurs through OVS or not. A punt to software
>should always work even if it does not happen. So while I believe that
>OVS needs more control over the hardware than available through the
>datapath cache it must program both the hardware and software in parallel
>even though the building blocks for doing so might look different.
>

I believe that having an equivalent punt path should be optional and
controlled by application policy. Some applications might give up on punt
path due to its throughput implication and prefer to just drop in HW and 
possibly leak some packets to software for exception processing and logging
only. 

>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-03-01  9:36                 ` Arad, Ronen
@ 2015-03-01 14:05                   ` Neil Horman
  2015-03-02 14:16                     ` Jamal Hadi Salim
  0 siblings, 1 reply; 53+ messages in thread
From: Neil Horman @ 2015-03-01 14:05 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: Thomas Graf, netdev, Simon Horman, Fastabend, John R, Jiri Pirko,
	davem, andy, dborkman, ogerlitz, jesse, jpettit, joestringer,
	jhs, sfeldma, f.fainelli, roopa, linville, shrijeet, gospo,
	bcrl@kvack.org

On Sun, Mar 01, 2015 at 09:36:43AM +0000, Arad, Ronen wrote:
> 
> 
> >-----Original Message-----
> >From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
> >Behalf Of Thomas Graf
> >Sent: Friday, February 27, 2015 12:42 AM
> >To: Neil Horman
> >Cc: Simon Horman; Fastabend, John R; Jiri Pirko; netdev@vger.kernel.org;
> >davem@davemloft.net; andy@greyhouse.net; dborkman@redhat.com;
> >ogerlitz@mellanox.com; jesse@nicira.com; jpettit@nicira.com;
> >joestringer@nicira.com; jhs@mojatatu.com; sfeldma@gmail.com;
> >f.fainelli@gmail.com; roopa@cumulusnetworks.com; linville@tuxdriver.com;
> >shrijeet@gmail.com; gospo@cumulusnetworks.com; bcrl@kvack.org
> >Subject: Re: Flows! Offload them.
> >
> >Blind random offload of some packets is better than nothing but knowing
> >and having control over which packets are offloaded is essential. You
> >typically don't want to randomly give one flow priority over another ;-)
> >Some software CPUs might not be able to handle the load. I know what
> >you mean though and as long as we allow to disable and overwrite this
> >behaviour we are good.
> >
> 
> Random offloading of flows does not preserve policy in general.
> For example let's consider two flows A.B.C.0/24 and A.B.C.D with the more
> specific rule has higher priority.
> If only the first rule is offloaded to HW, packets matched by the second
> rule will not be processed as expected by the user.
> Deciding which flow could be offloaded is an optimization decision that
> is better handled outside of the kernel. 
>   
Regarding the description of offload:
Random is I think a improper term here.  There is nothing random about flow
offload.  There is only the reality of needing to select which rules to offload,
as it is inevitable that hardware dataplane capacity won't/can't match that of
software limits.  All we can do is select which of those flows are offloaded.
When dealing with offload in the context of an existing sofware function, the
constraints of selecting which rules those are become fairly clear.  For example
with routing:
1) More specific rules get offloaded first
2) on overflow, you replace a rule that conforms to a pre-packaged policy (ex.
LRU), and iff it doesn't violate (1)
Theres nothing random about that.  The selection policy can make replacement as
deterministic as you like (though if you're more interested in just offloading
functionality, determinism of rules is less important, than just optimizing the
performance of the offloaded function within the above constraints).

Regarding offload selection:
I disagree.  you're correct when the key match is a multiple word tuple that has
no real context with in the kernel, but for the case you describe (l3
forwarding), the policy can be made fairly clearly.  All you need to do is
offload the more specific rule first.  Thats it.  If that doesn't give you the
offload performance you're looking for because theres not space for both rules
in the hw dataplane, too bad, thats the limitation of doing a kernel-function
mirroring to the hardware.  Hopefully that will get better as hardware dataplane
capacity improves.  Until then, thats what Johns Flow API is for.  It should
live in parallel with such a functional offload, so that if those limitations
are unacceptable to you, you can go off the reservation, and customize the
hardware behavior.

Neil

> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-03-01  9:47                 ` Arad, Ronen
@ 2015-03-01 17:20                   ` Neil Horman
  0 siblings, 0 replies; 53+ messages in thread
From: Neil Horman @ 2015-03-01 17:20 UTC (permalink / raw)
  To: Arad, Ronen
  Cc: Thomas Graf, netdev, Simon Horman, Fastabend, John R, Jiri Pirko,
	davem, andy, dborkman, ogerlitz, jesse, jpettit, joestringer,
	jhs, sfeldma, f.fainelli, roopa, linville, shrijeet, gospo,
	bcrl@kvack.org

On Sun, Mar 01, 2015 at 09:47:46AM +0000, Arad, Ronen wrote:
> 
> 
> >-----Original Message-----
> >From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
> >Behalf Of Thomas Graf
> >Sent: Friday, February 27, 2015 12:42 AM
> >To: Neil Horman
> >Cc: Simon Horman; Fastabend, John R; Jiri Pirko; netdev@vger.kernel.org;
> >davem@davemloft.net; andy@greyhouse.net; dborkman@redhat.com;
> >ogerlitz@mellanox.com; jesse@nicira.com; jpettit@nicira.com;
> >joestringer@nicira.com; jhs@mojatatu.com; sfeldma@gmail.com;
> >f.fainelli@gmail.com; roopa@cumulusnetworks.com; linville@tuxdriver.com;
> >shrijeet@gmail.com; gospo@cumulusnetworks.com; bcrl@kvack.org
> >Subject: Re: Flows! Offload them.
> >
> >
> >Maybe I'm misunderstanding your statement here but I think it's essential
> >that the kernel is able to handle whatever we program in hardware even
> >if the hardware tables look differrent than the software tables, no matter
> >whether the configuration occurs through OVS or not. A punt to software
> >should always work even if it does not happen. So while I believe that
> >OVS needs more control over the hardware than available through the
> >datapath cache it must program both the hardware and software in parallel
> >even though the building blocks for doing so might look different.
> >
> 
> I believe that having an equivalent punt path should be optional and
> controlled by application policy. Some applications might give up on punt
> path due to its throughput implication and prefer to just drop in HW and 
> possibly leak some packets to software for exception processing and logging
> only. 
> 
Thats only one use case.  Having a software fallback path implemented by, and
gated by, application policy is fine for some newer application, but there is a
huge legacy set of applications available today, that relies on kernel
functionality for dataplane forwarding.  For this use case,  what we want/need
is the in-kernel dataplane to be opportunistically offloaded to whatever degree
possible, based on administratively assigned policy, and have that happen
transparently to user functionality.  The trade off there is that we don't
always have control over what exactly gets offloaded.

Thats why we need both API's.  Something like the flow API that can offer fine
grained control for applications that are new and willing to understand more
about the hardware they're using, and something at a kernel functional
granularity that provides legacy acceleration with all the tradeoffs that
entails.

Neil

> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-27 15:23             ` John Fastabend
@ 2015-03-02 13:45               ` Jamal Hadi Salim
  0 siblings, 0 replies; 53+ messages in thread
From: Jamal Hadi Salim @ 2015-03-02 13:45 UTC (permalink / raw)
  To: John Fastabend, Shrijeet Mukherjee, Thomas Graf
  Cc: Jiri Pirko, Simon Horman, netdev, davem, nhorman, andy, dborkman,
	ogerlitz, jesse, jpettit, joestringer, sfeldma, f.fainelli,
	roopa, linville, gospo, bcrl

Hi John,

On 02/27/15 10:23, John Fastabend wrote:
> On 02/27/2015 05:33 AM, Jamal Hadi Salim wrote:


>> tc filter add ingress ethx classifier foo priority X \
>> match xyz action redirect macvlan3 offload
>>
>> where "offload" sets the netlink or classifier specific instruction
>> to offload.
>>
>> You can easily map macvlan3 to vsi 3.
>>
>
> This is why I said if you want to map my "flow" tool onto "tc" it can
> be done. :) I made a jump from macvlans that have net device representations
> to VFs being assigned to user space (VMs) where there is no net device
> to "redirect" to. So my explanation wasn't clear. Couple additional notes,
>
> The only issue I have with your 'tc' case is the "ingress" qdisc is per port where
> my flow tool is scoped at the switch pipeline so I would modify your cmd
> slightly and use a new classifier hook,
>
>   tc filter add sw-pipeline ethx classifier flow priority X \
>      match xyz action redirect macvlan3 offload
>

For hardware, we are talking abstractions, no? i.e it doesnt have
to be 1-1 mapping. IOW, It shouldnt matter how the hardware table
looks like as long as the kernel abstractions work on it.
There are two hooks:
either ingress or egress. These can be expressed in the kernel via
policy. Most hardware supports either one or the other.
IOW, I think the expressability in software is sufficient. In the
h/ware you are referring to - it is ingress.

> The other thing to note is "redirect" to macvlan3 doesn't handle the case
> where the VF is direct assigned to a VM in this case we don't have a netdev
> in the hypervisor to "redirect" to. Using the VSI# allows the hardware
> to translate the redirect cmd to the VF.
>

I agree - this is an interesting exception from a policy expression
view.

> So one more slight tweak,
>
>   tc filter add sw-pipeline ethx classifier flow priority X \
>      match xyz action redirect vsi:2 offload
>
> use the "vsi:" prefix to indicate this is for a hardware mapped VSI this
> way the user can use what ever notion is more convenient.
>

we do need the namespace separation. The concept of a "netdev" that
exists in hardware but is hidden at the hypervisor level is something
that needs to be addressed. I am still in love with concept of
abstracting these things as netdevs.
The question is how do you map which "vsi:" to which netdev on which
VM?

cheers,
jamal

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-02-27  1:52               ` Tom Herbert
@ 2015-03-02 13:49                 ` Andy Gospodarek
  2015-03-02 16:54                   ` Scott Feldman
  0 siblings, 1 reply; 53+ messages in thread
From: Andy Gospodarek @ 2015-03-02 13:49 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Neil Horman, Simon Horman, John Fastabend, Thomas Graf,
	Jiri Pirko, Linux Netdev List, David Miller, Andy Gospodarek,
	Daniel Borkmann, Or Gerlitz, Jesse Gross, jpettit, Joe Stringer,
	Jamal Hadi Salim, Scott Feldman, Florian Fainelli, Roopa Prabhu,
	John Linville, Shrijeet Mukherjee, bcrl

On Thu, Feb 26, 2015 at 05:52:16PM -0800, Tom Herbert wrote:
> On Thu, Feb 26, 2015 at 5:22 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
[...]
> > Yes, exactly that, for the general traditional networking use case, that is
> > exactly what we want, to opportunistically move traffic faster with less load on
> > the cpu.  We don't nominally care what traffic is offloaded, as long as the
> > hardware does a better job than just software alone.  If we get an occasional
> > miss and have to do stuff in software, so be it.
> >
> +1 on an in kernel "Network Resource Manager". This also came up in
> Sunil's plan to configure RPS affinities from a driver so I'm taking
> liberty by generalizing the concept :-).

I agree completely that there is a need for what you both describe.  Not
only to handle what items to offload for users looking for that level
of granularity, but also to allow driver implementers to decide if their
hardware/driver implementation may allow for async write of data to
hardware tables, and any other implementation specific detail they may
want to provide.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-03-01 14:05                   ` Neil Horman
@ 2015-03-02 14:16                     ` Jamal Hadi Salim
  0 siblings, 0 replies; 53+ messages in thread
From: Jamal Hadi Salim @ 2015-03-02 14:16 UTC (permalink / raw)
  To: Neil Horman, Arad, Ronen
  Cc: Thomas Graf, netdev, Simon Horman, Fastabend, John R, Jiri Pirko,
	davem, andy, dborkman, ogerlitz, jesse, jpettit, joestringer,
	sfeldma, f.fainelli, roopa, linville, shrijeet, gospo, bcrl,
	Aviad Raveh

On 03/01/15 09:05, Neil Horman wrote:
> On Sun, Mar 01, 2015 at 09:36:43AM +0000, Arad, Ronen wrote:
>>

>> Random offloading of flows does not preserve policy in general.
>> For example let's consider two flows A.B.C.0/24 and A.B.C.D with the more
>> specific rule has higher priority.
>> If only the first rule is offloaded to HW, packets matched by the second
>> rule will not be processed as expected by the user.
>> Deciding which flow could be offloaded is an optimization decision that
>> is better handled outside of the kernel.
>>
> Regarding the description of offload:
> Random is I think a improper term here.  There is nothing random about flow
> offload.  There is only the reality of needing to select which rules to offload,
> as it is inevitable that hardware dataplane capacity won't/can't match that of
> software limits.  All we can do is select which of those flows are offloaded.
> When dealing with offload in the context of an existing sofware function, the
> constraints of selecting which rules those are become fairly clear.  For example
> with routing:
> 1) More specific rules get offloaded first
> 2) on overflow, you replace a rule that conforms to a pre-packaged policy (ex.
> LRU), and iff it doesn't violate (1)


Sounds sane as a starting point. Lets just talk L3 since this applies
to any tcam approach (and expense of repeating things already discussed)
For hook #2 Dave proposes in other email to just flush everything and do
things in s/ware at that point for initial implementation.

[Most people would try to aggregate  things into a /24 for example to
make space for more /32s. And theres all sorts of crazy shit to amortize
tcam space that people do]

But this brings in another requirement: CCing the Mellanox folks who
brought up this  issue numerous times.
Things tend to "freeze" at that point while the kernel is madly moving
things around (as would be the case with Dave's suggested "move all
to software"). At netdev01, Aviad was suggesting that the kernel
respond first with a netlink message "work in progress" or even lie
with "success" if it knows it will accomplish this goal.
Ccing Sol who had a different reason for waiting seconds long for things
to be inserted into hardware.


cheers,
jamal

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-03-02 13:49                 ` Andy Gospodarek
@ 2015-03-02 16:54                   ` Scott Feldman
  2015-03-02 18:06                     ` Andy Gospodarek
       [not found]                     ` <CAGpadYEC3-5AdkOG66q0vX+HM0c6EU-C0ZT=sKGe7rZRHsYYKg@mail.gmail.com>
  0 siblings, 2 replies; 53+ messages in thread
From: Scott Feldman @ 2015-03-02 16:54 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Tom Herbert, Neil Horman, Simon Horman, John Fastabend,
	Thomas Graf, Jiri Pirko, Linux Netdev List, David Miller,
	Andy Gospodarek, Daniel Borkmann, Or Gerlitz, Jesse Gross,
	jpettit, Joe Stringer, Jamal Hadi Salim, Florian Fainelli,
	Roopa Prabhu, John Linville, Shrijeet Mukherjee,
	Benjamin LaHaise

On Mon, Mar 2, 2015 at 5:49 AM, Andy Gospodarek
<gospo@cumulusnetworks.com> wrote:
> On Thu, Feb 26, 2015 at 05:52:16PM -0800, Tom Herbert wrote:
>> On Thu, Feb 26, 2015 at 5:22 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> [...]
>> > Yes, exactly that, for the general traditional networking use case, that is
>> > exactly what we want, to opportunistically move traffic faster with less load on
>> > the cpu.  We don't nominally care what traffic is offloaded, as long as the
>> > hardware does a better job than just software alone.  If we get an occasional
>> > miss and have to do stuff in software, so be it.
>> >
>> +1 on an in kernel "Network Resource Manager". This also came up in
>> Sunil's plan to configure RPS affinities from a driver so I'm taking
>> liberty by generalizing the concept :-).
>
> I agree completely that there is a need for what you both describe.  Not
> only to handle what items to offload for users looking for that level
> of granularity, but also to allow driver implementers to decide if their
> hardware/driver implementation may allow for async write of data to
> hardware tables, and any other implementation specific detail they may
> want to provide.

Can you elaborate on "allow for async write of data to hardware
tables"?  Is this the trampoline model where user's request goes to
the kernel, and then back to user-space, and finally to the hardware
via an user-space SDK?  I think we should exclude that model from
discussions about resource management.  With the recent L2/L3 offload
work, I'm advocating a synchronous call path from user to kernel to
hardware so we can return a actionable result code, and put the burden
of resource management in user-space, not in the kernel.

-scott

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-03-02 16:54                   ` Scott Feldman
@ 2015-03-02 18:06                     ` Andy Gospodarek
       [not found]                     ` <CAGpadYEC3-5AdkOG66q0vX+HM0c6EU-C0ZT=sKGe7rZRHsYYKg@mail.gmail.com>
  1 sibling, 0 replies; 53+ messages in thread
From: Andy Gospodarek @ 2015-03-02 18:06 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Tom Herbert, Neil Horman, Simon Horman, John Fastabend,
	Thomas Graf, Jiri Pirko, Linux Netdev List, David Miller,
	Andy Gospodarek, Daniel Borkmann, Or Gerlitz, Jesse Gross,
	jpettit, Joe Stringer, Jamal Hadi Salim, Florian Fainelli,
	Roopa Prabhu, John Linville, Shrijeet Mukherjee,
	Benjamin LaHaise

On Mon, Mar 02, 2015 at 08:54:06AM -0800, Scott Feldman wrote:
> On Mon, Mar 2, 2015 at 5:49 AM, Andy Gospodarek
> <gospo@cumulusnetworks.com> wrote:
> > On Thu, Feb 26, 2015 at 05:52:16PM -0800, Tom Herbert wrote:
> >> On Thu, Feb 26, 2015 at 5:22 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> > [...]
> >> > Yes, exactly that, for the general traditional networking use case, that is
> >> > exactly what we want, to opportunistically move traffic faster with less load on
> >> > the cpu.  We don't nominally care what traffic is offloaded, as long as the
> >> > hardware does a better job than just software alone.  If we get an occasional
> >> > miss and have to do stuff in software, so be it.
> >> >
> >> +1 on an in kernel "Network Resource Manager". This also came up in
> >> Sunil's plan to configure RPS affinities from a driver so I'm taking
> >> liberty by generalizing the concept :-).
> >
> > I agree completely that there is a need for what you both describe.  Not
> > only to handle what items to offload for users looking for that level
> > of granularity, but also to allow driver implementers to decide if their
> > hardware/driver implementation may allow for async write of data to
> > hardware tables, and any other implementation specific detail they may
> > want to provide.
> 
> Can you elaborate on "allow for async write of data to hardware
> tables"?  Is this the trampoline model where user's request goes to
> the kernel, and then back to user-space, and finally to the hardware
> via an user-space SDK?I think we should exclude that model from
> discussions about resource management.  With the recent L2/L3 offload
> work, I'm advocating a synchronous call path from user to kernel to
> hardware so we can return a actionable result code, and put the burden
> of resource management in user-space, not in the kernel.

No, this would not be to enable a trampoline driver.  While that is
interesting to some, developing explicit infra upstream to enable such a
thing is not interesting to me.

This comes on the heels of recent discussions I've had with more than
one vendor who have discussed the algorithmic complexity that often
exists in their driver and the concern about whether or not they could
allow some calls down to hardware to be async when it is known that
there is no real pressure on the available space in any given table.

When a driver would alert that limited space would be available, I would
see these calls going synchronous to be positive that they would be
allowed to complete and to allow a more standard error path.  Since
behavior might change slightly depending on available resources, it
seemed like a good topic to discuss when discussing a resource manager.  

This may be an optimization that is not needed in the base version of a
resource manager and if that is the case, that is fine with me. :)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
       [not found]                     ` <CAGpadYEC3-5AdkOG66q0vX+HM0c6EU-C0ZT=sKGe7rZRHsYYKg@mail.gmail.com>
@ 2015-03-02 22:13                       ` Scott Feldman
  2015-03-02 22:43                         ` Andy Gospodarek
  0 siblings, 1 reply; 53+ messages in thread
From: Scott Feldman @ 2015-03-02 22:13 UTC (permalink / raw)
  To: Shrijeet Mukherjee
  Cc: Andy Gospodarek, Tom Herbert, Neil Horman, Simon Horman,
	John Fastabend, Thomas Graf, Jiri Pirko, Linux Netdev List,
	David Miller, Andy Gospodarek, Daniel Borkmann, Or Gerlitz,
	Jesse Gross, jpettit, Joe Stringer, Jamal Hadi Salim,
	Florian Fainelli, Roopa Prabhu, John Linville, Benjamin LaHaise

On Mon, Mar 2, 2015 at 10:31 AM, Shrijeet Mukherjee <shrijeet@gmail.com> wrote:
>>
>> Can you elaborate on "allow for async write of data to hardware
>> tables"?  Is this the trampoline model where user's request goes to
>> the kernel, and then back to user-space, and finally to the hardware
>> via an user-space SDK?  I think we should exclude that model from
>> discussions about resource management.  With the recent L2/L3 offload
>> work, I'm advocating a synchronous call path from user to kernel to
>> hardware so we can return a actionable result code, and put the burden
>> of resource management in user-space, not in the kernel.
>>
> Scott you mean synchronous to the switchdev driver right ?

Correct.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-03-02 22:13                       ` Scott Feldman
@ 2015-03-02 22:43                         ` Andy Gospodarek
  2015-03-02 22:49                           ` Florian Fainelli
  0 siblings, 1 reply; 53+ messages in thread
From: Andy Gospodarek @ 2015-03-02 22:43 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Shrijeet Mukherjee, Tom Herbert, Neil Horman, Simon Horman,
	John Fastabend, Thomas Graf, Jiri Pirko, Linux Netdev List,
	David Miller, Andy Gospodarek, Daniel Borkmann, Or Gerlitz,
	Jesse Gross, jpettit, Joe Stringer, Jamal Hadi Salim,
	Florian Fainelli, Roopa Prabhu, John Linville, Benjamin LaHaise

On Mon, Mar 02, 2015 at 02:13:35PM -0800, Scott Feldman wrote:
> On Mon, Mar 2, 2015 at 10:31 AM, Shrijeet Mukherjee <shrijeet@gmail.com> wrote:
> >>
> >> Can you elaborate on "allow for async write of data to hardware
> >> tables"?  Is this the trampoline model where user's request goes to
> >> the kernel, and then back to user-space, and finally to the hardware
> >> via an user-space SDK?  I think we should exclude that model from
> >> discussions about resource management.  With the recent L2/L3 offload
> >> work, I'm advocating a synchronous call path from user to kernel to
> >> hardware so we can return a actionable result code, and put the burden
> >> of resource management in user-space, not in the kernel.
> >>
> > Scott you mean synchronous to the switchdev driver right ?
> 
> Correct.

So while I agree with you that it would be ideal to be synchronous all
the way to the hardware (and that is what I continue to tell vendors
they should do) you would like to see each driver individually manage
whether or not they chose to make these calls async and handle the
fallout in their own?

I'm fine with that for now since Dave's opinion on most of this is that
we need to "keep it simple" for now (he said that once already today),
but I think it is something we should consider for the future.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Flows! Offload them.
  2015-03-02 22:43                         ` Andy Gospodarek
@ 2015-03-02 22:49                           ` Florian Fainelli
  0 siblings, 0 replies; 53+ messages in thread
From: Florian Fainelli @ 2015-03-02 22:49 UTC (permalink / raw)
  To: Andy Gospodarek, Scott Feldman
  Cc: Shrijeet Mukherjee, Tom Herbert, Neil Horman, Simon Horman,
	John Fastabend, Thomas Graf, Jiri Pirko, Linux Netdev List,
	David Miller, Andy Gospodarek, Daniel Borkmann, Or Gerlitz,
	Jesse Gross, jpettit, Joe Stringer, Jamal Hadi Salim,
	Roopa Prabhu, John Linville, Benjamin LaHaise

On 02/03/15 14:43, Andy Gospodarek wrote:
> On Mon, Mar 02, 2015 at 02:13:35PM -0800, Scott Feldman wrote:
>> On Mon, Mar 2, 2015 at 10:31 AM, Shrijeet Mukherjee <shrijeet@gmail.com> wrote:
>>>>
>>>> Can you elaborate on "allow for async write of data to hardware
>>>> tables"?  Is this the trampoline model where user's request goes to
>>>> the kernel, and then back to user-space, and finally to the hardware
>>>> via an user-space SDK?  I think we should exclude that model from
>>>> discussions about resource management.  With the recent L2/L3 offload
>>>> work, I'm advocating a synchronous call path from user to kernel to
>>>> hardware so we can return a actionable result code, and put the burden
>>>> of resource management in user-space, not in the kernel.
>>>>
>>> Scott you mean synchronous to the switchdev driver right ?
>>
>> Correct.
> 
> So while I agree with you that it would be ideal to be synchronous all
> the way to the hardware (and that is what I continue to tell vendors
> they should do) you would like to see each driver individually manage
> whether or not they chose to make these calls async and handle the
> fallout in their own?

I am not sure I follow you here, the way I would see it is something
like this:

SWITCHDEV	DRIVER
ndo_foo ->
		foo_op ->
			bus_op() ->
			# wait for interrupt
			complete()
ndo_foo 	<- return

at which point, the only things that matters is that the switch driver
ensures calls serialization at its own level (concurrent HW registers
access), eventually working under the assumption that ndo_foo() is also
called with appropriate locking (rtnl) to prevent concurrent operations
from occurring.

If the model is more like the switch driver is responsible for locally
queuing multiple commands, then I agree, we might not want that at all.

-- 
Florian

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2015-03-02 22:50 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-26  7:42 Flows! Offload them Jiri Pirko
2015-02-26  8:38 ` Simon Horman
2015-02-26  9:16   ` Jiri Pirko
2015-02-26 13:33     ` Thomas Graf
2015-02-26 15:23       ` John Fastabend
2015-02-26 20:16         ` Neil Horman
2015-02-26 21:11           ` John Fastabend
2015-02-27  1:17             ` Neil Horman
2015-02-27  8:53             ` Jiri Pirko
2015-02-27 16:00               ` John Fastabend
2015-02-26 21:52           ` Simon Horman
2015-02-27  1:22             ` Neil Horman
2015-02-27  1:52               ` Tom Herbert
2015-03-02 13:49                 ` Andy Gospodarek
2015-03-02 16:54                   ` Scott Feldman
2015-03-02 18:06                     ` Andy Gospodarek
     [not found]                     ` <CAGpadYEC3-5AdkOG66q0vX+HM0c6EU-C0ZT=sKGe7rZRHsYYKg@mail.gmail.com>
2015-03-02 22:13                       ` Scott Feldman
2015-03-02 22:43                         ` Andy Gospodarek
2015-03-02 22:49                           ` Florian Fainelli
2015-02-27  8:41               ` Thomas Graf
2015-02-27 12:59                 ` Neil Horman
2015-03-01  9:36                 ` Arad, Ronen
2015-03-01 14:05                   ` Neil Horman
2015-03-02 14:16                     ` Jamal Hadi Salim
2015-03-01  9:47                 ` Arad, Ronen
2015-03-01 17:20                   ` Neil Horman
     [not found]       ` <CAGpadYGrjfkZqe0k7D05+cy3pY=1hXZtQqtV0J-8ogU80K7BUQ@mail.gmail.com>
2015-02-26 15:39         ` John Fastabend
     [not found]           ` <CAGpadYHfNcDR2ojubkCJ8-nJTQkdLkPsAwJu0wOKU82bLDzhww@mail.gmail.com>
2015-02-26 16:33             ` Thomas Graf
2015-02-26 16:53             ` John Fastabend
2015-02-27 13:33           ` Jamal Hadi Salim
2015-02-27 15:23             ` John Fastabend
2015-03-02 13:45               ` Jamal Hadi Salim
2015-02-26 17:38       ` David Ahern
2015-02-26 16:04     ` Tom Herbert
2015-02-26 16:17       ` Jiri Pirko
2015-02-26 18:15         ` Tom Herbert
2015-02-26 19:05           ` Thomas Graf
2015-02-27  9:00           ` Jiri Pirko
2015-02-28 20:02           ` David Miller
2015-02-28 21:31             ` Jiri Pirko
2015-02-26 18:16       ` Scott Feldman
2015-02-26 11:22 ` Sowmini Varadhan
2015-02-26 11:39   ` Jiri Pirko
2015-02-26 15:42     ` Sowmini Varadhan
2015-02-27 13:15     ` Named sockets WAS(Re: " Jamal Hadi Salim
2015-02-26 12:51 ` Thomas Graf
2015-02-26 13:17   ` Jiri Pirko
2015-02-26 19:32 ` Florian Fainelli
2015-02-26 20:58   ` John Fastabend
2015-02-26 21:45     ` Florian Fainelli
2015-02-26 23:06       ` John Fastabend
2015-02-27 18:37       ` Neil Horman
2015-02-27 14:01     ` Driver level interface WAS(Re: " Jamal Hadi Salim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.