All of lore.kernel.org
 help / color / mirror / Atom feed
* Let's do P4
@ 2016-10-29  7:53 Jiri Pirko
  2016-10-29  9:39 ` Thomas Graf
                   ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Jiri Pirko @ 2016-10-29  7:53 UTC (permalink / raw)
  To: netdev
  Cc: davem, tgraf, jhs, roopa, john.fastabend, jakub.kicinski,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera

Hi all.

The network world is divided into 2 general types of hw:
1) network ASICs - network specific silicon, containing things like TCAM
   These ASICs are suitable to be programmed by P4.
2) network processors - basically a general purpose CPUs
   These processors are suitable to be programmed by eBPF.

I believe that by now, the most people came to a conclusion that it is
very difficult to handle both types by either P4 or eBPF. And since
eBPF is part of the kernel, I would like to introduce P4 into kernel
as well. Here's a plan:

1) Define P4 intermediate representation
   I cannot imagine loading P4 program (c-like syntax text file) into
   kernel as is. That means that as the first step, we need find some
   intermediate representation. I can imagine someting in a form of AST,
   call it "p4ast". I don't really know how to do this exactly though,
   it's just an idea.

   In the end there would be a userspace precompiler for this:
   $ makep4ast example.p4 example.ast

2) Implement p4ast in-kernel interpreter 
   A kernel module which takes a p4ast and emulates the pipeline.
   This can be implemented from scratch. Or, p4ast could be compiled
   to eBPF. I know there are already couple of p4>eBPF compilers.
   Not sure how feasible it would be to put this compiler in kernel.

3) Expose the p4ast in-kernel interpreter to userspace
   As the easiest way I see in to introduce a new TC classifier cls_p4.

   This can work in a very similar way cls_bpf is:
   $ tc filter add dev eth0 ingress p4 da ast example.ast

   The TC cls_p4 will be also used for runtime table manipulation.

4) Offload p4ast programs into hardware
   The same p4ast program representation will be passed down
   to drivers via existing TC offloading way - ndo_setup_tc.
   Drivers will then parse it and setup the hardware
   accordingly. Driver will also have possibility to error out
   in case it does not support some requested feature.

Thoughts? Ideas?

Thanks,
	Jiri

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-29  7:53 Let's do P4 Jiri Pirko
@ 2016-10-29  9:39 ` Thomas Graf
  2016-10-29 10:10   ` Jiri Pirko
  2016-10-29 14:49 ` Jakub Kicinski
  2016-11-01 11:57 ` Jamal Hadi Salim
  2 siblings, 1 reply; 41+ messages in thread
From: Thomas Graf @ 2016-10-29  9:39 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, jhs, roopa, john.fastabend, jakub.kicinski,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera

On 10/29/16 at 09:53am, Jiri Pirko wrote:
> Hi all.
> 
> The network world is divided into 2 general types of hw:
> 1) network ASICs - network specific silicon, containing things like TCAM
>    These ASICs are suitable to be programmed by P4.
> 2) network processors - basically a general purpose CPUs
>    These processors are suitable to be programmed by eBPF.
> 
> I believe that by now, the most people came to a conclusion that it is
> very difficult to handle both types by either P4 or eBPF. And since
> eBPF is part of the kernel, I would like to introduce P4 into kernel
> as well. Here's a plan:

For reference, last time I remember we discussed this in the BPF
offload context:
http://www.spinics.net/lists/netdev/msg356178.html

> 1) Define P4 intermediate representation
>    I cannot imagine loading P4 program (c-like syntax text file) into
>    kernel as is. That means that as the first step, we need find some
>    intermediate representation. I can imagine someting in a form of AST,
>    call it "p4ast". I don't really know how to do this exactly though,
>    it's just an idea.
> 
>    In the end there would be a userspace precompiler for this:
>    $ makep4ast example.p4 example.ast
> 
> 2) Implement p4ast in-kernel interpreter 
>    A kernel module which takes a p4ast and emulates the pipeline.
>    This can be implemented from scratch. Or, p4ast could be compiled
>    to eBPF. I know there are already couple of p4>eBPF compilers.
>    Not sure how feasible it would be to put this compiler in kernel.

+1 to using eBPF for emulation. Maybe the compiler doesn't need to be
in the kernel and user space can compile and provide the emulated
pipeline in eBPF directly. See next paragraph for an example where
this could be useful.

> 3) Expose the p4ast in-kernel interpreter to userspace
>    As the easiest way I see in to introduce a new TC classifier cls_p4.
> 
>    This can work in a very similar way cls_bpf is:
>    $ tc filter add dev eth0 ingress p4 da ast example.ast
> 
>    The TC cls_p4 will be also used for runtime table manipulation.

I think this is a great model for the case where HW can provide all
of the required capabilities. Thinking about the case where HW
provides a subset and SW provides an extended version, i.e. the
reality we live in for hosts with ASIC NICs ;-) The hand off point
requires some understanding between p4ast and eBPF.

Therefore another idea would be to use cls_bpf directly for this. The
p4ast IR could be stored in a separate ELF section in the same object
file with an existing eBPF program. The p4ast IR will match the
eBPF prog if capabilities of HW and SW match. If HW is limited, the
p4ast IR represents what the HW can do plus how to pass it to SW. The
eBPF prog contains whatever logic is required to take over if the HW
either bailed out or handed over deliberately. Then on top, all the
missing pieces of functionality which can only be performed in SW.

tc then loads 1) eBPF maps and prog through bpf() syscall
              2) cls_bpf filter with p4ast IR plus ref to prog and
                 maps

> 4) Offload p4ast programs into hardware
>    The same p4ast program representation will be passed down
>    to drivers via existing TC offloading way - ndo_setup_tc.
>    Drivers will then parse it and setup the hardware
>    accordingly. Driver will also have possibility to error out
>    in case it does not support some requested feature.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-29  9:39 ` Thomas Graf
@ 2016-10-29 10:10   ` Jiri Pirko
  2016-10-29 11:15     ` Thomas Graf
  0 siblings, 1 reply; 41+ messages in thread
From: Jiri Pirko @ 2016-10-29 10:10 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, davem, jhs, roopa, john.fastabend, jakub.kicinski,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera

Sat, Oct 29, 2016 at 11:39:05AM CEST, tgraf@suug.ch wrote:
>On 10/29/16 at 09:53am, Jiri Pirko wrote:
>> Hi all.
>> 
>> The network world is divided into 2 general types of hw:
>> 1) network ASICs - network specific silicon, containing things like TCAM
>>    These ASICs are suitable to be programmed by P4.
>> 2) network processors - basically a general purpose CPUs
>>    These processors are suitable to be programmed by eBPF.
>> 
>> I believe that by now, the most people came to a conclusion that it is
>> very difficult to handle both types by either P4 or eBPF. And since
>> eBPF is part of the kernel, I would like to introduce P4 into kernel
>> as well. Here's a plan:
>
>For reference, last time I remember we discussed this in the BPF
>offload context:
>http://www.spinics.net/lists/netdev/msg356178.html
>
>> 1) Define P4 intermediate representation
>>    I cannot imagine loading P4 program (c-like syntax text file) into
>>    kernel as is. That means that as the first step, we need find some
>>    intermediate representation. I can imagine someting in a form of AST,
>>    call it "p4ast". I don't really know how to do this exactly though,
>>    it's just an idea.
>> 
>>    In the end there would be a userspace precompiler for this:
>>    $ makep4ast example.p4 example.ast
>> 
>> 2) Implement p4ast in-kernel interpreter 
>>    A kernel module which takes a p4ast and emulates the pipeline.
>>    This can be implemented from scratch. Or, p4ast could be compiled
>>    to eBPF. I know there are already couple of p4>eBPF compilers.
>>    Not sure how feasible it would be to put this compiler in kernel.
>
>+1 to using eBPF for emulation. Maybe the compiler doesn't need to be
>in the kernel and user space can compile and provide the emulated
>pipeline in eBPF directly. See next paragraph for an example where
>this could be useful.

Ditto.


>
>> 3) Expose the p4ast in-kernel interpreter to userspace
>>    As the easiest way I see in to introduce a new TC classifier cls_p4.
>> 
>>    This can work in a very similar way cls_bpf is:
>>    $ tc filter add dev eth0 ingress p4 da ast example.ast
>> 
>>    The TC cls_p4 will be also used for runtime table manipulation.
>
>I think this is a great model for the case where HW can provide all
>of the required capabilities. Thinking about the case where HW
>provides a subset and SW provides an extended version, i.e. the
>reality we live in for hosts with ASIC NICs ;-) The hand off point
>requires some understanding between p4ast and eBPF.

It can be the other way around. The p4>ebpf compiler won't be complete
at the beginning so it is possible that HW could provide more features.
I don't think it is a problem. With SKIP_SW and SKIP_HW flags in TC,
the user can set different program to each. I think in real life, that
would be the most common case anyway.


>
>Therefore another idea would be to use cls_bpf directly for this. The
>p4ast IR could be stored in a separate ELF section in the same object
>file with an existing eBPF program. The p4ast IR will match the

I don't like this idea. The kernel API should be clean and simple.
Bundling p4ast with bpf.o code, so the bpf.o is for kernel and p4ast is
for driver does not look clean at all. The bundle does not make really
sense as the programs may do different things for BPF and p4.

Plus, it's up to user to set this up like he wants. If he wants SW
processing by BPF and at the same time HW processing by P4, he will use:
cls_bpf instance with SKIP_HW
cls_p4 instance with SKIP_SW.

This is much more variable, clean and non-confusing approach, I believe.


>eBPF prog if capabilities of HW and SW match. If HW is limited, the
>p4ast IR represents what the HW can do plus how to pass it to SW. The
>eBPF prog contains whatever logic is required to take over if the HW
>either bailed out or handed over deliberately. Then on top, all the
>missing pieces of functionality which can only be performed in SW.
>
>tc then loads 1) eBPF maps and prog through bpf() syscall
>              2) cls_bpf filter with p4ast IR plus ref to prog and
>                 maps
>
>> 4) Offload p4ast programs into hardware
>>    The same p4ast program representation will be passed down
>>    to drivers via existing TC offloading way - ndo_setup_tc.
>>    Drivers will then parse it and setup the hardware
>>    accordingly. Driver will also have possibility to error out
>>    in case it does not support some requested feature.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-29 10:10   ` Jiri Pirko
@ 2016-10-29 11:15     ` Thomas Graf
  2016-10-29 11:28       ` Jiri Pirko
  0 siblings, 1 reply; 41+ messages in thread
From: Thomas Graf @ 2016-10-29 11:15 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, jhs, roopa, john.fastabend, jakub.kicinski,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera

On 10/29/16 at 12:10pm, Jiri Pirko wrote:
> Sat, Oct 29, 2016 at 11:39:05AM CEST, tgraf@suug.ch wrote:
> >On 10/29/16 at 09:53am, Jiri Pirko wrote:
> >> 3) Expose the p4ast in-kernel interpreter to userspace
> >>    As the easiest way I see in to introduce a new TC classifier cls_p4.
> >> 
> >>    This can work in a very similar way cls_bpf is:
> >>    $ tc filter add dev eth0 ingress p4 da ast example.ast
> >> 
> >>    The TC cls_p4 will be also used for runtime table manipulation.
> >
> >I think this is a great model for the case where HW can provide all
> >of the required capabilities. Thinking about the case where HW
> >provides a subset and SW provides an extended version, i.e. the
> >reality we live in for hosts with ASIC NICs ;-) The hand off point
> >requires some understanding between p4ast and eBPF.
> 
> It can be the other way around. The p4>ebpf compiler won't be complete
> at the beginning so it is possible that HW could provide more features.
> I don't think it is a problem. With SKIP_SW and SKIP_HW flags in TC,
> the user can set different program to each. I think in real life, that
> would be the most common case anyway.

So given the SKIP_SW flag, the in-kernel compiler is optional anyway.
Why even risk including a possibly incomplete compiler? Older kernels
must be capable of running along newer hardware as long as eBPF can
represent the software path. Having to upgrade to latest and greatest
kernels is not an option for most people so they would simply have to
fall back to SKIP_SW and do it in user space anyway.

> >Therefore another idea would be to use cls_bpf directly for this. The
> >p4ast IR could be stored in a separate ELF section in the same object
> >file with an existing eBPF program. The p4ast IR will match the
> 
> I don't like this idea. The kernel API should be clean and simple.
> Bundling p4ast with bpf.o code, so the bpf.o is for kernel and p4ast is
> for driver does not look clean at all. The bundle does not make really
> sense as the programs may do different things for BPF and p4.

I don't care strongly for the bundle. Let's forget about it for now.

> Plus, it's up to user to set this up like he wants. If he wants SW
> processing by BPF and at the same time HW processing by P4, he will use:
> cls_bpf instance with SKIP_HW
> cls_p4 instance with SKIP_SW.
> 
> This is much more variable, clean and non-confusing approach, I believe.

Non ASIC hardware will want to do offload based on BPF though so your
model would require the user to be aware of what is the preferred
model for his hardware and then either load a cls_bpf only to work
with a Netronome NIC or a cls_p4 + cls_bpf to work with an ASIC NIC,
correct?

I'm not seeing how either of them is more or less variable. The main
difference is whether to require configuring a single cls with both
p4ast + bpf or two separate cls, one for each. I'd prefer the single
cls approach simply because it is cleaner wither regard to offload
directly off bpf vs off p4ast.

My main point is to not include a IR to eBPF compiler in the kernel
and let user space handle this instead.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-29 11:15     ` Thomas Graf
@ 2016-10-29 11:28       ` Jiri Pirko
  2016-10-29 12:09         ` Thomas Graf
  0 siblings, 1 reply; 41+ messages in thread
From: Jiri Pirko @ 2016-10-29 11:28 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, davem, jhs, roopa, john.fastabend, jakub.kicinski,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera

Sat, Oct 29, 2016 at 01:15:48PM CEST, tgraf@suug.ch wrote:
>On 10/29/16 at 12:10pm, Jiri Pirko wrote:
>> Sat, Oct 29, 2016 at 11:39:05AM CEST, tgraf@suug.ch wrote:
>> >On 10/29/16 at 09:53am, Jiri Pirko wrote:
>> >> 3) Expose the p4ast in-kernel interpreter to userspace
>> >>    As the easiest way I see in to introduce a new TC classifier cls_p4.
>> >> 
>> >>    This can work in a very similar way cls_bpf is:
>> >>    $ tc filter add dev eth0 ingress p4 da ast example.ast
>> >> 
>> >>    The TC cls_p4 will be also used for runtime table manipulation.
>> >
>> >I think this is a great model for the case where HW can provide all
>> >of the required capabilities. Thinking about the case where HW
>> >provides a subset and SW provides an extended version, i.e. the
>> >reality we live in for hosts with ASIC NICs ;-) The hand off point
>> >requires some understanding between p4ast and eBPF.
>> 
>> It can be the other way around. The p4>ebpf compiler won't be complete
>> at the beginning so it is possible that HW could provide more features.
>> I don't think it is a problem. With SKIP_SW and SKIP_HW flags in TC,
>> the user can set different program to each. I think in real life, that
>> would be the most common case anyway.
>
>So given the SKIP_SW flag, the in-kernel compiler is optional anyway.
>Why even risk including a possibly incomplete compiler? Older kernels
>must be capable of running along newer hardware as long as eBPF can
>represent the software path. Having to upgrade to latest and greatest
>kernels is not an option for most people so they would simply have to
>fall back to SKIP_SW and do it in user space anyway.

The thing is, if we needo to offload something, it needs to be
implemented in kernel first. Also, I believe that it is good to have
in-kernel p4 engine for testing and development purposes.


>
>> >Therefore another idea would be to use cls_bpf directly for this. The
>> >p4ast IR could be stored in a separate ELF section in the same object
>> >file with an existing eBPF program. The p4ast IR will match the
>> 
>> I don't like this idea. The kernel API should be clean and simple.
>> Bundling p4ast with bpf.o code, so the bpf.o is for kernel and p4ast is
>> for driver does not look clean at all. The bundle does not make really
>> sense as the programs may do different things for BPF and p4.
>
>I don't care strongly for the bundle. Let's forget about it for now.
>
>> Plus, it's up to user to set this up like he wants. If he wants SW
>> processing by BPF and at the same time HW processing by P4, he will use:
>> cls_bpf instance with SKIP_HW
>> cls_p4 instance with SKIP_SW.
>> 
>> This is much more variable, clean and non-confusing approach, I believe.
>
>Non ASIC hardware will want to do offload based on BPF though so your
>model would require the user to be aware of what is the preferred
>model for his hardware and then either load a cls_bpf only to work
>with a Netronome NIC or a cls_p4 + cls_bpf to work with an ASIC NIC,
>correct?

Correct


>
>I'm not seeing how either of them is more or less variable. The main
>difference is whether to require configuring a single cls with both
>p4ast + bpf or two separate cls, one for each. I'd prefer the single
>cls approach simply because it is cleaner wither regard to offload
>directly off bpf vs off p4ast.

That's the bundle that you asked me to forget earlier in this email? :)

>
>My main point is to not include a IR to eBPF compiler in the kernel
>and let user space handle this instead.

It we do it as you describe, we would be using 2 different APIs for
offloaded and non-offloaded path. I don't believe it is acceptable as
the offloaded features has to have kernel implementation. Therefore, I
believe that p4ast as a kernel API is the only possible option.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-29 11:28       ` Jiri Pirko
@ 2016-10-29 12:09         ` Thomas Graf
  2016-10-29 13:58           ` Jiri Pirko
  0 siblings, 1 reply; 41+ messages in thread
From: Thomas Graf @ 2016-10-29 12:09 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, jhs, roopa, john.fastabend, jakub.kicinski,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera

On 10/29/16 at 01:28pm, Jiri Pirko wrote:
> Sat, Oct 29, 2016 at 01:15:48PM CEST, tgraf@suug.ch wrote:
> >So given the SKIP_SW flag, the in-kernel compiler is optional anyway.
> >Why even risk including a possibly incomplete compiler? Older kernels
> >must be capable of running along newer hardware as long as eBPF can
> >represent the software path. Having to upgrade to latest and greatest
> >kernels is not an option for most people so they would simply have to
> >fall back to SKIP_SW and do it in user space anyway.
> 
> The thing is, if we needo to offload something, it needs to be
> implemented in kernel first. Also, I believe that it is good to have
> in-kernel p4 engine for testing and development purposes.

You lost me now :-) In an earlier email you said:

> It can be the other way around. The p4>ebpf compiler won't be complete
> at the beginning so it is possible that HW could provide more features.
> I don't think it is a problem. With SKIP_SW and SKIP_HW flags in TC,
> the user can set different program to each. I think in real life, that
> would be the most common case anyway.

If you allow to SKIP_SW and set different programs each to address
this, then how is this any different.

I completely agree that kernel must be able to provide the same
functionality as HW with optional additional capabilities on top so
the HW can always bail out and punt to SW.

[...]

> >I'm not seeing how either of them is more or less variable. The main
> >difference is whether to require configuring a single cls with both
> >p4ast + bpf or two separate cls, one for each. I'd prefer the single
> >cls approach simply because it is cleaner wither regard to offload
> >directly off bpf vs off p4ast.
> 
> That's the bundle that you asked me to forget earlier in this email? :)

I thought you referred to the "store in same object file" as bundle.
I don't really care about that. What I care about is a single way to
configure this that works for both ASIC and non-ASIC hardware.

> >My main point is to not include a IR to eBPF compiler in the kernel
> >and let user space handle this instead.
> 
> It we do it as you describe, we would be using 2 different APIs for
> offloaded and non-offloaded path. I don't believe it is acceptable as
> the offloaded features has to have kernel implementation. Therefore, I
> believe that p4ast as a kernel API is the only possible option.

Yes, the kernel has the SW implementation in eBPF. I thought that is
what you propose as well. The only difference is whether to generate
that eBPF in kernel or user space.

Not sure I understand the multiple APIs point for offload vs
non-offload. There is a single API: tc. Both models require the user
to provide additional metadata to allow programming ASIC HW: p4ast
IR or whatever we agree on.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-29 12:09         ` Thomas Graf
@ 2016-10-29 13:58           ` Jiri Pirko
  2016-10-29 14:54             ` Jakub Kicinski
  0 siblings, 1 reply; 41+ messages in thread
From: Jiri Pirko @ 2016-10-29 13:58 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, davem, jhs, roopa, john.fastabend, jakub.kicinski,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera

Sat, Oct 29, 2016 at 02:09:32PM CEST, tgraf@suug.ch wrote:
>On 10/29/16 at 01:28pm, Jiri Pirko wrote:
>> Sat, Oct 29, 2016 at 01:15:48PM CEST, tgraf@suug.ch wrote:
>> >So given the SKIP_SW flag, the in-kernel compiler is optional anyway.
>> >Why even risk including a possibly incomplete compiler? Older kernels
>> >must be capable of running along newer hardware as long as eBPF can
>> >represent the software path. Having to upgrade to latest and greatest
>> >kernels is not an option for most people so they would simply have to
>> >fall back to SKIP_SW and do it in user space anyway.
>> 
>> The thing is, if we needo to offload something, it needs to be
>> implemented in kernel first. Also, I believe that it is good to have
>> in-kernel p4 engine for testing and development purposes.
>
>You lost me now :-) In an earlier email you said:
>
>> It can be the other way around. The p4>ebpf compiler won't be complete
>> at the beginning so it is possible that HW could provide more features.
>> I don't think it is a problem. With SKIP_SW and SKIP_HW flags in TC,
>> the user can set different program to each. I think in real life, that
>> would be the most common case anyway.
>
>If you allow to SKIP_SW and set different programs each to address
>this, then how is this any different.
>
>I completely agree that kernel must be able to provide the same
>functionality as HW with optional additional capabilities on top so
>the HW can always bail out and punt to SW.
>
>[...]
>
>> >I'm not seeing how either of them is more or less variable. The main
>> >difference is whether to require configuring a single cls with both
>> >p4ast + bpf or two separate cls, one for each. I'd prefer the single
>> >cls approach simply because it is cleaner wither regard to offload
>> >directly off bpf vs off p4ast.
>> 
>> That's the bundle that you asked me to forget earlier in this email? :)
>
>I thought you referred to the "store in same object file" as bundle.
>I don't really care about that. What I care about is a single way to
>configure this that works for both ASIC and non-ASIC hardware.
>
>> >My main point is to not include a IR to eBPF compiler in the kernel
>> >and let user space handle this instead.
>> 
>> It we do it as you describe, we would be using 2 different APIs for
>> offloaded and non-offloaded path. I don't believe it is acceptable as
>> the offloaded features has to have kernel implementation. Therefore, I
>> believe that p4ast as a kernel API is the only possible option.
>
>Yes, the kernel has the SW implementation in eBPF. I thought that is
>what you propose as well. The only difference is whether to generate
>that eBPF in kernel or user space.
>
>Not sure I understand the multiple APIs point for offload vs
>non-offload. There is a single API: tc. Both models require the user
>to provide additional metadata to allow programming ASIC HW: p4ast
>IR or whatever we agree on.

If you do p4>ebpf in userspace, you have 2 apis:
1) to setup sw (in-kernel) p4 datapath, you push bpf.o to kernel
2) to setup hw p4 datapath, you push program.p4ast to kernel

Those are 2 apis. Both wrapped up by TC, but still 2 apis.

What I believe is correct is to have one api:
1) to setup sw (in-kernel) p4 datapath, you push program.p4ast to kernel
2) to setup hw p4 datapath, you push program.p4ast to kernel

In case of 1), the program.p4ast will be either interpreted by new p4
interpreter, of translated to bpf and interpreted by that. But this
translation code is part of kernel.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-29  7:53 Let's do P4 Jiri Pirko
  2016-10-29  9:39 ` Thomas Graf
@ 2016-10-29 14:49 ` Jakub Kicinski
  2016-10-29 14:55   ` Jiri Pirko
  2016-10-29 16:46   ` John Fastabend
  2016-11-01 11:57 ` Jamal Hadi Salim
  2 siblings, 2 replies; 41+ messages in thread
From: Jakub Kicinski @ 2016-10-29 14:49 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, tgraf, jhs, roopa, john.fastabend, simon.horman,
	ast, daniel, prem, hannes, jbenc, tom, mattyk, idosch, eladr,
	yotamg, nogahf, ogerlitz, linville, andy, f.fainelli, dsa,
	vivien.didelot, andrew, ivecera, Maciej Żenczykowski

On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
> Hi all.
> 
> The network world is divided into 2 general types of hw:
> 1) network ASICs - network specific silicon, containing things like TCAM
>    These ASICs are suitable to be programmed by P4.
> 2) network processors - basically a general purpose CPUs
>    These processors are suitable to be programmed by eBPF.
> 
> I believe that by now, the most people came to a conclusion that it is
> very difficult to handle both types by either P4 or eBPF. And since
> eBPF is part of the kernel, I would like to introduce P4 into kernel
> as well. Here's a plan:
> 
> 1) Define P4 intermediate representation
>    I cannot imagine loading P4 program (c-like syntax text file) into
>    kernel as is. That means that as the first step, we need find some
>    intermediate representation. I can imagine someting in a form of AST,
>    call it "p4ast". I don't really know how to do this exactly though,
>    it's just an idea.
> 
>    In the end there would be a userspace precompiler for this:
>    $ makep4ast example.p4 example.ast

Maybe stating the obvious, but IMHO defining the IR is the hardest part.
eBPF *is* the IR, we can compile C, P4 or even JIT Lua to eBPF.  The
AST/IR for switch pipelines should allow for similar flexibility.
Looser coupling would also protect us from changes in spec of the high
level language.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-29 13:58           ` Jiri Pirko
@ 2016-10-29 14:54             ` Jakub Kicinski
  2016-10-29 14:58               ` Jiri Pirko
  0 siblings, 1 reply; 41+ messages in thread
From: Jakub Kicinski @ 2016-10-29 14:54 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Thomas Graf, netdev, davem, jhs, roopa, john.fastabend,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera

On Sat, 29 Oct 2016 15:58:55 +0200, Jiri Pirko wrote:
> Sat, Oct 29, 2016 at 02:09:32PM CEST, tgraf@suug.ch wrote:
> >On 10/29/16 at 01:28pm, Jiri Pirko wrote:  
> >> Sat, Oct 29, 2016 at 01:15:48PM CEST, tgraf@suug.ch wrote:  
> >> >So given the SKIP_SW flag, the in-kernel compiler is optional anyway.
> >> >Why even risk including a possibly incomplete compiler? Older kernels
> >> >must be capable of running along newer hardware as long as eBPF can
> >> >represent the software path. Having to upgrade to latest and greatest
> >> >kernels is not an option for most people so they would simply have to
> >> >fall back to SKIP_SW and do it in user space anyway.  
> >> 
> >> The thing is, if we needo to offload something, it needs to be
> >> implemented in kernel first. Also, I believe that it is good to have
> >> in-kernel p4 engine for testing and development purposes.  
> >
> >You lost me now :-) In an earlier email you said:
> >  
> >> It can be the other way around. The p4>ebpf compiler won't be complete
> >> at the beginning so it is possible that HW could provide more features.
> >> I don't think it is a problem. With SKIP_SW and SKIP_HW flags in TC,
> >> the user can set different program to each. I think in real life, that
> >> would be the most common case anyway.  
> >
> >If you allow to SKIP_SW and set different programs each to address
> >this, then how is this any different.
> >
> >I completely agree that kernel must be able to provide the same
> >functionality as HW with optional additional capabilities on top so
> >the HW can always bail out and punt to SW.
> >
> >[...]
> >  
> >> >I'm not seeing how either of them is more or less variable. The main
> >> >difference is whether to require configuring a single cls with both
> >> >p4ast + bpf or two separate cls, one for each. I'd prefer the single
> >> >cls approach simply because it is cleaner wither regard to offload
> >> >directly off bpf vs off p4ast.  
> >> 
> >> That's the bundle that you asked me to forget earlier in this email? :)  
> >
> >I thought you referred to the "store in same object file" as bundle.
> >I don't really care about that. What I care about is a single way to
> >configure this that works for both ASIC and non-ASIC hardware.
> >  
> >> >My main point is to not include a IR to eBPF compiler in the kernel
> >> >and let user space handle this instead.  
> >> 
> >> It we do it as you describe, we would be using 2 different APIs for
> >> offloaded and non-offloaded path. I don't believe it is acceptable as
> >> the offloaded features has to have kernel implementation. Therefore, I
> >> believe that p4ast as a kernel API is the only possible option.  
> >
> >Yes, the kernel has the SW implementation in eBPF. I thought that is
> >what you propose as well. The only difference is whether to generate
> >that eBPF in kernel or user space.
> >
> >Not sure I understand the multiple APIs point for offload vs
> >non-offload. There is a single API: tc. Both models require the user
> >to provide additional metadata to allow programming ASIC HW: p4ast
> >IR or whatever we agree on.  
> 
> If you do p4>ebpf in userspace, you have 2 apis:
> 1) to setup sw (in-kernel) p4 datapath, you push bpf.o to kernel
> 2) to setup hw p4 datapath, you push program.p4ast to kernel
> 
> Those are 2 apis. Both wrapped up by TC, but still 2 apis.
> 
> What I believe is correct is to have one api:
> 1) to setup sw (in-kernel) p4 datapath, you push program.p4ast to kernel
> 2) to setup hw p4 datapath, you push program.p4ast to kernel
> 
> In case of 1), the program.p4ast will be either interpreted by new p4
> interpreter, of translated to bpf and interpreted by that. But this
> translation code is part of kernel.

Option 3) use a well structured subset of eBPF as user space ABI ;)

In all seriousness, user space already has to have some knowledge about
the underlaying hardware today with different vendors picking different
TC classifiers for offload.  So I humbly agree that 2 APIs may be
acceptable here.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-29 14:49 ` Jakub Kicinski
@ 2016-10-29 14:55   ` Jiri Pirko
  2016-10-29 16:46   ` John Fastabend
  1 sibling, 0 replies; 41+ messages in thread
From: Jiri Pirko @ 2016-10-29 14:55 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, tgraf, jhs, roopa, john.fastabend, simon.horman,
	ast, daniel, prem, hannes, jbenc, tom, mattyk, idosch, eladr,
	yotamg, nogahf, ogerlitz, linville, andy, f.fainelli, dsa,
	vivien.didelot, andrew, ivecera, Maciej Żenczykowski

Sat, Oct 29, 2016 at 04:49:03PM CEST, kubakici@wp.pl wrote:
>On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
>> Hi all.
>> 
>> The network world is divided into 2 general types of hw:
>> 1) network ASICs - network specific silicon, containing things like TCAM
>>    These ASICs are suitable to be programmed by P4.
>> 2) network processors - basically a general purpose CPUs
>>    These processors are suitable to be programmed by eBPF.
>> 
>> I believe that by now, the most people came to a conclusion that it is
>> very difficult to handle both types by either P4 or eBPF. And since
>> eBPF is part of the kernel, I would like to introduce P4 into kernel
>> as well. Here's a plan:
>> 
>> 1) Define P4 intermediate representation
>>    I cannot imagine loading P4 program (c-like syntax text file) into
>>    kernel as is. That means that as the first step, we need find some
>>    intermediate representation. I can imagine someting in a form of AST,
>>    call it "p4ast". I don't really know how to do this exactly though,
>>    it's just an idea.
>> 
>>    In the end there would be a userspace precompiler for this:
>>    $ makep4ast example.p4 example.ast
>
>Maybe stating the obvious, but IMHO defining the IR is the hardest part.
>eBPF *is* the IR, we can compile C, P4 or even JIT Lua to eBPF.  The
>AST/IR for switch pipelines should allow for similar flexibility.
>Looser coupling would also protect us from changes in spec of the high
>level language.

Agreed. I agree with you point this would be nice to have it done in a
generic way. However, I'm not aware of any other language similar to p4.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-29 14:54             ` Jakub Kicinski
@ 2016-10-29 14:58               ` Jiri Pirko
  0 siblings, 0 replies; 41+ messages in thread
From: Jiri Pirko @ 2016-10-29 14:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thomas Graf, netdev, davem, jhs, roopa, john.fastabend,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera

Sat, Oct 29, 2016 at 04:54:21PM CEST, kubakici@wp.pl wrote:
>On Sat, 29 Oct 2016 15:58:55 +0200, Jiri Pirko wrote:
>> Sat, Oct 29, 2016 at 02:09:32PM CEST, tgraf@suug.ch wrote:
>> >On 10/29/16 at 01:28pm, Jiri Pirko wrote:  
>> >> Sat, Oct 29, 2016 at 01:15:48PM CEST, tgraf@suug.ch wrote:  
>> >> >So given the SKIP_SW flag, the in-kernel compiler is optional anyway.
>> >> >Why even risk including a possibly incomplete compiler? Older kernels
>> >> >must be capable of running along newer hardware as long as eBPF can
>> >> >represent the software path. Having to upgrade to latest and greatest
>> >> >kernels is not an option for most people so they would simply have to
>> >> >fall back to SKIP_SW and do it in user space anyway.  
>> >> 
>> >> The thing is, if we needo to offload something, it needs to be
>> >> implemented in kernel first. Also, I believe that it is good to have
>> >> in-kernel p4 engine for testing and development purposes.  
>> >
>> >You lost me now :-) In an earlier email you said:
>> >  
>> >> It can be the other way around. The p4>ebpf compiler won't be complete
>> >> at the beginning so it is possible that HW could provide more features.
>> >> I don't think it is a problem. With SKIP_SW and SKIP_HW flags in TC,
>> >> the user can set different program to each. I think in real life, that
>> >> would be the most common case anyway.  
>> >
>> >If you allow to SKIP_SW and set different programs each to address
>> >this, then how is this any different.
>> >
>> >I completely agree that kernel must be able to provide the same
>> >functionality as HW with optional additional capabilities on top so
>> >the HW can always bail out and punt to SW.
>> >
>> >[...]
>> >  
>> >> >I'm not seeing how either of them is more or less variable. The main
>> >> >difference is whether to require configuring a single cls with both
>> >> >p4ast + bpf or two separate cls, one for each. I'd prefer the single
>> >> >cls approach simply because it is cleaner wither regard to offload
>> >> >directly off bpf vs off p4ast.  
>> >> 
>> >> That's the bundle that you asked me to forget earlier in this email? :)  
>> >
>> >I thought you referred to the "store in same object file" as bundle.
>> >I don't really care about that. What I care about is a single way to
>> >configure this that works for both ASIC and non-ASIC hardware.
>> >  
>> >> >My main point is to not include a IR to eBPF compiler in the kernel
>> >> >and let user space handle this instead.  
>> >> 
>> >> It we do it as you describe, we would be using 2 different APIs for
>> >> offloaded and non-offloaded path. I don't believe it is acceptable as
>> >> the offloaded features has to have kernel implementation. Therefore, I
>> >> believe that p4ast as a kernel API is the only possible option.  
>> >
>> >Yes, the kernel has the SW implementation in eBPF. I thought that is
>> >what you propose as well. The only difference is whether to generate
>> >that eBPF in kernel or user space.
>> >
>> >Not sure I understand the multiple APIs point for offload vs
>> >non-offload. There is a single API: tc. Both models require the user
>> >to provide additional metadata to allow programming ASIC HW: p4ast
>> >IR or whatever we agree on.  
>> 
>> If you do p4>ebpf in userspace, you have 2 apis:
>> 1) to setup sw (in-kernel) p4 datapath, you push bpf.o to kernel
>> 2) to setup hw p4 datapath, you push program.p4ast to kernel
>> 
>> Those are 2 apis. Both wrapped up by TC, but still 2 apis.
>> 
>> What I believe is correct is to have one api:
>> 1) to setup sw (in-kernel) p4 datapath, you push program.p4ast to kernel
>> 2) to setup hw p4 datapath, you push program.p4ast to kernel
>> 
>> In case of 1), the program.p4ast will be either interpreted by new p4
>> interpreter, of translated to bpf and interpreted by that. But this
>> translation code is part of kernel.
>
>Option 3) use a well structured subset of eBPF as user space ABI ;)

:( That would not be nice I believe. Also confusing and hard to
maintain. Plus we would have to do 2 translations, in between
incompatible paradigms.


>
>In all seriousness, user space already has to have some knowledge about
>the underlaying hardware today with different vendors picking different
>TC classifiers for offload.  So I humbly agree that 2 APIs may be
>acceptable here.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-29 14:49 ` Jakub Kicinski
  2016-10-29 14:55   ` Jiri Pirko
@ 2016-10-29 16:46   ` John Fastabend
  2016-10-30  7:44     ` Jiri Pirko
  1 sibling, 1 reply; 41+ messages in thread
From: John Fastabend @ 2016-10-29 16:46 UTC (permalink / raw)
  To: Jakub Kicinski, Jiri Pirko
  Cc: netdev, davem, tgraf, jhs, roopa, simon.horman, ast, daniel,
	prem, hannes, jbenc, tom, mattyk, idosch, eladr, yotamg, nogahf,
	ogerlitz, linville, andy, f.fainelli, dsa, vivien.didelot,
	andrew, ivecera, Maciej Żenczykowski

On 16-10-29 07:49 AM, Jakub Kicinski wrote:
> On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
>> Hi all.
>>
>> The network world is divided into 2 general types of hw:
>> 1) network ASICs - network specific silicon, containing things like TCAM
>>    These ASICs are suitable to be programmed by P4.
>> 2) network processors - basically a general purpose CPUs
>>    These processors are suitable to be programmed by eBPF.
>>
>> I believe that by now, the most people came to a conclusion that it is
>> very difficult to handle both types by either P4 or eBPF. And since
>> eBPF is part of the kernel, I would like to introduce P4 into kernel
>> as well. Here's a plan:
>>
>> 1) Define P4 intermediate representation
>>    I cannot imagine loading P4 program (c-like syntax text file) into
>>    kernel as is. That means that as the first step, we need find some
>>    intermediate representation. I can imagine someting in a form of AST,
>>    call it "p4ast". I don't really know how to do this exactly though,
>>    it's just an idea.
>>
>>    In the end there would be a userspace precompiler for this:
>>    $ makep4ast example.p4 example.ast
> 
> Maybe stating the obvious, but IMHO defining the IR is the hardest part.
> eBPF *is* the IR, we can compile C, P4 or even JIT Lua to eBPF.  The
> AST/IR for switch pipelines should allow for similar flexibility.
> Looser coupling would also protect us from changes in spec of the high
> level language.
> 

Jumping in the middle here. You managed to get an entire thread going
before I even woke up :)

The problem with eBPF as an IR is that in the universe of eBPF IR
programs the subset that can be offloaded onto a standard ASIC based
hardware (non NPU/FPGA/etc) is so small to be almost meaningless IMO.

I tried this for awhile and the result is users have to write very
targeted eBPF that they "know" will be pattern matched and pushed into
an ASIC. It can work but its very fragile. When I did this I ended up
with an eBPF generator for deviceX and an eBPF generator for deviceY
each with a very specific pattern matching engine in the driver to
xlate ebpf-deviceX into its asic. Existing ASICs for example usually
support only one pipeline, only one parser (or require moving mountains
to change the parse via ucode), only one set of tables, and only one
deparser/serailizer at the end to build the new packet. Next-gen pieces
may have some flexibility on the parser side.

There is an interesting resource allocation problem we have that could
be solved by p4 or devlink where in we want to pre-allocate slices of
the TCAM for certain match types. I was planning on writing devlink code
for this because its primarily done at initialization once.

I will note one nice thing about using eBPF however is that you have an
easy software emulation path via ebpf engine in kernel.

... And merging threads here with Jiri's email ...

> If you do p4>ebpf in userspace, you have 2 apis:
> 1) to setup sw (in-kernel) p4 datapath, you push bpf.o to kernel
> 2) to setup hw p4 datapath, you push program.p4ast to kernel
> 
> Those are 2 apis. Both wrapped up by TC, but still 2 apis.
> 
> What I believe is correct is to have one api:
> 1) to setup sw (in-kernel) p4 datapath, you push program.p4ast to kernel
> 2) to setup hw p4 datapath, you push program.p4ast to kernel
> 

Couple comments around this, first adding yet another IR in the kernel
and another JIT engine to map that IR on to eBPF or hardware vendor X
doesn't get me excited. Its really much easier to write these as backend
objects in LLVM. Not saying it can't be done just saying it is easier
in LLVM. Also we already have the LLVM code for P4 to LLVM-IR to eBPF.
In the end this would be a reasonably complex bit of code in
the kernel only for hardware offload. I have doubts that folks would
ever use it for software only cases. I'm happy to admit I'm wrong here
though.

So yes using llvm backends creates two paths a hardware mgmt and sw
path but in the hardware + software case typical on the edge the
orchestration and management planes have started to manage the hardware
and software as two blocks of logic for performance SLA logic. Even on
the edge it seems in most cases folks are selling SR-IOV ports and
can't fall back to software and charge for the port. But this is just
one use case I suspect others where it does make sense.

> In case of 1), the program.p4ast will be either interpreted by new p4
> interpreter, of translated to bpf and interpreted by that. But this
> translation code is part of kernel.

Finally a couple historic bits. The Flow-API proposed in Ottawa was
mechanically generated from an original P4 draft. At the time I was
working fairly closely with both the hardware and compiler folks. If
there is interest we could use that as a base IR for hardware. It has
a simple mapping to/from the original P4 spec. The newer P4 specs are
significantly more complex by the way.

We also have an emulated path also auto-generated from compiler tools
that creates eBPF code from the IR so this would give you the software
fall-back.

It is something we could spin up an RFC in a few weeks if there is some
agreement here. I'll be traveling though for a week or two but could
get something out in November.

Thanks,
John

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-29 16:46   ` John Fastabend
@ 2016-10-30  7:44     ` Jiri Pirko
  2016-10-30 10:26       ` Thomas Graf
  2016-10-30 20:54       ` John Fastabend
  0 siblings, 2 replies; 41+ messages in thread
From: Jiri Pirko @ 2016-10-30  7:44 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jakub Kicinski, netdev, davem, tgraf, jhs, roopa, simon.horman,
	ast, daniel, prem, hannes, jbenc, tom, mattyk, idosch, eladr,
	yotamg, nogahf, ogerlitz, linville, andy, f.fainelli, dsa,
	vivien.didelot, andrew, ivecera, Maciej Żenczykowski

Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastabend@gmail.com wrote:
>On 16-10-29 07:49 AM, Jakub Kicinski wrote:
>> On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
>>> Hi all.
>>>
>>> The network world is divided into 2 general types of hw:
>>> 1) network ASICs - network specific silicon, containing things like TCAM
>>>    These ASICs are suitable to be programmed by P4.
>>> 2) network processors - basically a general purpose CPUs
>>>    These processors are suitable to be programmed by eBPF.
>>>
>>> I believe that by now, the most people came to a conclusion that it is
>>> very difficult to handle both types by either P4 or eBPF. And since
>>> eBPF is part of the kernel, I would like to introduce P4 into kernel
>>> as well. Here's a plan:
>>>
>>> 1) Define P4 intermediate representation
>>>    I cannot imagine loading P4 program (c-like syntax text file) into
>>>    kernel as is. That means that as the first step, we need find some
>>>    intermediate representation. I can imagine someting in a form of AST,
>>>    call it "p4ast". I don't really know how to do this exactly though,
>>>    it's just an idea.
>>>
>>>    In the end there would be a userspace precompiler for this:
>>>    $ makep4ast example.p4 example.ast
>> 
>> Maybe stating the obvious, but IMHO defining the IR is the hardest part.
>> eBPF *is* the IR, we can compile C, P4 or even JIT Lua to eBPF.  The
>> AST/IR for switch pipelines should allow for similar flexibility.
>> Looser coupling would also protect us from changes in spec of the high
>> level language.
>> 
>
>Jumping in the middle here. You managed to get an entire thread going
>before I even woke up :)
>
>The problem with eBPF as an IR is that in the universe of eBPF IR
>programs the subset that can be offloaded onto a standard ASIC based
>hardware (non NPU/FPGA/etc) is so small to be almost meaningless IMO.
>
>I tried this for awhile and the result is users have to write very
>targeted eBPF that they "know" will be pattern matched and pushed into
>an ASIC. It can work but its very fragile. When I did this I ended up
>with an eBPF generator for deviceX and an eBPF generator for deviceY
>each with a very specific pattern matching engine in the driver to
>xlate ebpf-deviceX into its asic. Existing ASICs for example usually
>support only one pipeline, only one parser (or require moving mountains
>to change the parse via ucode), only one set of tables, and only one
>deparser/serailizer at the end to build the new packet. Next-gen pieces
>may have some flexibility on the parser side.
>
>There is an interesting resource allocation problem we have that could
>be solved by p4 or devlink where in we want to pre-allocate slices of
>the TCAM for certain match types. I was planning on writing devlink code
>for this because its primarily done at initialization once.

There are 2 resource allocation problems in our hw. One is general
division ot the resources in feature-chunks. That needs to be done
during the ASIC initialization phase. For that, I also plan to utilize
devlink API.

The second one is runtime allocation of tables, and that would be
handled by p4 just fine.


>
>I will note one nice thing about using eBPF however is that you have an
>easy software emulation path via ebpf engine in kernel.
>
>... And merging threads here with Jiri's email ...
>
>> If you do p4>ebpf in userspace, you have 2 apis:
>> 1) to setup sw (in-kernel) p4 datapath, you push bpf.o to kernel
>> 2) to setup hw p4 datapath, you push program.p4ast to kernel
>> 
>> Those are 2 apis. Both wrapped up by TC, but still 2 apis.
>> 
>> What I believe is correct is to have one api:
>> 1) to setup sw (in-kernel) p4 datapath, you push program.p4ast to kernel
>> 2) to setup hw p4 datapath, you push program.p4ast to kernel
>> 
>
>Couple comments around this, first adding yet another IR in the kernel
>and another JIT engine to map that IR on to eBPF or hardware vendor X
>doesn't get me excited. Its really much easier to write these as backend
>objects in LLVM. Not saying it can't be done just saying it is easier
>in LLVM. Also we already have the LLVM code for P4 to LLVM-IR to eBPF.
>In the end this would be a reasonably complex bit of code in
>the kernel only for hardware offload. I have doubts that folks would
>ever use it for software only cases. I'm happy to admit I'm wrong here
>though.

Well for hw offload, every driver has to parse the IR (whatever will it
be in) and program HW accordingly. Similar parsing and translation would
be needed for SW path, to translate into eBPF. I don't think it would be
more complex than in the drivers. Should be fine.



>
>So yes using llvm backends creates two paths a hardware mgmt and sw
>path but in the hardware + software case typical on the edge the
>orchestration and management planes have started to manage the hardware
>and software as two blocks of logic for performance SLA logic. Even on
>the edge it seems in most cases folks are selling SR-IOV ports and
>can't fall back to software and charge for the port. But this is just
>one use case I suspect others where it does make sense.
>
>> In case of 1), the program.p4ast will be either interpreted by new p4
>> interpreter, of translated to bpf and interpreted by that. But this
>> translation code is part of kernel.
>
>Finally a couple historic bits. The Flow-API proposed in Ottawa was
>mechanically generated from an original P4 draft. At the time I was
>working fairly closely with both the hardware and compiler folks. If
>there is interest we could use that as a base IR for hardware. It has
>a simple mapping to/from the original P4 spec. The newer P4 specs are
>significantly more complex by the way.

Yeah, I was also thinking about something similar to your Flow-API,
but we need something more generic I believe.


>
>We also have an emulated path also auto-generated from compiler tools
>that creates eBPF code from the IR so this would give you the software
>fall-back.


Btw, Flow-API was rejected because it was a clean kernel-bypass. In case
of p4, if we do what Thomas is suggesting, having x.bpf for SW and
x.p4ast for HW, that would be the very same kernel-bypass. Therefore I
strongly believe there should be a single kernel API for p4 SW+HW - for
both p4 program insertion and runtime configuration.



>
>It is something we could spin up an RFC in a few weeks if there is some
>agreement here. I'll be traveling though for a week or two but could
>get something out in November.
>
>Thanks,
>John
>
>
>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-30  7:44     ` Jiri Pirko
@ 2016-10-30 10:26       ` Thomas Graf
  2016-10-30 16:38         ` Jiri Pirko
  2016-10-30 20:54       ` John Fastabend
  1 sibling, 1 reply; 41+ messages in thread
From: Thomas Graf @ 2016-10-30 10:26 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: John Fastabend, Jakub Kicinski, netdev, davem, jhs, roopa,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

On 10/30/16 at 08:44am, Jiri Pirko wrote:
> Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastabend@gmail.com wrote:
> >On 16-10-29 07:49 AM, Jakub Kicinski wrote:
> >> On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
> >>> Hi all.
> >>>
> >>> The network world is divided into 2 general types of hw:
> >>> 1) network ASICs - network specific silicon, containing things like TCAM
> >>>    These ASICs are suitable to be programmed by P4.
> >>> 2) network processors - basically a general purpose CPUs
> >>>    These processors are suitable to be programmed by eBPF.
> >>>
> >>> I believe that by now, the most people came to a conclusion that it is
> >>> very difficult to handle both types by either P4 or eBPF. And since
> >>> eBPF is part of the kernel, I would like to introduce P4 into kernel
> >>> as well. Here's a plan:
> >>>
> >>> 1) Define P4 intermediate representation
> >>>    I cannot imagine loading P4 program (c-like syntax text file) into
> >>>    kernel as is. That means that as the first step, we need find some
> >>>    intermediate representation. I can imagine someting in a form of AST,
> >>>    call it "p4ast". I don't really know how to do this exactly though,
> >>>    it's just an idea.
> >>>
> >>>    In the end there would be a userspace precompiler for this:
> >>>    $ makep4ast example.p4 example.ast
> >> 
> >> Maybe stating the obvious, but IMHO defining the IR is the hardest part.
> >> eBPF *is* the IR, we can compile C, P4 or even JIT Lua to eBPF.  The
> >> AST/IR for switch pipelines should allow for similar flexibility.
> >> Looser coupling would also protect us from changes in spec of the high
> >> level language.

My assumption was that a new IR is defined which is easier to parse than
eBPF which is targeted at execution on a CPU and not indented for pattern
matching. Just looking at how llvm creates different patterns and reorders
instructions, I'm not seeing how eBPF can serve as a general purpose IR
if the objective is to allow fairly flexible generation of the bytecode.
Hence the alternative IR serving as additional metadata complementing the
eBPF program.

> >Jumping in the middle here. You managed to get an entire thread going
> >before I even woke up :)
> >
> >The problem with eBPF as an IR is that in the universe of eBPF IR
> >programs the subset that can be offloaded onto a standard ASIC based
> >hardware (non NPU/FPGA/etc) is so small to be almost meaningless IMO.
> >
> >I tried this for awhile and the result is users have to write very
> >targeted eBPF that they "know" will be pattern matched and pushed into
> >an ASIC. It can work but its very fragile. When I did this I ended up
> >with an eBPF generator for deviceX and an eBPF generator for deviceY
> >each with a very specific pattern matching engine in the driver to
> >xlate ebpf-deviceX into its asic. Existing ASICs for example usually
> >support only one pipeline, only one parser (or require moving mountains
> >to change the parse via ucode), only one set of tables, and only one
> >deparser/serailizer at the end to build the new packet. Next-gen pieces
> >may have some flexibility on the parser side.
> >
> >There is an interesting resource allocation problem we have that could
> >be solved by p4 or devlink where in we want to pre-allocate slices of
> >the TCAM for certain match types. I was planning on writing devlink code
> >for this because its primarily done at initialization once.
> 
> There are 2 resource allocation problems in our hw. One is general
> division ot the resources in feature-chunks. That needs to be done
> during the ASIC initialization phase. For that, I also plan to utilize
> devlink API.
> 
> The second one is runtime allocation of tables, and that would be
> handled by p4 just fine.
> 
> 
> >
> >I will note one nice thing about using eBPF however is that you have an
> >easy software emulation path via ebpf engine in kernel.
> >
> >... And merging threads here with Jiri's email ...
> >
> >> If you do p4>ebpf in userspace, you have 2 apis:
> >> 1) to setup sw (in-kernel) p4 datapath, you push bpf.o to kernel
> >> 2) to setup hw p4 datapath, you push program.p4ast to kernel
> >> 
> >> Those are 2 apis. Both wrapped up by TC, but still 2 apis.
> >> 
> >> What I believe is correct is to have one api:
> >> 1) to setup sw (in-kernel) p4 datapath, you push program.p4ast to kernel
> >> 2) to setup hw p4 datapath, you push program.p4ast to kernel

I understand what you mean with two APIs now. You want a single IR
block and divide the SW/HW part in the kernel rather than let llvm or
something else do it.

> >Couple comments around this, first adding yet another IR in the kernel
> >and another JIT engine to map that IR on to eBPF or hardware vendor X
> >doesn't get me excited. Its really much easier to write these as backend
> >objects in LLVM. Not saying it can't be done just saying it is easier
> >in LLVM. Also we already have the LLVM code for P4 to LLVM-IR to eBPF.
> >In the end this would be a reasonably complex bit of code in
> >the kernel only for hardware offload. I have doubts that folks would
> >ever use it for software only cases. I'm happy to admit I'm wrong here
> >though.
> 
> Well for hw offload, every driver has to parse the IR (whatever will it
> be in) and program HW accordingly. Similar parsing and translation would
> be needed for SW path, to translate into eBPF. I don't think it would be
> more complex than in the drivers. Should be fine.

I'm not sure I see why anyone would ever want to use an IR for SW
purposes which is restricted to the lowest common denominator of HW.
A good example here is OpenFlow and how some of its SW consumers
have evolved with extensions which cannot be mappepd to HW easily.
The same seems to happen with P4 as it introduces the concept of
state and other concepts which are hard to map for dumb HW. P4 doesn't
magically solve this problem, the fundamental difference in
capabilities between HW and SW remain.

> >So yes using llvm backends creates two paths a hardware mgmt and sw
> >path but in the hardware + software case typical on the edge the
> >orchestration and management planes have started to manage the hardware
> >and software as two blocks of logic for performance SLA logic. Even on
> >the edge it seems in most cases folks are selling SR-IOV ports and
> >can't fall back to software and charge for the port. But this is just
> >one use case I suspect others where it does make sense.
> >
> >> In case of 1), the program.p4ast will be either interpreted by new p4
> >> interpreter, of translated to bpf and interpreted by that. But this
> >> translation code is part of kernel.
> >
> >Finally a couple historic bits. The Flow-API proposed in Ottawa was
> >mechanically generated from an original P4 draft. At the time I was
> >working fairly closely with both the hardware and compiler folks. If
> >there is interest we could use that as a base IR for hardware. It has
> >a simple mapping to/from the original P4 spec. The newer P4 specs are
> >significantly more complex by the way.
> 
> Yeah, I was also thinking about something similar to your Flow-API,
> but we need something more generic I believe.
> 
> >We also have an emulated path also auto-generated from compiler tools
> >that creates eBPF code from the IR so this would give you the software
> >fall-back.
> 
> Btw, Flow-API was rejected because it was a clean kernel-bypass. In case
> of p4, if we do what Thomas is suggesting, having x.bpf for SW and
> x.p4ast for HW, that would be the very same kernel-bypass. Therefore I
> strongly believe there should be a single kernel API for p4 SW+HW - for
> both p4 program insertion and runtime configuration.

I think you misunderstand me. This is not what I'm proposing at all.
In either model, the kernel receives the same IR and can reject.

The rule is very clear: we can't allow to program anything that the
kernel is not capable of doing in SW, right? That was the key take
away from that discussion.

Let's assume we do cls_p4ast HW+SW with an eBPF translator for SW. As a
user of this, as my program becomes more complex I will hit the wall of
HW capabilities at some point and either the IR is not expressive
enough or the driver will reject the program.

I have at least three options now:

  1) I move everything to SW and forget about HW programmability

  2) I let HW bail out when HW can't support it and start parsing from
     scratch in SW. I don't really care how much of it has been done
     in HW, it's a best effort optimization. A use case for this might
     be dropping of packets. This is easy to do with flow based
     offloads as it can be best effort but already difficult when
     programming based on a IR.

  3) I let HW bail out but carry some metadata trying to preserve
     some of the work done. A use case for this would be a router type
     of work where the lookup itself can be expensive and the majority
     of actions taken on packets are simple forwards but a subset of
     actions performed are too complex for HW.  You still want to
     preserve the savings from the expensive lookup already performed.

Even for the simpler 2) I can't just put everything I need into my
p4ast program because the program will either load in its entirety or
it won't.

What I would likely end up doing is to write any number of subsets of
my program which only contain various levels of pieces that are very
likely to load on my target HW. I then load my full program with tc
and want a notification if it also loaded into HW. If it HW failed,
then I want tc to load subset programs with SKIP_SW starting from the
one with most complexity. I need SKIP_SW because I already have the
full program loaded and I don't want to go through both the partial
and full program in case of a HW bail out. Is your proposal to not
allow for the SKIP_SW?

A more evolved form of this would be to expose capabilities of the HW
and have the program writer include logic which results in the split
based on the capabilities of the hardware.

I I understand you correctly, you propose to make this split
automatically in the kernel somehow.

In either model the kernel receives the same new IR which it can
reject. No difference. None of the models allow more or less.

In either model, the program can be loaded with SKIP_SW to load a valid
program into HW only.

In either model, an eBPF program can be loaded at cls_bpf, or a new IR
can be loaded with SKIP_HW to do SW only.

The only difference I see between the models is whether to include a
new IR => eBPF compiler in the kernel or not which is going to be
optional anyway.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-30 10:26       ` Thomas Graf
@ 2016-10-30 16:38         ` Jiri Pirko
  2016-10-30 17:45           ` Jakub Kicinski
  2016-10-30 22:39           ` Alexei Starovoitov
  0 siblings, 2 replies; 41+ messages in thread
From: Jiri Pirko @ 2016-10-30 16:38 UTC (permalink / raw)
  To: Thomas Graf
  Cc: John Fastabend, Jakub Kicinski, netdev, davem, jhs, roopa,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

Sun, Oct 30, 2016 at 11:26:49AM CET, tgraf@suug.ch wrote:
>On 10/30/16 at 08:44am, Jiri Pirko wrote:
>> Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastabend@gmail.com wrote:
>> >On 16-10-29 07:49 AM, Jakub Kicinski wrote:
>> >> On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
>> >>> Hi all.
>> >>>
>> >>> The network world is divided into 2 general types of hw:
>> >>> 1) network ASICs - network specific silicon, containing things like TCAM
>> >>>    These ASICs are suitable to be programmed by P4.
>> >>> 2) network processors - basically a general purpose CPUs
>> >>>    These processors are suitable to be programmed by eBPF.
>> >>>
>> >>> I believe that by now, the most people came to a conclusion that it is
>> >>> very difficult to handle both types by either P4 or eBPF. And since
>> >>> eBPF is part of the kernel, I would like to introduce P4 into kernel
>> >>> as well. Here's a plan:
>> >>>
>> >>> 1) Define P4 intermediate representation
>> >>>    I cannot imagine loading P4 program (c-like syntax text file) into
>> >>>    kernel as is. That means that as the first step, we need find some
>> >>>    intermediate representation. I can imagine someting in a form of AST,
>> >>>    call it "p4ast". I don't really know how to do this exactly though,
>> >>>    it's just an idea.
>> >>>
>> >>>    In the end there would be a userspace precompiler for this:
>> >>>    $ makep4ast example.p4 example.ast
>> >> 
>> >> Maybe stating the obvious, but IMHO defining the IR is the hardest part.
>> >> eBPF *is* the IR, we can compile C, P4 or even JIT Lua to eBPF.  The
>> >> AST/IR for switch pipelines should allow for similar flexibility.
>> >> Looser coupling would also protect us from changes in spec of the high
>> >> level language.
>
>My assumption was that a new IR is defined which is easier to parse than
>eBPF which is targeted at execution on a CPU and not indented for pattern
>matching. Just looking at how llvm creates different patterns and reorders
>instructions, I'm not seeing how eBPF can serve as a general purpose IR
>if the objective is to allow fairly flexible generation of the bytecode.
>Hence the alternative IR serving as additional metadata complementing the
>eBPF program.

Agreed.


[...]

>> >... And merging threads here with Jiri's email ...
>> >
>> >> If you do p4>ebpf in userspace, you have 2 apis:
>> >> 1) to setup sw (in-kernel) p4 datapath, you push bpf.o to kernel
>> >> 2) to setup hw p4 datapath, you push program.p4ast to kernel
>> >> 
>> >> Those are 2 apis. Both wrapped up by TC, but still 2 apis.
>> >> 
>> >> What I believe is correct is to have one api:
>> >> 1) to setup sw (in-kernel) p4 datapath, you push program.p4ast to kernel
>> >> 2) to setup hw p4 datapath, you push program.p4ast to kernel
>
>I understand what you mean with two APIs now. You want a single IR
>block and divide the SW/HW part in the kernel rather than let llvm or
>something else do it.

Exactly. Following drawing shows p4 pipeline setup for SW and Hw:

                                 |
                                 |               +--> ebpf engine
                                 |               |
                                 |               |
                                 |           compilerB
                                 |               ^
                                 |               |
p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC -> HW
                                 |
                       userspace | kernel
                                 |


Now please consider runtime API for rule insertion/removal/stats/etc.
Also, the single API is cls_p4 here:

                        |
                        |            
                        |            
                        |               
                        |            ebpf map fillup
                        |               ^
                        |               |
             p4 rule --TCNL--> cls_p4 --+-> driver -> HW table fillup
                        |
              userspace | kernel
                        



>
>> >Couple comments around this, first adding yet another IR in the kernel
>> >and another JIT engine to map that IR on to eBPF or hardware vendor X
>> >doesn't get me excited. Its really much easier to write these as backend
>> >objects in LLVM. Not saying it can't be done just saying it is easier
>> >in LLVM. Also we already have the LLVM code for P4 to LLVM-IR to eBPF.
>> >In the end this would be a reasonably complex bit of code in
>> >the kernel only for hardware offload. I have doubts that folks would
>> >ever use it for software only cases. I'm happy to admit I'm wrong here
>> >though.
>> 
>> Well for hw offload, every driver has to parse the IR (whatever will it
>> be in) and program HW accordingly. Similar parsing and translation would
>> be needed for SW path, to translate into eBPF. I don't think it would be
>> more complex than in the drivers. Should be fine.
>
>I'm not sure I see why anyone would ever want to use an IR for SW
>purposes which is restricted to the lowest common denominator of HW.
>A good example here is OpenFlow and how some of its SW consumers
>have evolved with extensions which cannot be mappepd to HW easily.
>The same seems to happen with P4 as it introduces the concept of
>state and other concepts which are hard to map for dumb HW. P4 doesn't
>magically solve this problem, the fundamental difference in
>capabilities between HW and SW remain.
>
>> >So yes using llvm backends creates two paths a hardware mgmt and sw
>> >path but in the hardware + software case typical on the edge the
>> >orchestration and management planes have started to manage the hardware
>> >and software as two blocks of logic for performance SLA logic. Even on
>> >the edge it seems in most cases folks are selling SR-IOV ports and
>> >can't fall back to software and charge for the port. But this is just
>> >one use case I suspect others where it does make sense.
>> >
>> >> In case of 1), the program.p4ast will be either interpreted by new p4
>> >> interpreter, of translated to bpf and interpreted by that. But this
>> >> translation code is part of kernel.
>> >
>> >Finally a couple historic bits. The Flow-API proposed in Ottawa was
>> >mechanically generated from an original P4 draft. At the time I was
>> >working fairly closely with both the hardware and compiler folks. If
>> >there is interest we could use that as a base IR for hardware. It has
>> >a simple mapping to/from the original P4 spec. The newer P4 specs are
>> >significantly more complex by the way.
>> 
>> Yeah, I was also thinking about something similar to your Flow-API,
>> but we need something more generic I believe.
>> 
>> >We also have an emulated path also auto-generated from compiler tools
>> >that creates eBPF code from the IR so this would give you the software
>> >fall-back.
>> 
>> Btw, Flow-API was rejected because it was a clean kernel-bypass. In case
>> of p4, if we do what Thomas is suggesting, having x.bpf for SW and
>> x.p4ast for HW, that would be the very same kernel-bypass. Therefore I
>> strongly believe there should be a single kernel API for p4 SW+HW - for
>> both p4 program insertion and runtime configuration.
>
>I think you misunderstand me. This is not what I'm proposing at all.
>In either model, the kernel receives the same IR and can reject.
>
>The rule is very clear: we can't allow to program anything that the
>kernel is not capable of doing in SW, right? That was the key take
>away from that discussion.


***
Exactly. But if you treat p4ast as a "metadata" of ebpf program destined
solely to setup HW, that in my opinion is a bypass. Because the ebpf part
and p4ast part could have no relacionship with each other. So I see it as
2 independent APIs. One for SW, one for HW. And having this kind od API
for hw only is a bypass.

Plus the thing I cannot imagine in the model you propose is table fillup.
For ebpf, you use maps. For p4 you would have to have a separate HW-only
API. This is very similar to the original John's Flow-API. And therefore
a kernel bypass.


>
>Let's assume we do cls_p4ast HW+SW with an eBPF translator for SW. As a
>user of this, as my program becomes more complex I will hit the wall of
>HW capabilities at some point and either the IR is not expressive
>enough or the driver will reject the program.

That can certainly happen, no matter what model we choose.


>
>I have at least three options now:
>
>  1) I move everything to SW and forget about HW programmability
>
>  2) I let HW bail out when HW can't support it and start parsing from
>     scratch in SW. I don't really care how much of it has been done
>     in HW, it's a best effort optimization. A use case for this might
>     be dropping of packets. This is easy to do with flow based
>     offloads as it can be best effort but already difficult when
>     programming based on a IR.
>
>  3) I let HW bail out but carry some metadata trying to preserve
>     some of the work done. A use case for this would be a router type
>     of work where the lookup itself can be expensive and the majority
>     of actions taken on packets are simple forwards but a subset of
>     actions performed are too complex for HW.  You still want to
>     preserve the savings from the expensive lookup already performed.
>

You still have a choice to do this:
use cls_bpf SKIP_HW for SW processing
use cls_p4 SKIP_SW for HW processing
That gives you flexibility to program the pipelines separatelly. If a driver
is not capable of handling the selected p4 program, it will refuse to
program the HW. Then it is up to the user to change the program according
to the HW features.


>Even for the simpler 2) I can't just put everything I need into my
>p4ast program because the program will either load in its entirety or
>it won't.
>
>What I would likely end up doing is to write any number of subsets of
>my program which only contain various levels of pieces that are very
>likely to load on my target HW. I then load my full program with tc
>and want a notification if it also loaded into HW. If it HW failed,
>then I want tc to load subset programs with SKIP_SW starting from the
>one with most complexity. I need SKIP_SW because I already have the
>full program loaded and I don't want to go through both the partial
>and full program in case of a HW bail out. Is your proposal to not
>allow for the SKIP_SW?

Definitelly not. User should be able to pass SKIP_SW and SKIP_HW as he is
now able to do it for cls_u32, cls_flower and others.


>
>A more evolved form of this would be to expose capabilities of the HW
>and have the program writer include logic which results in the split
>based on the capabilities of the hardware.

I wonder why p4 does not handle the HW capabilities. At least I did
not find it. It would be certainly nice to have it.


>
>I I understand you correctly, you propose to make this split
>automatically in the kernel somehow.
>
>In either model the kernel receives the same new IR which it can
>reject. No difference. None of the models allow more or less.
>
>In either model, the program can be loaded with SKIP_SW to load a valid
>program into HW only.
>
>In either model, an eBPF program can be loaded at cls_bpf, or a new IR
>can be loaded with SKIP_HW to do SW only.
>
>The only difference I see between the models is whether to include a
>new IR => eBPF compiler in the kernel or not which is going to be
>optional anyway.

The main dirrefence I see is the single API/kernel bypass problem
I described earlier in this email (***)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-30 16:38         ` Jiri Pirko
@ 2016-10-30 17:45           ` Jakub Kicinski
  2016-10-30 18:01             ` Jiri Pirko
  2016-10-30 22:39           ` Alexei Starovoitov
  1 sibling, 1 reply; 41+ messages in thread
From: Jakub Kicinski @ 2016-10-30 17:45 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Thomas Graf, John Fastabend, netdev, davem, jhs, roopa,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

On Sun, 30 Oct 2016 17:38:36 +0100, Jiri Pirko wrote:
> Sun, Oct 30, 2016 at 11:26:49AM CET, tgraf@suug.ch wrote:
> >On 10/30/16 at 08:44am, Jiri Pirko wrote:  
> >> Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastabend@gmail.com wrote:  
>  [...]  
>  [...]  
>  [...]  
>  [...]  
> >
> >My assumption was that a new IR is defined which is easier to parse than
> >eBPF which is targeted at execution on a CPU and not indented for pattern
> >matching. Just looking at how llvm creates different patterns and reorders
> >instructions, I'm not seeing how eBPF can serve as a general purpose IR
> >if the objective is to allow fairly flexible generation of the bytecode.
> >Hence the alternative IR serving as additional metadata complementing the
> >eBPF program.  
> 
> Agreed.

Just to clarify my intention here was not to suggest the use of eBPF as
the IR.  I was merely cautioning against bundling the new API with P4,
for multiple reasons.  As John mentioned P4 spec was evolving in the
past.  The spec is designed for HW more capable than the switch ASICs we
have today.  As vendors move to provide more configurability we may need
to extend the API beyond P4.  We may want to extend this API to for SW
hand-offs (as suggested by Thomas) which are not part of P4 spec.  Also
John showed examples of matchd software which already uses P4 at the
frontend today and translates it to different targets (eBPF, u32, HW).
It may just be about the naming but I feel like calling the new API
more generically, switch AST or some such may help to avoid unnecessary
ties and confusion.

> >I understand what you mean with two APIs now. You want a single IR
> >block and divide the SW/HW part in the kernel rather than let llvm or
> >something else do it.  
> 
> Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
> 
>                                  |
>                                  |               +--> ebpf engine
>                                  |               |
>                                  |               |
>                                  |           compilerB
>                                  |               ^
>                                  |               |
> p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC -> HW
>                                  |
>                        userspace | kernel
>                                  |
>
> Now please consider runtime API for rule insertion/removal/stats/etc.
> Also, the single API is cls_p4 here:
> 
>                         |
>                         |            
>                         |            
>                         |               
>                         |            ebpf map fillup
>                         |               ^
>                         |               |
>              p4 rule --TCNL--> cls_p4 --+-> driver -> HW table fillup
>                         |
>               userspace | kernel
>                         

My understanding was that the main purpose of SW eBPF translation would
be to piggy back on eBPF userspace map API.  This seems not to be the
case here?  Is "P4 rule" being added via some new API?  From performance
perspective the SW AST implementation would probably not be any slower
than u32, so I don't think we need eBPF for performance.  I must be
misreading this, if we want eBPF fallback we must extend eBPF with all
the map types anyway... so we could just use eBPF map API?  I believe
John has already done some work in this space (see his GitHub :))

As for AST -> eBPF translator in the kernel, IMHO it could be very
useful.  Since all the drivers will have to implement translators
anyway, the eBPF translator may help to build a good shared
infrastructure.  I mean - it could be a starting place for sharing code
between drivers if done properly.

> >> Well for hw offload, every driver has to parse the IR (whatever will it
> >> be in) and program HW accordingly. Similar parsing and translation would
> >> be needed for SW path, to translate into eBPF. I don't think it would be
> >> more complex than in the drivers. Should be fine.  
> >
> >I'm not sure I see why anyone would ever want to use an IR for SW
> >purposes which is restricted to the lowest common denominator of HW.
> >A good example here is OpenFlow and how some of its SW consumers
> >have evolved with extensions which cannot be mappepd to HW easily.
> >The same seems to happen with P4 as it introduces the concept of
> >state and other concepts which are hard to map for dumb HW. P4 doesn't
> >magically solve this problem, the fundamental difference in
> >capabilities between HW and SW remain.
> >  
>  [...]  
>  [...]  
>  [...]  
> >> 
> >> Yeah, I was also thinking about something similar to your Flow-API,
> >> but we need something more generic I believe.
> >>   
>  [...]  
> >> 
> >> Btw, Flow-API was rejected because it was a clean kernel-bypass. In case
> >> of p4, if we do what Thomas is suggesting, having x.bpf for SW and
> >> x.p4ast for HW, that would be the very same kernel-bypass. Therefore I
> >> strongly believe there should be a single kernel API for p4 SW+HW - for
> >> both p4 program insertion and runtime configuration.  
> >
> >I think you misunderstand me. This is not what I'm proposing at all.
> >In either model, the kernel receives the same IR and can reject.
> >
> >The rule is very clear: we can't allow to program anything that the
> >kernel is not capable of doing in SW, right? That was the key take
> >away from that discussion.  
> 
> 
> ***
> Exactly. But if you treat p4ast as a "metadata" of ebpf program destined
> solely to setup HW, that in my opinion is a bypass. Because the ebpf part
> and p4ast part could have no relacionship with each other. So I see it as
> 2 independent APIs. One for SW, one for HW. And having this kind od API
> for hw only is a bypass.

+1
Adding metadata to eBPF programs usually fails because the verification
that the metadata is correct in the kernel is usually not much easier
than generating it in the first place.  And not verifying it opens up a
way of kernel bypass.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-30 17:45           ` Jakub Kicinski
@ 2016-10-30 18:01             ` Jiri Pirko
  2016-10-30 18:44               ` Jakub Kicinski
  0 siblings, 1 reply; 41+ messages in thread
From: Jiri Pirko @ 2016-10-30 18:01 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thomas Graf, John Fastabend, netdev, davem, jhs, roopa,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

Sun, Oct 30, 2016 at 06:45:26PM CET, kubakici@wp.pl wrote:
>On Sun, 30 Oct 2016 17:38:36 +0100, Jiri Pirko wrote:
>> Sun, Oct 30, 2016 at 11:26:49AM CET, tgraf@suug.ch wrote:
>> >On 10/30/16 at 08:44am, Jiri Pirko wrote:  
>> >> Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastabend@gmail.com wrote:  
>>  [...]  
>>  [...]  
>>  [...]  
>>  [...]  
>> >
>> >My assumption was that a new IR is defined which is easier to parse than
>> >eBPF which is targeted at execution on a CPU and not indented for pattern
>> >matching. Just looking at how llvm creates different patterns and reorders
>> >instructions, I'm not seeing how eBPF can serve as a general purpose IR
>> >if the objective is to allow fairly flexible generation of the bytecode.
>> >Hence the alternative IR serving as additional metadata complementing the
>> >eBPF program.  
>> 
>> Agreed.
>
>Just to clarify my intention here was not to suggest the use of eBPF as
>the IR.  I was merely cautioning against bundling the new API with P4,
>for multiple reasons.  As John mentioned P4 spec was evolving in the
>past.  The spec is designed for HW more capable than the switch ASICs we
>have today.  As vendors move to provide more configurability we may need
>to extend the API beyond P4.  We may want to extend this API to for SW
>hand-offs (as suggested by Thomas) which are not part of P4 spec.  Also
>John showed examples of matchd software which already uses P4 at the
>frontend today and translates it to different targets (eBPF, u32, HW).
>It may just be about the naming but I feel like calling the new API
>more generically, switch AST or some such may help to avoid unnecessary
>ties and confusion.

Well, that basically means to create "something" that could be be used
to translate p4 source to. Not sure how exactly this "something" should
look like and how different would it be from p4. I thought it might
be good to benefit from the p4 definition and use it directly. Not sure.


>
>> >I understand what you mean with two APIs now. You want a single IR
>> >block and divide the SW/HW part in the kernel rather than let llvm or
>> >something else do it.  
>> 
>> Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
>> 
>>                                  |
>>                                  |               +--> ebpf engine
>>                                  |               |
>>                                  |               |
>>                                  |           compilerB
>>                                  |               ^
>>                                  |               |
>> p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC -> HW
>>                                  |
>>                        userspace | kernel
>>                                  |
>>
>> Now please consider runtime API for rule insertion/removal/stats/etc.
>> Also, the single API is cls_p4 here:
>> 
>>                         |
>>                         |            
>>                         |            
>>                         |               
>>                         |            ebpf map fillup
>>                         |               ^
>>                         |               |
>>              p4 rule --TCNL--> cls_p4 --+-> driver -> HW table fillup
>>                         |
>>               userspace | kernel
>>                         
>
>My understanding was that the main purpose of SW eBPF translation would
>be to piggy back on eBPF userspace map API.  This seems not to be the
>case here?  Is "P4 rule" being added via some new API?  From performance

cls_p4 TC classifier.


>perspective the SW AST implementation would probably not be any slower
>than u32, so I don't think we need eBPF for performance.  I must be
>misreading this, if we want eBPF fallback we must extend eBPF with all
>the map types anyway... so we could just use eBPF map API?  I believe
>John has already done some work in this space (see his GitHub :))

I don't think you can use existing BPF maps kernel API. You would still
have to have another API just for the offloaded datapath. And that is
a bypass. I strongly believe we need a single kernel API for both
SW and HW datapath setup and runtime configuration.


>
>As for AST -> eBPF translator in the kernel, IMHO it could be very
>useful.  Since all the drivers will have to implement translators
>anyway, the eBPF translator may help to build a good shared
>infrastructure.  I mean - it could be a starting place for sharing code
>between drivers if done properly.

Agreed.


>
>> >> Well for hw offload, every driver has to parse the IR (whatever will it
>> >> be in) and program HW accordingly. Similar parsing and translation would
>> >> be needed for SW path, to translate into eBPF. I don't think it would be
>> >> more complex than in the drivers. Should be fine.  
>> >
>> >I'm not sure I see why anyone would ever want to use an IR for SW
>> >purposes which is restricted to the lowest common denominator of HW.
>> >A good example here is OpenFlow and how some of its SW consumers
>> >have evolved with extensions which cannot be mappepd to HW easily.
>> >The same seems to happen with P4 as it introduces the concept of
>> >state and other concepts which are hard to map for dumb HW. P4 doesn't
>> >magically solve this problem, the fundamental difference in
>> >capabilities between HW and SW remain.
>> >  
>>  [...]  
>>  [...]  
>>  [...]  
>> >> 
>> >> Yeah, I was also thinking about something similar to your Flow-API,
>> >> but we need something more generic I believe.
>> >>   
>>  [...]  
>> >> 
>> >> Btw, Flow-API was rejected because it was a clean kernel-bypass. In case
>> >> of p4, if we do what Thomas is suggesting, having x.bpf for SW and
>> >> x.p4ast for HW, that would be the very same kernel-bypass. Therefore I
>> >> strongly believe there should be a single kernel API for p4 SW+HW - for
>> >> both p4 program insertion and runtime configuration.  
>> >
>> >I think you misunderstand me. This is not what I'm proposing at all.
>> >In either model, the kernel receives the same IR and can reject.
>> >
>> >The rule is very clear: we can't allow to program anything that the
>> >kernel is not capable of doing in SW, right? That was the key take
>> >away from that discussion.  
>> 
>> 
>> ***
>> Exactly. But if you treat p4ast as a "metadata" of ebpf program destined
>> solely to setup HW, that in my opinion is a bypass. Because the ebpf part
>> and p4ast part could have no relacionship with each other. So I see it as
>> 2 independent APIs. One for SW, one for HW. And having this kind od API
>> for hw only is a bypass.
>
>+1
>Adding metadata to eBPF programs usually fails because the verification
>that the metadata is correct in the kernel is usually not much easier
>than generating it in the first place.  And not verifying it opens up a
>way of kernel bypass.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-30 18:01             ` Jiri Pirko
@ 2016-10-30 18:44               ` Jakub Kicinski
  2016-10-30 19:56                 ` Jiri Pirko
  0 siblings, 1 reply; 41+ messages in thread
From: Jakub Kicinski @ 2016-10-30 18:44 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Thomas Graf, John Fastabend, netdev, davem, jhs, roopa,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

On Sun, 30 Oct 2016 19:01:03 +0100, Jiri Pirko wrote:
> Sun, Oct 30, 2016 at 06:45:26PM CET, kubakici@wp.pl wrote:
> >On Sun, 30 Oct 2016 17:38:36 +0100, Jiri Pirko wrote:  
> >> Sun, Oct 30, 2016 at 11:26:49AM CET, tgraf@suug.ch wrote:  
>  [...]  
>  [...]  
> >>  [...]  
> >>  [...]  
> >>  [...]  
> >>  [...]    
>  [...]  
> >> 
> >> Agreed.  
> >
> >Just to clarify my intention here was not to suggest the use of eBPF as
> >the IR.  I was merely cautioning against bundling the new API with P4,
> >for multiple reasons.  As John mentioned P4 spec was evolving in the
> >past.  The spec is designed for HW more capable than the switch ASICs we
> >have today.  As vendors move to provide more configurability we may need
> >to extend the API beyond P4.  We may want to extend this API to for SW
> >hand-offs (as suggested by Thomas) which are not part of P4 spec.  Also
> >John showed examples of matchd software which already uses P4 at the
> >frontend today and translates it to different targets (eBPF, u32, HW).
> >It may just be about the naming but I feel like calling the new API
> >more generically, switch AST or some such may help to avoid unnecessary
> >ties and confusion.  
> 
> Well, that basically means to create "something" that could be be used
> to translate p4 source to. Not sure how exactly this "something" should
> look like and how different would it be from p4. I thought it might
> be good to benefit from the p4 definition and use it directly. Not sure.

We have to translate the P4 into "something" already, that something
is the AST we will load into the kernel.  Or were you planning to use
some official P4 AST?  I'm not suggesting we add our own high level
language.  I agree that P4 is a good starting point, and perhaps a good
high level language.  I'm just cautious of creating an equivalency
between high level language (P4) and the kernel ABI.

Perhaps I'm just wasting everyone's time with this.

> >> 
> >> Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
> >> 
> >>                                  |
> >>                                  |               +--> ebpf engine
> >>                                  |               |
> >>                                  |               |
> >>                                  |           compilerB
> >>                                  |               ^
> >>                                  |               |
> >> p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC -> HW
> >>                                  |
> >>                        userspace | kernel
> >>                                  |
> >>
> >> Now please consider runtime API for rule insertion/removal/stats/etc.
> >> Also, the single API is cls_p4 here:
> >> 
> >>                         |
> >>                         |            
> >>                         |            
> >>                         |               
> >>                         |            ebpf map fillup
> >>                         |               ^
> >>                         |               |
> >>              p4 rule --TCNL--> cls_p4 --+-> driver -> HW table fillup
> >>                         |
> >>               userspace | kernel
> >>                           
> >
> >My understanding was that the main purpose of SW eBPF translation would
> >be to piggy back on eBPF userspace map API.  This seems not to be the
> >case here?  Is "P4 rule" being added via some new API?  From performance  
> 
> cls_p4 TC classifier.

Oh, so the cls_p4 is just a proxy forwarding the requests to drivers
or eBPF backend.  Got it.  Sorry for being slow.  And the requests
come down via change() op or something new?  I wonder how such scheme
compares to eBPF maps performance-wise (updates/sec).

> >perspective the SW AST implementation would probably not be any slower
> >than u32, so I don't think we need eBPF for performance.  I must be
> >misreading this, if we want eBPF fallback we must extend eBPF with all
> >the map types anyway... so we could just use eBPF map API?  I believe
> >John has already done some work in this space (see his GitHub :))  
> 
> I don't think you can use existing BPF maps kernel API. You would still
> have to have another API just for the offloaded datapath. And that is
> a bypass. I strongly believe we need a single kernel API for both
> SW and HW datapath setup and runtime configuration.

Agreed, single API is a must.  What is the HW characteristic which
doesn't fit with eBPF map API, though?  For eBPF offload I was planning
on adding offload hooks on eBPF map lookup/update paths and a way of
associating the map with a netdev.  This should be enough to forward
updates to the driver and intercept reads to return the right
statistics.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-30 18:44               ` Jakub Kicinski
@ 2016-10-30 19:56                 ` Jiri Pirko
  2016-10-30 21:14                   ` John Fastabend
  0 siblings, 1 reply; 41+ messages in thread
From: Jiri Pirko @ 2016-10-30 19:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Thomas Graf, John Fastabend, netdev, davem, jhs, roopa,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

Sun, Oct 30, 2016 at 07:44:43PM CET, kubakici@wp.pl wrote:
>On Sun, 30 Oct 2016 19:01:03 +0100, Jiri Pirko wrote:
>> Sun, Oct 30, 2016 at 06:45:26PM CET, kubakici@wp.pl wrote:
>> >On Sun, 30 Oct 2016 17:38:36 +0100, Jiri Pirko wrote:  
>> >> Sun, Oct 30, 2016 at 11:26:49AM CET, tgraf@suug.ch wrote:  
>>  [...]  
>>  [...]  
>> >>  [...]  
>> >>  [...]  
>> >>  [...]  
>> >>  [...]    
>>  [...]  
>> >> 
>> >> Agreed.  
>> >
>> >Just to clarify my intention here was not to suggest the use of eBPF as
>> >the IR.  I was merely cautioning against bundling the new API with P4,
>> >for multiple reasons.  As John mentioned P4 spec was evolving in the
>> >past.  The spec is designed for HW more capable than the switch ASICs we
>> >have today.  As vendors move to provide more configurability we may need
>> >to extend the API beyond P4.  We may want to extend this API to for SW
>> >hand-offs (as suggested by Thomas) which are not part of P4 spec.  Also
>> >John showed examples of matchd software which already uses P4 at the
>> >frontend today and translates it to different targets (eBPF, u32, HW).
>> >It may just be about the naming but I feel like calling the new API
>> >more generically, switch AST or some such may help to avoid unnecessary
>> >ties and confusion.  
>> 
>> Well, that basically means to create "something" that could be be used
>> to translate p4 source to. Not sure how exactly this "something" should
>> look like and how different would it be from p4. I thought it might
>> be good to benefit from the p4 definition and use it directly. Not sure.
>
>We have to translate the P4 into "something" already, that something
>is the AST we will load into the kernel.  Or were you planning to use
>some official P4 AST?  I'm not suggesting we add our own high level

I'm not aware of existence of some official P4 AST. We have to figure it
out.


>language.  I agree that P4 is a good starting point, and perhaps a good
>high level language.  I'm just cautious of creating an equivalency
>between high level language (P4) and the kernel ABI.

Understood. Definitelly good to be very cautious when defining a kernel
API.


>
>Perhaps I'm just wasting everyone's time with this.
>
>> >> 
>> >> Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
>> >> 
>> >>                                  |
>> >>                                  |               +--> ebpf engine
>> >>                                  |               |
>> >>                                  |               |
>> >>                                  |           compilerB
>> >>                                  |               ^
>> >>                                  |               |
>> >> p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC -> HW
>> >>                                  |
>> >>                        userspace | kernel
>> >>                                  |
>> >>
>> >> Now please consider runtime API for rule insertion/removal/stats/etc.
>> >> Also, the single API is cls_p4 here:
>> >> 
>> >>                         |
>> >>                         |            
>> >>                         |            
>> >>                         |               
>> >>                         |            ebpf map fillup
>> >>                         |               ^
>> >>                         |               |
>> >>              p4 rule --TCNL--> cls_p4 --+-> driver -> HW table fillup
>> >>                         |
>> >>               userspace | kernel
>> >>                           
>> >
>> >My understanding was that the main purpose of SW eBPF translation would
>> >be to piggy back on eBPF userspace map API.  This seems not to be the
>> >case here?  Is "P4 rule" being added via some new API?  From performance  
>> 
>> cls_p4 TC classifier.
>
>Oh, so the cls_p4 is just a proxy forwarding the requests to drivers
>or eBPF backend.  Got it.  Sorry for being slow.  And the requests
>come down via change() op or something new?  I wonder how such scheme
>compares to eBPF maps performance-wise (updates/sec).

I have no numbers at this time. I guess Jamal and Alexei did some
measurements in this are in the past.


>
>> >perspective the SW AST implementation would probably not be any slower
>> >than u32, so I don't think we need eBPF for performance.  I must be
>> >misreading this, if we want eBPF fallback we must extend eBPF with all
>> >the map types anyway... so we could just use eBPF map API?  I believe
>> >John has already done some work in this space (see his GitHub :))  
>> 
>> I don't think you can use existing BPF maps kernel API. You would still
>> have to have another API just for the offloaded datapath. And that is
>> a bypass. I strongly believe we need a single kernel API for both
>> SW and HW datapath setup and runtime configuration.
>
>Agreed, single API is a must.  What is the HW characteristic which
>doesn't fit with eBPF map API, though?  For eBPF offload I was planning
>on adding offload hooks on eBPF map lookup/update paths and a way of
>associating the map with a netdev.  This should be enough to forward
>updates to the driver and intercept reads to return the right
>statistics.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-30  7:44     ` Jiri Pirko
  2016-10-30 10:26       ` Thomas Graf
@ 2016-10-30 20:54       ` John Fastabend
  1 sibling, 0 replies; 41+ messages in thread
From: John Fastabend @ 2016-10-30 20:54 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jakub Kicinski, netdev, davem, tgraf, jhs, roopa, simon.horman,
	ast, daniel, prem, hannes, jbenc, tom, mattyk, idosch, eladr,
	yotamg, nogahf, ogerlitz, linville, andy, f.fainelli, dsa,
	vivien.didelot, andrew, ivecera, Maciej Żenczykowski

[...]

> 
> Yeah, I was also thinking about something similar to your Flow-API,
> but we need something more generic I believe.

I've heard this in a couple other forums as well but please elaborate
exactly what needs to be more generic? That API is sufficient to both
express the init time piece of the original P4 draft and the runtime
component.

I guess we are trying to strike a balance here between the ability
to actually write an IR that a sufficiently large subset of hardware
can support "easily" and something that can support all possible
hardware features.

IMO this leads to something like the Flow-API in the first case or
to something like eBPF for all possible features.

> 
> 
>>
>> We also have an emulated path also auto-generated from compiler tools
>> that creates eBPF code from the IR so this would give you the software
>> fall-back.
> 
> 
> Btw, Flow-API was rejected because it was a clean kernel-bypass. In case
> of p4, if we do what Thomas is suggesting, having x.bpf for SW and
> x.p4ast for HW, that would be the very same kernel-bypass. Therefore I
> strongly believe there should be a single kernel API for p4 SW+HW - for
> both p4 program insertion and runtime configuration.

Another area of push-back came from creating yet another infrastructure.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-30 19:56                 ` Jiri Pirko
@ 2016-10-30 21:14                   ` John Fastabend
  0 siblings, 0 replies; 41+ messages in thread
From: John Fastabend @ 2016-10-30 21:14 UTC (permalink / raw)
  To: Jiri Pirko, Jakub Kicinski
  Cc: Thomas Graf, netdev, davem, jhs, roopa, simon.horman, ast,
	daniel, prem, hannes, jbenc, tom, mattyk, idosch, eladr, yotamg,
	nogahf, ogerlitz, linville, andy, f.fainelli, dsa,
	vivien.didelot, andrew, ivecera, Maciej Żenczykowski

On 16-10-30 12:56 PM, Jiri Pirko wrote:
> Sun, Oct 30, 2016 at 07:44:43PM CET, kubakici@wp.pl wrote:
>> On Sun, 30 Oct 2016 19:01:03 +0100, Jiri Pirko wrote:
>>> Sun, Oct 30, 2016 at 06:45:26PM CET, kubakici@wp.pl wrote:
>>>> On Sun, 30 Oct 2016 17:38:36 +0100, Jiri Pirko wrote:  
>>>>> Sun, Oct 30, 2016 at 11:26:49AM CET, tgraf@suug.ch wrote:  
>>>  [...]  
>>>  [...]  
>>>>>  [...]  
>>>>>  [...]  
>>>>>  [...]  
>>>>>  [...]    
>>>  [...]  
>>>>>
>>>>> Agreed.  
>>>>
>>>> Just to clarify my intention here was not to suggest the use of eBPF as
>>>> the IR.  I was merely cautioning against bundling the new API with P4,
>>>> for multiple reasons.  As John mentioned P4 spec was evolving in the
>>>> past.  The spec is designed for HW more capable than the switch ASICs we
>>>> have today.  As vendors move to provide more configurability we may need
>>>> to extend the API beyond P4.  We may want to extend this API to for SW
>>>> hand-offs (as suggested by Thomas) which are not part of P4 spec.  Also
>>>> John showed examples of matchd software which already uses P4 at the
>>>> frontend today and translates it to different targets (eBPF, u32, HW).
>>>> It may just be about the naming but I feel like calling the new API
>>>> more generically, switch AST or some such may help to avoid unnecessary
>>>> ties and confusion.  
>>>
>>> Well, that basically means to create "something" that could be be used
>>> to translate p4 source to. Not sure how exactly this "something" should
>>> look like and how different would it be from p4. I thought it might
>>> be good to benefit from the p4 definition and use it directly. Not sure.
>>
>> We have to translate the P4 into "something" already, that something
>> is the AST we will load into the kernel.  Or were you planning to use
>> some official P4 AST?  I'm not suggesting we add our own high level
> 
> I'm not aware of existence of some official P4 AST. We have to figure it
> out.
> 

The compilers at p4.org have an AST so you could claim those are in some
sense "official". Also given the BNF published in the p4 spec lends
itself to what the AST should look like.

Also FWIW the AST is not necessarily the same as the IR.

> 
>> language.  I agree that P4 is a good starting point, and perhaps a good
>> high level language.  I'm just cautious of creating an equivalency
>> between high level language (P4) and the kernel ABI.
> 
> Understood. Definitelly good to be very cautious when defining a kernel
> API.
> 
> 

And another point that came up (trying to unify threads a bit)

"I wonder why p4 does not handle the HW capabilities. At least I did
not find it. It would be certainly nice to have it."

One of the points of P4 is that the hardware should be configurable. So
given a P4 definition of a parse graph, table layout, etc. the hardware
should configure itself to support that "program". The reason you don't
see any HW capabilities is because the "program" is exactly what the
hardware is expected to run. Also the P4 spec does not provide a
definition or a "runtime" API. This will at some point be defined in
another spec.

So a clarifying point are you expecting hardware to reconfigure itself
to match the P4 program or are you simply using this to configure TCAM
slices and building a runtime API.

For example if a P4 program gives a new parse graph that is not
supported by the hardware should it be rejected. From the flow-api
you will see a handful of get_* operations but no set_* operations.
Because the set_* path has to come down to the hardware in
ucode/low-level firmware updates. Its unlikely that vendors will
want to expose ucode/etc.

The set_flow/get_flow bits could be mapped onto a cls_p4 or a
cls_switch as I think was hinted above.

Thanks,
John

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-30 16:38         ` Jiri Pirko
  2016-10-30 17:45           ` Jakub Kicinski
@ 2016-10-30 22:39           ` Alexei Starovoitov
  2016-10-31  6:03             ` Maciej Żenczykowski
  2016-10-31  9:39             ` Jiri Pirko
  1 sibling, 2 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2016-10-30 22:39 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Thomas Graf, John Fastabend, Jakub Kicinski, netdev, davem, jhs,
	roopa, simon.horman, ast, daniel, prem, hannes, jbenc, tom,
	mattyk, idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

On Sun, Oct 30, 2016 at 05:38:36PM +0100, Jiri Pirko wrote:
> Sun, Oct 30, 2016 at 11:26:49AM CET, tgraf@suug.ch wrote:
> >On 10/30/16 at 08:44am, Jiri Pirko wrote:
> >> Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastabend@gmail.com wrote:
> >> >On 16-10-29 07:49 AM, Jakub Kicinski wrote:
> >> >> On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
> >> >>> Hi all.
> >> >>>

sorry for delay. travelling to KS, so probably missed something in
this thread and comments can be totally off...

the subject "let's do P4" is imo misleading, since it reads like
we don't do P4 at the moment, whereas the opposite is true.
Several p4->bpf compilers is a proof.

> The network world is divided into 2 general types of hw:
> 1) network ASICs - network specific silicon, containing things like TCAM
>    These ASICs are suitable to be programmed by P4.

i think the opposite is the case in case of P4.
when hw asic has tcam it's still far far away from being usable with P4
which requires fully programmable protocol parser, arbitrary tables and so on.
P4 doesn't even define TCAM as a table type. The p4 program can declare
a desired algorithm of search in the table and compiler has to figure out
what HW resources to use to satisfy such p4 program.

> 2) network processors - basically a general purpose CPUs
>    These processors are suitable to be programmed by eBPF.

I think this statement is also misleading, since it positions
p4 and bpf as competitors whereas that's not the case.
p4 is the language. bpf is an instruction set.

> Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
> 
>                                  |
>                                  |               +--> ebpf engine
>                                  |               |
>                                  |               |
>                                  |           compilerB
>                                  |               ^
>                                  |               |
> p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC -> HW
>                                  |
>                        userspace | kernel
>                                  |

frankly this diagram smells very much like kernel bypass to me,
since I cannot see how one can put the whole p4 language compiler
into the driver, so this last step of p4ast->hw, I presume, will be
done by firmware, which will be running full compiler in an embedded cpu
on the switch. To me that's precisely the kernel bypass, since we won't
have a clue what HW capabilities actually are and won't be able to fine
grain control them.
Please correct me if I'm wrong.

> Plus the thing I cannot imagine in the model you propose is table fillup.
> For ebpf, you use maps. For p4 you would have to have a separate HW-only
> API. This is very similar to the original John's Flow-API. And therefore
> a kernel bypass.

I think John's flow api is a better way to expose mellanox switch capabilities.
I also think it's not fair to call it 'bypass'. I see nothing in it
that justify such 'swear word' ;)
The goal of flow api was to expose HW features to user space, so that
user space can program it. For something simple as mellanox switch
asic it fits perfectly well.
Unless I misunderstand the bigger goal of this discussion and it's
about programming ezchip devices.

If the goal is to model hw tcam in the linux kernel then just introduce
tcam bpf map type. It will be dog slow in user space, but it will
match exactly what is happnening in the HW and user space can make
sensible trade-offs.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-30 22:39           ` Alexei Starovoitov
@ 2016-10-31  6:03             ` Maciej Żenczykowski
  2016-10-31  7:47               ` Jiri Pirko
  2016-10-31  9:39             ` Jiri Pirko
  1 sibling, 1 reply; 41+ messages in thread
From: Maciej Żenczykowski @ 2016-10-31  6:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Pirko, Thomas Graf, John Fastabend, Jakub Kicinski,
	Linux NetDev, David Miller, Jamal Hadi Salim, roopa,
	simon.horman, ast, daniel, prem, Hannes Frederic Sowa, Jiri Benc,
	Tom Herbert, mattyk, idosch, eladr, yotamg, nogahf, ogerlitz,
	John W. Linville, Andy Gospodarek, Florian Fainelli, dsa,
	vivien.d

One thing to consider...

Just because the compiler could be in the kernel, doesn't mean it has to be.

One could envision a hotplug/modprobe like helper program that the
kernel executes
when it wants to translate from one encoding (say p4) to another (say [e]bpf).

This keeps complexity (compiler) out of the kernel, while still
allowing us to have
the illusion of only one interface to sw/hw.  And it has the nice
benefit of allowing us
to use existing compiler toolchains...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-31  6:03             ` Maciej Żenczykowski
@ 2016-10-31  7:47               ` Jiri Pirko
  0 siblings, 0 replies; 41+ messages in thread
From: Jiri Pirko @ 2016-10-31  7:47 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Alexei Starovoitov, Thomas Graf, John Fastabend, Jakub Kicinski,
	Linux NetDev, David Miller, Jamal Hadi Salim, roopa,
	simon.horman, ast, daniel, prem, Hannes Frederic Sowa, Jiri Benc,
	Tom Herbert, mattyk, idosch, eladr, yotamg, nogahf, ogerlitz,
	John W. Linville, Andy Gospodarek, Florian Fainelli, dsa

Mon, Oct 31, 2016 at 07:03:53AM CET, zenczykowski@gmail.com wrote:
>One thing to consider...
>
>Just because the compiler could be in the kernel, doesn't mean it has to be.
>
>One could envision a hotplug/modprobe like helper program that the
>kernel executes
>when it wants to translate from one encoding (say p4) to another (say [e]bpf).
>
>This keeps complexity (compiler) out of the kernel, while still
>allowing us to have
>the illusion of only one interface to sw/hw.  And it has the nice
>benefit of allowing us
>to use existing compiler toolchains...

This idea was repeatedly marked as unacceptable.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-30 22:39           ` Alexei Starovoitov
  2016-10-31  6:03             ` Maciej Żenczykowski
@ 2016-10-31  9:39             ` Jiri Pirko
  2016-10-31 16:53               ` John Fastabend
  2016-11-02  2:29               ` Daniel Borkmann
  1 sibling, 2 replies; 41+ messages in thread
From: Jiri Pirko @ 2016-10-31  9:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Graf, John Fastabend, Jakub Kicinski, netdev, davem, jhs,
	roopa, simon.horman, ast, daniel, prem, hannes, jbenc, tom,
	mattyk, idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

Sun, Oct 30, 2016 at 11:39:05PM CET, alexei.starovoitov@gmail.com wrote:
>On Sun, Oct 30, 2016 at 05:38:36PM +0100, Jiri Pirko wrote:
>> Sun, Oct 30, 2016 at 11:26:49AM CET, tgraf@suug.ch wrote:
>> >On 10/30/16 at 08:44am, Jiri Pirko wrote:
>> >> Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastabend@gmail.com wrote:
>> >> >On 16-10-29 07:49 AM, Jakub Kicinski wrote:
>> >> >> On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
>> >> >>> Hi all.
>> >> >>>
>
>sorry for delay. travelling to KS, so probably missed something in
>this thread and comments can be totally off...
>
>the subject "let's do P4" is imo misleading, since it reads like
>we don't do P4 at the moment, whereas the opposite is true.
>Several p4->bpf compilers is a proof.

We don't do p4 in kernel now, we don't do p4 offloading now. That is
the reason I started this discussion.


>
>> The network world is divided into 2 general types of hw:
>> 1) network ASICs - network specific silicon, containing things like TCAM
>>    These ASICs are suitable to be programmed by P4.
>
>i think the opposite is the case in case of P4.
>when hw asic has tcam it's still far far away from being usable with P4
>which requires fully programmable protocol parser, arbitrary tables and so on.
>P4 doesn't even define TCAM as a table type. The p4 program can declare
>a desired algorithm of search in the table and compiler has to figure out
>what HW resources to use to satisfy such p4 program.
>
>> 2) network processors - basically a general purpose CPUs
>>    These processors are suitable to be programmed by eBPF.
>
>I think this statement is also misleading, since it positions
>p4 and bpf as competitors whereas that's not the case.
>p4 is the language. bpf is an instruction set.

I wanted to say that we are having 2 approaches in silicon, 2 different
paradigms. Sure you can do p4>bpf. But hard to do it the opposite way.


>
>> Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
>> 
>>                                  |
>>                                  |               +--> ebpf engine
>>                                  |               |
>>                                  |               |
>>                                  |           compilerB
>>                                  |               ^
>>                                  |               |
>> p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC -> HW
>>                                  |
>>                        userspace | kernel
>>                                  |
>
>frankly this diagram smells very much like kernel bypass to me,

what? That is well defined kernel API, in-kernel sw consumer and offload
in driver. Same API for both.

Alex, you have very odd sense about what's bypassing kernel. That kind
of freaks me out...


>since I cannot see how one can put the whole p4 language compiler
>into the driver, so this last step of p4ast->hw, I presume, will be
>done by firmware, which will be running full compiler in an embedded cpu

In case of mlxsw, that compiler would be in driver.


>on the switch. To me that's precisely the kernel bypass, since we won't
>have a clue what HW capabilities actually are and won't be able to fine
>grain control them.
>Please correct me if I'm wrong.

You are wrong. By your definition, everything has to be figured out in
driver and FW does nothing. Otherwise it could do "something else" and
that would be a bypass? Does not make any sense to me whatsoever.


>
>> Plus the thing I cannot imagine in the model you propose is table fillup.
>> For ebpf, you use maps. For p4 you would have to have a separate HW-only
>> API. This is very similar to the original John's Flow-API. And therefore
>> a kernel bypass.
>
>I think John's flow api is a better way to expose mellanox switch capabilities.

We are under impression that p4 suits us nicely. But it is not about
us, it is about finding the common way to do this.


>I also think it's not fair to call it 'bypass'. I see nothing in it
>that justify such 'swear word' ;)

John's Flow-API was a kernel bypass. Why? It was a API specifically
designed to directly work with HW tables, without kernel being involved.


>The goal of flow api was to expose HW features to user space, so that
>user space can program it. For something simple as mellanox switch
>asic it fits perfectly well.

Again, this is not mlx-asic-specific. And again, that is a kernel bypass.


>Unless I misunderstand the bigger goal of this discussion and it's
>about programming ezchip devices.

No. For network processors, I believe that BPF is nicely offloadable, no
need to do the excercise for that.


>
>If the goal is to model hw tcam in the linux kernel then just introduce
>tcam bpf map type. It will be dog slow in user space, but it will
>match exactly what is happnening in the HW and user space can make
>sensible trade-offs.

No, you got me completely wrong. This is not about the TCAM. This is
about differences in the 2 words (p4/bpf).
Again, for "p4-ish" devices, you have to translate BPF. And as you
noted, it's an instruction set. Very hard if not impossible to parse in
order to get back the original semantics.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-31  9:39             ` Jiri Pirko
@ 2016-10-31 16:53               ` John Fastabend
  2016-10-31 17:12                 ` Jiri Pirko
  2016-11-02  2:29               ` Daniel Borkmann
  1 sibling, 1 reply; 41+ messages in thread
From: John Fastabend @ 2016-10-31 16:53 UTC (permalink / raw)
  To: Jiri Pirko, Alexei Starovoitov
  Cc: Thomas Graf, Jakub Kicinski, netdev, davem, jhs, roopa,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

On 16-10-31 02:39 AM, Jiri Pirko wrote:
> Sun, Oct 30, 2016 at 11:39:05PM CET, alexei.starovoitov@gmail.com wrote:
>> On Sun, Oct 30, 2016 at 05:38:36PM +0100, Jiri Pirko wrote:
>>> Sun, Oct 30, 2016 at 11:26:49AM CET, tgraf@suug.ch wrote:
>>>> On 10/30/16 at 08:44am, Jiri Pirko wrote:
>>>>> Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastabend@gmail.com wrote:
>>>>>> On 16-10-29 07:49 AM, Jakub Kicinski wrote:
>>>>>>> On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
>>>>>>>> Hi all.
>>>>>>>>
>>
>> sorry for delay. travelling to KS, so probably missed something in
>> this thread and comments can be totally off...
>>
>> the subject "let's do P4" is imo misleading, since it reads like
>> we don't do P4 at the moment, whereas the opposite is true.
>> Several p4->bpf compilers is a proof.
> 
> We don't do p4 in kernel now, we don't do p4 offloading now. That is
> the reason I started this discussion.
> 

The point here is P4 is a high level language likely we will never "do"
P4 in the kernel nor offload it. P4 translates to eBPF and runs in
kernel just fine. This can be offloaded to some devices but as you
point out is challenging for a class of architecture to offload.

Also simple P4 programs can be offloaded into 'tc' cls_u32 for example
and even cls_flower.

> 
>>
>>> The network world is divided into 2 general types of hw:
>>> 1) network ASICs - network specific silicon, containing things like TCAM
>>>    These ASICs are suitable to be programmed by P4.
>>
>> i think the opposite is the case in case of P4.
>> when hw asic has tcam it's still far far away from being usable with P4
>> which requires fully programmable protocol parser, arbitrary tables and so on.
>> P4 doesn't even define TCAM as a table type. The p4 program can declare
>> a desired algorithm of search in the table and compiler has to figure out
>> what HW resources to use to satisfy such p4 program.
>>
>>> 2) network processors - basically a general purpose CPUs
>>>    These processors are suitable to be programmed by eBPF.
>>
>> I think this statement is also misleading, since it positions
>> p4 and bpf as competitors whereas that's not the case.
>> p4 is the language. bpf is an instruction set.
> 
> I wanted to say that we are having 2 approaches in silicon, 2 different
> paradigms. Sure you can do p4>bpf. But hard to do it the opposite way.
> 
> 
>>
>>> Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
>>>
>>>                                  |
>>>                                  |               +--> ebpf engine
>>>                                  |               |
>>>                                  |               |
>>>                                  |           compilerB
>>>                                  |               ^
>>>                                  |               |
>>> p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC -> HW
>>>                                  |
>>>                        userspace | kernel
>>>                                  |
>>
>> frankly this diagram smells very much like kernel bypass to me,
> 
> what? That is well defined kernel API, in-kernel sw consumer and offload
> in driver. Same API for both.
> 
> Alex, you have very odd sense about what's bypassing kernel. That kind
> of freaks me out...
> 

I think the issue with offloading a P4-AST will be how much work goes
into mapping this onto any particular hardware instance. And how much
of the P4 language feature set is exposed.

For example I suspect MLX switch has a different pipeline than MLX NIC
and even different variations of the product lines. The same goes for
Intel pipeline in NIC and switch and different products in same line.

If P4-ast describes the exact instance of the hardware its an easy task
the map is 1:1 but isn't exactly portable. Taking an N table onto a M
table pipeline on the other hand is a bit more work and requires various
transformations to occur in the runtime API. I'm guessing the class of
devices we are talking about here can not reconfigure themselves to
match the P4-ast.

In the naive implementation only pipelines that map 1:1 will work. Maybe
this is what Alexei is noticing?

> 
>> since I cannot see how one can put the whole p4 language compiler
>> into the driver, so this last step of p4ast->hw, I presume, will be
>> done by firmware, which will be running full compiler in an embedded cpu
> 
> In case of mlxsw, that compiler would be in driver.
> 
> 
>> on the switch. To me that's precisely the kernel bypass, since we won't
>> have a clue what HW capabilities actually are and won't be able to fine
>> grain control them.
>> Please correct me if I'm wrong.
> 
> You are wrong. By your definition, everything has to be figured out in
> driver and FW does nothing. Otherwise it could do "something else" and
> that would be a bypass? Does not make any sense to me whatsoever.
> 
> 
>>
>>> Plus the thing I cannot imagine in the model you propose is table fillup.
>>> For ebpf, you use maps. For p4 you would have to have a separate HW-only
>>> API. This is very similar to the original John's Flow-API. And therefore
>>> a kernel bypass.
>>
>> I think John's flow api is a better way to expose mellanox switch capabilities.
> 
> We are under impression that p4 suits us nicely. But it is not about
> us, it is about finding the common way to do this.
> 

I'll just poke at my FlowAPI question again. For fixed ASICS what is
the Flow-API missing. We have a few proof points that show it is both
sufficient and usable for the handful of use cases we care about.

> 
>> I also think it's not fair to call it 'bypass'. I see nothing in it
>> that justify such 'swear word' ;)
> 
> John's Flow-API was a kernel bypass. Why? It was a API specifically
> designed to directly work with HW tables, without kernel being involved.

I don't think that is a fair definition of HW bypass. The SKIP_SW flag
does exactly that for 'tc' based offloads and it was not rejected.

The _real_ reason that seems to have fallen out of this and other
discussion is the Flow-API didn't provide an in-kernel translation into
an emulated patch. Note we always had a usermode translation to eBPF.
A secondary reason appears to be overhead of adding yet another netlink
family.

> 
> 
>> The goal of flow api was to expose HW features to user space, so that
>> user space can program it. For something simple as mellanox switch
>> asic it fits perfectly well.
> 
> Again, this is not mlx-asic-specific. And again, that is a kernel bypass.
> 
> 
>> Unless I misunderstand the bigger goal of this discussion and it's
>> about programming ezchip devices.
> 
> No. For network processors, I believe that BPF is nicely offloadable, no
> need to do the excercise for that.
> 
> 
>>
>> If the goal is to model hw tcam in the linux kernel then just introduce
>> tcam bpf map type. It will be dog slow in user space, but it will
>> match exactly what is happnening in the HW and user space can make
>> sensible trade-offs.
> 
> No, you got me completely wrong. This is not about the TCAM. This is
> about differences in the 2 words (p4/bpf).
> Again, for "p4-ish" devices, you have to translate BPF. And as you
> noted, it's an instruction set. Very hard if not impossible to parse in
> order to get back the original semantics.
> 

I think in this discussion "p4-ish" devices means devices with multiple
tables in a pipeline? Not devices that have programmable/configurable
pipelines right? And if we get to talking about reconfigurable devices
I believe this should be done out of band as it typically means
reloading some ucode, etc.

.John

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-31 16:53               ` John Fastabend
@ 2016-10-31 17:12                 ` Jiri Pirko
  2016-10-31 18:32                   ` Hannes Frederic Sowa
  2016-10-31 19:35                   ` John Fastabend
  0 siblings, 2 replies; 41+ messages in thread
From: Jiri Pirko @ 2016-10-31 17:12 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexei Starovoitov, Thomas Graf, Jakub Kicinski, netdev, davem,
	jhs, roopa, simon.horman, ast, daniel, prem, hannes, jbenc, tom,
	mattyk, idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

Mon, Oct 31, 2016 at 05:53:38PM CET, john.fastabend@gmail.com wrote:
>On 16-10-31 02:39 AM, Jiri Pirko wrote:
>> Sun, Oct 30, 2016 at 11:39:05PM CET, alexei.starovoitov@gmail.com wrote:
>>> On Sun, Oct 30, 2016 at 05:38:36PM +0100, Jiri Pirko wrote:
>>>> Sun, Oct 30, 2016 at 11:26:49AM CET, tgraf@suug.ch wrote:
>>>>> On 10/30/16 at 08:44am, Jiri Pirko wrote:
>>>>>> Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastabend@gmail.com wrote:
>>>>>>> On 16-10-29 07:49 AM, Jakub Kicinski wrote:
>>>>>>>> On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
>>>>>>>>> Hi all.
>>>>>>>>>
>>>
>>> sorry for delay. travelling to KS, so probably missed something in
>>> this thread and comments can be totally off...
>>>
>>> the subject "let's do P4" is imo misleading, since it reads like
>>> we don't do P4 at the moment, whereas the opposite is true.
>>> Several p4->bpf compilers is a proof.
>> 
>> We don't do p4 in kernel now, we don't do p4 offloading now. That is
>> the reason I started this discussion.
>> 
>
>The point here is P4 is a high level language likely we will never "do"
>P4 in the kernel nor offload it. P4 translates to eBPF and runs in
>kernel just fine. This can be offloaded to some devices but as you
>point out is challenging for a class of architecture to offload.
>
>Also simple P4 programs can be offloaded into 'tc' cls_u32 for example
>and even cls_flower.
>
>> 
>>>
>>>> The network world is divided into 2 general types of hw:
>>>> 1) network ASICs - network specific silicon, containing things like TCAM
>>>>    These ASICs are suitable to be programmed by P4.
>>>
>>> i think the opposite is the case in case of P4.
>>> when hw asic has tcam it's still far far away from being usable with P4
>>> which requires fully programmable protocol parser, arbitrary tables and so on.
>>> P4 doesn't even define TCAM as a table type. The p4 program can declare
>>> a desired algorithm of search in the table and compiler has to figure out
>>> what HW resources to use to satisfy such p4 program.
>>>
>>>> 2) network processors - basically a general purpose CPUs
>>>>    These processors are suitable to be programmed by eBPF.
>>>
>>> I think this statement is also misleading, since it positions
>>> p4 and bpf as competitors whereas that's not the case.
>>> p4 is the language. bpf is an instruction set.
>> 
>> I wanted to say that we are having 2 approaches in silicon, 2 different
>> paradigms. Sure you can do p4>bpf. But hard to do it the opposite way.
>> 
>> 
>>>
>>>> Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
>>>>
>>>>                                  |
>>>>                                  |               +--> ebpf engine
>>>>                                  |               |
>>>>                                  |               |
>>>>                                  |           compilerB
>>>>                                  |               ^
>>>>                                  |               |
>>>> p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC -> HW
>>>>                                  |
>>>>                        userspace | kernel
>>>>                                  |
>>>
>>> frankly this diagram smells very much like kernel bypass to me,
>> 
>> what? That is well defined kernel API, in-kernel sw consumer and offload
>> in driver. Same API for both.
>> 
>> Alex, you have very odd sense about what's bypassing kernel. That kind
>> of freaks me out...
>> 
>
>I think the issue with offloading a P4-AST will be how much work goes
>into mapping this onto any particular hardware instance. And how much
>of the P4 language feature set is exposed.
>
>For example I suspect MLX switch has a different pipeline than MLX NIC
>and even different variations of the product lines. The same goes for
>Intel pipeline in NIC and switch and different products in same line.
>
>If P4-ast describes the exact instance of the hardware its an easy task
>the map is 1:1 but isn't exactly portable. Taking an N table onto a M
>table pipeline on the other hand is a bit more work and requires various
>transformations to occur in the runtime API. I'm guessing the class of
>devices we are talking about here can not reconfigure themselves to
>match the P4-ast.

I believe we can assume that. the p4ast has to be generic as the
original p4source is. It would be a terrible mistake to couple it with
some specific hardware. I only want to use p4ast because it would be easy
parse in kernel, unlike p4source.


>
>In the naive implementation only pipelines that map 1:1 will work. Maybe
>this is what Alexei is noticing?

P4 is ment to program programable hw, not fixed pipeline.


>
>> 
>>> since I cannot see how one can put the whole p4 language compiler
>>> into the driver, so this last step of p4ast->hw, I presume, will be
>>> done by firmware, which will be running full compiler in an embedded cpu
>> 
>> In case of mlxsw, that compiler would be in driver.
>> 
>> 
>>> on the switch. To me that's precisely the kernel bypass, since we won't
>>> have a clue what HW capabilities actually are and won't be able to fine
>>> grain control them.
>>> Please correct me if I'm wrong.
>> 
>> You are wrong. By your definition, everything has to be figured out in
>> driver and FW does nothing. Otherwise it could do "something else" and
>> that would be a bypass? Does not make any sense to me whatsoever.
>> 
>> 
>>>
>>>> Plus the thing I cannot imagine in the model you propose is table fillup.
>>>> For ebpf, you use maps. For p4 you would have to have a separate HW-only
>>>> API. This is very similar to the original John's Flow-API. And therefore
>>>> a kernel bypass.
>>>
>>> I think John's flow api is a better way to expose mellanox switch capabilities.
>> 
>> We are under impression that p4 suits us nicely. But it is not about
>> us, it is about finding the common way to do this.
>> 
>
>I'll just poke at my FlowAPI question again. For fixed ASICS what is
>the Flow-API missing. We have a few proof points that show it is both
>sufficient and usable for the handful of use cases we care about.

Yeah, it is most probably fine. Even for flex ASICs to some point. The
question is how it stands comparing to other alternatives, like p4


>
>> 
>>> I also think it's not fair to call it 'bypass'. I see nothing in it
>>> that justify such 'swear word' ;)
>> 
>> John's Flow-API was a kernel bypass. Why? It was a API specifically
>> designed to directly work with HW tables, without kernel being involved.
>
>I don't think that is a fair definition of HW bypass. The SKIP_SW flag
>does exactly that for 'tc' based offloads and it was not rejected.

No, no, no. You still have possibility to do the same thing in kernel,
same functionality, with the same API. That is a big difference.


>
>The _real_ reason that seems to have fallen out of this and other
>discussion is the Flow-API didn't provide an in-kernel translation into
>an emulated patch. Note we always had a usermode translation to eBPF.
>A secondary reason appears to be overhead of adding yet another netlink
>family.

Yeah. Maybe you remember, back then when Flow-API was being discussed,
I suggested to wrap it under TC as cls_xflows and cls_xflowsaction of
some sort and do in-kernel datapath implementation. I believe that after
that, it would be acceptable.


>
>> 
>> 
>>> The goal of flow api was to expose HW features to user space, so that
>>> user space can program it. For something simple as mellanox switch
>>> asic it fits perfectly well.
>> 
>> Again, this is not mlx-asic-specific. And again, that is a kernel bypass.
>> 
>> 
>>> Unless I misunderstand the bigger goal of this discussion and it's
>>> about programming ezchip devices.
>> 
>> No. For network processors, I believe that BPF is nicely offloadable, no
>> need to do the excercise for that.
>> 
>> 
>>>
>>> If the goal is to model hw tcam in the linux kernel then just introduce
>>> tcam bpf map type. It will be dog slow in user space, but it will
>>> match exactly what is happnening in the HW and user space can make
>>> sensible trade-offs.
>> 
>> No, you got me completely wrong. This is not about the TCAM. This is
>> about differences in the 2 words (p4/bpf).
>> Again, for "p4-ish" devices, you have to translate BPF. And as you
>> noted, it's an instruction set. Very hard if not impossible to parse in
>> order to get back the original semantics.
>> 
>
>I think in this discussion "p4-ish" devices means devices with multiple
>tables in a pipeline? Not devices that have programmable/configurable
>pipelines right? And if we get to talking about reconfigurable devices
>I believe this should be done out of band as it typically means
>reloading some ucode, etc.

I'm talking about both. But I think we should focus on reconfigurable
ones, as we probably won't see that much fixed ones in the future.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-31 17:12                 ` Jiri Pirko
@ 2016-10-31 18:32                   ` Hannes Frederic Sowa
  2016-10-31 19:35                   ` John Fastabend
  1 sibling, 0 replies; 41+ messages in thread
From: Hannes Frederic Sowa @ 2016-10-31 18:32 UTC (permalink / raw)
  To: Jiri Pirko, John Fastabend
  Cc: Alexei Starovoitov, Thomas Graf, Jakub Kicinski, netdev, davem,
	jhs, roopa, simon.horman, ast, daniel, prem, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

On 31.10.2016 18:12, Jiri Pirko wrote:
>> >
>> >In the naive implementation only pipelines that map 1:1 will work. Maybe
>> >this is what Alexei is noticing?
> P4 is ment to program programable hw, not fixed pipeline.

Is it realistic to assume that future hardware might be programmed with
a proprietary (FPGA-alike) bitstream where a generic API wouldn't fit
anymore? I could imagine vendors shipping a higher abstracted
VHDL/Verilog compiler in the future and expect the kernel just forward
it to the hardware as-is.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-31 17:12                 ` Jiri Pirko
  2016-10-31 18:32                   ` Hannes Frederic Sowa
@ 2016-10-31 19:35                   ` John Fastabend
  2016-11-01  8:46                     ` Jiri Pirko
  1 sibling, 1 reply; 41+ messages in thread
From: John Fastabend @ 2016-10-31 19:35 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alexei Starovoitov, Thomas Graf, Jakub Kicinski, netdev, davem,
	jhs, roopa, simon.horman, ast, daniel, prem, hannes, jbenc, tom,
	mattyk, idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

[...]

>>>
>>
>> I think the issue with offloading a P4-AST will be how much work goes
>> into mapping this onto any particular hardware instance. And how much
>> of the P4 language feature set is exposed.
>>
>> For example I suspect MLX switch has a different pipeline than MLX NIC
>> and even different variations of the product lines. The same goes for
>> Intel pipeline in NIC and switch and different products in same line.
>>
>> If P4-ast describes the exact instance of the hardware its an easy task
>> the map is 1:1 but isn't exactly portable. Taking an N table onto a M
>> table pipeline on the other hand is a bit more work and requires various
>> transformations to occur in the runtime API. I'm guessing the class of
>> devices we are talking about here can not reconfigure themselves to
>> match the P4-ast.
> 
> I believe we can assume that. the p4ast has to be generic as the
> original p4source is. It would be a terrible mistake to couple it with
> some specific hardware. I only want to use p4ast because it would be easy
> parse in kernel, unlike p4source.

Sure but in the fixed ASIC cases the universe of P4 programs is much
larger than the handful of ones that can be 'accepted' by the device. So
you really need to have some knowledge of the hardware. However if you
believe (guessing from last bullet) that devices will be configurable
in the future then its more likely that the hardware can 'accept' the
program.

> 
> 
>>
>> In the naive implementation only pipelines that map 1:1 will work. Maybe
>> this is what Alexei is noticing?
> 
> P4 is ment to program programable hw, not fixed pipeline.
> 

I'm guessing there are no upstream drivers at the moment that support
this though right? The rocker universe bits though could leverage this.

> 
>>
>>>
>>>> since I cannot see how one can put the whole p4 language compiler
>>>> into the driver, so this last step of p4ast->hw, I presume, will be
>>>> done by firmware, which will be running full compiler in an embedded cpu
>>>
>>> In case of mlxsw, that compiler would be in driver.
>>>
>>>
>>>> on the switch. To me that's precisely the kernel bypass, since we won't
>>>> have a clue what HW capabilities actually are and won't be able to fine
>>>> grain control them.
>>>> Please correct me if I'm wrong.
>>>
>>> You are wrong. By your definition, everything has to be figured out in
>>> driver and FW does nothing. Otherwise it could do "something else" and
>>> that would be a bypass? Does not make any sense to me whatsoever.
>>>
>>>
>>>>
>>>>> Plus the thing I cannot imagine in the model you propose is table fillup.
>>>>> For ebpf, you use maps. For p4 you would have to have a separate HW-only
>>>>> API. This is very similar to the original John's Flow-API. And therefore
>>>>> a kernel bypass.
>>>>
>>>> I think John's flow api is a better way to expose mellanox switch capabilities.
>>>
>>> We are under impression that p4 suits us nicely. But it is not about
>>> us, it is about finding the common way to do this.
>>>
>>
>> I'll just poke at my FlowAPI question again. For fixed ASICS what is
>> the Flow-API missing. We have a few proof points that show it is both
>> sufficient and usable for the handful of use cases we care about.
> 
> Yeah, it is most probably fine. Even for flex ASICs to some point. The
> question is how it stands comparing to other alternatives, like p4
> 

Just to be clear the Flow-API _was_ generated from the initial P4 spec.
The header files and tools used with it were autogenerated ("compiled"
in a loose sense) from the P4 program. The piece I never exposed
was the set_* operations to reconfigure running systems. I'm not sure
how valuable this is in practice though.

Also there is a P4-16 spec that will be released shortly that is more
flexible and also more complex.

> 
>>
>>>
>>>> I also think it's not fair to call it 'bypass'. I see nothing in it
>>>> that justify such 'swear word' ;)
>>>
>>> John's Flow-API was a kernel bypass. Why? It was a API specifically
>>> designed to directly work with HW tables, without kernel being involved.
>>
>> I don't think that is a fair definition of HW bypass. The SKIP_SW flag
>> does exactly that for 'tc' based offloads and it was not rejected.
> 
> No, no, no. You still have possibility to do the same thing in kernel,
> same functionality, with the same API. That is a big difference.
> 
> 
>>
>> The _real_ reason that seems to have fallen out of this and other
>> discussion is the Flow-API didn't provide an in-kernel translation into
>> an emulated patch. Note we always had a usermode translation to eBPF.
>> A secondary reason appears to be overhead of adding yet another netlink
>> family.
> 
> Yeah. Maybe you remember, back then when Flow-API was being discussed,
> I suggested to wrap it under TC as cls_xflows and cls_xflowsaction of
> some sort and do in-kernel datapath implementation. I believe that after
> that, it would be acceptable.
> 

As I understand the thread here that is exactly the proposal here right?
With a discussion around if the structures/etc are sufficient or any
alternative representations exist.

> 
>>
>>>
>>>
>>>> The goal of flow api was to expose HW features to user space, so that
>>>> user space can program it. For something simple as mellanox switch
>>>> asic it fits perfectly well.
>>>
>>> Again, this is not mlx-asic-specific. And again, that is a kernel bypass.
>>>
>>>
>>>> Unless I misunderstand the bigger goal of this discussion and it's
>>>> about programming ezchip devices.
>>>
>>> No. For network processors, I believe that BPF is nicely offloadable, no
>>> need to do the excercise for that.
>>>
>>>
>>>>
>>>> If the goal is to model hw tcam in the linux kernel then just introduce
>>>> tcam bpf map type. It will be dog slow in user space, but it will
>>>> match exactly what is happnening in the HW and user space can make
>>>> sensible trade-offs.
>>>
>>> No, you got me completely wrong. This is not about the TCAM. This is
>>> about differences in the 2 words (p4/bpf).
>>> Again, for "p4-ish" devices, you have to translate BPF. And as you
>>> noted, it's an instruction set. Very hard if not impossible to parse in
>>> order to get back the original semantics.
>>>
>>
>> I think in this discussion "p4-ish" devices means devices with multiple
>> tables in a pipeline? Not devices that have programmable/configurable
>> pipelines right? And if we get to talking about reconfigurable devices
>> I believe this should be done out of band as it typically means
>> reloading some ucode, etc.
> 
> I'm talking about both. But I think we should focus on reconfigurable
> ones, as we probably won't see that much fixed ones in the future.
> 

hmm maybe but the 10/40/100Gbps devices are going to be around for some
time. So we need to ensure these work well.

.John
	

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-31 19:35                   ` John Fastabend
@ 2016-11-01  8:46                     ` Jiri Pirko
  2016-11-01 15:13                       ` John Fastabend
  0 siblings, 1 reply; 41+ messages in thread
From: Jiri Pirko @ 2016-11-01  8:46 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexei Starovoitov, Thomas Graf, Jakub Kicinski, netdev, davem,
	jhs, roopa, simon.horman, ast, daniel, prem, hannes, jbenc, tom,
	mattyk, idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

Mon, Oct 31, 2016 at 08:35:00PM CET, john.fastabend@gmail.com wrote:
>[...]
>
>>>>
>>>
>>> I think the issue with offloading a P4-AST will be how much work goes
>>> into mapping this onto any particular hardware instance. And how much
>>> of the P4 language feature set is exposed.
>>>
>>> For example I suspect MLX switch has a different pipeline than MLX NIC
>>> and even different variations of the product lines. The same goes for
>>> Intel pipeline in NIC and switch and different products in same line.
>>>
>>> If P4-ast describes the exact instance of the hardware its an easy task
>>> the map is 1:1 but isn't exactly portable. Taking an N table onto a M
>>> table pipeline on the other hand is a bit more work and requires various
>>> transformations to occur in the runtime API. I'm guessing the class of
>>> devices we are talking about here can not reconfigure themselves to
>>> match the P4-ast.
>> 
>> I believe we can assume that. the p4ast has to be generic as the
>> original p4source is. It would be a terrible mistake to couple it with
>> some specific hardware. I only want to use p4ast because it would be easy
>> parse in kernel, unlike p4source.
>
>Sure but in the fixed ASIC cases the universe of P4 programs is much
>larger than the handful of ones that can be 'accepted' by the device. So
>you really need to have some knowledge of the hardware. However if you
>believe (guessing from last bullet) that devices will be configurable
>in the future then its more likely that the hardware can 'accept' the
>program.
>
>> 
>> 
>>>
>>> In the naive implementation only pipelines that map 1:1 will work. Maybe
>>> this is what Alexei is noticing?
>> 
>> P4 is ment to program programable hw, not fixed pipeline.
>> 
>
>I'm guessing there are no upstream drivers at the moment that support
>this though right? The rocker universe bits though could leverage this.

mlxsw. But this is naturaly not implemented yet, as there is no
infrastructure.


>
>> 
>>>
>>>>
>>>>> since I cannot see how one can put the whole p4 language compiler
>>>>> into the driver, so this last step of p4ast->hw, I presume, will be
>>>>> done by firmware, which will be running full compiler in an embedded cpu
>>>>
>>>> In case of mlxsw, that compiler would be in driver.
>>>>
>>>>
>>>>> on the switch. To me that's precisely the kernel bypass, since we won't
>>>>> have a clue what HW capabilities actually are and won't be able to fine
>>>>> grain control them.
>>>>> Please correct me if I'm wrong.
>>>>
>>>> You are wrong. By your definition, everything has to be figured out in
>>>> driver and FW does nothing. Otherwise it could do "something else" and
>>>> that would be a bypass? Does not make any sense to me whatsoever.
>>>>
>>>>
>>>>>
>>>>>> Plus the thing I cannot imagine in the model you propose is table fillup.
>>>>>> For ebpf, you use maps. For p4 you would have to have a separate HW-only
>>>>>> API. This is very similar to the original John's Flow-API. And therefore
>>>>>> a kernel bypass.
>>>>>
>>>>> I think John's flow api is a better way to expose mellanox switch capabilities.
>>>>
>>>> We are under impression that p4 suits us nicely. But it is not about
>>>> us, it is about finding the common way to do this.
>>>>
>>>
>>> I'll just poke at my FlowAPI question again. For fixed ASICS what is
>>> the Flow-API missing. We have a few proof points that show it is both
>>> sufficient and usable for the handful of use cases we care about.
>> 
>> Yeah, it is most probably fine. Even for flex ASICs to some point. The
>> question is how it stands comparing to other alternatives, like p4
>> 
>
>Just to be clear the Flow-API _was_ generated from the initial P4 spec.
>The header files and tools used with it were autogenerated ("compiled"
>in a loose sense) from the P4 program. The piece I never exposed
>was the set_* operations to reconfigure running systems. I'm not sure
>how valuable this is in practice though.
>
>Also there is a P4-16 spec that will be released shortly that is more
>flexible and also more complex.

Would it be able to easily extend the Flow-API to include the changes?


>
>> 
>>>
>>>>
>>>>> I also think it's not fair to call it 'bypass'. I see nothing in it
>>>>> that justify such 'swear word' ;)
>>>>
>>>> John's Flow-API was a kernel bypass. Why? It was a API specifically
>>>> designed to directly work with HW tables, without kernel being involved.
>>>
>>> I don't think that is a fair definition of HW bypass. The SKIP_SW flag
>>> does exactly that for 'tc' based offloads and it was not rejected.
>> 
>> No, no, no. You still have possibility to do the same thing in kernel,
>> same functionality, with the same API. That is a big difference.
>> 
>> 
>>>
>>> The _real_ reason that seems to have fallen out of this and other
>>> discussion is the Flow-API didn't provide an in-kernel translation into
>>> an emulated patch. Note we always had a usermode translation to eBPF.
>>> A secondary reason appears to be overhead of adding yet another netlink
>>> family.
>> 
>> Yeah. Maybe you remember, back then when Flow-API was being discussed,
>> I suggested to wrap it under TC as cls_xflows and cls_xflowsaction of
>> some sort and do in-kernel datapath implementation. I believe that after
>> that, it would be acceptable.
>> 
>
>As I understand the thread here that is exactly the proposal here right?
>With a discussion around if the structures/etc are sufficient or any
>alternative representations exist.

Might be the way, yes. But I fear that with other p4 extensions this
might not be easy to align with. Therefore I though about something more
generic, like the p4ast.


>
>> 
>>>
>>>>
>>>>
>>>>> The goal of flow api was to expose HW features to user space, so that
>>>>> user space can program it. For something simple as mellanox switch
>>>>> asic it fits perfectly well.
>>>>
>>>> Again, this is not mlx-asic-specific. And again, that is a kernel bypass.
>>>>
>>>>
>>>>> Unless I misunderstand the bigger goal of this discussion and it's
>>>>> about programming ezchip devices.
>>>>
>>>> No. For network processors, I believe that BPF is nicely offloadable, no
>>>> need to do the excercise for that.
>>>>
>>>>
>>>>>
>>>>> If the goal is to model hw tcam in the linux kernel then just introduce
>>>>> tcam bpf map type. It will be dog slow in user space, but it will
>>>>> match exactly what is happnening in the HW and user space can make
>>>>> sensible trade-offs.
>>>>
>>>> No, you got me completely wrong. This is not about the TCAM. This is
>>>> about differences in the 2 words (p4/bpf).
>>>> Again, for "p4-ish" devices, you have to translate BPF. And as you
>>>> noted, it's an instruction set. Very hard if not impossible to parse in
>>>> order to get back the original semantics.
>>>>
>>>
>>> I think in this discussion "p4-ish" devices means devices with multiple
>>> tables in a pipeline? Not devices that have programmable/configurable
>>> pipelines right? And if we get to talking about reconfigurable devices
>>> I believe this should be done out of band as it typically means
>>> reloading some ucode, etc.
>> 
>> I'm talking about both. But I think we should focus on reconfigurable
>> ones, as we probably won't see that much fixed ones in the future.
>> 
>
>hmm maybe but the 10/40/100Gbps devices are going to be around for some
>time. So we need to ensure these work well.

Yes, but I would like to emphasize, if we are defining new api
the primary focus should be on new devices.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-29  7:53 Let's do P4 Jiri Pirko
  2016-10-29  9:39 ` Thomas Graf
  2016-10-29 14:49 ` Jakub Kicinski
@ 2016-11-01 11:57 ` Jamal Hadi Salim
  2016-11-01 15:03   ` John Fastabend
  2 siblings, 1 reply; 41+ messages in thread
From: Jamal Hadi Salim @ 2016-11-01 11:57 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, tgraf, roopa, john.fastabend, jakub.kicinski,
	simon.horman, ast, daniel, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera


I am in travel mode so havent read the huge blast of
emails (and i am probably taking this email out of
the already discussed topics). I will try to catchup later.

Simple question (same chat I had with Prem at netdev1.2):
What is it that can be expressed by P4 that cant be expressed
with the (userspace) tc grammar? If any i would say the diff
is very small.
Is there something we need to add to kernel tc that will complete
the policy graph needed to express a P4 context?
Essentially if one can express the tc policies with p4 DSL then
that could become another frontend to tc (and a p4 component could
be implemented in classic tc action/classifier or ebpf).

I think trying to express p4 at the coarse granularity it offers
using ebpf is challenging.

cheers,
jamal

On 16-10-29 03:53 AM, Jiri Pirko wrote:
> Hi all.
>
> The network world is divided into 2 general types of hw:
> 1) network ASICs - network specific silicon, containing things like TCAM
>    These ASICs are suitable to be programmed by P4.
> 2) network processors - basically a general purpose CPUs
>    These processors are suitable to be programmed by eBPF.
>
> I believe that by now, the most people came to a conclusion that it is
> very difficult to handle both types by either P4 or eBPF. And since
> eBPF is part of the kernel, I would like to introduce P4 into kernel
> as well. Here's a plan:
>
> 1) Define P4 intermediate representation
>    I cannot imagine loading P4 program (c-like syntax text file) into
>    kernel as is. That means that as the first step, we need find some
>    intermediate representation. I can imagine someting in a form of AST,
>    call it "p4ast". I don't really know how to do this exactly though,
>    it's just an idea.
>
>    In the end there would be a userspace precompiler for this:
>    $ makep4ast example.p4 example.ast
>
> 2) Implement p4ast in-kernel interpreter
>    A kernel module which takes a p4ast and emulates the pipeline.
>    This can be implemented from scratch. Or, p4ast could be compiled
>    to eBPF. I know there are already couple of p4>eBPF compilers.
>    Not sure how feasible it would be to put this compiler in kernel.
>
> 3) Expose the p4ast in-kernel interpreter to userspace
>    As the easiest way I see in to introduce a new TC classifier cls_p4.
>
>    This can work in a very similar way cls_bpf is:
>    $ tc filter add dev eth0 ingress p4 da ast example.ast
>
>    The TC cls_p4 will be also used for runtime table manipulation.
>
> 4) Offload p4ast programs into hardware
>    The same p4ast program representation will be passed down
>    to drivers via existing TC offloading way - ndo_setup_tc.
>    Drivers will then parse it and setup the hardware
>    accordingly. Driver will also have possibility to error out
>    in case it does not support some requested feature.
>
> Thoughts? Ideas?
>
> Thanks,
> 	Jiri
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-11-01 11:57 ` Jamal Hadi Salim
@ 2016-11-01 15:03   ` John Fastabend
  0 siblings, 0 replies; 41+ messages in thread
From: John Fastabend @ 2016-11-01 15:03 UTC (permalink / raw)
  To: Jamal Hadi Salim, Jiri Pirko, netdev
  Cc: davem, tgraf, roopa, jakub.kicinski, simon.horman, ast, daniel,
	prem, hannes, jbenc, tom, mattyk, idosch, eladr, yotamg, nogahf,
	ogerlitz, linville, andy, f.fainelli, dsa, vivien.didelot,
	andrew, ivecera

On 16-11-01 04:57 AM, Jamal Hadi Salim wrote:
> 
> I am in travel mode so havent read the huge blast of
> emails (and i am probably taking this email out of
> the already discussed topics). I will try to catchup later.
> 
> Simple question (same chat I had with Prem at netdev1.2):
> What is it that can be expressed by P4 that cant be expressed
> with the (userspace) tc grammar? If any i would say the diff
> is very small.

Taking eBPF into account its small if it at all as you note. But,
the real problem is mapping it onto hardware. Pushing eBPF onto
pipeline ASICs is difficult and even when its done it is extremely
fragile as pointed out by folks. cls_u32 works OK IMO although the
mapping between hardware tables and software tables is a bit of an art.
cls_flower has no notion of tables or arbitrary actions and
continuations.

Also P4 is about programming the hardware parse graph, table layout,
etc. and doing this from 'tc' requires drivers to generate very
low level hardware ucode typically or a cpu on the board to process
to the generation from high level commands. At least those are the
only two options I see. I suspect the end result is the reprogramming
of these flexible devices is done out of band via firmware uploading.

> Is there something we need to add to kernel tc that will complete
> the policy graph needed to express a P4 context?
> Essentially if one can express the tc policies with p4 DSL then
> that could become another frontend to tc (and a p4 component could
> be implemented in classic tc action/classifier or ebpf).

per above runtime programming perhaps, configuration of the device
unlikely.

> 
> I think trying to express p4 at the coarse granularity it offers
> using ebpf is challenging.

Nope its actually much easier than "compiling" p4 for hardware IMO.
Mapping P4 onto an instruction set vs mapping it onto some of the
esoteric features of CAM based parsing logic, other non-standard ALU
designs, etc. A lot of the hardware architecture is bent around pushing
32+ ports of 100Gbps through the system which creates some interesting
designs. Caveat being I'm a software guy and hardware folks might have
a different take. I've not found any part of the spec for example that
can not be mapped onto LLVM-IR.

Also there exist a handful of proof points of p4 to ebpf code on the
Internet. We should get a LLVM frontend here shortly.

> 
> cheers,
> jamal
> 
> On 16-10-29 03:53 AM, Jiri Pirko wrote:
>> Hi all.
>>
>> The network world is divided into 2 general types of hw:
>> 1) network ASICs - network specific silicon, containing things like TCAM
>>    These ASICs are suitable to be programmed by P4.
>> 2) network processors - basically a general purpose CPUs
>>    These processors are suitable to be programmed by eBPF.
>>
>> I believe that by now, the most people came to a conclusion that it is
>> very difficult to handle both types by either P4 or eBPF. And since
>> eBPF is part of the kernel, I would like to introduce P4 into kernel
>> as well. Here's a plan:
>>
>> 1) Define P4 intermediate representation
>>    I cannot imagine loading P4 program (c-like syntax text file) into
>>    kernel as is. That means that as the first step, we need find some
>>    intermediate representation. I can imagine someting in a form of AST,
>>    call it "p4ast". I don't really know how to do this exactly though,
>>    it's just an idea.
>>
>>    In the end there would be a userspace precompiler for this:
>>    $ makep4ast example.p4 example.ast
>>
>> 2) Implement p4ast in-kernel interpreter
>>    A kernel module which takes a p4ast and emulates the pipeline.
>>    This can be implemented from scratch. Or, p4ast could be compiled
>>    to eBPF. I know there are already couple of p4>eBPF compilers.
>>    Not sure how feasible it would be to put this compiler in kernel.
>>
>> 3) Expose the p4ast in-kernel interpreter to userspace
>>    As the easiest way I see in to introduce a new TC classifier cls_p4.
>>
>>    This can work in a very similar way cls_bpf is:
>>    $ tc filter add dev eth0 ingress p4 da ast example.ast
>>
>>    The TC cls_p4 will be also used for runtime table manipulation.
>>
>> 4) Offload p4ast programs into hardware
>>    The same p4ast program representation will be passed down
>>    to drivers via existing TC offloading way - ndo_setup_tc.
>>    Drivers will then parse it and setup the hardware
>>    accordingly. Driver will also have possibility to error out
>>    in case it does not support some requested feature.
>>
>> Thoughts? Ideas?
>>
>> Thanks,
>>     Jiri
>>
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-11-01  8:46                     ` Jiri Pirko
@ 2016-11-01 15:13                       ` John Fastabend
  2016-11-02  8:07                         ` Jiri Pirko
  0 siblings, 1 reply; 41+ messages in thread
From: John Fastabend @ 2016-11-01 15:13 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alexei Starovoitov, Thomas Graf, Jakub Kicinski, netdev, davem,
	jhs, roopa, simon.horman, ast, daniel, prem, hannes, jbenc, tom,
	mattyk, idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

[...]

>>> P4 is ment to program programable hw, not fixed pipeline.
>>>
>>
>> I'm guessing there are no upstream drivers at the moment that support
>> this though right? The rocker universe bits though could leverage this.
> 
> mlxsw. But this is naturaly not implemented yet, as there is no
> infrastructure.

Really? What is re-programmable?

Can the parse graph support arbitrary parse graph?
Can the table topology be reconfigured?
Can new tables be created?
What about "new" actions being defined at configuration time?

Or is this just the normal TCAM configuration of defining key widths and
fields.

> 
> 
>>
>>>
>>>>
>>>>>
>>>>>> since I cannot see how one can put the whole p4 language compiler
>>>>>> into the driver, so this last step of p4ast->hw, I presume, will be
>>>>>> done by firmware, which will be running full compiler in an embedded cpu
>>>>>
>>>>> In case of mlxsw, that compiler would be in driver.
>>>>>
>>>>>
>>>>>> on the switch. To me that's precisely the kernel bypass, since we won't
>>>>>> have a clue what HW capabilities actually are and won't be able to fine
>>>>>> grain control them.
>>>>>> Please correct me if I'm wrong.
>>>>>
>>>>> You are wrong. By your definition, everything has to be figured out in
>>>>> driver and FW does nothing. Otherwise it could do "something else" and
>>>>> that would be a bypass? Does not make any sense to me whatsoever.
>>>>>
>>>>>
>>>>>>
>>>>>>> Plus the thing I cannot imagine in the model you propose is table fillup.
>>>>>>> For ebpf, you use maps. For p4 you would have to have a separate HW-only
>>>>>>> API. This is very similar to the original John's Flow-API. And therefore
>>>>>>> a kernel bypass.
>>>>>>
>>>>>> I think John's flow api is a better way to expose mellanox switch capabilities.
>>>>>
>>>>> We are under impression that p4 suits us nicely. But it is not about
>>>>> us, it is about finding the common way to do this.
>>>>>
>>>>
>>>> I'll just poke at my FlowAPI question again. For fixed ASICS what is
>>>> the Flow-API missing. We have a few proof points that show it is both
>>>> sufficient and usable for the handful of use cases we care about.
>>>
>>> Yeah, it is most probably fine. Even for flex ASICs to some point. The
>>> question is how it stands comparing to other alternatives, like p4
>>>
>>
>> Just to be clear the Flow-API _was_ generated from the initial P4 spec.
>> The header files and tools used with it were autogenerated ("compiled"
>> in a loose sense) from the P4 program. The piece I never exposed
>> was the set_* operations to reconfigure running systems. I'm not sure
>> how valuable this is in practice though.
>>
>> Also there is a P4-16 spec that will be released shortly that is more
>> flexible and also more complex.
> 
> Would it be able to easily extend the Flow-API to include the changes?
> 

P4-16 will allow externs, "functions" to execute in the control flow and
possibly inside the parse graph. None of this was considered in the
Flow-API. So none of this is supported.

I still have the question are you trying to push the "programming" of
the device via 'tc' or just the runtime configuration of tables? If it
is just runtime Flow-API is sufficient IMO. If its programming the
device using the complete P4-16 spec than no its not sufficient. But
I don't believe vendors will expose the complete programmability of the
device in the driver, this is going to look more like a fw update than
a runtime change at least on the devices I'm aware of.

> 
>>
>>>
>>>>
>>>>>
>>>>>> I also think it's not fair to call it 'bypass'. I see nothing in it
>>>>>> that justify such 'swear word' ;)
>>>>>
>>>>> John's Flow-API was a kernel bypass. Why? It was a API specifically
>>>>> designed to directly work with HW tables, without kernel being involved.
>>>>
>>>> I don't think that is a fair definition of HW bypass. The SKIP_SW flag
>>>> does exactly that for 'tc' based offloads and it was not rejected.
>>>
>>> No, no, no. You still have possibility to do the same thing in kernel,
>>> same functionality, with the same API. That is a big difference.
>>>
>>>
>>>>
>>>> The _real_ reason that seems to have fallen out of this and other
>>>> discussion is the Flow-API didn't provide an in-kernel translation into
>>>> an emulated patch. Note we always had a usermode translation to eBPF.
>>>> A secondary reason appears to be overhead of adding yet another netlink
>>>> family.
>>>
>>> Yeah. Maybe you remember, back then when Flow-API was being discussed,
>>> I suggested to wrap it under TC as cls_xflows and cls_xflowsaction of
>>> some sort and do in-kernel datapath implementation. I believe that after
>>> that, it would be acceptable.
>>>
>>
>> As I understand the thread here that is exactly the proposal here right?
>> With a discussion around if the structures/etc are sufficient or any
>> alternative representations exist.
> 
> Might be the way, yes. But I fear that with other p4 extensions this
> might not be easy to align with. Therefore I though about something more
> generic, like the p4ast.
> 

Same question as above are we _really_ talking about pushing the entire
programmability of the device via 'tc'. If so we need to have a vendor
say they will support and implement this?

> 
>>
>>>
>>>>
>>>>>
>>>>>
>>>>>> The goal of flow api was to expose HW features to user space, so that
>>>>>> user space can program it. For something simple as mellanox switch
>>>>>> asic it fits perfectly well.
>>>>>
>>>>> Again, this is not mlx-asic-specific. And again, that is a kernel bypass.
>>>>>
>>>>>
>>>>>> Unless I misunderstand the bigger goal of this discussion and it's
>>>>>> about programming ezchip devices.
>>>>>
>>>>> No. For network processors, I believe that BPF is nicely offloadable, no
>>>>> need to do the excercise for that.
>>>>>
>>>>>
>>>>>>
>>>>>> If the goal is to model hw tcam in the linux kernel then just introduce
>>>>>> tcam bpf map type. It will be dog slow in user space, but it will
>>>>>> match exactly what is happnening in the HW and user space can make
>>>>>> sensible trade-offs.
>>>>>
>>>>> No, you got me completely wrong. This is not about the TCAM. This is
>>>>> about differences in the 2 words (p4/bpf).
>>>>> Again, for "p4-ish" devices, you have to translate BPF. And as you
>>>>> noted, it's an instruction set. Very hard if not impossible to parse in
>>>>> order to get back the original semantics.
>>>>>
>>>>
>>>> I think in this discussion "p4-ish" devices means devices with multiple
>>>> tables in a pipeline? Not devices that have programmable/configurable
>>>> pipelines right? And if we get to talking about reconfigurable devices
>>>> I believe this should be done out of band as it typically means
>>>> reloading some ucode, etc.
>>>
>>> I'm talking about both. But I think we should focus on reconfigurable
>>> ones, as we probably won't see that much fixed ones in the future.
>>>
>>
>> hmm maybe but the 10/40/100Gbps devices are going to be around for some
>> time. So we need to ensure these work well.
> 
> Yes, but I would like to emphasize, if we are defining new api
> the primary focus should be on new devices.
> 
> 

What device though. Back to mlxsw question about actually supporting
this stuff.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-10-31  9:39             ` Jiri Pirko
  2016-10-31 16:53               ` John Fastabend
@ 2016-11-02  2:29               ` Daniel Borkmann
  2016-11-02  5:06                 ` Maciej Żenczykowski
  2016-11-02  8:14                 ` Jiri Pirko
  1 sibling, 2 replies; 41+ messages in thread
From: Daniel Borkmann @ 2016-11-02  2:29 UTC (permalink / raw)
  To: Jiri Pirko, Alexei Starovoitov
  Cc: Thomas Graf, John Fastabend, Jakub Kicinski, netdev, davem, jhs,
	roopa, simon.horman, ast, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

On 10/31/2016 10:39 AM, Jiri Pirko wrote:
> Sun, Oct 30, 2016 at 11:39:05PM CET, alexei.starovoitov@gmail.com wrote:
>> On Sun, Oct 30, 2016 at 05:38:36PM +0100, Jiri Pirko wrote:
>>> Sun, Oct 30, 2016 at 11:26:49AM CET, tgraf@suug.ch wrote:
>>>> On 10/30/16 at 08:44am, Jiri Pirko wrote:
>>>>> Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastabend@gmail.com wrote:
>>>>>> On 16-10-29 07:49 AM, Jakub Kicinski wrote:
>>>>>>> On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
>>>>>>>> Hi all.
>>
>> sorry for delay. travelling to KS, so probably missed something in
>> this thread and comments can be totally off...
>>
>> the subject "let's do P4" is imo misleading, since it reads like
>> we don't do P4 at the moment, whereas the opposite is true.
>> Several p4->bpf compilers is a proof.
>
> We don't do p4 in kernel now, we don't do p4 offloading now. That is
> the reason I started this discussion.
>
>>> The network world is divided into 2 general types of hw:
>>> 1) network ASICs - network specific silicon, containing things like TCAM
>>>     These ASICs are suitable to be programmed by P4.
>>
>> i think the opposite is the case in case of P4.
>> when hw asic has tcam it's still far far away from being usable with P4
>> which requires fully programmable protocol parser, arbitrary tables and so on.
>> P4 doesn't even define TCAM as a table type. The p4 program can declare
>> a desired algorithm of search in the table and compiler has to figure out
>> what HW resources to use to satisfy such p4 program.
>>
>>> 2) network processors - basically a general purpose CPUs
>>>     These processors are suitable to be programmed by eBPF.
>>
>> I think this statement is also misleading, since it positions
>> p4 and bpf as competitors whereas that's not the case.
>> p4 is the language. bpf is an instruction set.
>
> I wanted to say that we are having 2 approaches in silicon, 2 different
> paradigms. Sure you can do p4>bpf. But hard to do it the opposite way.
>
>>> Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
>>>
>>>                                   |
>>>                                   |               +--> ebpf engine
>>>                                   |               |
>>>                                   |               |
>>>                                   |           compilerB
>>>                                   |               ^
>>>                                   |               |
>>> p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC -> HW
>>>                                   |
>>>                         userspace | kernel
>>>                                   |

Sorry for jumping into the middle and the delay (plumbers this week). My
question would be, if the main target is for p4 *offloading* anyway, who
would use this sw fallback path? Mostly for testing purposes?

I'm not sure about compilerB here and the complexity that needs to be
pushed into the kernel along with it. I would assume this would result
in slower code than what the existing P4 -> eBPF front ends for LLVM
would generate since it could perform all kind of optimizations there,
that might not be feasible for doing inside the kernel. Thus, if I'd want
to do that in sw, I'd just use the existing LLVM facilities instead and
go via cls_bpf in that case.

What is your compilerA? Is that part of tc in user space? Maybe linked
against LLVM lib, for example? If you really want some sw path, can't tc
do this transparently from user space instead when it gets a netlink error
that it cannot get offloaded (and thus switch internally to f_bpf's loader)?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-11-02  2:29               ` Daniel Borkmann
@ 2016-11-02  5:06                 ` Maciej Żenczykowski
  2016-11-02  8:14                 ` Jiri Pirko
  1 sibling, 0 replies; 41+ messages in thread
From: Maciej Żenczykowski @ 2016-11-02  5:06 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jiri Pirko, Alexei Starovoitov, Thomas Graf, John Fastabend,
	Jakub Kicinski, Linux NetDev, David Miller, Jamal Hadi Salim,
	roopa, simon.horman, ast, prem, Hannes Frederic Sowa, Jiri Benc,
	Tom Herbert, mattyk, idosch, eladr, yotamg, nogahf, ogerlitz,
	John W. Linville, Andy Gospodarek, Florian Fainelli

> Sorry for jumping into the middle and the delay (plumbers this week). My
> question would be, if the main target is for p4 *offloading* anyway, who
> would use this sw fallback path? Mostly for testing purposes?
>
> I'm not sure about compilerB here and the complexity that needs to be
> pushed into the kernel along with it. I would assume this would result
> in slower code than what the existing P4 -> eBPF front ends for LLVM
> would generate since it could perform all kind of optimizations there,
> that might not be feasible for doing inside the kernel. Thus, if I'd want
> to do that in sw, I'd just use the existing LLVM facilities instead and
> go via cls_bpf in that case.
>
> What is your compilerA? Is that part of tc in user space? Maybe linked
> against LLVM lib, for example? If you really want some sw path, can't tc
> do this transparently from user space instead when it gets a netlink error
> that it cannot get offloaded (and thus switch internally to f_bpf's loader)?

Since we're jumping in the middle ;-)

Ideally we'd have an interface where some generic like program is
loaded into the kernel,
and the kernel core fetches some sort of generic description of the
hardware capabilities,
translates the program and fits as much of it as it can into the hardware,
possibly all of it, and emulates/executes the rest in software.

ie. if hardware can only match on 5 different 10 byte headers, but we
need to match on 7 different 12 byte headers,
we can still use the hardware to help us dispatch straight into 'check
the last 2 bytes, then the last 2 headers' software emulation code.

or maybe hardware can match, but can't count packets... so we need to
implement counting in sw.

or it can't do all types of encap/decap, so we need to sw encap in
certain cases...

Doing this via extracting such information out of a bpf program seems
pretty hard.

Or maybe I'm overestimating the true difficulty of taking a bpf
program and extracting it into a TCAM...
Maybe if the bpf program has a more 'standard' layout
(ie. a tree doing packet parsing/matching, with 'actions' in the
leaves) then it's not so hard?...

Obviously real hardware has significantly more capabilities then just
a tcam at the front of the pipeline...

I'm afraid I lack the knowledge of what the real capabilities of
current (and future...) hardware are...

But maybe we could come up with some sufficiently generic description
of *what* we want accomplished
instead of the precise specifics of how.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-11-01 15:13                       ` John Fastabend
@ 2016-11-02  8:07                         ` Jiri Pirko
  2016-11-02 15:18                           ` John Fastabend
  0 siblings, 1 reply; 41+ messages in thread
From: Jiri Pirko @ 2016-11-02  8:07 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexei Starovoitov, Thomas Graf, Jakub Kicinski, netdev, davem,
	jhs, roopa, simon.horman, ast, daniel, prem, hannes, jbenc, tom,
	mattyk, idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

Tue, Nov 01, 2016 at 04:13:32PM CET, john.fastabend@gmail.com wrote:
>[...]
>
>>>> P4 is ment to program programable hw, not fixed pipeline.
>>>>
>>>
>>> I'm guessing there are no upstream drivers at the moment that support
>>> this though right? The rocker universe bits though could leverage this.
>> 
>> mlxsw. But this is naturaly not implemented yet, as there is no
>> infrastructure.
>
>Really? What is re-programmable?
>
>Can the parse graph support arbitrary parse graph?
>Can the table topology be reconfigured?
>Can new tables be created?
>What about "new" actions being defined at configuration time?
>
>Or is this just the normal TCAM configuration of defining key widths and
>fields.

At this point TCAM configuration.


>
>> 
>> 
>>>
>>>>
>>>>>
>>>>>>
>>>>>>> since I cannot see how one can put the whole p4 language compiler
>>>>>>> into the driver, so this last step of p4ast->hw, I presume, will be
>>>>>>> done by firmware, which will be running full compiler in an embedded cpu
>>>>>>
>>>>>> In case of mlxsw, that compiler would be in driver.
>>>>>>
>>>>>>
>>>>>>> on the switch. To me that's precisely the kernel bypass, since we won't
>>>>>>> have a clue what HW capabilities actually are and won't be able to fine
>>>>>>> grain control them.
>>>>>>> Please correct me if I'm wrong.
>>>>>>
>>>>>> You are wrong. By your definition, everything has to be figured out in
>>>>>> driver and FW does nothing. Otherwise it could do "something else" and
>>>>>> that would be a bypass? Does not make any sense to me whatsoever.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> Plus the thing I cannot imagine in the model you propose is table fillup.
>>>>>>>> For ebpf, you use maps. For p4 you would have to have a separate HW-only
>>>>>>>> API. This is very similar to the original John's Flow-API. And therefore
>>>>>>>> a kernel bypass.
>>>>>>>
>>>>>>> I think John's flow api is a better way to expose mellanox switch capabilities.
>>>>>>
>>>>>> We are under impression that p4 suits us nicely. But it is not about
>>>>>> us, it is about finding the common way to do this.
>>>>>>
>>>>>
>>>>> I'll just poke at my FlowAPI question again. For fixed ASICS what is
>>>>> the Flow-API missing. We have a few proof points that show it is both
>>>>> sufficient and usable for the handful of use cases we care about.
>>>>
>>>> Yeah, it is most probably fine. Even for flex ASICs to some point. The
>>>> question is how it stands comparing to other alternatives, like p4
>>>>
>>>
>>> Just to be clear the Flow-API _was_ generated from the initial P4 spec.
>>> The header files and tools used with it were autogenerated ("compiled"
>>> in a loose sense) from the P4 program. The piece I never exposed
>>> was the set_* operations to reconfigure running systems. I'm not sure
>>> how valuable this is in practice though.
>>>
>>> Also there is a P4-16 spec that will be released shortly that is more
>>> flexible and also more complex.
>> 
>> Would it be able to easily extend the Flow-API to include the changes?
>> 
>
>P4-16 will allow externs, "functions" to execute in the control flow and
>possibly inside the parse graph. None of this was considered in the
>Flow-API. So none of this is supported.
>
>I still have the question are you trying to push the "programming" of
>the device via 'tc' or just the runtime configuration of tables? If it
>is just runtime Flow-API is sufficient IMO. If its programming the
>device using the complete P4-16 spec than no its not sufficient. But

Sure we need both.


>I don't believe vendors will expose the complete programmability of the
>device in the driver, this is going to look more like a fw update than
>a runtime change at least on the devices I'm aware of.

Depends on driver. I think it is fine if driver processed it into come
hw configuration sequence or it simply pushed the program down to fw.
Both usecases are legit.


>
>> 
>>>
>>>>
>>>>>
>>>>>>
>>>>>>> I also think it's not fair to call it 'bypass'. I see nothing in it
>>>>>>> that justify such 'swear word' ;)
>>>>>>
>>>>>> John's Flow-API was a kernel bypass. Why? It was a API specifically
>>>>>> designed to directly work with HW tables, without kernel being involved.
>>>>>
>>>>> I don't think that is a fair definition of HW bypass. The SKIP_SW flag
>>>>> does exactly that for 'tc' based offloads and it was not rejected.
>>>>
>>>> No, no, no. You still have possibility to do the same thing in kernel,
>>>> same functionality, with the same API. That is a big difference.
>>>>
>>>>
>>>>>
>>>>> The _real_ reason that seems to have fallen out of this and other
>>>>> discussion is the Flow-API didn't provide an in-kernel translation into
>>>>> an emulated patch. Note we always had a usermode translation to eBPF.
>>>>> A secondary reason appears to be overhead of adding yet another netlink
>>>>> family.
>>>>
>>>> Yeah. Maybe you remember, back then when Flow-API was being discussed,
>>>> I suggested to wrap it under TC as cls_xflows and cls_xflowsaction of
>>>> some sort and do in-kernel datapath implementation. I believe that after
>>>> that, it would be acceptable.
>>>>
>>>
>>> As I understand the thread here that is exactly the proposal here right?
>>> With a discussion around if the structures/etc are sufficient or any
>>> alternative representations exist.
>> 
>> Might be the way, yes. But I fear that with other p4 extensions this
>> might not be easy to align with. Therefore I though about something more
>> generic, like the p4ast.
>> 
>
>Same question as above are we _really_ talking about pushing the entire
>programmability of the device via 'tc'. If so we need to have a vendor
>say they will support and implement this?

We need some API, and I believe that TC is perfectly suitable for that.
Why do you think it's a problem?



>
>> 
>>>
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>> The goal of flow api was to expose HW features to user space, so that
>>>>>>> user space can program it. For something simple as mellanox switch
>>>>>>> asic it fits perfectly well.
>>>>>>
>>>>>> Again, this is not mlx-asic-specific. And again, that is a kernel bypass.
>>>>>>
>>>>>>
>>>>>>> Unless I misunderstand the bigger goal of this discussion and it's
>>>>>>> about programming ezchip devices.
>>>>>>
>>>>>> No. For network processors, I believe that BPF is nicely offloadable, no
>>>>>> need to do the excercise for that.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> If the goal is to model hw tcam in the linux kernel then just introduce
>>>>>>> tcam bpf map type. It will be dog slow in user space, but it will
>>>>>>> match exactly what is happnening in the HW and user space can make
>>>>>>> sensible trade-offs.
>>>>>>
>>>>>> No, you got me completely wrong. This is not about the TCAM. This is
>>>>>> about differences in the 2 words (p4/bpf).
>>>>>> Again, for "p4-ish" devices, you have to translate BPF. And as you
>>>>>> noted, it's an instruction set. Very hard if not impossible to parse in
>>>>>> order to get back the original semantics.
>>>>>>
>>>>>
>>>>> I think in this discussion "p4-ish" devices means devices with multiple
>>>>> tables in a pipeline? Not devices that have programmable/configurable
>>>>> pipelines right? And if we get to talking about reconfigurable devices
>>>>> I believe this should be done out of band as it typically means
>>>>> reloading some ucode, etc.
>>>>
>>>> I'm talking about both. But I think we should focus on reconfigurable
>>>> ones, as we probably won't see that much fixed ones in the future.
>>>>
>>>
>>> hmm maybe but the 10/40/100Gbps devices are going to be around for some
>>> time. So we need to ensure these work well.
>> 
>> Yes, but I would like to emphasize, if we are defining new api
>> the primary focus should be on new devices.
>> 
>> 
>
>What device though. Back to mlxsw question about actually supporting
>this stuff.
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-11-02  2:29               ` Daniel Borkmann
  2016-11-02  5:06                 ` Maciej Żenczykowski
@ 2016-11-02  8:14                 ` Jiri Pirko
  2016-11-02 15:22                   ` John Fastabend
  1 sibling, 1 reply; 41+ messages in thread
From: Jiri Pirko @ 2016-11-02  8:14 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Thomas Graf, John Fastabend, Jakub Kicinski,
	netdev, davem, jhs, roopa, simon.horman, ast, prem, hannes,
	jbenc, tom, mattyk, idosch, eladr, yotamg, nogahf, ogerlitz,
	linville, andy, f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

Wed, Nov 02, 2016 at 03:29:23AM CET, daniel@iogearbox.net wrote:
>On 10/31/2016 10:39 AM, Jiri Pirko wrote:
>> Sun, Oct 30, 2016 at 11:39:05PM CET, alexei.starovoitov@gmail.com wrote:
>> > On Sun, Oct 30, 2016 at 05:38:36PM +0100, Jiri Pirko wrote:
>> > > Sun, Oct 30, 2016 at 11:26:49AM CET, tgraf@suug.ch wrote:
>> > > > On 10/30/16 at 08:44am, Jiri Pirko wrote:
>> > > > > Sat, Oct 29, 2016 at 06:46:21PM CEST, john.fastabend@gmail.com wrote:
>> > > > > > On 16-10-29 07:49 AM, Jakub Kicinski wrote:
>> > > > > > > On Sat, 29 Oct 2016 09:53:28 +0200, Jiri Pirko wrote:
>> > > > > > > > Hi all.
>> > 
>> > sorry for delay. travelling to KS, so probably missed something in
>> > this thread and comments can be totally off...
>> > 
>> > the subject "let's do P4" is imo misleading, since it reads like
>> > we don't do P4 at the moment, whereas the opposite is true.
>> > Several p4->bpf compilers is a proof.
>> 
>> We don't do p4 in kernel now, we don't do p4 offloading now. That is
>> the reason I started this discussion.
>> 
>> > > The network world is divided into 2 general types of hw:
>> > > 1) network ASICs - network specific silicon, containing things like TCAM
>> > >     These ASICs are suitable to be programmed by P4.
>> > 
>> > i think the opposite is the case in case of P4.
>> > when hw asic has tcam it's still far far away from being usable with P4
>> > which requires fully programmable protocol parser, arbitrary tables and so on.
>> > P4 doesn't even define TCAM as a table type. The p4 program can declare
>> > a desired algorithm of search in the table and compiler has to figure out
>> > what HW resources to use to satisfy such p4 program.
>> > 
>> > > 2) network processors - basically a general purpose CPUs
>> > >     These processors are suitable to be programmed by eBPF.
>> > 
>> > I think this statement is also misleading, since it positions
>> > p4 and bpf as competitors whereas that's not the case.
>> > p4 is the language. bpf is an instruction set.
>> 
>> I wanted to say that we are having 2 approaches in silicon, 2 different
>> paradigms. Sure you can do p4>bpf. But hard to do it the opposite way.
>> 
>> > > Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
>> > > 
>> > >                                   |
>> > >                                   |               +--> ebpf engine
>> > >                                   |               |
>> > >                                   |               |
>> > >                                   |           compilerB
>> > >                                   |               ^
>> > >                                   |               |
>> > > p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC -> HW
>> > >                                   |
>> > >                         userspace | kernel
>> > >                                   |
>
>Sorry for jumping into the middle and the delay (plumbers this week). My
>question would be, if the main target is for p4 *offloading* anyway, who
>would use this sw fallback path? Mostly for testing purposes?

Development and testing purposes, yes.


>
>I'm not sure about compilerB here and the complexity that needs to be
>pushed into the kernel along with it. I would assume this would result
>in slower code than what the existing P4 -> eBPF front ends for LLVM
>would generate since it could perform all kind of optimizations there,

The complexity would be similar to compilerC. For compilerB,
optimizations does not really matter, as it it for testing mainly.


>that might not be feasible for doing inside the kernel. Thus, if I'd want
>to do that in sw, I'd just use the existing LLVM facilities instead and
>go via cls_bpf in that case.
>
>What is your compilerA? Is that part of tc in user space? Maybe linked

It is something that transforms original p4 source to some intermediate
form, easy to be processed by in-kernel compilers.


>against LLVM lib, for example? If you really want some sw path, can't tc
>do this transparently from user space instead when it gets a netlink error
>that it cannot get offloaded (and thus switch internally to f_bpf's loader)?

In real life, user will most probably use p4 for hw programming, but the
sw fallback will be done in bpf directly. In that case, he would use
cls_bfp SKIP_HW
cls_p4 SKIP_SW

But in order to allow cls_p4 offloading to hw, we need in-kernel
interpreter. That is purpose of compilerB to take agvantage of bpf, but
the in-kernel interpreter could be implemented differently.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-11-02  8:07                         ` Jiri Pirko
@ 2016-11-02 15:18                           ` John Fastabend
  2016-11-02 15:23                             ` Jiri Pirko
  0 siblings, 1 reply; 41+ messages in thread
From: John Fastabend @ 2016-11-02 15:18 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Alexei Starovoitov, Thomas Graf, Jakub Kicinski, netdev, davem,
	jhs, roopa, simon.horman, ast, daniel, prem, hannes, jbenc, tom,
	mattyk, idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

On 16-11-02 01:07 AM, Jiri Pirko wrote:
> Tue, Nov 01, 2016 at 04:13:32PM CET, john.fastabend@gmail.com wrote:
>> [...]
>>
>>>>> P4 is ment to program programable hw, not fixed pipeline.
>>>>>
>>>>
>>>> I'm guessing there are no upstream drivers at the moment that support
>>>> this though right? The rocker universe bits though could leverage this.
>>>
>>> mlxsw. But this is naturaly not implemented yet, as there is no
>>> infrastructure.
>>
>> Really? What is re-programmable?
>>
>> Can the parse graph support arbitrary parse graph?
>> Can the table topology be reconfigured?
>> Can new tables be created?
>> What about "new" actions being defined at configuration time?
>>
>> Or is this just the normal TCAM configuration of defining key widths and
>> fields.
> 
> At this point TCAM configuration.
> 

OK so before we go down the path to enable a full fledged P4 interface
we need a consumer for sure. We shouldn't add all this complexity until
someone steps up to use it. A runtime API is sufficient for TCAM config.

[...]

>>
>> P4-16 will allow externs, "functions" to execute in the control flow and
>> possibly inside the parse graph. None of this was considered in the
>> Flow-API. So none of this is supported.
>>
>> I still have the question are you trying to push the "programming" of
>> the device via 'tc' or just the runtime configuration of tables? If it
>> is just runtime Flow-API is sufficient IMO. If its programming the
>> device using the complete P4-16 spec than no its not sufficient. But
> 
> Sure we need both.
> 

See above.

> 
>> I don't believe vendors will expose the complete programmability of the
>> device in the driver, this is going to look more like a fw update than
>> a runtime change at least on the devices I'm aware of.
> 
> Depends on driver. I think it is fine if driver processed it into come
> hw configuration sequence or it simply pushed the program down to fw.
> Both usecases are legit.
> 

At this point I don't think the entire P4 capabilities will be exposed
as an API but more along the lines of an FPGA bitstream or firmware
update.


[...]

>>
>> Same question as above are we _really_ talking about pushing the entire
>> programmability of the device via 'tc'. If so we need to have a vendor
>> say they will support and implement this?
> 
> We need some API, and I believe that TC is perfectly suitable for that.
> Why do you think it's a problem?
> 

For runtime configuration completely agree. For device configuration
I don't see the advantage of adding an entire device specific compiler
in the driver. The device is a set of CAMs, TCAMs, ALUs, instruction
caches, etc. its not like a typical NIC/switch where you just bang
some registers. Unless its all done in firmware but that creates an
entirely different set of problems like how to update your compiler.

Bottom line we need to have a proof point of a driver in kernel
to see exactly how a P4 configuration would work. Again runtime config
and device topology/capabilities discovery I'm completely on board.

Thanks,
John

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-11-02  8:14                 ` Jiri Pirko
@ 2016-11-02 15:22                   ` John Fastabend
  2016-11-02 15:27                     ` Jiri Pirko
  0 siblings, 1 reply; 41+ messages in thread
From: John Fastabend @ 2016-11-02 15:22 UTC (permalink / raw)
  To: Jiri Pirko, Daniel Borkmann
  Cc: Alexei Starovoitov, Thomas Graf, Jakub Kicinski, netdev, davem,
	jhs, roopa, simon.horman, ast, prem, hannes, jbenc, tom, mattyk,
	idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

[...]

>>>>> Exactly. Following drawing shows p4 pipeline setup for SW and Hw:
>>>>>
>>>>>                                   |
>>>>>                                   |               +--> ebpf engine
>>>>>                                   |               |
>>>>>                                   |               |
>>>>>                                   |           compilerB
>>>>>                                   |               ^
>>>>>                                   |               |
>>>>> p4src --> compilerA --> p4ast --TCNL--> cls_p4 --+-> driver -> compilerC -> HW
>>>>>                                   |
>>>>>                         userspace | kernel
>>>>>                                   |
>>
>> Sorry for jumping into the middle and the delay (plumbers this week). My
>> question would be, if the main target is for p4 *offloading* anyway, who
>> would use this sw fallback path? Mostly for testing purposes?
> 
> Development and testing purposes, yes.
> 
> 
>>
>> I'm not sure about compilerB here and the complexity that needs to be
>> pushed into the kernel along with it. I would assume this would result
>> in slower code than what the existing P4 -> eBPF front ends for LLVM
>> would generate since it could perform all kind of optimizations there,
> 
> The complexity would be similar to compilerC. For compilerB,
> optimizations does not really matter, as it it for testing mainly.
> 
> 
>> that might not be feasible for doing inside the kernel. Thus, if I'd want
>> to do that in sw, I'd just use the existing LLVM facilities instead and
>> go via cls_bpf in that case.
>>
>> What is your compilerA? Is that part of tc in user space? Maybe linked
> 
> It is something that transforms original p4 source to some intermediate
> form, easy to be processed by in-kernel compilers.
> 
> 
>> against LLVM lib, for example? If you really want some sw path, can't tc
>> do this transparently from user space instead when it gets a netlink error
>> that it cannot get offloaded (and thus switch internally to f_bpf's loader)?
> 
> In real life, user will most probably use p4 for hw programming, but the
> sw fallback will be done in bpf directly. In that case, he would use
> cls_bfp SKIP_HW
> cls_p4 SKIP_SW
> 
> But in order to allow cls_p4 offloading to hw, we need in-kernel
> interpreter. That is purpose of compilerB to take agvantage of bpf, but
> the in-kernel interpreter could be implemented differently.
> 

But this is the issue. We openly acknowledge it wont actually be used.
We have multiple user space compilers that generate at least half way
reasonable ebpf code that is being used in real deployments and
works great for testing. This looks like pure overhead to satisfy this
hw/sw parity checkbox and I can't see why anyone would use it or even
maintain it. Looks like a checkbox and I like to avoid useless work that
is likely to bit rot.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-11-02 15:18                           ` John Fastabend
@ 2016-11-02 15:23                             ` Jiri Pirko
  0 siblings, 0 replies; 41+ messages in thread
From: Jiri Pirko @ 2016-11-02 15:23 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexei Starovoitov, Thomas Graf, Jakub Kicinski, netdev, davem,
	jhs, roopa, simon.horman, ast, daniel, prem, hannes, jbenc, tom,
	mattyk, idosch, eladr, yotamg, nogahf, ogerlitz, linville, andy,
	f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

Wed, Nov 02, 2016 at 04:18:06PM CET, john.fastabend@gmail.com wrote:
>On 16-11-02 01:07 AM, Jiri Pirko wrote:
>> Tue, Nov 01, 2016 at 04:13:32PM CET, john.fastabend@gmail.com wrote:

[...]


>[...]>
>>>
>>> Same question as above are we _really_ talking about pushing the entire
>>> programmability of the device via 'tc'. If so we need to have a vendor
>>> say they will support and implement this?
>> 
>> We need some API, and I believe that TC is perfectly suitable for that.
>> Why do you think it's a problem?
>> 
>
>For runtime configuration completely agree. For device configuration
>I don't see the advantage of adding an entire device specific compiler
>in the driver. The device is a set of CAMs, TCAMs, ALUs, instruction
>caches, etc. its not like a typical NIC/switch where you just bang
>some registers. Unless its all done in firmware but that creates an
>entirely different set of problems like how to update your compiler.
>
>Bottom line we need to have a proof point of a driver in kernel
>to see exactly how a P4 configuration would work. Again runtime config
>and device topology/capabilities discovery I'm completely on board.

I think we need to implement P4 world in rocker. Any volunteer? :)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Let's do P4
  2016-11-02 15:22                   ` John Fastabend
@ 2016-11-02 15:27                     ` Jiri Pirko
  0 siblings, 0 replies; 41+ messages in thread
From: Jiri Pirko @ 2016-11-02 15:27 UTC (permalink / raw)
  To: John Fastabend
  Cc: Daniel Borkmann, Alexei Starovoitov, Thomas Graf, Jakub Kicinski,
	netdev, davem, jhs, roopa, simon.horman, ast, prem, hannes,
	jbenc, tom, mattyk, idosch, eladr, yotamg, nogahf, ogerlitz,
	linville, andy, f.fainelli, dsa, vivien.didelot, andrew, ivecera,
	Maciej Żenczykowski

Wed, Nov 02, 2016 at 04:22:50PM CET, john.fastabend@gmail.com wrote:

[...]

>>>
>>> What is your compilerA? Is that part of tc in user space? Maybe linked
>> 
>> It is something that transforms original p4 source to some intermediate
>> form, easy to be processed by in-kernel compilers.
>> 
>> 
>>> against LLVM lib, for example? If you really want some sw path, can't tc
>>> do this transparently from user space instead when it gets a netlink error
>>> that it cannot get offloaded (and thus switch internally to f_bpf's loader)?
>> 
>> In real life, user will most probably use p4 for hw programming, but the
>> sw fallback will be done in bpf directly. In that case, he would use
>> cls_bfp SKIP_HW
>> cls_p4 SKIP_SW
>> 
>> But in order to allow cls_p4 offloading to hw, we need in-kernel
>> interpreter. That is purpose of compilerB to take agvantage of bpf, but
>> the in-kernel interpreter could be implemented differently.
>> 
>
>But this is the issue. We openly acknowledge it wont actually be used.
>We have multiple user space compilers that generate at least half way
>reasonable ebpf code that is being used in real deployments and
>works great for testing. This looks like pure overhead to satisfy this
>hw/sw parity checkbox and I can't see why anyone would use it or even
>maintain it. Looks like a checkbox and I like to avoid useless work that
>is likely to bit rot.

That's how it works I'm afraid, unless something changed from the last
time we discussed this. Without in-kernel implementation, it's a bypass.

Dave?

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2016-11-02 15:27 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-29  7:53 Let's do P4 Jiri Pirko
2016-10-29  9:39 ` Thomas Graf
2016-10-29 10:10   ` Jiri Pirko
2016-10-29 11:15     ` Thomas Graf
2016-10-29 11:28       ` Jiri Pirko
2016-10-29 12:09         ` Thomas Graf
2016-10-29 13:58           ` Jiri Pirko
2016-10-29 14:54             ` Jakub Kicinski
2016-10-29 14:58               ` Jiri Pirko
2016-10-29 14:49 ` Jakub Kicinski
2016-10-29 14:55   ` Jiri Pirko
2016-10-29 16:46   ` John Fastabend
2016-10-30  7:44     ` Jiri Pirko
2016-10-30 10:26       ` Thomas Graf
2016-10-30 16:38         ` Jiri Pirko
2016-10-30 17:45           ` Jakub Kicinski
2016-10-30 18:01             ` Jiri Pirko
2016-10-30 18:44               ` Jakub Kicinski
2016-10-30 19:56                 ` Jiri Pirko
2016-10-30 21:14                   ` John Fastabend
2016-10-30 22:39           ` Alexei Starovoitov
2016-10-31  6:03             ` Maciej Żenczykowski
2016-10-31  7:47               ` Jiri Pirko
2016-10-31  9:39             ` Jiri Pirko
2016-10-31 16:53               ` John Fastabend
2016-10-31 17:12                 ` Jiri Pirko
2016-10-31 18:32                   ` Hannes Frederic Sowa
2016-10-31 19:35                   ` John Fastabend
2016-11-01  8:46                     ` Jiri Pirko
2016-11-01 15:13                       ` John Fastabend
2016-11-02  8:07                         ` Jiri Pirko
2016-11-02 15:18                           ` John Fastabend
2016-11-02 15:23                             ` Jiri Pirko
2016-11-02  2:29               ` Daniel Borkmann
2016-11-02  5:06                 ` Maciej Żenczykowski
2016-11-02  8:14                 ` Jiri Pirko
2016-11-02 15:22                   ` John Fastabend
2016-11-02 15:27                     ` Jiri Pirko
2016-10-30 20:54       ` John Fastabend
2016-11-01 11:57 ` Jamal Hadi Salim
2016-11-01 15:03   ` John Fastabend

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.