All of lore.kernel.org
 help / color / mirror / Atom feed
* XDP seeking input from NIC hardware vendors
@ 2016-07-07 10:42 Jesper Dangaard Brouer via iovisor-dev
  2016-07-07 15:18 ` Fastabend, John R
  0 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-07-07 10:42 UTC (permalink / raw)
  To: iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy
  Cc: Jakub Kicinski, netdev-u79uwXL29TY76Z2rM5mHXA, John Fastabend,
	Edward Cree, Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha,
	Tariq Toukan


Would it make sense from a hardware point of view, to split the XDP
eBPF program into two stages.

 Stage-1: Filter (restricted eBPF / no-helper calls)
 Stage-2: Program

Then the HW can choose to offload stage-1 "filter", and keep the
likely more advanced stage-2 on the kernel side.  Do HW vendors see a
benefit of this approach?


The generic problem I'm trying to solve is parsing. E.g. that the
first step in every XDP program will be to parse the packet-data,
in-order to determine if this is a packet the XDP program should
process.

Actions from stage-1 "filter" program:
 - DROP (like XDP_DROP, early drop)
 - PASS (like XDP_PASS, normal netstack)
 - MATCH (call stage-2, likely carry-over opaque return code)

The MATCH action should likely carry-over an opaque return code, that
makes sense for the stage-2 program. E.g. proto id and/or data offset.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: XDP seeking input from NIC hardware vendors
  2016-07-07 10:42 XDP seeking input from NIC hardware vendors Jesper Dangaard Brouer via iovisor-dev
@ 2016-07-07 15:18 ` Fastabend, John R
       [not found]   ` <D6BB30FE66EA894C9F13C9E3CDDF00F564E5FB81-5FK+k9557ZBqS6EAlXoojrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Fastabend, John R @ 2016-07-07 15:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, iovisor-dev
  Cc: Brenden Blanco, Alexei Starovoitov, Rana Shahout, Ari Saha,
	Tariq Toukan, Or Gerlitz, netdev, Simon Horman, Simon Horman,
	Jakub Kicinski, Edward Cree

Hi Jesper,

I have done some previous work on proprietary systems where we used hardware to do the classification/parsing then passed a cookie to the software which used the cookie to lookup a program to run on the packet. When your programs are structured as a bunch of parsing followed by some actions this can provide real performance benefits. Also a lot of existing hardware supports this today assuming you use headers the hardware "knows" about. It's a natural model for hardware that uses a parser followed by tcam/cam/sram/etc lookup tables.

If the goal is to just separate XDP traffic from non-XDP traffic you could accomplish this with a combination of SR-IOV/macvlan to separate the device queues into multiple netdevs and then run XDP on just one of the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to steer traffic to the netdev. This is how we support multiple networking stacks on one device by the way it is called the bifurcated driver. Its not too far of a stretch to think we could offload some simple XDP programs to program the splitting of traffic instead of cls_u32/flower/flow_director and then you would have a stack of XDP programs. One running in hardware and a set running on the queues in software.

The other interesting thing would be to do more than just packet steering but actually run a more complete XDP program. Netronome supports this right. The question I have though is this a stacked of XDP programs one or more designated for hardware and some running in software perhaps with some annotation in the program so the hardware JIT knows where to place programs or do we expect the JIT itself to try and decide what is best to offload. I think the easiest to start with is to annotate the programs.

Also as far as I know a lot of hardware can stick extra data to the front or end of a packet so you could push metadata calculated by the program here in a generic way without having to extend XDP defined metadata structures. Another option is to DMA the metadata to a specified address. With this metadata the consumer/producer XDP programs have to agree on the format but no one else.

FWIW I was hoping to get some data to show performance overhead vs how deep we parse into the packets. I just wont have time to get to it for awhile but that could tell us how much perf gain the hardware could provide.

Thanks,
John

-----Original Message-----
From: Jesper Dangaard Brouer [mailto:brouer@redhat.com] 
Sent: Thursday, July 7, 2016 3:43 AM
To: iovisor-dev@lists.iovisor.org
Cc: brouer@redhat.com; Brenden Blanco <bblanco@plumgrid.com>; Alexei Starovoitov <alexei.starovoitov@gmail.com>; Rana Shahout <ranas@mellanox.com>; Ari Saha <as754m@att.com>; Tariq Toukan <tariqt@mellanox.com>; Or Gerlitz <ogerlitz@mellanox.com>; netdev@vger.kernel.org; Simon Horman <horms@verge.net.au>; Simon Horman <simon.horman@netronome.com>; Jakub Kicinski <jakub.kicinski@netronome.com>; Edward Cree <ecree@solarflare.com>; Fastabend, John R <john.r.fastabend@intel.com>
Subject: XDP seeking input from NIC hardware vendors


Would it make sense from a hardware point of view, to split the XDP eBPF program into two stages.

 Stage-1: Filter (restricted eBPF / no-helper calls)
 Stage-2: Program

Then the HW can choose to offload stage-1 "filter", and keep the likely more advanced stage-2 on the kernel side.  Do HW vendors see a benefit of this approach?


The generic problem I'm trying to solve is parsing. E.g. that the first step in every XDP program will be to parse the packet-data, in-order to determine if this is a packet the XDP program should process.

Actions from stage-1 "filter" program:
 - DROP (like XDP_DROP, early drop)
 - PASS (like XDP_PASS, normal netstack)
 - MATCH (call stage-2, likely carry-over opaque return code)

The MATCH action should likely carry-over an opaque return code, that makes sense for the stage-2 program. E.g. proto id and/or data offset.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]   ` <D6BB30FE66EA894C9F13C9E3CDDF00F564E5FB81-5FK+k9557ZBqS6EAlXoojrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2016-07-07 16:12     ` Jakub Kicinski via iovisor-dev
  2016-07-07 17:53       ` Tom Herbert via iovisor-dev
  2016-07-08  2:22     ` Alexei Starovoitov via iovisor-dev
  1 sibling, 1 reply; 23+ messages in thread
From: Jakub Kicinski via iovisor-dev @ 2016-07-07 16:12 UTC (permalink / raw)
  To: Fastabend, John R
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Edward Cree,
	Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha

On Thu, 7 Jul 2016 15:18:11 +0000, Fastabend, John R wrote:
> The other interesting thing would be to do more than just packet
> steering but actually run a more complete XDP program. Netronome
> supports this right. The question I have though is this a stacked of
> XDP programs one or more designated for hardware and some running in
> software perhaps with some annotation in the program so the hardware
> JIT knows where to place programs or do we expect the JIT itself to
> try and decide what is best to offload. I think the easiest to start
> with is to annotate the programs.
> 
> Also as far as I know a lot of hardware can stick extra data to the
> front or end of a packet so you could push metadata calculated by the
> program here in a generic way without having to extend XDP defined
> metadata structures. Another option is to DMA the metadata to a
> specified address. With this metadata the consumer/producer XDP
> programs have to agree on the format but no one else.

Yes!

At the XDP summit we were discussing pipe-lining XDP programs in
general, with different stages of the pipeline potentially using
specific hardware capabilities or even being directly mappable on
fixed HW functions.

Designating parsing as one of specialized blocks makes sense in a long
run, probably at the first stage with recirculation possible.  We also
have some parsing HW we could utilize at some point.  However, I'm
worried that it's too early to impose constraints and APIs.  I agree
that we should first set a standard way to pass metadata across tail
calls to facilitate any form of pipe lining, regardless of which parts
of pipeline HW is able to offload.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
  2016-07-07 16:12     ` Jakub Kicinski via iovisor-dev
@ 2016-07-07 17:53       ` Tom Herbert via iovisor-dev
       [not found]         ` <CALx6S36BADKByJAYQLMXBx1NEDaqn6fdqsCk-OdgNo5vgHrO1Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Tom Herbert via iovisor-dev @ 2016-07-07 17:53 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Fastabend, John R,
	Edward Cree, Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha

On Thu, Jul 7, 2016 at 9:12 AM, Jakub Kicinski
<jakub.kicinski-wFxRvT7yatFl57MIdRCFDg@public.gmane.org> wrote:
> On Thu, 7 Jul 2016 15:18:11 +0000, Fastabend, John R wrote:
>> The other interesting thing would be to do more than just packet
>> steering but actually run a more complete XDP program. Netronome
>> supports this right. The question I have though is this a stacked of
>> XDP programs one or more designated for hardware and some running in
>> software perhaps with some annotation in the program so the hardware
>> JIT knows where to place programs or do we expect the JIT itself to
>> try and decide what is best to offload. I think the easiest to start
>> with is to annotate the programs.
>>
>> Also as far as I know a lot of hardware can stick extra data to the
>> front or end of a packet so you could push metadata calculated by the
>> program here in a generic way without having to extend XDP defined
>> metadata structures. Another option is to DMA the metadata to a
>> specified address. With this metadata the consumer/producer XDP
>> programs have to agree on the format but no one else.
>
> Yes!
>
> At the XDP summit we were discussing pipe-lining XDP programs in
> general, with different stages of the pipeline potentially using
> specific hardware capabilities or even being directly mappable on
> fixed HW functions.
>
> Designating parsing as one of specialized blocks makes sense in a long
> run, probably at the first stage with recirculation possible.  We also
> have some parsing HW we could utilize at some point.  However, I'm
> worried that it's too early to impose constraints and APIs.  I agree
> that we should first set a standard way to pass metadata across tail
> calls to facilitate any form of pipe lining, regardless of which parts
> of pipeline HW is able to offload.

+1

I don't see any reason why XDP programs can be turned into a pipeline,
but this is implementation based on the output of one program being
the inout of the next.  While XDP may work with pipeline it does not
require it or define it. This makes XDP different from P4 and the
match-action paradigm.

Tom

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]         ` <CALx6S36BADKByJAYQLMXBx1NEDaqn6fdqsCk-OdgNo5vgHrO1Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-07-07 21:33           ` John Fastabend via iovisor-dev
  0 siblings, 0 replies; 23+ messages in thread
From: John Fastabend via iovisor-dev @ 2016-07-07 21:33 UTC (permalink / raw)
  To: Tom Herbert, Jakub Kicinski
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Fastabend, John R,
	Edward Cree, Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha

On 16-07-07 10:53 AM, Tom Herbert wrote:
> On Thu, Jul 7, 2016 at 9:12 AM, Jakub Kicinski
> <jakub.kicinski-wFxRvT7yatFl57MIdRCFDg@public.gmane.org> wrote:
>> On Thu, 7 Jul 2016 15:18:11 +0000, Fastabend, John R wrote:
>>> The other interesting thing would be to do more than just packet
>>> steering but actually run a more complete XDP program. Netronome
>>> supports this right. The question I have though is this a stacked of
>>> XDP programs one or more designated for hardware and some running in
>>> software perhaps with some annotation in the program so the hardware
>>> JIT knows where to place programs or do we expect the JIT itself to
>>> try and decide what is best to offload. I think the easiest to start
>>> with is to annotate the programs.
>>>
>>> Also as far as I know a lot of hardware can stick extra data to the
>>> front or end of a packet so you could push metadata calculated by the
>>> program here in a generic way without having to extend XDP defined
>>> metadata structures. Another option is to DMA the metadata to a
>>> specified address. With this metadata the consumer/producer XDP
>>> programs have to agree on the format but no one else.
>>
>> Yes!
>>
>> At the XDP summit we were discussing pipe-lining XDP programs in
>> general, with different stages of the pipeline potentially using
>> specific hardware capabilities or even being directly mappable on
>> fixed HW functions.
>>
>> Designating parsing as one of specialized blocks makes sense in a long
>> run, probably at the first stage with recirculation possible.  We also
>> have some parsing HW we could utilize at some point.  However, I'm
>> worried that it's too early to impose constraints and APIs.  I agree
>> that we should first set a standard way to pass metadata across tail
>> calls to facilitate any form of pipe lining, regardless of which parts
>> of pipeline HW is able to offload.
> 
> +1
> 
> I don't see any reason why XDP programs can be turned into a pipeline,
> but this is implementation based on the output of one program being
> the inout of the next.  While XDP may work with pipeline it does not
> require it or define it. This makes XDP different from P4 and the
> match-action paradigm.
> 
> Tom
> 

Sounds like we all agree. Just a note, XDP is a reasonable target
for P4 in fact we have a P4 to eBPF target already working. We may end
up with a set of DSLs running on top of XDP where P4 is one of them.

.John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]   ` <D6BB30FE66EA894C9F13C9E3CDDF00F564E5FB81-5FK+k9557ZBqS6EAlXoojrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2016-07-07 16:12     ` Jakub Kicinski via iovisor-dev
@ 2016-07-08  2:22     ` Alexei Starovoitov via iovisor-dev
       [not found]       ` <20160708022210.GA12244-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Alexei Starovoitov via iovisor-dev @ 2016-07-08  2:22 UTC (permalink / raw)
  To: Fastabend, John R
  Cc: Jakub Kicinski, netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Edward Cree,
	Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha

On Thu, Jul 07, 2016 at 03:18:11PM +0000, Fastabend, John R wrote:
> Hi Jesper,
> 
> I have done some previous work on proprietary systems where we used hardware to do the classification/parsing then passed a cookie to the software which used the cookie to lookup a program to run on the packet. When your programs are structured as a bunch of parsing followed by some actions this can provide real performance benefits. Also a lot of existing hardware supports this today assuming you use headers the hardware "knows" about. It's a natural model for hardware that uses a parser followed by tcam/cam/sram/etc lookup tables.

looking at bpf programs written in plumgrid, facebook and cisco
with full certainty I can assure that parse/action split doesn't exist.
Parsing is always interleaved with lookups and actions.
cpu spends a tiny fraction of time doing parsing. Lookups are the heaviest.
Trying to split single logical program into parsing/after_parse stages
has no pracitcal benefit.

> If the goal is to just separate XDP traffic from non-XDP traffic you could accomplish this with a combination of SR-IOV/macvlan to separate the device queues into multiple netdevs and then run XDP on just one of the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to steer traffic to the netdev. This is how we support multiple networking stacks on one device by the way it is called the bifurcated driver. Its not too far of a stretch to think we could offload some simple XDP programs to program the splitting of traffic instead of cls_u32/flower/flow_director and then you would have a stack of XDP programs. One running in hardware and a set running on the queues in software.

the above sounds like much better approach then Jesper/mine prog_per_ring stuff.
If we can split the nic via sriov and have dedicated netdev via VF just for XDP that's way cleaner approach.
I guess we won't need to do xdp_rxqmask after all.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]       ` <20160708022210.GA12244-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
@ 2016-07-08  4:05         ` John Fastabend via iovisor-dev
       [not found]           ` <577F2689.4010602-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2016-07-08 13:44         ` Jakub Kicinski via iovisor-dev
  1 sibling, 1 reply; 23+ messages in thread
From: John Fastabend via iovisor-dev @ 2016-07-08  4:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Fastabend, John R
  Cc: Jakub Kicinski, netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Edward Cree,
	Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha

On 16-07-07 07:22 PM, Alexei Starovoitov wrote:
> On Thu, Jul 07, 2016 at 03:18:11PM +0000, Fastabend, John R wrote:
>> Hi Jesper,
>>
>> I have done some previous work on proprietary systems where we
>> used hardware to do the classification/parsing then passed a cookie to the
>> software which used the cookie to lookup a program to run on the packet.
>> When your programs are structured as a bunch of parsing followed by some
>> actions this can provide real performance benefits. Also a lot of
>> existing hardware supports this today assuming you use headers the
>> hardware "knows" about. It's a natural model for hardware that uses a
>> parser followed by tcam/cam/sram/etc lookup tables.

> looking at bpf programs written in plumgrid, facebook and cisco
> with full certainty I can assure that parse/action split doesn't exist.
> Parsing is always interleaved with lookups and actions.
> cpu spends a tiny fraction of time doing parsing. Lookups are the heaviest.

What is heavy about a lookup? Is it the key generation? The key
generation can be provided by the hardware is what I was really alluding
to. If your data structures are ebpf maps though its probably a hash
or array table and the benefit of leveraging hardware would likely be
much better if/when there are software structures for LPM or wildcard
lookups.

> Trying to split single logical program into parsing/after_parse stages
> has no pracitcal benefit.
> 
>> If the goal is to just separate XDP traffic from non-XDP traffic
>> you could accomplish this with a combination of SR-IOV/macvlan to separate
>> the device queues into multiple netdevs and then run XDP on just one of
>> the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to
>> steer traffic to the netdev. This is how we support multiple networking
>> stacks on one device by the way it is called the bifurcated driver. Its
>> not too far of a stretch to think we could offload some simple XDP
>> programs to program the splitting of traffic instead of
>> cls_u32/flower/flow_director and then you would have a stack of XDP
>> programs. One running in hardware and a set running on the queues in
>> software.
> 
> the above sounds like much better approach then Jesper/mine prog_per_ring stuff.
> If we can split the nic via sriov and have dedicated netdev via VF just for XDP that's way cleaner approach.
> I guess we won't need to do xdp_rxqmask after all.
> 

Right and this works today so all it would require is adding the XDP
engine code to the VF drivers. Which should be relatively straight
forward if you have the PF driver working.

.John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]           ` <577F2689.4010602-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2016-07-08  4:28             ` Alexei Starovoitov via iovisor-dev
  0 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov via iovisor-dev @ 2016-07-08  4:28 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jakub Kicinski, netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Fastabend, John R,
	Edward Cree, Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha

On Thu, Jul 07, 2016 at 09:05:29PM -0700, John Fastabend wrote:
> On 16-07-07 07:22 PM, Alexei Starovoitov wrote:
> > On Thu, Jul 07, 2016 at 03:18:11PM +0000, Fastabend, John R wrote:
> >> Hi Jesper,
> >>
> >> I have done some previous work on proprietary systems where we
> >> used hardware to do the classification/parsing then passed a cookie to the
> >> software which used the cookie to lookup a program to run on the packet.
> >> When your programs are structured as a bunch of parsing followed by some
> >> actions this can provide real performance benefits. Also a lot of
> >> existing hardware supports this today assuming you use headers the
> >> hardware "knows" about. It's a natural model for hardware that uses a
> >> parser followed by tcam/cam/sram/etc lookup tables.
> 
> > looking at bpf programs written in plumgrid, facebook and cisco
> > with full certainty I can assure that parse/action split doesn't exist.
> > Parsing is always interleaved with lookups and actions.
> > cpu spends a tiny fraction of time doing parsing. Lookups are the heaviest.
> 
> What is heavy about a lookup? Is it the key generation? The key
> generation can be provided by the hardware is what I was really alluding
> to. If your data structures are ebpf maps though its probably a hash
> or array table and the benefit of leveraging hardware would likely be
> much better if/when there are software structures for LPM or wildcard
> lookups.

there is only hash map in the sw and the main cost of it was doing jhash
math and occasional miss in hashtable.
'key generation' is only copying bytes, so it mostly free.
Just like parsing which is few branches which tend to be predicted
by cpu quite well.
In case of our L4 loadbalancer we need to do consistent hash which
fixed hw probably won't be able to provide.
Unless hw is programmable :)
In general when we developed and benchmarked the programs,
redesigning the program to remove extra hash lookup gave performance
improvement whereas simplifying parsing logic (like removing vlan
handling or ip option) showed no difference in performance.

> > Trying to split single logical program into parsing/after_parse stages
> > has no pracitcal benefit.
> > 
> >> If the goal is to just separate XDP traffic from non-XDP traffic
> >> you could accomplish this with a combination of SR-IOV/macvlan to separate
> >> the device queues into multiple netdevs and then run XDP on just one of
> >> the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to
> >> steer traffic to the netdev. This is how we support multiple networking
> >> stacks on one device by the way it is called the bifurcated driver. Its
> >> not too far of a stretch to think we could offload some simple XDP
> >> programs to program the splitting of traffic instead of
> >> cls_u32/flower/flow_director and then you would have a stack of XDP
> >> programs. One running in hardware and a set running on the queues in
> >> software.
> > 
> > the above sounds like much better approach then Jesper/mine prog_per_ring stuff.
> > If we can split the nic via sriov and have dedicated netdev via VF just for XDP that's way cleaner approach.
> > I guess we won't need to do xdp_rxqmask after all.
> > 
> 
> Right and this works today so all it would require is adding the XDP
> engine code to the VF drivers. Which should be relatively straight
> forward if you have the PF driver working.

Good point. I think the next step should be to enable xdp in VF drivers
and measure performance.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]       ` <20160708022210.GA12244-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
  2016-07-08  4:05         ` John Fastabend via iovisor-dev
@ 2016-07-08 13:44         ` Jakub Kicinski via iovisor-dev
  2016-07-08 15:19           ` Jesper Dangaard Brouer via iovisor-dev
  1 sibling, 1 reply; 23+ messages in thread
From: Jakub Kicinski via iovisor-dev @ 2016-07-08 13:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Fastabend, John R,
	Edward Cree, Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha

On Thu, 7 Jul 2016 19:22:12 -0700, Alexei Starovoitov wrote:
> > If the goal is to just separate XDP traffic from non-XDP traffic you could accomplish this with a combination of SR-IOV/macvlan to separate the device queues into multiple netdevs and then run XDP on just one of the netdevs. Then use flow director (ethtool) or 'tc cls_u32/flower' to steer traffic to the netdev. This is how we support multiple networking stacks on one device by the way it is called the bifurcated driver. Its not too far of a stretch to think we could offload some simple XDP programs to program the splitting of traffic instead of cls_u32/flower/flow_director and then you would have a stack of XDP programs. One running in hardware and a set running on the queues in software.  
> 
> the above sounds like much better approach then Jesper/mine prog_per_ring stuff.
> If we can split the nic via sriov and have dedicated netdev via VF just for XDP that's way cleaner approach.
> I guess we won't need to do xdp_rxqmask after all.

+1

I was thinking about using eBPF to direct to NIC queues but concluded
that doing a redirect to a VF is cleaner.  Especially if the PF driver
supports VF representatives we could potentially just use
bpf_redirect(VFR netdev) and the VF doesn't even have to be handled by
the same stack.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
  2016-07-08 13:44         ` Jakub Kicinski via iovisor-dev
@ 2016-07-08 15:19           ` Jesper Dangaard Brouer via iovisor-dev
       [not found]             ` <20160708171943.0e1ce8d7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-07-08 15:19 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Fastabend, John R,
	Edward Cree, Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha


On Fri, 8 Jul 2016 14:44:53 +0100 Jakub Kicinski <jakub.kicinski-wFxRvT7yatFl57MIdRCFDg@public.gmane.org> wrote:

> On Thu, 7 Jul 2016 19:22:12 -0700, Alexei Starovoitov wrote:
>
> > > If the goal is to just separate XDP traffic from non-XDP traffic
> > > you could accomplish this with a combination of SR-IOV/macvlan to
> > > separate the device queues into multiple netdevs and then run XDP
> > > on just one of the netdevs. Then use flow director (ethtool) or
> > > 'tc cls_u32/flower' to steer traffic to the netdev. This is how
> > > we support multiple networking stacks on one device by the way it
> > > is called the bifurcated driver. Its not too far of a stretch to
> > > think we could offload some simple XDP programs to program the
> > > splitting of traffic instead of cls_u32/flower/flow_director and
> > > then you would have a stack of XDP programs. One running in
> > > hardware and a set running on the queues in software.    
> > 
> >
> > the above sounds like much better approach then Jesper/mine
> > prog_per_ring stuff.
> >
> > If we can split the nic via sriov and have dedicated netdev via VF
> > just for XDP that's way cleaner approach. I guess we won't need to
> > do xdp_rxqmask after all.  
> 
> +1
> 
> I was thinking about using eBPF to direct to NIC queues but concluded
> that doing a redirect to a VF is cleaner.  Especially if the PF driver
> supports VF representatives we could potentially just use
> bpf_redirect(VFR netdev) and the VF doesn't even have to be handled by
> the same stack.

I actually disagree.

I _do_ want to use the "filter" part of eBPF to direct to NIC queues, and
then run a single/specific XDP program on that queue.

Why to I want this?

This part of solving a very fundamental CS problem (early demux), when
wanting to support Zero-copy on RX.  The basic problem that the NIC
driver need to map RX pages into the RX ring, prior to receiving
packets. Thus, we need HW support to steer packets, for gaining enough
isolation (e.g between tenants domains) for allowing zero-copy.


Based on the flexibility of the HW-filter, the granularity achievable
for isolation (e.g. application specific) is much more flexible.  Than
splitting up the entire NIC with SR-IOV, VFs or macvlans.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]             ` <20160708171943.0e1ce8d7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-07-08 16:07               ` Jakub Kicinski via iovisor-dev
  2016-07-08 16:45                 ` John Fastabend via iovisor-dev
  0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski via iovisor-dev @ 2016-07-08 16:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Fastabend, John R,
	Edward Cree, Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha

On Fri, 8 Jul 2016 17:19:43 +0200, Jesper Dangaard Brouer wrote:
> On Fri, 8 Jul 2016 14:44:53 +0100 Jakub Kicinski <jakub.kicinski-wFxRvT7yatFl57MIdRCFDg@public.gmane.org> wrote:
> > On Thu, 7 Jul 2016 19:22:12 -0700, Alexei Starovoitov wrote:
> > > > If the goal is to just separate XDP traffic from non-XDP traffic
> > > > you could accomplish this with a combination of SR-IOV/macvlan to
> > > > separate the device queues into multiple netdevs and then run XDP
> > > > on just one of the netdevs. Then use flow director (ethtool) or
> > > > 'tc cls_u32/flower' to steer traffic to the netdev. This is how
> > > > we support multiple networking stacks on one device by the way it
> > > > is called the bifurcated driver. Its not too far of a stretch to
> > > > think we could offload some simple XDP programs to program the
> > > > splitting of traffic instead of cls_u32/flower/flow_director and
> > > > then you would have a stack of XDP programs. One running in
> > > > hardware and a set running on the queues in software.      
> > > 
> > >
> > > the above sounds like much better approach then Jesper/mine
> > > prog_per_ring stuff.
> > >
> > > If we can split the nic via sriov and have dedicated netdev via VF
> > > just for XDP that's way cleaner approach. I guess we won't need to
> > > do xdp_rxqmask after all.    
> > 
> > +1
> > 
> > I was thinking about using eBPF to direct to NIC queues but concluded
> > that doing a redirect to a VF is cleaner.  Especially if the PF driver
> > supports VF representatives we could potentially just use
> > bpf_redirect(VFR netdev) and the VF doesn't even have to be handled by
> > the same stack.  
> 
> I actually disagree.
> 
> I _do_ want to use the "filter" part of eBPF to direct to NIC queues, and
> then run a single/specific XDP program on that queue.
> 
> Why to I want this?
> 
> This part of solving a very fundamental CS problem (early demux), when
> wanting to support Zero-copy on RX.  The basic problem that the NIC
> driver need to map RX pages into the RX ring, prior to receiving
> packets. Thus, we need HW support to steer packets, for gaining enough
> isolation (e.g between tenants domains) for allowing zero-copy.
> 
> 
> Based on the flexibility of the HW-filter, the granularity achievable
> for isolation (e.g. application specific) is much more flexible.  Than
> splitting up the entire NIC with SR-IOV, VFs or macvlans.

I think of SR-IOV VFs a way of grouping queues.  If HW is capable of
directing to a queue it's usually capable of directing to a VF as well.
And the VF could have all other traffic disabled so you would get only
packets directed to it by the (BPF) filter - same as you would for the
queue.  Does that make sense for zero copy apps?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
  2016-07-08 16:07               ` Jakub Kicinski via iovisor-dev
@ 2016-07-08 16:45                 ` John Fastabend via iovisor-dev
       [not found]                   ` <577FD8A5.8020700-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: John Fastabend via iovisor-dev @ 2016-07-08 16:45 UTC (permalink / raw)
  To: Jakub Kicinski, Jesper Dangaard Brouer
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Fastabend, John R,
	Edward Cree, Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha

On 16-07-08 09:07 AM, Jakub Kicinski wrote:
> On Fri, 8 Jul 2016 17:19:43 +0200, Jesper Dangaard Brouer wrote:
>> On Fri, 8 Jul 2016 14:44:53 +0100 Jakub Kicinski <jakub.kicinski-wFxRvT7yatFl57MIdRCFDg@public.gmane.org> wrote:
>>> On Thu, 7 Jul 2016 19:22:12 -0700, Alexei Starovoitov wrote:
>>>>> If the goal is to just separate XDP traffic from non-XDP traffic
>>>>> you could accomplish this with a combination of SR-IOV/macvlan to
>>>>> separate the device queues into multiple netdevs and then run XDP
>>>>> on just one of the netdevs. Then use flow director (ethtool) or
>>>>> 'tc cls_u32/flower' to steer traffic to the netdev. This is how
>>>>> we support multiple networking stacks on one device by the way it
>>>>> is called the bifurcated driver. Its not too far of a stretch to
>>>>> think we could offload some simple XDP programs to program the
>>>>> splitting of traffic instead of cls_u32/flower/flow_director and
>>>>> then you would have a stack of XDP programs. One running in
>>>>> hardware and a set running on the queues in software.      
>>>>
>>>>
>>>> the above sounds like much better approach then Jesper/mine
>>>> prog_per_ring stuff.
>>>>
>>>> If we can split the nic via sriov and have dedicated netdev via VF
>>>> just for XDP that's way cleaner approach. I guess we won't need to
>>>> do xdp_rxqmask after all.    
>>>
>>> +1
>>>
>>> I was thinking about using eBPF to direct to NIC queues but concluded
>>> that doing a redirect to a VF is cleaner.  Especially if the PF driver
>>> supports VF representatives we could potentially just use
>>> bpf_redirect(VFR netdev) and the VF doesn't even have to be handled by
>>> the same stack.  
>>
>> I actually disagree.
>>
>> I _do_ want to use the "filter" part of eBPF to direct to NIC queues, and
>> then run a single/specific XDP program on that queue.
>>
>> Why to I want this?
>>
>> This part of solving a very fundamental CS problem (early demux), when
>> wanting to support Zero-copy on RX.  The basic problem that the NIC
>> driver need to map RX pages into the RX ring, prior to receiving
>> packets. Thus, we need HW support to steer packets, for gaining enough
>> isolation (e.g between tenants domains) for allowing zero-copy.
>>
>>
>> Based on the flexibility of the HW-filter, the granularity achievable
>> for isolation (e.g. application specific) is much more flexible.  Than
>> splitting up the entire NIC with SR-IOV, VFs or macvlans.
> 
> I think of SR-IOV VFs a way of grouping queues.  If HW is capable of
> directing to a queue it's usually capable of directing to a VF as well.
> And the VF could have all other traffic disabled so you would get only
> packets directed to it by the (BPF) filter - same as you would for the
> queue.  Does that make sense for zero copy apps?
> 

The only distinction between VFs and queue groupings on my side is VFs
provide RSS where as queue groupings have to be selected explicitly.
In a programmable NIC world the distinction might be lost if a "RSS"
program can be loaded into the NIC to select queues but for existing
hardware the distinction is there.

If you demux using a eBPF program or via a filter model like
flow_director or cls_{u32|flower} I think we can support both. And this
just depends on the programmability of the hardware. Note flow_director
and cls_{u32|flower} steering to VFs is already in place.

The question I have is should the "filter" part of the eBPF program
be a separate program from the XDP program and loaded using specific
semantics (e.g. "load_hardware_demux" ndo op) at the risk of building
a ever growing set of "ndo" ops. If you are running multiple XDP
programs on the same NIC hardware then I think this actually makes
sense otherwise how would the hardware and even software find the
"demux" logic. In this model there is a "demux" program that selects
a queue/VF and a program that runs on the netdev queues.

Any thoughts?

.John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]                   ` <577FD8A5.8020700-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2016-07-08 17:51                     ` Jakub Kicinski via iovisor-dev
  2016-07-09 11:27                       ` Jesper Dangaard Brouer via iovisor-dev
  0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski via iovisor-dev @ 2016-07-08 17:51 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Fastabend, John R,
	Edward Cree, Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha

On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:
> The only distinction between VFs and queue groupings on my side is VFs
> provide RSS where as queue groupings have to be selected explicitly.
> In a programmable NIC world the distinction might be lost if a "RSS"
> program can be loaded into the NIC to select queues but for existing
> hardware the distinction is there.

To do BPF RSS we need a way to select the queue which I think is all
Jasper wanted.  So we will have to tackle the queue selection at some
point.  The main obstacle with it for me is to define what queue
selection means when program is not offloaded to HW...  Implementing
queue selection on HW side is trivial.

> If you demux using a eBPF program or via a filter model like
> flow_director or cls_{u32|flower} I think we can support both. And this
> just depends on the programmability of the hardware. Note flow_director
> and cls_{u32|flower} steering to VFs is already in place.

Yes, for steering to VFs we could potentially reuse a lot of existing
infrastructure.

> The question I have is should the "filter" part of the eBPF program
> be a separate program from the XDP program and loaded using specific
> semantics (e.g. "load_hardware_demux" ndo op) at the risk of building
> a ever growing set of "ndo" ops. If you are running multiple XDP
> programs on the same NIC hardware then I think this actually makes
> sense otherwise how would the hardware and even software find the
> "demux" logic. In this model there is a "demux" program that selects
> a queue/VF and a program that runs on the netdev queues.

I don't think we should enforce the separation here.  What we may want
to do before forwarding to the VF can be much more complicated than
pure demux/filtering (simple eg - pop VLAN/tunnel).  VF representative
model works well here as fallback - if program could not be offloaded
it will be run on the host and "trombone" packets via VFR into the VF.

If we have a chain of BPF programs we can order them in increasing
level of complexity/features required and then HW could transparently
offload the first parts - the easier ones - leaving more complex
processing on the host.

This should probably be paired with some sort of "skip-sw" flag to let
user space enforce the HW offload on the fast path part.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
  2016-07-08 17:51                     ` Jakub Kicinski via iovisor-dev
@ 2016-07-09 11:27                       ` Jesper Dangaard Brouer via iovisor-dev
  2016-07-12  2:24                         ` Alexei Starovoitov
  0 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-07-09 11:27 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, John Fastabend,
	Fastabend, John R, Edward Cree, Simon Horman, Rana Shahout,
	Or Gerlitz, Ari Saha

On Fri, 8 Jul 2016 18:51:07 +0100
Jakub Kicinski <jakub.kicinski-wFxRvT7yatFl57MIdRCFDg@public.gmane.org> wrote:

> On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:
> > The only distinction between VFs and queue groupings on my side is VFs
> > provide RSS where as queue groupings have to be selected explicitly.
> > In a programmable NIC world the distinction might be lost if a "RSS"
> > program can be loaded into the NIC to select queues but for existing
> > hardware the distinction is there.  
> 
> To do BPF RSS we need a way to select the queue which I think is all
> Jesper wanted.  So we will have to tackle the queue selection at some
> point.  The main obstacle with it for me is to define what queue
> selection means when program is not offloaded to HW...  Implementing
> queue selection on HW side is trivial.

Yes, I do see the problem of fallback, when the programs "filter" demux
cannot be offloaded to hardware.

First I though it was a good idea to keep the "demux-filter" part of
the eBPF program, as software fallback can still apply this filter in
SW, and just mark the packets as not-zero-copy-safe.  But when HW
offloading is not possible, then packets can be delivered every RX
queue, and SW would need to handle that, which hard to keep transparent.


> > If you demux using a eBPF program or via a filter model like
> > flow_director or cls_{u32|flower} I think we can support both. And this
> > just depends on the programmability of the hardware. Note flow_director
> > and cls_{u32|flower} steering to VFs is already in place.  

Maybe we should keep HW demuxing as a separate setup step.

Today I can almost do what I want: by setting up ntuple filters, and (if
Alexei allows it) assign an application specific XDP eBPF program to a
specific RX queue.

 ethtool -K eth2 ntuple on
 ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42

Then the XDP program can be attached to RX queue 42, and
promise/guarantee that it will consume all packet.  And then the
backing page-pool can allow zero-copy RX (and enable scrubbing when
refilling pool).


> Yes, for steering to VFs we could potentially reuse a lot of existing
> infrastructure.
> 
> > The question I have is should the "filter" part of the eBPF program
> > be a separate program from the XDP program and loaded using specific
> > semantics (e.g. "load_hardware_demux" ndo op) at the risk of building
> > a ever growing set of "ndo" ops. If you are running multiple XDP
> > programs on the same NIC hardware then I think this actually makes
> > sense otherwise how would the hardware and even software find the
> > "demux" logic. In this model there is a "demux" program that selects
> > a queue/VF and a program that runs on the netdev queues.  
> 
> I don't think we should enforce the separation here.  What we may want
> to do before forwarding to the VF can be much more complicated than
> pure demux/filtering (simple eg - pop VLAN/tunnel).  VF representative
> model works well here as fallback - if program could not be offloaded
> it will be run on the host and "trombone" packets via VFR into the VF.

That is an interesting idea.

> If we have a chain of BPF programs we can order them in increasing
> level of complexity/features required and then HW could transparently
> offload the first parts - the easier ones - leaving more complex
> processing on the host.

I'll try to keep out of the discussion of how to structure the BPF
program, as it is outside my "area".
 
> This should probably be paired with some sort of "skip-sw" flag to let
> user space enforce the HW offload on the fast path part.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
  2016-07-09 11:27                       ` Jesper Dangaard Brouer via iovisor-dev
@ 2016-07-12  2:24                         ` Alexei Starovoitov
       [not found]                           ` <20160712022423.GA47757-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Alexei Starovoitov @ 2016-07-12  2:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jakub Kicinski, John Fastabend, Fastabend, John R, iovisor-dev,
	Brenden Blanco, Rana Shahout, Ari Saha, Tariq Toukan, Or Gerlitz,
	netdev, Simon Horman, Simon Horman, Edward Cree

On Sat, Jul 09, 2016 at 01:27:26PM +0200, Jesper Dangaard Brouer wrote:
> On Fri, 8 Jul 2016 18:51:07 +0100
> Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> 
> > On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:
> > > The only distinction between VFs and queue groupings on my side is VFs
> > > provide RSS where as queue groupings have to be selected explicitly.
> > > In a programmable NIC world the distinction might be lost if a "RSS"
> > > program can be loaded into the NIC to select queues but for existing
> > > hardware the distinction is there.  
> > 
> > To do BPF RSS we need a way to select the queue which I think is all
> > Jesper wanted.  So we will have to tackle the queue selection at some
> > point.  The main obstacle with it for me is to define what queue
> > selection means when program is not offloaded to HW...  Implementing
> > queue selection on HW side is trivial.
> 
> Yes, I do see the problem of fallback, when the programs "filter" demux
> cannot be offloaded to hardware.
> 
> First I though it was a good idea to keep the "demux-filter" part of
> the eBPF program, as software fallback can still apply this filter in
> SW, and just mark the packets as not-zero-copy-safe.  But when HW
> offloading is not possible, then packets can be delivered every RX
> queue, and SW would need to handle that, which hard to keep transparent.
> 
> 
> > > If you demux using a eBPF program or via a filter model like
> > > flow_director or cls_{u32|flower} I think we can support both. And this
> > > just depends on the programmability of the hardware. Note flow_director
> > > and cls_{u32|flower} steering to VFs is already in place.  
> 
> Maybe we should keep HW demuxing as a separate setup step.
> 
> Today I can almost do what I want: by setting up ntuple filters, and (if
> Alexei allows it) assign an application specific XDP eBPF program to a
> specific RX queue.
> 
>  ethtool -K eth2 ntuple on
>  ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42
> 
> Then the XDP program can be attached to RX queue 42, and
> promise/guarantee that it will consume all packet.  And then the
> backing page-pool can allow zero-copy RX (and enable scrubbing when
> refilling pool).

so such ntuple rule will send udp4 traffic for specific ip and port
into a queue then it will somehow gets zero-copied to vm?
. looks like a lot of other pieces about zero-copy and qemu need to be
implemented (or at least architected) for this scheme to be conceivable
. and when all that happens what vm is going to do with this very specific
traffic? vm won't have any tcp or even ping?

the network virtualization traffic is typically encapsulated,
so if xdp is used to do steer the traffic, the program would need
to figure out vm id based on headers, strip tunnel, apply policy before
forwarding the packet further. Clearly hw ntuple is not going to suffice.

If there is no networking virtualization and VMs are operating in the
flat network, then there is no policy, no ip filter, no vm migration.
Only mac per vm and sriov handles this case just fine.
When hw becomes more programmable we'll be able to load xdp program
into hw that does tunnel, policy and forwards into vf then sriov will
become actually usable for cloud providers.
hw xdp into vf is more interesting than into a queue, since there is
more than one queue/interrupt per vf and network heavy vm can actually
consume large amount of traffic.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]                           ` <20160712022423.GA47757-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
@ 2016-07-12 19:13                             ` John Fastabend via iovisor-dev
       [not found]                               ` <5785413D.4050901-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: John Fastabend via iovisor-dev @ 2016-07-12 19:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Jesper Dangaard Brouer
  Cc: Jakub Kicinski, netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Fastabend, John R,
	Edward Cree, Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha

On 16-07-11 07:24 PM, Alexei Starovoitov wrote:
> On Sat, Jul 09, 2016 at 01:27:26PM +0200, Jesper Dangaard Brouer wrote:
>> On Fri, 8 Jul 2016 18:51:07 +0100
>> Jakub Kicinski <jakub.kicinski-wFxRvT7yatFl57MIdRCFDg@public.gmane.org> wrote:
>>
>>> On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:
>>>> The only distinction between VFs and queue groupings on my side is VFs
>>>> provide RSS where as queue groupings have to be selected explicitly.
>>>> In a programmable NIC world the distinction might be lost if a "RSS"
>>>> program can be loaded into the NIC to select queues but for existing
>>>> hardware the distinction is there.  
>>>
>>> To do BPF RSS we need a way to select the queue which I think is all
>>> Jesper wanted.  So we will have to tackle the queue selection at some
>>> point.  The main obstacle with it for me is to define what queue
>>> selection means when program is not offloaded to HW...  Implementing
>>> queue selection on HW side is trivial.
>>
>> Yes, I do see the problem of fallback, when the programs "filter" demux
>> cannot be offloaded to hardware.
>>
>> First I though it was a good idea to keep the "demux-filter" part of
>> the eBPF program, as software fallback can still apply this filter in
>> SW, and just mark the packets as not-zero-copy-safe.  But when HW
>> offloading is not possible, then packets can be delivered every RX
>> queue, and SW would need to handle that, which hard to keep transparent.
>>
>>
>>>> If you demux using a eBPF program or via a filter model like
>>>> flow_director or cls_{u32|flower} I think we can support both. And this
>>>> just depends on the programmability of the hardware. Note flow_director
>>>> and cls_{u32|flower} steering to VFs is already in place.  
>>
>> Maybe we should keep HW demuxing as a separate setup step.
>>
>> Today I can almost do what I want: by setting up ntuple filters, and (if
>> Alexei allows it) assign an application specific XDP eBPF program to a
>> specific RX queue.
>>
>>  ethtool -K eth2 ntuple on
>>  ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42
>>
>> Then the XDP program can be attached to RX queue 42, and
>> promise/guarantee that it will consume all packet.  And then the
>> backing page-pool can allow zero-copy RX (and enable scrubbing when
>> refilling pool).
> 
> so such ntuple rule will send udp4 traffic for specific ip and port
> into a queue then it will somehow gets zero-copied to vm?
> . looks like a lot of other pieces about zero-copy and qemu need to be
> implemented (or at least architected) for this scheme to be conceivable
> . and when all that happens what vm is going to do with this very specific
> traffic? vm won't have any tcp or even ping?

I have perhaps a different motivation to have queue steering in 'tc
cls-u32' and eventually xdp. The general idea is I have thousands of
queues and I can bind applications to the queues. When I know an
application is bound to a queue I can enable per queue busy polling (to
be implemented), set specific interrupt rates on the queue
(implementation will be posted soon), bind the queue to the correct
cpu, etc.

ntuple works OK for this now but xdp provides more flexibility and
also lets us add additional policy on the queue other than simply
queue steering.

I'm not convinced though that the demux queue selection should be part
of the XDP program itself just because it has no software analog to me
it sits in front of the set of XDP programs. But I think I could perhaps
be convinced it does if there is some reasonable way to do it. I guess
the single program method would result in an XDP program that read like

  if (rx_queue == x)
       do_foo
  if (rx_queue == y)
       do_bar

A hardware jit may be able to sort that out. Or use per queue sections.

> 
> the network virtualization traffic is typically encapsulated,
> so if xdp is used to do steer the traffic, the program would need
> to figure out vm id based on headers, strip tunnel, apply policy before
> forwarding the packet further. Clearly hw ntuple is not going to suffice.
>
> If there is no networking virtualization and VMs are operating in the
> flat network, then there is no policy, no ip filter, no vm migration.
> Only mac per vm and sriov handles this case just fine.
> When hw becomes more programmable we'll be able to load xdp program
> into hw that does tunnel, policy and forwards into vf then sriov will
> become actually usable for cloud providers.

Yep :)

> hw xdp into vf is more interesting than into a queue, since there is
> more than one queue/interrupt per vf and network heavy vm can actually
> consume large amount of traffic.
> 

Another use case I have is to make a really high performance AF_PACKET
interface. So if there was a way to say bind a queue to an AF_PACKET
ring and run a policy XDP program before hitting the AF_PACKET
descriptor bit that would be really interesting because it would solve
some of my need for poll mode drivers in userspace.

.John

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]                               ` <5785413D.4050901-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2016-07-12 19:49                                 ` Jakub Kicinski via iovisor-dev
  2016-07-12 20:32                                 ` Jesper Dangaard Brouer via iovisor-dev
  1 sibling, 0 replies; 23+ messages in thread
From: Jakub Kicinski via iovisor-dev @ 2016-07-12 19:49 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Fastabend, John R,
	Edward Cree, Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha

On Tue, 12 Jul 2016 12:13:01 -0700, John Fastabend wrote:
> On 16-07-11 07:24 PM, Alexei Starovoitov wrote:
> > On Sat, Jul 09, 2016 at 01:27:26PM +0200, Jesper Dangaard Brouer wrote:  
> >> On Fri, 8 Jul 2016 18:51:07 +0100
> >> Jakub Kicinski <jakub.kicinski-wFxRvT7yatFl57MIdRCFDg@public.gmane.org> wrote:
> >>  
> >>> On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:  
> >>>> The only distinction between VFs and queue groupings on my side is VFs
> >>>> provide RSS where as queue groupings have to be selected explicitly.
> >>>> In a programmable NIC world the distinction might be lost if a "RSS"
> >>>> program can be loaded into the NIC to select queues but for existing
> >>>> hardware the distinction is there.    
> >>>
> >>> To do BPF RSS we need a way to select the queue which I think is all
> >>> Jesper wanted.  So we will have to tackle the queue selection at some
> >>> point.  The main obstacle with it for me is to define what queue
> >>> selection means when program is not offloaded to HW...  Implementing
> >>> queue selection on HW side is trivial.  
> >>
> >> Yes, I do see the problem of fallback, when the programs "filter" demux
> >> cannot be offloaded to hardware.
> >>
> >> First I though it was a good idea to keep the "demux-filter" part of
> >> the eBPF program, as software fallback can still apply this filter in
> >> SW, and just mark the packets as not-zero-copy-safe.  But when HW
> >> offloading is not possible, then packets can be delivered every RX
> >> queue, and SW would need to handle that, which hard to keep transparent.
> >>
> >>  
> >>>> If you demux using a eBPF program or via a filter model like
> >>>> flow_director or cls_{u32|flower} I think we can support both. And this
> >>>> just depends on the programmability of the hardware. Note flow_director
> >>>> and cls_{u32|flower} steering to VFs is already in place.    
> >>
> >> Maybe we should keep HW demuxing as a separate setup step.
> >>
> >> Today I can almost do what I want: by setting up ntuple filters, and (if
> >> Alexei allows it) assign an application specific XDP eBPF program to a
> >> specific RX queue.
> >>
> >>  ethtool -K eth2 ntuple on
> >>  ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42
> >>
> >> Then the XDP program can be attached to RX queue 42, and
> >> promise/guarantee that it will consume all packet.  And then the
> >> backing page-pool can allow zero-copy RX (and enable scrubbing when
> >> refilling pool).  
> > 
> > so such ntuple rule will send udp4 traffic for specific ip and port
> > into a queue then it will somehow gets zero-copied to vm?
> > . looks like a lot of other pieces about zero-copy and qemu need to be
> > implemented (or at least architected) for this scheme to be conceivable
> > . and when all that happens what vm is going to do with this very specific
> > traffic? vm won't have any tcp or even ping?  
> 
> I have perhaps a different motivation to have queue steering in 'tc
> cls-u32' and eventually xdp. The general idea is I have thousands of
> queues and I can bind applications to the queues. When I know an
> application is bound to a queue I can enable per queue busy polling (to
> be implemented), set specific interrupt rates on the queue
> (implementation will be posted soon), bind the queue to the correct
> cpu, etc.
> 
> ntuple works OK for this now but xdp provides more flexibility and
> also lets us add additional policy on the queue other than simply
> queue steering.
> 
> I'm not convinced though that the demux queue selection should be part
> of the XDP program itself just because it has no software analog to me
> it sits in front of the set of XDP programs. 

Yes, although if we expect XDP to be target of offloading efforts
putting the demux here doesn't seem like an entirely bad idea.  We
could say demux is just an API that more capable drivers/HW can
implement.

> But I think I could perhaps
> be convinced it does if there is some reasonable way to do it. I guess
> the single program method would result in an XDP program that read like
> 
>   if (rx_queue == x)
>        do_foo
>   if (rx_queue == y)
>        do_bar
> 
> A hardware jit may be able to sort that out.

+1  

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]                               ` <5785413D.4050901-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2016-07-12 19:49                                 ` Jakub Kicinski via iovisor-dev
@ 2016-07-12 20:32                                 ` Jesper Dangaard Brouer via iovisor-dev
       [not found]                                   ` <20160712223231.202cd122-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-07-12 20:32 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jakub Kicinski, netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Fastabend, John R,
	Edward Cree, Simon Horman, Rana Shahout, Or Gerlitz, Ari Saha

On Tue, 12 Jul 2016 12:13:01 -0700
John Fastabend <john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On 16-07-11 07:24 PM, Alexei Starovoitov wrote:
> > On Sat, Jul 09, 2016 at 01:27:26PM +0200, Jesper Dangaard Brouer wrote:  
> >> On Fri, 8 Jul 2016 18:51:07 +0100
> >> Jakub Kicinski <jakub.kicinski-wFxRvT7yatFl57MIdRCFDg@public.gmane.org> wrote:
> >>  
> >>> On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:  
> >>>> The only distinction between VFs and queue groupings on my side is VFs
> >>>> provide RSS where as queue groupings have to be selected explicitly.
> >>>> In a programmable NIC world the distinction might be lost if a "RSS"
> >>>> program can be loaded into the NIC to select queues but for existing
> >>>> hardware the distinction is there.    
> >>>
> >>> To do BPF RSS we need a way to select the queue which I think is all
> >>> Jesper wanted.  So we will have to tackle the queue selection at some
> >>> point.  The main obstacle with it for me is to define what queue
> >>> selection means when program is not offloaded to HW...  Implementing
> >>> queue selection on HW side is trivial.  
> >>
> >> Yes, I do see the problem of fallback, when the programs "filter" demux
> >> cannot be offloaded to hardware.
> >>
> >> First I though it was a good idea to keep the "demux-filter" part of
> >> the eBPF program, as software fallback can still apply this filter in
> >> SW, and just mark the packets as not-zero-copy-safe.  But when HW
> >> offloading is not possible, then packets can be delivered every RX
> >> queue, and SW would need to handle that, which hard to keep transparent.
> >>
> >>  
> >>>> If you demux using a eBPF program or via a filter model like
> >>>> flow_director or cls_{u32|flower} I think we can support both. And this
> >>>> just depends on the programmability of the hardware. Note flow_director
> >>>> and cls_{u32|flower} steering to VFs is already in place.    
> >>
> >> Maybe we should keep HW demuxing as a separate setup step.
> >>
> >> Today I can almost do what I want: by setting up ntuple filters, and (if
> >> Alexei allows it) assign an application specific XDP eBPF program to a
> >> specific RX queue.
> >>
> >>  ethtool -K eth2 ntuple on
> >>  ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42
> >>
> >> Then the XDP program can be attached to RX queue 42, and
> >> promise/guarantee that it will consume all packet.  And then the
> >> backing page-pool can allow zero-copy RX (and enable scrubbing when
> >> refilling pool).  
> > 
> > so such ntuple rule will send udp4 traffic for specific ip and port
> > into a queue then it will somehow gets zero-copied to vm?
> > . looks like a lot of other pieces about zero-copy and qemu need to be
> > implemented (or at least architected) for this scheme to be conceivable
> > . and when all that happens what vm is going to do with this very specific
> > traffic? vm won't have any tcp or even ping?  
> 
> I have perhaps a different motivation to have queue steering in 'tc
> cls-u32' and eventually xdp. The general idea is I have thousands of
> queues and I can bind applications to the queues. When I know an
> application is bound to a queue I can enable per queue busy polling (to
> be implemented), set specific interrupt rates on the queue
> (implementation will be posted soon), bind the queue to the correct
> cpu, etc.

+1 

binding applications to queues.

This is basically what our customers are requesting. They have one or
two applications that need DPDK speeds.  But they don't like dedicating
an entire NIC per application (like DPDK requires).

The basic idea is actually more fundamental.  It reminds me of Van
Jacobson's netchannels[1] when he talks about "Channelize" (slides 24+)
Creating full "application" channel allow for lock free single producer
single consumer (SPSC) queue directly into the application.

[1] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf


> ntuple works OK for this now but xdp provides more flexibility and
> also lets us add additional policy on the queue other than simply
> queue steering.
> 
> I'm not convinced though that the demux queue selection should be part
> of the XDP program itself just because it has no software analog to me
> it sits in front of the set of XDP programs. But I think I could perhaps
> be convinced it does if there is some reasonable way to do it. I guess
> the single program method would result in an XDP program that read like
> 
>   if (rx_queue == x)
>        do_foo
>   if (rx_queue == y)
>        do_bar

Yes, that is also why I wanted a XDP program per RX queue.  But the
"channelize" concept is more important.

 
> A hardware jit may be able to sort that out. Or use per queue
> sections.
> 
> > 
> > the network virtualization traffic is typically encapsulated,
> > so if xdp is used to do steer the traffic, the program would need
> > to figure out vm id based on headers, strip tunnel, apply policy
> > before forwarding the packet further. Clearly hw ntuple is not
> > going to suffice.
> >
> > If there is no networking virtualization and VMs are operating in
> > the flat network, then there is no policy, no ip filter, no vm
> > migration. Only mac per vm and sriov handles this case just fine.
> > When hw becomes more programmable we'll be able to load xdp program
> > into hw that does tunnel, policy and forwards into vf then sriov
> > will become actually usable for cloud providers.  
> 
> Yep :)
> 
> > hw xdp into vf is more interesting than into a queue, since there is
> > more than one queue/interrupt per vf and network heavy vm can
> > actually consume large amount of traffic.
> >   
> 
> Another use case I have is to make a really high performance AF_PACKET
> interface. So if there was a way to say bind a queue to an AF_PACKET
> ring and run a policy XDP program before hitting the AF_PACKET
> descriptor bit that would be really interesting because it would solve
> some of my need for poll mode drivers in userspace.

+1 yes, a super fast AF_PACKET is also on my wish/todo list for XDP.
It would basically allow for implementing DPDK or netmap on top of XDP
(as least the RX side) without needing to run a NIC driver in userspace.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]                                   ` <20160712223231.202cd122-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2016-07-26 13:31                                     ` Thomas Monjalon via iovisor-dev
  2016-07-26 16:08                                       ` [iovisor-dev] " Tom Herbert
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Monjalon via iovisor-dev @ 2016-07-26 13:31 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, John Fastabend
  Cc: Jakub Kicinski, adrien.mazarguil-pdR9zngts4EAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	iovisor-dev-9jONkmmOlFHEE9lA1F8Ukti2O/JbrIOy, Fastabend, John R,
	Rana Shahout, Simon Horman, Edward Cree, Or Gerlitz, Ari Saha

Hi,

About RX filtering, there is an ongoing effort in DPDK to write an API
which could leverage most of the hardware capabilities of any NICs:
	https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
	http://thread.gmane.org/gmane.comp.networking.dpdk.devel/43352
I understand that XDP does not target to support every hardware features,
though it may be an interesting approach to check.

2016-07-12 22:32, Jesper Dangaard Brouer via iovisor-dev:
> On Tue, 12 Jul 2016 12:13:01 -0700
> John Fastabend <john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > 
> > Another use case I have is to make a really high performance AF_PACKET
> > interface. So if there was a way to say bind a queue to an AF_PACKET
> > ring and run a policy XDP program before hitting the AF_PACKET
> > descriptor bit that would be really interesting because it would solve
> > some of my need for poll mode drivers in userspace.

Have you started this work?
Do you have an idea of how RX would perform through XDP + AF_PACKET + DPDK?

> +1 yes, a super fast AF_PACKET is also on my wish/todo list for XDP.
> It would basically allow for implementing DPDK or netmap on top of XDP
> (as least the RX side) without needing to run a NIC driver in userspace.

Why TX would not be possible through AF_PACKET?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [iovisor-dev] XDP seeking input from NIC hardware vendors
  2016-07-26 13:31                                     ` Thomas Monjalon via iovisor-dev
@ 2016-07-26 16:08                                       ` Tom Herbert
       [not found]                                         ` <CALx6S35XjCsG5EmiYBpbGk9NckQbe4VbNSGLqV7h+d16PgNGKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Tom Herbert @ 2016-07-26 16:08 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Jesper Dangaard Brouer, John Fastabend,
	Tom Herbert via iovisor-dev, Jakub Kicinski, netdev, Fastabend,
	John R, Edward Cree, Simon Horman, Rana Shahout, Or Gerlitz,
	Ari Saha, adrien.mazarguil

On Tue, Jul 26, 2016 at 6:31 AM, Thomas Monjalon
<thomas.monjalon@6wind.com> wrote:
> Hi,
>
> About RX filtering, there is an ongoing effort in DPDK to write an API
> which could leverage most of the hardware capabilities of any NICs:
>         https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
>         http://thread.gmane.org/gmane.comp.networking.dpdk.devel/43352
> I understand that XDP does not target to support every hardware features,
> though it may be an interesting approach to check.
>
Thomas,

A major goal of XDP is to leverage and in fact encourage innovation in
hardware features. But, we are asking that vendors design the APIs
with the community in mind. For instance, if XDP supports crypto
offload it should have one API that different companies, we don't want
every vendor coming up with their own.

> 2016-07-12 22:32, Jesper Dangaard Brouer via iovisor-dev:
>> On Tue, 12 Jul 2016 12:13:01 -0700
>> John Fastabend <john.fastabend@gmail.com> wrote:
>> >
>> > Another use case I have is to make a really high performance AF_PACKET
>> > interface. So if there was a way to say bind a queue to an AF_PACKET
>> > ring and run a policy XDP program before hitting the AF_PACKET
>> > descriptor bit that would be really interesting because it would solve
>> > some of my need for poll mode drivers in userspace.
>
> Have you started this work?
> Do you have an idea of how RX would perform through XDP + AF_PACKET + DPDK?
>
I don't understand why the AF_PACKET with DPDK. They should be
mutually exclusive. XDP over DPDK does make sense.

Tom

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]                                         ` <CALx6S35XjCsG5EmiYBpbGk9NckQbe4VbNSGLqV7h+d16PgNGKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-07-26 17:53                                           ` John Fastabend via iovisor-dev
       [not found]                                             ` <5797A381.90406-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: John Fastabend via iovisor-dev @ 2016-07-26 17:53 UTC (permalink / raw)
  To: Tom Herbert, Thomas Monjalon
  Cc: Jakub Kicinski, adrien.mazarguil-pdR9zngts4EAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, Tom Herbert via iovisor-dev,
	Fastabend, John R, Rana Shahout, Simon Horman, Edward Cree,
	Or Gerlitz, Ari Saha

On 16-07-26 09:08 AM, Tom Herbert wrote:
> On Tue, Jul 26, 2016 at 6:31 AM, Thomas Monjalon
> <thomas.monjalon-pdR9zngts4EAvxtiuMwx3w@public.gmane.org> wrote:
>> Hi,
>>
>> About RX filtering, there is an ongoing effort in DPDK to write an API
>> which could leverage most of the hardware capabilities of any NICs:
>>         https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
>>         http://thread.gmane.org/gmane.comp.networking.dpdk.devel/43352
>> I understand that XDP does not target to support every hardware features,
>> though it may be an interesting approach to check.
>>
> Thomas,
> 
> A major goal of XDP is to leverage and in fact encourage innovation in
> hardware features. But, we are asking that vendors design the APIs
> with the community in mind. For instance, if XDP supports crypto
> offload it should have one API that different companies, we don't want
> every vendor coming up with their own.

The work in those threads is to create a single API for users of DPDK
to interact with their hardware. The equivalent interface in Linux
kernel is ntuple filters from ethtool the effort here is to make a
usable interface to manage this from an application and also expose
all the hardware features. Ethtool does a fairly poor job on both
fronts IMO.

If we evolve the mechanism to run per rx queue xdp programs this
interface could easily be used to forward packets to specific rx
queues and run targeted xdp programs.

Integrating this functionality into running XDP programs as ebpf code
seems a bit challenging to me because there is no software equivalent.
Once XDP ebpf program is running the pkt has already landed on the rx
queue. To me the mechanism to bind XDP programs to rx queues and steer
specific flows (e.g. match a flow label and forward to a queue) needs
to be part of the runtime environment not part of the main ebpf program
itself. The runtime environment could use the above linked API. I know
we debated earlier including this in the ebpf program itself but that
really doesn't seem feasible to me. Whether the steering is expresses
as an ebpf program or an API like above seems like a reasonable
discussion. Perhaps a section could be used to describe the per program
filter for example which would be different from an API approach used
in the proposal above or the JIT could translate it into the above
API for devices without instruction based hardware.

Step 0 should be to show a set of compelling use cases that want to run
per queue programs then we can talk about the runtime.


> 
>> 2016-07-12 22:32, Jesper Dangaard Brouer via iovisor-dev:
>>> On Tue, 12 Jul 2016 12:13:01 -0700
>>> John Fastabend <john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>
>>>> Another use case I have is to make a really high performance AF_PACKET
>>>> interface. So if there was a way to say bind a queue to an AF_PACKET
>>>> ring and run a policy XDP program before hitting the AF_PACKET
>>>> descriptor bit that would be really interesting because it would solve
>>>> some of my need for poll mode drivers in userspace.
>>
>> Have you started this work?
>> Do you have an idea of how RX would perform through XDP + AF_PACKET + DPDK?
>>
> I don't understand why the AF_PACKET with DPDK. They should be
> mutually exclusive. XDP over DPDK does make sense.
> 

Because DPDK is more than just a poll mode driver that binds to a
device. AF_Packet could be used as a replacement for a specific poll
mode driver but the application could still use the other libraries
provided by DPDK to build ACLs for example or deep packet inspection.


> Tom
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]                                             ` <5797A381.90406-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2016-07-26 18:42                                               ` Jesper Dangaard Brouer via iovisor-dev
  2016-07-26 18:58                                               ` Tom Herbert via iovisor-dev
  1 sibling, 0 replies; 23+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-07-26 18:42 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jakub Kicinski, adrien.mazarguil-pdR9zngts4EAvxtiuMwx3w,
	Tom Herbert, Tom Herbert via iovisor-dev, Fastabend, John R,
	Rana Shahout, Simon Horman, netdev-u79uwXL29TY76Z2rM5mHXA,
	Edward Cree, Or Gerlitz, Ari Saha

On Tue, 26 Jul 2016 10:53:05 -0700
John Fastabend <john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On 16-07-26 09:08 AM, Tom Herbert wrote:
> > On Tue, Jul 26, 2016 at 6:31 AM, Thomas Monjalon
> > <thomas.monjalon-pdR9zngts4EAvxtiuMwx3w@public.gmane.org> wrote:  
> >> Hi,
> >>
> >> About RX filtering, there is an ongoing effort in DPDK to write an API
> >> which could leverage most of the hardware capabilities of any NICs:
> >>         https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
> >>         http://thread.gmane.org/gmane.comp.networking.dpdk.devel/43352
> >> I understand that XDP does not target to support every hardware features,
> >> though it may be an interesting approach to check.
> >>  
> > Thomas,
> > 
> > A major goal of XDP is to leverage and in fact encourage innovation in
> > hardware features. But, we are asking that vendors design the APIs
> > with the community in mind. For instance, if XDP supports crypto
> > offload it should have one API that different companies, we don't want
> > every vendor coming up with their own.  
> 
> The work in those threads is to create a single API for users of DPDK
> to interact with their hardware. The equivalent interface in Linux
> kernel is ntuple filters from ethtool the effort here is to make a
> usable interface to manage this from an application and also expose
> all the hardware features. Ethtool does a fairly poor job on both
> fronts IMO.

Yes, the ethtool + ntuple approach is unfortunately not very easy to :-(


> If we evolve the mechanism to run per rx queue xdp programs this
> interface could easily be used to forward packets to specific rx
> queues and run targeted xdp programs.
> 
> Integrating this functionality into running XDP programs as ebpf code
> seems a bit challenging to me because there is no software equivalent.
> Once XDP ebpf program is running the pkt has already landed on the rx
> queue. To me the mechanism to bind XDP programs to rx queues and steer
> specific flows (e.g. match a flow label and forward to a queue) needs
> to be part of the runtime environment not part of the main ebpf program
> itself. 

I agree, based on the discussion in this thread. Will admit that my
initial idea of adding this filter to the eBPF/XDP program was not such
a good idea.

> The runtime environment could use the above linked API. I know
> we debated earlier including this in the ebpf program itself but that
> really doesn't seem feasible to me. Whether the steering is expresses
> as an ebpf program or an API like above seems like a reasonable
> discussion. Perhaps a section could be used to describe the per program
> filter for example which would be different from an API approach used
> in the proposal above or the JIT could translate it into the above
> API for devices without instruction based hardware.

It seems like someone actually put some though into this, in the link
you send... quite interesting, thanks:
 https://rawgit.com/6WIND/rte_flow/master/rte_flow.html

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: XDP seeking input from NIC hardware vendors
       [not found]                                             ` <5797A381.90406-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2016-07-26 18:42                                               ` Jesper Dangaard Brouer via iovisor-dev
@ 2016-07-26 18:58                                               ` Tom Herbert via iovisor-dev
  1 sibling, 0 replies; 23+ messages in thread
From: Tom Herbert via iovisor-dev @ 2016-07-26 18:58 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jakub Kicinski, adrien.mazarguil-pdR9zngts4EAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, Tom Herbert via iovisor-dev,
	Fastabend, John R, Rana Shahout, Simon Horman, Edward Cree,
	Or Gerlitz, Ari Saha

On Tue, Jul 26, 2016 at 10:53 AM, John Fastabend
<john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On 16-07-26 09:08 AM, Tom Herbert wrote:
>> On Tue, Jul 26, 2016 at 6:31 AM, Thomas Monjalon
>> <thomas.monjalon-pdR9zngts4EAvxtiuMwx3w@public.gmane.org> wrote:
>>> Hi,
>>>
>>> About RX filtering, there is an ongoing effort in DPDK to write an API
>>> which could leverage most of the hardware capabilities of any NICs:
>>>         https://rawgit.com/6WIND/rte_flow/master/rte_flow.html
>>>         http://thread.gmane.org/gmane.comp.networking.dpdk.devel/43352
>>> I understand that XDP does not target to support every hardware features,
>>> though it may be an interesting approach to check.
>>>
>> Thomas,
>>
>> A major goal of XDP is to leverage and in fact encourage innovation in
>> hardware features. But, we are asking that vendors design the APIs
>> with the community in mind. For instance, if XDP supports crypto
>> offload it should have one API that different companies, we don't want
>> every vendor coming up with their own.
>
> The work in those threads is to create a single API for users of DPDK
> to interact with their hardware. The equivalent interface in Linux
> kernel is ntuple filters from ethtool the effort here is to make a
> usable interface to manage this from an application and also expose
> all the hardware features. Ethtool does a fairly poor job on both
> fronts IMO.
>
> If we evolve the mechanism to run per rx queue xdp programs this
> interface could easily be used to forward packets to specific rx
> queues and run targeted xdp programs.
>
> Integrating this functionality into running XDP programs as ebpf code
> seems a bit challenging to me because there is no software equivalent.
> Once XDP ebpf program is running the pkt has already landed on the rx
> queue. To me the mechanism to bind XDP programs to rx queues and steer
> specific flows (e.g. match a flow label and forward to a queue) needs
> to be part of the runtime environment not part of the main ebpf program
> itself. The runtime environment could use the above linked API. I know
> we debated earlier including this in the ebpf program itself but that
> really doesn't seem feasible to me. Whether the steering is expresses
> as an ebpf program or an API like above seems like a reasonable
> discussion. Perhaps a section could be used to describe the per program
> filter for example which would be different from an API approach used
> in the proposal above or the JIT could translate it into the above
> API for devices without instruction based hardware.
>
I think your convoluting two different mechanisms. If the device has
the capability to different packets and steer them to different
queues, then XDP should be able to make use of that by running
different programs appropriate for each queue. Packet steering is the
domain of HW, for that we have ntuple filtering now but there is no
reason to believe that XDP won't be used for that also. Per queue
program in the host is just configuration, i.e. bind this program to
that queue. Even if the first is not ready, I don't see why the second
is so complex; it should just be a matter of per queue configuration
for which there is already a lot of infrastructure.

> Step 0 should be to show a set of compelling use cases that want to run
> per queue programs then we can talk about the runtime.

Imagine we are able to split VM traffic and non-VM traffic to
different RX queues. The program we will want to run would be very
different in those cases.

Tom

>
>
>>
>>> 2016-07-12 22:32, Jesper Dangaard Brouer via iovisor-dev:
>>>> On Tue, 12 Jul 2016 12:13:01 -0700
>>>> John Fastabend <john.fastabend-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>>>
>>>>> Another use case I have is to make a really high performance AF_PACKET
>>>>> interface. So if there was a way to say bind a queue to an AF_PACKET
>>>>> ring and run a policy XDP program before hitting the AF_PACKET
>>>>> descriptor bit that would be really interesting because it would solve
>>>>> some of my need for poll mode drivers in userspace.
>>>
>>> Have you started this work?
>>> Do you have an idea of how RX would perform through XDP + AF_PACKET + DPDK?
>>>
>> I don't understand why the AF_PACKET with DPDK. They should be
>> mutually exclusive. XDP over DPDK does make sense.
>>
>
> Because DPDK is more than just a poll mode driver that binds to a
> device. AF_Packet could be used as a replacement for a specific poll
> mode driver but the application could still use the other libraries
> provided by DPDK to build ACLs for example or deep packet inspection.
>
>
>> Tom
>>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-07-26 18:58 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-07 10:42 XDP seeking input from NIC hardware vendors Jesper Dangaard Brouer via iovisor-dev
2016-07-07 15:18 ` Fastabend, John R
     [not found]   ` <D6BB30FE66EA894C9F13C9E3CDDF00F564E5FB81-5FK+k9557ZBqS6EAlXoojrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2016-07-07 16:12     ` Jakub Kicinski via iovisor-dev
2016-07-07 17:53       ` Tom Herbert via iovisor-dev
     [not found]         ` <CALx6S36BADKByJAYQLMXBx1NEDaqn6fdqsCk-OdgNo5vgHrO1Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-07-07 21:33           ` John Fastabend via iovisor-dev
2016-07-08  2:22     ` Alexei Starovoitov via iovisor-dev
     [not found]       ` <20160708022210.GA12244-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
2016-07-08  4:05         ` John Fastabend via iovisor-dev
     [not found]           ` <577F2689.4010602-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-07-08  4:28             ` Alexei Starovoitov via iovisor-dev
2016-07-08 13:44         ` Jakub Kicinski via iovisor-dev
2016-07-08 15:19           ` Jesper Dangaard Brouer via iovisor-dev
     [not found]             ` <20160708171943.0e1ce8d7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-07-08 16:07               ` Jakub Kicinski via iovisor-dev
2016-07-08 16:45                 ` John Fastabend via iovisor-dev
     [not found]                   ` <577FD8A5.8020700-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-07-08 17:51                     ` Jakub Kicinski via iovisor-dev
2016-07-09 11:27                       ` Jesper Dangaard Brouer via iovisor-dev
2016-07-12  2:24                         ` Alexei Starovoitov
     [not found]                           ` <20160712022423.GA47757-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
2016-07-12 19:13                             ` John Fastabend via iovisor-dev
     [not found]                               ` <5785413D.4050901-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-07-12 19:49                                 ` Jakub Kicinski via iovisor-dev
2016-07-12 20:32                                 ` Jesper Dangaard Brouer via iovisor-dev
     [not found]                                   ` <20160712223231.202cd122-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2016-07-26 13:31                                     ` Thomas Monjalon via iovisor-dev
2016-07-26 16:08                                       ` [iovisor-dev] " Tom Herbert
     [not found]                                         ` <CALx6S35XjCsG5EmiYBpbGk9NckQbe4VbNSGLqV7h+d16PgNGKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-07-26 17:53                                           ` John Fastabend via iovisor-dev
     [not found]                                             ` <5797A381.90406-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-07-26 18:42                                               ` Jesper Dangaard Brouer via iovisor-dev
2016-07-26 18:58                                               ` Tom Herbert via iovisor-dev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.