All of lore.kernel.org
 help / color / mirror / Atom feed
* Focusing the XDP project
@ 2017-02-20 10:13 Jesper Dangaard Brouer
  2017-02-20 20:09 ` Alexander Duyck
  0 siblings, 1 reply; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-20 10:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Alexander Duyck, John Fastabend, David Miller
  Cc: brouer, Saeed Mahameed, Tom Herbert, netdev, Brenden Blanco


First thing to bring in order for the XDP project:

  RX batching is missing.

I don't want to discuss packet page-sizes or multi-port forwarding,
before we have established the most fundamental principal that all
other solution use; RX batching.

Without building in RX batching, from the beginning/now, the XDP
architecture have lost.  As adding features and capabilities, will
just lead us back to the exact same performance problems as before!


Today we already have the 64 packets NAPI budget, but we are not
taking advantage of this. For XDP as long as eBPF always return
XDP_DROP or XDP_TX, then we (falsely) experience the effect of bulking
(as code fits within the icache) and see huge perf boosts.

The initial principal of bulking/batching packets to amortize per
packet costs.  The next step is just as important: Lookup table sizes
(FIB) kills performance again. The solution is implementing a smart
table lookup scheme that prefetch hash table key-cells and afterwards
prefetch data-cells, based on the RX batch of packets.  Notice VPP
revolves around similar tricks, and why it beats DPDK, and why it
scales with 1Millon routes.

I hope I've made it very clear where the focus for XDP should be.
This involves implementing what I call RX-stages in the drivers. While
doing that we can figure out the most optimal data structure for
packet batching.
 I know Saeed is already working on RX-stages for mlx5, and I've tested
the initial version of his patch, and the results are excellent.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-20 10:13 Focusing the XDP project Jesper Dangaard Brouer
@ 2017-02-20 20:09 ` Alexander Duyck
  2017-02-20 22:57   ` Saeed Mahameed
  2017-02-20 23:39   ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 20+ messages in thread
From: Alexander Duyck @ 2017-02-20 20:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, John Fastabend, David Miller, Saeed Mahameed,
	Tom Herbert, netdev, Brenden Blanco

On Mon, Feb 20, 2017 at 2:13 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> First thing to bring in order for the XDP project:
>
>   RX batching is missing.
>
> I don't want to discuss packet page-sizes or multi-port forwarding,
> before we have established the most fundamental principal that all
> other solution use; RX batching.

That is all well and good, but some of us would like to discuss other
items as it has a direct impact on our driver implementation and
future driver design.  Rx batching really seems tangential to the
whole XDP discussion anyway unless you are talking about rewriting the
core BPF code and kernel API itself to process multiple frames at a
time.

That said, if something seems like it would break the concept you have
for Rx batching please bring it up.  What I would like to see is well
defined APIs and a usable interface so that I can present XDP to
management and they will see the use of it and be willing to let me
dedicate developer heads to enabling it on our drivers.

> Without building in RX batching, from the beginning/now, the XDP
> architecture have lost.  As adding features and capabilities, will
> just lead us back to the exact same performance problems as before!

I would argue you have much bigger issues to deal with.  Here is a short list:
1.  The Tx code is mostly just a toy.  We need support for more
functional use cases.
2.  1 page per packet is costly, and blocks use on the intel drivers,
mlx4 (after Eric's patches), and 64K page architectures.
3.  Should we support scatter-gather to support 9K jumbo frames
instead of allocating order 2 pages?

Focusing on Rx batching seems like bike shedding more than anything
else.  I would much rather be focused on what the API definitions
should be for the drivers and the BPF code rather than focus on the
inner workings of the drivers themselves.  Then at that point we can
start looking at expanding this out to other drivers and coming up
with good test cases to test the functionality.  We really need the
interfaces clearly defines so that we can then look at having those
pulled into the distros so we have some sort of ABI we can work with
in customer environments.

Dropping frames is all well and good, but only so useful.  With the
addition of DMA_ATTR_SKIP_CPU_SYNC we should be able to do writable
pages so we could now do encap/decap type workloads.  If we can add
support for routing pages between interfaces that gets us close to
being able to OVS style demos.  At that point we can then start
comparing ourselves to DPDK and FD.io and seeing what we can do to
improve performance.

> Today we already have the 64 packets NAPI budget, but we are not
> taking advantage of this. For XDP as long as eBPF always return
> XDP_DROP or XDP_TX, then we (falsely) experience the effect of bulking
> (as code fits within the icache) and see huge perf boosts.

This makes a lot of assumptions.  First, the budget is up to 64, it
isn't always 64.  Second, you say we are "falsely" seeing icache
improvements, and I would argue that it isn't false as we are
intentionally bypassing most of the stack to perform the drop early.
That was kind of the point of all this.  Finally, this completely
discounts GRO/LRO which would take care of aggregating the frames and
reducing much of this overhead for TCP flows being received over the
interface.

> The initial principal of bulking/batching packets to amortize per
> packet costs.  The next step is just as important: Lookup table sizes
> (FIB) kills performance again. The solution is implementing a smart
> table lookup scheme that prefetch hash table key-cells and afterwards
> prefetch data-cells, based on the RX batch of packets.  Notice VPP
> revolves around similar tricks, and why it beats DPDK, and why it
> scales with 1Millon routes.

This is where things go completely sideways in your argument.  If you
want to implement some sort of custom FIB lookup library for XDP be my
guest.  If you are talking about hacking on the kernel I would
question how this is even related to XDP?  The lookup that is in the
kernel is designed to provide the best possible lookup under a number
of different conditions.  It is a "jack of all trades, master of none"
type of implementation.

Also, why should we be focused on FIB?  Seems like this is getting
back into routing territory and what I am looking for is uses well
beyond just routing.

> I hope I've made it very clear where the focus for XDP should be.
> This involves implementing what I call RX-stages in the drivers. While
> doing that we can figure out the most optimal data structure for
> packet batching.

Yes Jesper, your point of view is clear.  This is the same agenda you
have been pushing for the last several years.  I just don't see how
this can be made a priority now for a project where it isn't even
necessarily related.  In order for any of this to work the stack needs
support for bulk Rx, and we still seem pretty far from that happening.

>  I know Saeed is already working on RX-stages for mlx5, and I've tested
> the initial version of his patch, and the results are excellent.

That is great!  I look forward to seeing it when they push it to net-next.

By the way, after looking over the mlx5 driver it seems like there is
a bug in the logic.  From what I can tell it is using build_skb to
build frames around the page, but it doesn't bother to take care of
handling the mappings correctly.  So mlx5 can end up with data
corruption when the pages are unmapped.  My advice would be to look at
updating the driver to do something like what I did in ixgbe to make
use of the DMA_ATTR_SKIP_CPU_SYNC DMA attribute so that it won't
invalidate any updates made when adding headers or shared info.

- Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-20 20:09 ` Alexander Duyck
@ 2017-02-20 22:57   ` Saeed Mahameed
  2017-02-20 23:40     ` Alexander Duyck
  2017-02-21 16:35     ` Tom Herbert
  2017-02-20 23:39   ` Jesper Dangaard Brouer
  1 sibling, 2 replies; 20+ messages in thread
From: Saeed Mahameed @ 2017-02-20 22:57 UTC (permalink / raw)
  To: Alexander Duyck, Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, John Fastabend, David Miller, Tom Herbert,
	netdev, Brenden Blanco



On 02/20/2017 10:09 PM, Alexander Duyck wrote:
> On Mon, Feb 20, 2017 at 2:13 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>>
>> First thing to bring in order for the XDP project:
>>
>>   RX batching is missing.
>>
>> I don't want to discuss packet page-sizes or multi-port forwarding,
>> before we have established the most fundamental principal that all
>> other solution use; RX batching.
> 
> That is all well and good, but some of us would like to discuss other
> items as it has a direct impact on our driver implementation and
> future driver design.  Rx batching really seems tangential to the
> whole XDP discussion anyway unless you are talking about rewriting the
> core BPF code and kernel API itself to process multiple frames at a
> time.
> 
> That said, if something seems like it would break the concept you have
> for Rx batching please bring it up.  What I would like to see is well
> defined APIs and a usable interface so that I can present XDP to
> management and they will see the use of it and be willing to let me
> dedicate developer heads to enabling it on our drivers.
> 
>> Without building in RX batching, from the beginning/now, the XDP
>> architecture have lost.  As adding features and capabilities, will
>> just lead us back to the exact same performance problems as before!
> 
> I would argue you have much bigger issues to deal with.  Here is a short list:
> 1.  The Tx code is mostly just a toy.  We need support for more
> functional use cases.
> 2.  1 page per packet is costly, and blocks use on the intel drivers,
> mlx4 (after Eric's patches), and 64K page architectures.
> 3.  Should we support scatter-gather to support 9K jumbo frames
> instead of allocating order 2 pages?
> 
> Focusing on Rx batching seems like bike shedding more than anything
> else.  I would much rather be focused on what the API definitions
> should be for the drivers and the BPF code rather than focus on the
> inner workings of the drivers themselves.  Then at that point we can
> start looking at expanding this out to other drivers and coming up
> with good test cases to test the functionality.  We really need the
> interfaces clearly defines so that we can then look at having those
> pulled into the distros so we have some sort of ABI we can work with
> in customer environments.
> 
> Dropping frames is all well and good, but only so useful.  With the
> addition of DMA_ATTR_SKIP_CPU_SYNC we should be able to do writable
> pages so we could now do encap/decap type workloads.  If we can add
> support for routing pages between interfaces that gets us close to
> being able to OVS style demos.  At that point we can then start
> comparing ourselves to DPDK and FD.io and seeing what we can do to
> improve performance.
> 

Well, although I think Jesper is a little bit exaggerating ;) I guess he has a point 
and i am on his side on this discussion. you see, if we define the APIs and ABIs now
and they turn out to be a bottleneck for the whole XDP arch performance, at that 
point it will be too late to compare XDP to DPDK and other kernel bypass solutions.

What we need to do is to bring XDP to a state where it performs at least the same as other
kernel bypass solutions. I know that the DPDK team here at mellanox spent years working
on DPDK performance, squeezing every bit out of the code/dcache/icache/cpu you name it..
We simply need to do the same for XDP to prove it worthy and can deliver the required
rates. Only then, when we have the performance baseline numbers, we can start expanding XDP features
and defining new use cases and a uniform API, while making sure the performance is kept at it max.

Yes, there is a down side to this, that currently most of the optimizations and implementations we can do 
are inside the device driver and they are driver dependent, but once we have a clear image
on how things should work, we can pause and think on how to generalize the approaches 
to all device drivers. 

>> Today we already have the 64 packets NAPI budget, but we are not
>> taking advantage of this. For XDP as long as eBPF always return
>> XDP_DROP or XDP_TX, then we (falsely) experience the effect of bulking
>> (as code fits within the icache) and see huge perf boosts.
> 
> This makes a lot of assumptions.  First, the budget is up to 64, it
> isn't always 64.  Second, you say we are "falsely" seeing icache
> improvements, and I would argue that it isn't false as we are
> intentionally bypassing most of the stack to perform the drop early.
> That was kind of the point of all this.  Finally, this completely
> discounts GRO/LRO which would take care of aggregating the frames and
> reducing much of this overhead for TCP flows being received over the
> interface.
> 
>> The initial principal of bulking/batching packets to amortize per
>> packet costs.  The next step is just as important: Lookup table sizes
>> (FIB) kills performance again. The solution is implementing a smart
>> table lookup scheme that prefetch hash table key-cells and afterwards
>> prefetch data-cells, based on the RX batch of packets.  Notice VPP
>> revolves around similar tricks, and why it beats DPDK, and why it
>> scales with 1Millon routes.
> 
> This is where things go completely sideways in your argument.  If you
> want to implement some sort of custom FIB lookup library for XDP be my
> guest.  If you are talking about hacking on the kernel I would
> question how this is even related to XDP?  The lookup that is in the
> kernel is designed to provide the best possible lookup under a number
> of different conditions.  It is a "jack of all trades, master of none"
> type of implementation.
> 
> Also, why should we be focused on FIB?  Seems like this is getting
> back into routing territory and what I am looking for is uses well
> beyond just routing.
> 
>> I hope I've made it very clear where the focus for XDP should be.
>> This involves implementing what I call RX-stages in the drivers. While
>> doing that we can figure out the most optimal data structure for
>> packet batching.
> 
> Yes Jesper, your point of view is clear.  This is the same agenda you
> have been pushing for the last several years.  I just don't see how
> this can be made a priority now for a project where it isn't even
> necessarily related.  In order for any of this to work the stack needs
> support for bulk Rx, and we still seem pretty far from that happening.
> 
>>  I know Saeed is already working on RX-stages for mlx5, and I've tested
>> the initial version of his patch, and the results are excellent.
> 
> That is great!  I look forward to seeing it when they push it to net-next.
> 
> By the way, after looking over the mlx5 driver it seems like there is
> a bug in the logic.  From what I can tell it is using build_skb to
> build frames around the page, but it doesn't bother to take care of
> handling the mappings correctly.  So mlx5 can end up with data
> corruption when the pages are unmapped.  My advice would be to look at
> updating the driver to do something like what I did in ixgbe to make
> use of the DMA_ATTR_SKIP_CPU_SYNC DMA attribute so that it won't
> invalidate any updates made when adding headers or shared info.
> 

hmmm, are you talking about the mlx5 rx page cache ? will take a look at the ixgbe code for sure
but we didn't experience any issue of the sort, can you shed more light on the issue ?

Thanks,
-Saeed.

> - Alex
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-20 20:09 ` Alexander Duyck
  2017-02-20 22:57   ` Saeed Mahameed
@ 2017-02-20 23:39   ` Jesper Dangaard Brouer
  2017-02-21  0:39     ` Alexander Duyck
  1 sibling, 1 reply; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-20 23:39 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexei Starovoitov, John Fastabend, David Miller, Saeed Mahameed,
	Tom Herbert, netdev, Brenden Blanco, brouer

On Mon, 20 Feb 2017 12:09:30 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Mon, Feb 20, 2017 at 2:13 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > First thing to bring in order for the XDP project:
> >
> >   RX batching is missing.
> >
> > I don't want to discuss packet page-sizes or multi-port forwarding,
> > before we have established the most fundamental principal that all
> > other solution use; RX batching.  
> 
> That is all well and good, but some of us would like to discuss other
> items as it has a direct impact on our driver implementation and
> future driver design.  Rx batching really seems tangential to the
> whole XDP discussion anyway unless you are talking about rewriting the
> core BPF code and kernel API itself to process multiple frames at a
> time.

If I could change the BPF XDP program to take/process multiple frames
at a time, I would, but this is likely too late ABI wise?  As the BPF
programs are so small, we can simply simulate "bulking" by calling the
BPF prog in a loop (this is sort of already happening with XDP_DROP and
XDP_TX as the code path/size is so small) and it is good-enough.


> That said, if something seems like it would break the concept you have
> for Rx batching please bring it up.  What I would like to see is well
> defined APIs and a usable interface so that I can present XDP to
> management and they will see the use of it and be willing to let me
> dedicate developer heads to enabling it on our drivers.

What I'm afraid of is that you/we start to define APIs for multi-port
XDP forwarding, without supporting batching/bundling, because RX
batching layer is not ready yet.

 
> > Without building in RX batching, from the beginning/now, the XDP
> > architecture have lost.  As adding features and capabilities, will
> > just lead us back to the exact same performance problems as before!  

 
> I would argue you have much bigger issues to deal with.  Here is a short list:
>
> 1. The Tx code is mostly just a toy.  We need support for more
>    functional use cases.

XDP_TX do have real-life usage.

Multi-port TX or forwarding need to be designed right. I would like to
see that we think further than ifindex'es.   Can a simple vport mapping
table, that maps vport to ifindex, also be used for mapping a vport to
a socket?


> 2.  1 page per packet is costly, and blocks use on the intel drivers,
> mlx4 (after Eric's patches), and 64K page architectures.

XDP have opened the door to allow us to change the memory model for the
drivers.  This is needed big time.  The major performance bottleneck
for networking lies in memory management overhead.  Memory management
is a key for all the bypass solutions.

The Intel drivers are blocked, because you have solved the problem
partly (for short lived packets). The solution is brilliant, no doubt,
but IMHO it have its limitations. And I don't get why you insist XDP
should inherit these limitations. The page recycling is tied hard to
the size of the RX ring, that is a problem.  As both Eric and Tariq
realized they needed increase the RX ring size to 4096 to get TCP
performance.  Sharing the page makes in unpredictable when the
cache-line of struct page, will get cache-line-refcnt bounced, which is
a problem for the XDP performance target.


> 3.  Should we support scatter-gather to support 9K jumbo frames
> instead of allocating order 2 pages?

>From the start, we have chosen that enabling result in disabling some
features.  We explicitly choose not to support jumbo frames.
XDP buffers need to be keep as simple as possible.

 
> Focusing on Rx batching seems like bike shedding more than anything
> else.  I would much rather be focused on what the API definitions
> should be for the drivers and the BPF code rather than focus on the
> inner workings of the drivers themselves.  Then at that point we can
> start looking at expanding this out to other drivers and coming up
> with good test cases to test the functionality.  We really need the
> interfaces clearly defines so that we can then look at having those
> pulled into the distros so we have some sort of ABI we can work with
> in customer environments.
> 
> Dropping frames is all well and good, but only so useful.  With the
> addition of DMA_ATTR_SKIP_CPU_SYNC we should be able to do writable
> pages so we could now do encap/decap type workloads.  If we can add
> support for routing pages between interfaces that gets us close to
> being able to OVS style demos.  At that point we can then start
> comparing ourselves to DPDK and FD.io and seeing what we can do to
> improve performance.

I just hope you/we design interfaces with bundling in mind, as the
tricks FD.io uses requires that...  We can likely compete with DPDK
speeds, for toy examples, but for more realistic use-case with larger
code and large tables, FD.io/VPP can beat us, if we don't think in
bundling.

 
> > Today we already have the 64 packets NAPI budget, but we are not
> > taking advantage of this. For XDP as long as eBPF always return
> > XDP_DROP or XDP_TX, then we (falsely) experience the effect of bulking
> > (as code fits within the icache) and see huge perf boosts.  
> 
> This makes a lot of assumptions.  First, the budget is up to 64, it
> isn't always 64.  Second, you say we are "falsely" seeing icache
> improvements, and I would argue that it isn't false as we are
> intentionally bypassing most of the stack to perform the drop early.

By falsely I mean, our toy examples always return XDP_DROP, they don't
test the case where some packets also go to the netstack.  Once a
packet travel to the netstack, it will have flushed the eBPF icache
once it returns, and the eBPF code is reloaded (that was the point).


> That was kind of the point of all this.  Finally, this completely
> discounts GRO/LRO which would take care of aggregating the frames and
> reducing much of this overhead for TCP flows being received over the
> interface.

XDP does not hurt GRO, but bundling packets for GRO processing still
work. XDP does requires to disable HW LRO.  It is a trade off, if you
want LRO don't load a XDP program.


> > The initial principal of bulking/batching packets to amortize per
> > packet costs.  The next step is just as important: Lookup table sizes
> > (FIB) kills performance again. The solution is implementing a smart
> > table lookup scheme that prefetch hash table key-cells and afterwards
> > prefetch data-cells, based on the RX batch of packets.  Notice VPP
> > revolves around similar tricks, and why it beats DPDK, and why it
> > scales with 1Millon routes.  
> 
> This is where things go completely sideways in your argument.  If you
> want to implement some sort of custom FIB lookup library for XDP be my
> guest.  If you are talking about hacking on the kernel I would
> question how this is even related to XDP?  The lookup that is in the
> kernel is designed to provide the best possible lookup under a number
> of different conditions.  It is a "jack of all trades, master of none"
> type of implementation.

I do see your point, that RX bundling can also help the normal
netstack.  You did some amazing work in optimizing the FIB-trie lookup
(thanks for that!).  I guess, we could implement FIB lookup with
prefetching latency hiding based on having a bundle of packets.


> Also, why should we be focused on FIB?  Seems like this is getting
> back into routing territory and what I am looking for is uses well
> beyond just routing.

I'm not focused on FIB at all... I just pointed out that I would to get
an XDP forward arch that support sending packet round via "vports" for
maximum flexibility.


> > I hope I've made it very clear where the focus for XDP should be.
> > This involves implementing what I call RX-stages in the drivers. While
> > doing that we can figure out the most optimal data structure for
> > packet batching.  
> 
> Yes Jesper, your point of view is clear.  This is the same agenda you
> have been pushing for the last several years.  I just don't see how
> this can be made a priority now for a project where it isn't even
> necessarily related.  In order for any of this to work the stack needs
> support for bulk Rx, and we still seem pretty far from that happening.

I'm thinking we can implement this kind of bundling for the XDP use-case
first, as most of this can be contained within the driver.  After we
get experience with what works, we can take the next step and also
support bulking towards the netstack.

 
> >  I know Saeed is already working on RX-stages for mlx5, and I've tested
> > the initial version of his patch, and the results are excellent.  
> 
> That is great!  I look forward to seeing it when they push it to net-next.
> 
> By the way, after looking over the mlx5 driver it seems like there is
> a bug in the logic.  From what I can tell it is using build_skb to
> build frames around the page, but it doesn't bother to take care of
> handling the mappings correctly.  So mlx5 can end up with data
> corruption when the pages are unmapped.  My advice would be to look at
> updating the driver to do something like what I did in ixgbe to make
> use of the DMA_ATTR_SKIP_CPU_SYNC DMA attribute so that it won't
> invalidate any updates made when adding headers or shared info.

I guess, Saeed would need to look at this!

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-20 22:57   ` Saeed Mahameed
@ 2017-02-20 23:40     ` Alexander Duyck
  2017-02-21 23:08       ` Saeed Mahameed
  2017-02-21 16:35     ` Tom Herbert
  1 sibling, 1 reply; 20+ messages in thread
From: Alexander Duyck @ 2017-02-20 23:40 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Jesper Dangaard Brouer, Alexei Starovoitov, John Fastabend,
	David Miller, Tom Herbert, netdev, Brenden Blanco

On Mon, Feb 20, 2017 at 2:57 PM, Saeed Mahameed <saeedm@mellanox.com> wrote:
>
>
> On 02/20/2017 10:09 PM, Alexander Duyck wrote:
>> On Mon, Feb 20, 2017 at 2:13 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>>>
>>> First thing to bring in order for the XDP project:
>>>
>>>   RX batching is missing.
>>>
>>> I don't want to discuss packet page-sizes or multi-port forwarding,
>>> before we have established the most fundamental principal that all
>>> other solution use; RX batching.
>>
>> That is all well and good, but some of us would like to discuss other
>> items as it has a direct impact on our driver implementation and
>> future driver design.  Rx batching really seems tangential to the
>> whole XDP discussion anyway unless you are talking about rewriting the
>> core BPF code and kernel API itself to process multiple frames at a
>> time.
>>
>> That said, if something seems like it would break the concept you have
>> for Rx batching please bring it up.  What I would like to see is well
>> defined APIs and a usable interface so that I can present XDP to
>> management and they will see the use of it and be willing to let me
>> dedicate developer heads to enabling it on our drivers.
>>
>>> Without building in RX batching, from the beginning/now, the XDP
>>> architecture have lost.  As adding features and capabilities, will
>>> just lead us back to the exact same performance problems as before!
>>
>> I would argue you have much bigger issues to deal with.  Here is a short list:
>> 1.  The Tx code is mostly just a toy.  We need support for more
>> functional use cases.
>> 2.  1 page per packet is costly, and blocks use on the intel drivers,
>> mlx4 (after Eric's patches), and 64K page architectures.
>> 3.  Should we support scatter-gather to support 9K jumbo frames
>> instead of allocating order 2 pages?
>>
>> Focusing on Rx batching seems like bike shedding more than anything
>> else.  I would much rather be focused on what the API definitions
>> should be for the drivers and the BPF code rather than focus on the
>> inner workings of the drivers themselves.  Then at that point we can
>> start looking at expanding this out to other drivers and coming up
>> with good test cases to test the functionality.  We really need the
>> interfaces clearly defines so that we can then look at having those
>> pulled into the distros so we have some sort of ABI we can work with
>> in customer environments.
>>
>> Dropping frames is all well and good, but only so useful.  With the
>> addition of DMA_ATTR_SKIP_CPU_SYNC we should be able to do writable
>> pages so we could now do encap/decap type workloads.  If we can add
>> support for routing pages between interfaces that gets us close to
>> being able to OVS style demos.  At that point we can then start
>> comparing ourselves to DPDK and FD.io and seeing what we can do to
>> improve performance.
>>
>
> Well, although I think Jesper is a little bit exaggerating ;) I guess he has a point
> and i am on his side on this discussion. you see, if we define the APIs and ABIs now
> and they turn out to be a bottleneck for the whole XDP arch performance, at that
> point it will be too late to compare XDP to DPDK and other kernel bypass solutions.

Yes, but at the same time we cannot hold due to decision paralysis.
We should be moving forward, not holding waiting on things that may or
may not get done.

> What we need to do is to bring XDP to a state where it performs at least the same as other
> kernel bypass solutions. I know that the DPDK team here at mellanox spent years working
> on DPDK performance, squeezing every bit out of the code/dcache/icache/cpu you name it..
> We simply need to do the same for XDP to prove it worthy and can deliver the required
> rates. Only then, when we have the performance baseline numbers, we can start expanding XDP features
> and defining new use cases and a uniform API, while making sure the performance is kept at it max.

The problem is performance without features is useless.  I can make a
driver that received and drops all packets that goes really fast, but
it isn't too terribly useful and nobody will use it.  I don't want us
locking in on one use case and spending all of our time optimizing for
that when there is a good chance that nobody cares.  For example the
FIB argument Jesper was making is likely completely useless to most
people who will want to use XDP.  While there are some that may want a
router implemented in XDP it is much more likely that they will want
to do VM to VM switching via something more like OVS.

My argument is that we need to figure out what features we need, then
we can focus on performance.  I would much rather deliver a feature
and then improve the performance, than show the performance and not be
able to meet that after adding a feature.  It is all a matter of
setting expectations.

> Yes, there is a down side to this, that currently most of the optimizations and implementations we can do
> are inside the device driver and they are driver dependent, but once we have a clear image
> on how things should work, we can pause and think on how to generalize the approaches
> to all device drivers.

I'm fine with the optimizations being in the device driver, however
feature implementations are another matter.  Historically once
something is in a driver it takes a long time if ever for it to be
generalized out of the driver.  More often than not the driver vendors
prefer to leave their code as-is for a competitive advantage.
Historically the way we deal with this as a community is that if an
interface is likely to be used by more than one device it has to start
out generalized.

>>> Today we already have the 64 packets NAPI budget, but we are not
>>> taking advantage of this. For XDP as long as eBPF always return
>>> XDP_DROP or XDP_TX, then we (falsely) experience the effect of bulking
>>> (as code fits within the icache) and see huge perf boosts.
>>
>> This makes a lot of assumptions.  First, the budget is up to 64, it
>> isn't always 64.  Second, you say we are "falsely" seeing icache
>> improvements, and I would argue that it isn't false as we are
>> intentionally bypassing most of the stack to perform the drop early.
>> That was kind of the point of all this.  Finally, this completely
>> discounts GRO/LRO which would take care of aggregating the frames and
>> reducing much of this overhead for TCP flows being received over the
>> interface.
>>
>>> The initial principal of bulking/batching packets to amortize per
>>> packet costs.  The next step is just as important: Lookup table sizes
>>> (FIB) kills performance again. The solution is implementing a smart
>>> table lookup scheme that prefetch hash table key-cells and afterwards
>>> prefetch data-cells, based on the RX batch of packets.  Notice VPP
>>> revolves around similar tricks, and why it beats DPDK, and why it
>>> scales with 1Millon routes.
>>
>> This is where things go completely sideways in your argument.  If you
>> want to implement some sort of custom FIB lookup library for XDP be my
>> guest.  If you are talking about hacking on the kernel I would
>> question how this is even related to XDP?  The lookup that is in the
>> kernel is designed to provide the best possible lookup under a number
>> of different conditions.  It is a "jack of all trades, master of none"
>> type of implementation.
>>
>> Also, why should we be focused on FIB?  Seems like this is getting
>> back into routing territory and what I am looking for is uses well
>> beyond just routing.
>>
>>> I hope I've made it very clear where the focus for XDP should be.
>>> This involves implementing what I call RX-stages in the drivers. While
>>> doing that we can figure out the most optimal data structure for
>>> packet batching.
>>
>> Yes Jesper, your point of view is clear.  This is the same agenda you
>> have been pushing for the last several years.  I just don't see how
>> this can be made a priority now for a project where it isn't even
>> necessarily related.  In order for any of this to work the stack needs
>> support for bulk Rx, and we still seem pretty far from that happening.
>>
>>>  I know Saeed is already working on RX-stages for mlx5, and I've tested
>>> the initial version of his patch, and the results are excellent.
>>
>> That is great!  I look forward to seeing it when they push it to net-next.
>>
>> By the way, after looking over the mlx5 driver it seems like there is
>> a bug in the logic.  From what I can tell it is using build_skb to
>> build frames around the page, but it doesn't bother to take care of
>> handling the mappings correctly.  So mlx5 can end up with data
>> corruption when the pages are unmapped.  My advice would be to look at
>> updating the driver to do something like what I did in ixgbe to make
>> use of the DMA_ATTR_SKIP_CPU_SYNC DMA attribute so that it won't
>> invalidate any updates made when adding headers or shared info.
>>
>
> hmmm, are you talking about the mlx5 rx page cache ? will take a look at the ixgbe code for sure
> but we didn't experience any issue of the sort, can you shed more light on the issue ?
>
> Thanks,
> -Saeed.

Basically the issue is that there are some architectures where
dma_unmap_page will invalidate the page and cause any data written to
it from the CPU side to be invalidated.  On x86 the only way to
recreate this is to use the kernel parameter "swiotlb=force".
Basically when a page was mapped you couldn't unmap it without running
the risk of invalidating any data you had written to it.  I added a
DMA attribute called DMA_ATTR_SKIP_CPU_SYNC which is meant to prevent
that from taking place on unmap.  It also ends up being a performance
gain on architectures that do this since it avoids looping through
cache lines invalidating them on unmap.

Hope that helps.

- Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-20 23:39   ` Jesper Dangaard Brouer
@ 2017-02-21  0:39     ` Alexander Duyck
  0 siblings, 0 replies; 20+ messages in thread
From: Alexander Duyck @ 2017-02-21  0:39 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, John Fastabend, David Miller, Saeed Mahameed,
	Tom Herbert, netdev, Brenden Blanco

On Mon, Feb 20, 2017 at 3:39 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Mon, 20 Feb 2017 12:09:30 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Mon, Feb 20, 2017 at 2:13 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> >
>> > First thing to bring in order for the XDP project:
>> >
>> >   RX batching is missing.
>> >
>> > I don't want to discuss packet page-sizes or multi-port forwarding,
>> > before we have established the most fundamental principal that all
>> > other solution use; RX batching.
>>
>> That is all well and good, but some of us would like to discuss other
>> items as it has a direct impact on our driver implementation and
>> future driver design.  Rx batching really seems tangential to the
>> whole XDP discussion anyway unless you are talking about rewriting the
>> core BPF code and kernel API itself to process multiple frames at a
>> time.
>
> If I could change the BPF XDP program to take/process multiple frames
> at a time, I would, but this is likely too late ABI wise?  As the BPF
> programs are so small, we can simply simulate "bulking" by calling the
> BPF prog in a loop (this is sort of already happening with XDP_DROP and
> XDP_TX as the code path/size is so small) and it is good-enough.
>
>
>> That said, if something seems like it would break the concept you have
>> for Rx batching please bring it up.  What I would like to see is well
>> defined APIs and a usable interface so that I can present XDP to
>> management and they will see the use of it and be willing to let me
>> dedicate developer heads to enabling it on our drivers.
>
> What I'm afraid of is that you/we start to define APIs for multi-port
> XDP forwarding, without supporting batching/bundling, because RX
> batching layer is not ready yet.
>
>
>> > Without building in RX batching, from the beginning/now, the XDP
>> > architecture have lost.  As adding features and capabilities, will
>> > just lead us back to the exact same performance problems as before!
>
>
>> I would argue you have much bigger issues to deal with.  Here is a short list:
>>
>> 1. The Tx code is mostly just a toy.  We need support for more
>>    functional use cases.
>
> XDP_TX do have real-life usage.

Benchmarks don't count.  I don't see how echoing a frame back out on
the port it came in on will have much of an effect other than
confusing a switch.  You need to be able to change values in the
frame, I suppose you can do it on mlx5 but technically you are
violating the DMA API if you do.

> Multi-port TX or forwarding need to be designed right. I would like to
> see that we think further than ifindex'es.   Can a simple vport mapping
> table, that maps vport to ifindex, also be used for mapping a vport to
> a socket?

I'd say you might be getting ahead of yourself.  For a Tx we should be
using a device, for a socket that is another matter entirely.  I would
consider them two very different things and not eligible for using the
same interface.

>> 2.  1 page per packet is costly, and blocks use on the intel drivers,
>> mlx4 (after Eric's patches), and 64K page architectures.
>
> XDP have opened the door to allow us to change the memory model for the
> drivers.  This is needed big time.  The major performance bottleneck
> for networking lies in memory management overhead.  Memory management
> is a key for all the bypass solutions.

Changing the memory model isn't that big of a deal.  Honestly the
Intel drivers took care of this long ago with our page recycling
setup.  I think the DMA API bits I just added push this one step
further by allowing us to avoid the skb->head allocations.  The only
real issue we have to deal with is the sk_buff struct itself since
that thing is an oversized beast.

Also last I knew the bulk allocation stuff still showed little boost
for anything other than the routing use case.  I just don't want to
spend our time working on optimizing code for a use case that ends up
penalizing us when users try to make use of the kernel for actual work
like handling TCP sockets.

> The Intel drivers are blocked, because you have solved the problem
> partly (for short lived packets). The solution is brilliant, no doubt,
> but IMHO it have its limitations. And I don't get why you insist XDP
> should inherit these limitations. The page recycling is tied hard to
> the size of the RX ring, that is a problem.  As both Eric and Tariq
> realized they needed increase the RX ring size to 4096 to get TCP
> performance.  Sharing the page makes in unpredictable when the
> cache-line of struct page, will get cache-line-refcnt bounced, which is
> a problem for the XDP performance target.

I'm not insisting XDP be locked down to supporting the Intel approach.
All I am saying is don't lock us into having to use your approach.
>From what I can tell there is no hard requirement so as long as that
is the case don't try to enforce it as though there is.  Once your
page allocation API is in the kernel we can then revisit this and see
if we want to do it differently.

As far as the cache size, the upper limit using the Intel approach is
the size of the ring.  Historically we have seen very good reuse with
this at just the 512 descriptor size.  One of the reasons I
implemented it the way I did is because on PowerPC and x86 systems the
IOMMU can come into play on every map/unmap call and we took a heavy
penalty for that.  After implementing the page recycling the IOMMU
penalty on Rx was nearly completely wiped out.  At this point the only
real issue in regards to the IOMMU and mapping/unmapping is related to
Tx for us.

>> 3.  Should we support scatter-gather to support 9K jumbo frames
>> instead of allocating order 2 pages?
>
> From the start, we have chosen that enabling result in disabling some
> features.  We explicitly choose not to support jumbo frames.
> XDP buffers need to be keep as simple as possible.

That's fine if we never want this to be anything more than a proof of
concept.  However we should consider that some systems can't afford to
allocate 16K of memory per descriptor for receive.

>> Focusing on Rx batching seems like bike shedding more than anything
>> else.  I would much rather be focused on what the API definitions
>> should be for the drivers and the BPF code rather than focus on the
>> inner workings of the drivers themselves.  Then at that point we can
>> start looking at expanding this out to other drivers and coming up
>> with good test cases to test the functionality.  We really need the
>> interfaces clearly defines so that we can then look at having those
>> pulled into the distros so we have some sort of ABI we can work with
>> in customer environments.
>>
>> Dropping frames is all well and good, but only so useful.  With the
>> addition of DMA_ATTR_SKIP_CPU_SYNC we should be able to do writable
>> pages so we could now do encap/decap type workloads.  If we can add
>> support for routing pages between interfaces that gets us close to
>> being able to OVS style demos.  At that point we can then start
>> comparing ourselves to DPDK and FD.io and seeing what we can do to
>> improve performance.
>
> I just hope you/we design interfaces with bundling in mind, as the
> tricks FD.io uses requires that...  We can likely compete with DPDK
> speeds, for toy examples, but for more realistic use-case with larger
> code and large tables, FD.io/VPP can beat us, if we don't think in
> bundling.

Think about this the other way.  What if we go and implement almost no
features and show all this great performance.  Then when we finally
get around to implementing features we lose all the performance
because we didn't take into account the needs of the features.  I
don't want to just write marketing code, I want code that actually
does something useful.

>> > Today we already have the 64 packets NAPI budget, but we are not
>> > taking advantage of this. For XDP as long as eBPF always return
>> > XDP_DROP or XDP_TX, then we (falsely) experience the effect of bulking
>> > (as code fits within the icache) and see huge perf boosts.
>>
>> This makes a lot of assumptions.  First, the budget is up to 64, it
>> isn't always 64.  Second, you say we are "falsely" seeing icache
>> improvements, and I would argue that it isn't false as we are
>> intentionally bypassing most of the stack to perform the drop early.
>
> By falsely I mean, our toy examples always return XDP_DROP, they don't
> test the case where some packets also go to the netstack.  Once a
> packet travel to the netstack, it will have flushed the eBPF icache
> once it returns, and the eBPF code is reloaded (that was the point).

I fail to see how that is a problem of XDP.  I think you are confusing
XDP and your driver refactor ideas.  We could implement bulk Rx and
never touch a line of XDP code.

With that said, I get your concerns for XDP_TX and needing to bulk the
traffic somehow to get xmit_more type performance, but don't conflate
that with Rx bulking.

>> That was kind of the point of all this.  Finally, this completely
>> discounts GRO/LRO which would take care of aggregating the frames and
>> reducing much of this overhead for TCP flows being received over the
>> interface.
>
> XDP does not hurt GRO, but bundling packets for GRO processing still
> work. XDP does requires to disable HW LRO.  It is a trade off, if you
> want LRO don't load a XDP program.

Right, but the reason for having to disable LRO is the same reason why
you can't really handle jumbo frames very well.  I think we need to
look at implementing scatter-gather as a feature to address that.

>> > The initial principal of bulking/batching packets to amortize per
>> > packet costs.  The next step is just as important: Lookup table sizes
>> > (FIB) kills performance again. The solution is implementing a smart
>> > table lookup scheme that prefetch hash table key-cells and afterwards
>> > prefetch data-cells, based on the RX batch of packets.  Notice VPP
>> > revolves around similar tricks, and why it beats DPDK, and why it
>> > scales with 1Millon routes.
>>
>> This is where things go completely sideways in your argument.  If you
>> want to implement some sort of custom FIB lookup library for XDP be my
>> guest.  If you are talking about hacking on the kernel I would
>> question how this is even related to XDP?  The lookup that is in the
>> kernel is designed to provide the best possible lookup under a number
>> of different conditions.  It is a "jack of all trades, master of none"
>> type of implementation.
>
> I do see your point, that RX bundling can also help the normal
> netstack.  You did some amazing work in optimizing the FIB-trie lookup
> (thanks for that!).  I guess, we could implement FIB lookup with
> prefetching latency hiding based on having a bundle of packets.

Really this would be low priority in my opinion.  There are a few
users out there that might care like Project Calico users but for now
I'm not planning on reworking the FIB code any time soon.

>> Also, why should we be focused on FIB?  Seems like this is getting
>> back into routing territory and what I am looking for is uses well
>> beyond just routing.
>
> I'm not focused on FIB at all... I just pointed out that I would to get
> an XDP forward arch that support sending packet round via "vports" for
> maximum flexibility.

Okay.

>> > I hope I've made it very clear where the focus for XDP should be.
>> > This involves implementing what I call RX-stages in the drivers. While
>> > doing that we can figure out the most optimal data structure for
>> > packet batching.
>>
>> Yes Jesper, your point of view is clear.  This is the same agenda you
>> have been pushing for the last several years.  I just don't see how
>> this can be made a priority now for a project where it isn't even
>> necessarily related.  In order for any of this to work the stack needs
>> support for bulk Rx, and we still seem pretty far from that happening.
>
> I'm thinking we can implement this kind of bundling for the XDP use-case
> first, as most of this can be contained within the driver.  After we
> get experience with what works, we can take the next step and also
> support bulking towards the netstack.

Like you said, for the Tx and drop use cases you essentially already
have the effect of bulking.  You would be better off focusing on the
network stack and how to get bulk frames out of the driver and into
there before you start trying to make XDP do the bulking for you.

>> >  I know Saeed is already working on RX-stages for mlx5, and I've tested
>> > the initial version of his patch, and the results are excellent.
>>
>> That is great!  I look forward to seeing it when they push it to net-next.
>>
>> By the way, after looking over the mlx5 driver it seems like there is
>> a bug in the logic.  From what I can tell it is using build_skb to
>> build frames around the page, but it doesn't bother to take care of
>> handling the mappings correctly.  So mlx5 can end up with data
>> corruption when the pages are unmapped.  My advice would be to look at
>> updating the driver to do something like what I did in ixgbe to make
>> use of the DMA_ATTR_SKIP_CPU_SYNC DMA attribute so that it won't
>> invalidate any updates made when adding headers or shared info.
>
> I guess, Saeed would need to look at this!

Already gave him a few pointers where to look.

- Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-20 22:57   ` Saeed Mahameed
  2017-02-20 23:40     ` Alexander Duyck
@ 2017-02-21 16:35     ` Tom Herbert
  2017-02-21 16:46       ` David Miller
  2017-02-21 22:29       ` Saeed Mahameed
  1 sibling, 2 replies; 20+ messages in thread
From: Tom Herbert @ 2017-02-21 16:35 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Alexander Duyck, Jesper Dangaard Brouer, Alexei Starovoitov,
	John Fastabend, David Miller, netdev, Brenden Blanco

On Mon, Feb 20, 2017 at 2:57 PM, Saeed Mahameed <saeedm@mellanox.com> wrote:
>
>
> On 02/20/2017 10:09 PM, Alexander Duyck wrote:
>> On Mon, Feb 20, 2017 at 2:13 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>>>
>>> First thing to bring in order for the XDP project:
>>>
>>>   RX batching is missing.
>>>
>>> I don't want to discuss packet page-sizes or multi-port forwarding,
>>> before we have established the most fundamental principal that all
>>> other solution use; RX batching.
>>
>> That is all well and good, but some of us would like to discuss other
>> items as it has a direct impact on our driver implementation and
>> future driver design.  Rx batching really seems tangential to the
>> whole XDP discussion anyway unless you are talking about rewriting the
>> core BPF code and kernel API itself to process multiple frames at a
>> time.
>>
>> That said, if something seems like it would break the concept you have
>> for Rx batching please bring it up.  What I would like to see is well
>> defined APIs and a usable interface so that I can present XDP to
>> management and they will see the use of it and be willing to let me
>> dedicate developer heads to enabling it on our drivers.
>>
>>> Without building in RX batching, from the beginning/now, the XDP
>>> architecture have lost.  As adding features and capabilities, will
>>> just lead us back to the exact same performance problems as before!
>>
>> I would argue you have much bigger issues to deal with.  Here is a short list:
>> 1.  The Tx code is mostly just a toy.  We need support for more
>> functional use cases.
>> 2.  1 page per packet is costly, and blocks use on the intel drivers,
>> mlx4 (after Eric's patches), and 64K page architectures.
>> 3.  Should we support scatter-gather to support 9K jumbo frames
>> instead of allocating order 2 pages?
>>
>> Focusing on Rx batching seems like bike shedding more than anything
>> else.  I would much rather be focused on what the API definitions
>> should be for the drivers and the BPF code rather than focus on the
>> inner workings of the drivers themselves.  Then at that point we can
>> start looking at expanding this out to other drivers and coming up
>> with good test cases to test the functionality.  We really need the
>> interfaces clearly defines so that we can then look at having those
>> pulled into the distros so we have some sort of ABI we can work with
>> in customer environments.
>>
>> Dropping frames is all well and good, but only so useful.  With the
>> addition of DMA_ATTR_SKIP_CPU_SYNC we should be able to do writable
>> pages so we could now do encap/decap type workloads.  If we can add
>> support for routing pages between interfaces that gets us close to
>> being able to OVS style demos.  At that point we can then start
>> comparing ourselves to DPDK and FD.io and seeing what we can do to
>> improve performance.
>>
>
> Well, although I think Jesper is a little bit exaggerating ;) I guess he has a point
> and i am on his side on this discussion. you see, if we define the APIs and ABIs now
> and they turn out to be a bottleneck for the whole XDP arch performance, at that
> point it will be too late to compare XDP to DPDK and other kernel bypass solutions.
>
> What we need to do is to bring XDP to a state where it performs at least the same as other
> kernel bypass solutions. I know that the DPDK team here at mellanox spent years working
> on DPDK performance, squeezing every bit out of the code/dcache/icache/cpu you name it..
> We simply need to do the same for XDP to prove it worthy and can deliver the required
> rates. Only then, when we have the performance baseline numbers, we can start expanding XDP features
> and defining new use cases and a uniform API, while making sure the performance is kept at it max.
>
> Yes, there is a down side to this, that currently most of the optimizations and implementations we can do
> are inside the device driver and they are driver dependent, but once we have a clear image
> on how things should work, we can pause and think on how to generalize the approaches
> to all device drivers.
>
I don't agree with this approach. We only have a handful of drivers
that support XDP and already it is obvious that XDP is invasive in the
critical path and has created maintainence issues. XDP is lacking a
general API which means that drivers have to do more redundant
operations, and when it comes time to set such an API (as my patch set
is trying to deal) we need to retrofit it and deal with this
complexity in each driver. I agree that super great XDP performance is
a goal, but it's not the only goal-- we still need to provide stable,
maintainable, good performance drivers for everyone.

We already have good APIs for similar datapath functionality (GRO,
GSO, xmit_more, etc.), and I don't see why XDP is so special that we
can't come up with a reasonable API for batching and implement it. A
good API for XDP will move as much of the logic and decision making
out of the drivers as possible and should be amenable to the typical
driver processing flow. If the API becomes the bottleneck then we fix
the API, if we really can't fix it and we need another 1000 LOC in the
critical datapath of drivers then maybe we do that but only by making
an explicit tradeoff between XDP performance and complexity.

Tom

>>> Today we already have the 64 packets NAPI budget, but we are not
>>> taking advantage of this. For XDP as long as eBPF always return
>>> XDP_DROP or XDP_TX, then we (falsely) experience the effect of bulking
>>> (as code fits within the icache) and see huge perf boosts.
>>
>> This makes a lot of assumptions.  First, the budget is up to 64, it
>> isn't always 64.  Second, you say we are "falsely" seeing icache
>> improvements, and I would argue that it isn't false as we are
>> intentionally bypassing most of the stack to perform the drop early.
>> That was kind of the point of all this.  Finally, this completely
>> discounts GRO/LRO which would take care of aggregating the frames and
>> reducing much of this overhead for TCP flows being received over the
>> interface.
>>
>>> The initial principal of bulking/batching packets to amortize per
>>> packet costs.  The next step is just as important: Lookup table sizes
>>> (FIB) kills performance again. The solution is implementing a smart
>>> table lookup scheme that prefetch hash table key-cells and afterwards
>>> prefetch data-cells, based on the RX batch of packets.  Notice VPP
>>> revolves around similar tricks, and why it beats DPDK, and why it
>>> scales with 1Millon routes.
>>
>> This is where things go completely sideways in your argument.  If you
>> want to implement some sort of custom FIB lookup library for XDP be my
>> guest.  If you are talking about hacking on the kernel I would
>> question how this is even related to XDP?  The lookup that is in the
>> kernel is designed to provide the best possible lookup under a number
>> of different conditions.  It is a "jack of all trades, master of none"
>> type of implementation.
>>
>> Also, why should we be focused on FIB?  Seems like this is getting
>> back into routing territory and what I am looking for is uses well
>> beyond just routing.
>>
>>> I hope I've made it very clear where the focus for XDP should be.
>>> This involves implementing what I call RX-stages in the drivers. While
>>> doing that we can figure out the most optimal data structure for
>>> packet batching.
>>
>> Yes Jesper, your point of view is clear.  This is the same agenda you
>> have been pushing for the last several years.  I just don't see how
>> this can be made a priority now for a project where it isn't even
>> necessarily related.  In order for any of this to work the stack needs
>> support for bulk Rx, and we still seem pretty far from that happening.
>>
>>>  I know Saeed is already working on RX-stages for mlx5, and I've tested
>>> the initial version of his patch, and the results are excellent.
>>
>> That is great!  I look forward to seeing it when they push it to net-next.
>>
>> By the way, after looking over the mlx5 driver it seems like there is
>> a bug in the logic.  From what I can tell it is using build_skb to
>> build frames around the page, but it doesn't bother to take care of
>> handling the mappings correctly.  So mlx5 can end up with data
>> corruption when the pages are unmapped.  My advice would be to look at
>> updating the driver to do something like what I did in ixgbe to make
>> use of the DMA_ATTR_SKIP_CPU_SYNC DMA attribute so that it won't
>> invalidate any updates made when adding headers or shared info.
>>
>
> hmmm, are you talking about the mlx5 rx page cache ? will take a look at the ixgbe code for sure
> but we didn't experience any issue of the sort, can you shed more light on the issue ?
>
> Thanks,
> -Saeed.
>
>> - Alex
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-21 16:35     ` Tom Herbert
@ 2017-02-21 16:46       ` David Miller
  2017-02-21 17:40         ` Tom Herbert
  2017-02-21 22:29       ` Saeed Mahameed
  1 sibling, 1 reply; 20+ messages in thread
From: David Miller @ 2017-02-21 16:46 UTC (permalink / raw)
  To: tom
  Cc: saeedm, alexander.duyck, brouer, alexei.starovoitov,
	john.fastabend, netdev, bblanco

From: Tom Herbert <tom@herbertland.com>
Date: Tue, 21 Feb 2017 08:35:41 -0800

> We already have good APIs for similar datapath functionality (GRO,
> GSO, xmit_more, etc.), and I don't see why XDP is so special that we
> can't come up with a reasonable API for batching and implement it.

What you are missing is that it wasn't always this way.

The initial TSO support was a hodge-podge of weird driver APIs and
simple heuristics thrown all over the place.  It was ugly but worked
and allowed us to experiment.  We had to adjust driver internals
a lot until on the way towards getting things how they are today.

This is the natural course of things, so please don't suggest that XDP
shouldn't evolve in the same way.

I think we really need to be fast and loose right now and only try to
constrict and perfect the API after some of this initial activity has
died down.

Yes it's more work for the brave drivers that add XDP support, but
unfortunately that's how we figure out what's really needed and works
in the long term.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-21 16:46       ` David Miller
@ 2017-02-21 17:40         ` Tom Herbert
  2017-02-21 18:11           ` David Miller
  2017-02-21 22:40           ` Saeed Mahameed
  0 siblings, 2 replies; 20+ messages in thread
From: Tom Herbert @ 2017-02-21 17:40 UTC (permalink / raw)
  To: David Miller
  Cc: Saeed Mahameed, Alexander Duyck, Jesper Dangaard Brouer,
	Alexei Starovoitov, john fastabend,
	Linux Kernel Network Developers, Brenden Blanco

On Tue, Feb 21, 2017 at 8:46 AM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@herbertland.com>
> Date: Tue, 21 Feb 2017 08:35:41 -0800
>
>> We already have good APIs for similar datapath functionality (GRO,
>> GSO, xmit_more, etc.), and I don't see why XDP is so special that we
>> can't come up with a reasonable API for batching and implement it.
>
> What you are missing is that it wasn't always this way.
>
> The initial TSO support was a hodge-podge of weird driver APIs and
> simple heuristics thrown all over the place.  It was ugly but worked
> and allowed us to experiment.  We had to adjust driver internals
> a lot until on the way towards getting things how they are today.
>
> This is the natural course of things, so please don't suggest that XDP
> shouldn't evolve in the same way.
>
> I think we really need to be fast and loose right now and only try to
> constrict and perfect the API after some of this initial activity has
> died down.
>
> Yes it's more work for the brave drivers that add XDP support, but
> unfortunately that's how we figure out what's really needed and works
> in the long term.

I'd be more supportive of this line of thinking if we (e.g. FB) didn't
have to spend the majority time over the past few months trying to
deal with all the complexity being thrown into drivers for all these
new features such as XDP. Case in point, Mellanox drivers are
completely non-modular and have a horrible directory structure. They
tried to fix, this but the patch set was rejected because it would
break people trying to do backports. That's a fair argument, but the
lesson I gather from that is that we should put more time in up front
thinking about how to structure code the right way instead of just
throwing it in and trying to deal with the consequences later.

Tom

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-21 17:40         ` Tom Herbert
@ 2017-02-21 18:11           ` David Miller
  2017-02-21 18:23             ` Tom Herbert
  2017-02-21 22:40           ` Saeed Mahameed
  1 sibling, 1 reply; 20+ messages in thread
From: David Miller @ 2017-02-21 18:11 UTC (permalink / raw)
  To: tom
  Cc: saeedm, alexander.duyck, brouer, alexei.starovoitov,
	john.fastabend, netdev, bblanco

From: Tom Herbert <tom@herbertland.com>
Date: Tue, 21 Feb 2017 09:40:17 -0800

> I'd be more supportive of this line of thinking if we (e.g. FB) didn't
> have to spend the majority time over the past few months trying to
> deal with all the complexity being thrown into drivers for all these
> new features such as XDP. Case in point, Mellanox drivers are
> completely non-modular and have a horrible directory structure. They
> tried to fix, this but the patch set was rejected because it would
> break people trying to do backports. That's a fair argument, but the
> lesson I gather from that is that we should put more time in up front
> thinking about how to structure code the right way instead of just
> throwing it in and trying to deal with the consequences later.

Hey aren't you guys suffering from this because you're stuck on an
older kernel for one reason or another?  Don't we constantly give
the Android guys a hard time about this? ;-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-21 18:11           ` David Miller
@ 2017-02-21 18:23             ` Tom Herbert
  0 siblings, 0 replies; 20+ messages in thread
From: Tom Herbert @ 2017-02-21 18:23 UTC (permalink / raw)
  To: David Miller
  Cc: Saeed Mahameed, Alexander Duyck, Jesper Dangaard Brouer,
	Alexei Starovoitov, john fastabend,
	Linux Kernel Network Developers, Brenden Blanco

On Tue, Feb 21, 2017 at 10:11 AM, David Miller <davem@davemloft.net> wrote:
> From: Tom Herbert <tom@herbertland.com>
> Date: Tue, 21 Feb 2017 09:40:17 -0800
>
>> I'd be more supportive of this line of thinking if we (e.g. FB) didn't
>> have to spend the majority time over the past few months trying to
>> deal with all the complexity being thrown into drivers for all these
>> new features such as XDP. Case in point, Mellanox drivers are
>> completely non-modular and have a horrible directory structure. They
>> tried to fix, this but the patch set was rejected because it would
>> break people trying to do backports. That's a fair argument, but the
>> lesson I gather from that is that we should put more time in up front
>> thinking about how to structure code the right way instead of just
>> throwing it in and trying to deal with the consequences later.
>
> Hey aren't you guys suffering from this because you're stuck on an
> older kernel for one reason or another?  Don't we constantly give
> the Android guys a hard time about this? ;-)

Ha, isn't "everyone is stuck on an older kernel for one reason or
another" a metaphor for life? ;-) Backports/rebases/maintainence/bug
fixing/testing are the unglamorous realities of kernel engineers
trying to deploy in production!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-21 16:35     ` Tom Herbert
  2017-02-21 16:46       ` David Miller
@ 2017-02-21 22:29       ` Saeed Mahameed
  2017-02-21 22:54         ` Tom Herbert
  1 sibling, 1 reply; 20+ messages in thread
From: Saeed Mahameed @ 2017-02-21 22:29 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Saeed Mahameed, Alexander Duyck, Jesper Dangaard Brouer,
	Alexei Starovoitov, John Fastabend, David Miller, netdev,
	Brenden Blanco

On Tue, Feb 21, 2017 at 6:35 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Mon, Feb 20, 2017 at 2:57 PM, Saeed Mahameed <saeedm@mellanox.com> wrote:
>>
>> Well, although I think Jesper is a little bit exaggerating ;) I guess he has a point
>> and i am on his side on this discussion. you see, if we define the APIs and ABIs now
>> and they turn out to be a bottleneck for the whole XDP arch performance, at that
>> point it will be too late to compare XDP to DPDK and other kernel bypass solutions.
>>
>> What we need to do is to bring XDP to a state where it performs at least the same as other
>> kernel bypass solutions. I know that the DPDK team here at mellanox spent years working
>> on DPDK performance, squeezing every bit out of the code/dcache/icache/cpu you name it..
>> We simply need to do the same for XDP to prove it worthy and can deliver the required
>> rates. Only then, when we have the performance baseline numbers, we can start expanding XDP features
>> and defining new use cases and a uniform API, while making sure the performance is kept at it max.
>>
>> Yes, there is a down side to this, that currently most of the optimizations and implementations we can do
>> are inside the device driver and they are driver dependent, but once we have a clear image
>> on how things should work, we can pause and think on how to generalize the approaches
>> to all device drivers.
>>
> I don't agree with this approach. We only have a handful of drivers
> that support XDP and already it is obvious that XDP is invasive in the
> critical path and has created maintainence issues. XDP is lacking a
> general API which means that drivers have to do more redundant
> operations, and when it comes time to set such an API (as my patch set
> is trying to deal) we need to retrofit it and deal with this

For control path and XDP program hooks management i completely support
your work,
but as Dave puts it, we need to have some freedom at least in the
first stages in the interaction between driver RX path and XDP
programs packet flow, as the flow might change a couple of times until
we settle down on an optimal approach.

> complexity in each driver. I agree that super great XDP performance is
> a goal, but it's not the only goal-- we still need to provide stable,
> maintainable, good performance drivers for everyone.
>

The only complexity XDP is adding to the drivers is the constrains on
RX memory management and memory model, calling the XDP program itself
and handling the  action is really a simple thing once you have the
correct memory model.

Who knows! maybe someday XDP will define one unified RX API for all
drivers and it even will handle normal stack delivery it self :).

for the long long term I dream of a driver passing page fragments +
"on the side offloads (if any)" to the stack instead of fat SKBs, and
in return it will get the same page back to be recycled into RX buffer
or a replacement new one.
good performance should really come from the stack/XDP/upper layers,
not form the device drivers.

but for the short term we will need to continue experimenting with
what we have and optimize it as much as possible with no constrains.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-21 17:40         ` Tom Herbert
  2017-02-21 18:11           ` David Miller
@ 2017-02-21 22:40           ` Saeed Mahameed
  2017-02-21 23:04             ` Tom Herbert
  1 sibling, 1 reply; 20+ messages in thread
From: Saeed Mahameed @ 2017-02-21 22:40 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, Saeed Mahameed, Alexander Duyck,
	Jesper Dangaard Brouer, Alexei Starovoitov, john fastabend,
	Linux Kernel Network Developers, Brenden Blanco

On Tue, Feb 21, 2017 at 7:40 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Tue, Feb 21, 2017 at 8:46 AM, David Miller <davem@davemloft.net> wrote:
>> From: Tom Herbert <tom@herbertland.com>
>> Date: Tue, 21 Feb 2017 08:35:41 -0800
>>
>>> We already have good APIs for similar datapath functionality (GRO,
>>> GSO, xmit_more, etc.), and I don't see why XDP is so special that we
>>> can't come up with a reasonable API for batching and implement it.
>>
>> What you are missing is that it wasn't always this way.
>>
>> The initial TSO support was a hodge-podge of weird driver APIs and
>> simple heuristics thrown all over the place.  It was ugly but worked
>> and allowed us to experiment.  We had to adjust driver internals
>> a lot until on the way towards getting things how they are today.
>>
>> This is the natural course of things, so please don't suggest that XDP
>> shouldn't evolve in the same way.
>>
>> I think we really need to be fast and loose right now and only try to
>> constrict and perfect the API after some of this initial activity has
>> died down.
>>
>> Yes it's more work for the brave drivers that add XDP support, but
>> unfortunately that's how we figure out what's really needed and works
>> in the long term.
>
> I'd be more supportive of this line of thinking if we (e.g. FB) didn't
> have to spend the majority time over the past few months trying to
> deal with all the complexity being thrown into drivers for all these
> new features such as XDP. Case in point, Mellanox drivers are
> completely non-modular and have a horrible directory structure. They
> tried to fix, this but the patch set was rejected because it would
> break people trying to do backports. That's a fair argument, but the
> lesson I gather from that is that we should put more time in up front
> thinking about how to structure code the right way instead of just
> throwing it in and trying to deal with the consequences later.
>

I also gathered the same lesson :) I will do my best to separate
between regular RX path and XDP-enabled RX path as much as possible in
my next RX staging/bulking patches as the current mlx5 design allows
me to do so with a little code duplication.

> Tom

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-21 22:29       ` Saeed Mahameed
@ 2017-02-21 22:54         ` Tom Herbert
  2017-02-22  9:43           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 20+ messages in thread
From: Tom Herbert @ 2017-02-21 22:54 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Saeed Mahameed, Alexander Duyck, Jesper Dangaard Brouer,
	Alexei Starovoitov, John Fastabend, David Miller, netdev,
	Brenden Blanco

On Tue, Feb 21, 2017 at 2:29 PM, Saeed Mahameed
<saeedm@dev.mellanox.co.il> wrote:
> On Tue, Feb 21, 2017 at 6:35 PM, Tom Herbert <tom@herbertland.com> wrote:
>> On Mon, Feb 20, 2017 at 2:57 PM, Saeed Mahameed <saeedm@mellanox.com> wrote:
>>>
>>> Well, although I think Jesper is a little bit exaggerating ;) I guess he has a point
>>> and i am on his side on this discussion. you see, if we define the APIs and ABIs now
>>> and they turn out to be a bottleneck for the whole XDP arch performance, at that
>>> point it will be too late to compare XDP to DPDK and other kernel bypass solutions.
>>>
>>> What we need to do is to bring XDP to a state where it performs at least the same as other
>>> kernel bypass solutions. I know that the DPDK team here at mellanox spent years working
>>> on DPDK performance, squeezing every bit out of the code/dcache/icache/cpu you name it..
>>> We simply need to do the same for XDP to prove it worthy and can deliver the required
>>> rates. Only then, when we have the performance baseline numbers, we can start expanding XDP features
>>> and defining new use cases and a uniform API, while making sure the performance is kept at it max.
>>>
>>> Yes, there is a down side to this, that currently most of the optimizations and implementations we can do
>>> are inside the device driver and they are driver dependent, but once we have a clear image
>>> on how things should work, we can pause and think on how to generalize the approaches
>>> to all device drivers.
>>>
>> I don't agree with this approach. We only have a handful of drivers
>> that support XDP and already it is obvious that XDP is invasive in the
>> critical path and has created maintainence issues. XDP is lacking a
>> general API which means that drivers have to do more redundant
>> operations, and when it comes time to set such an API (as my patch set
>> is trying to deal) we need to retrofit it and deal with this
>
> For control path and XDP program hooks management i completely support
> your work,
> but as Dave puts it, we need to have some freedom at least in the
> first stages in the interaction between driver RX path and XDP
> programs packet flow, as the flow might change a couple of times until
> we settle down on an optimal approach.
>
>> complexity in each driver. I agree that super great XDP performance is
>> a goal, but it's not the only goal-- we still need to provide stable,
>> maintainable, good performance drivers for everyone.
>>
>
> The only complexity XDP is adding to the drivers is the constrains on
> RX memory management and memory model, calling the XDP program itself
> and handling the  action is really a simple thing once you have the
> correct memory model.
>
> Who knows! maybe someday XDP will define one unified RX API for all
> drivers and it even will handle normal stack delivery it self :).
>
That's exactly the point and what we need for TXDP. I'm missing why
doing this is such rocket science other than the fact that all these
drivers are vastly different and changing the existing API is
unpleasant. The only functional complexity I see in creating a generic
batching interface is handling return codes asynchronously. This is
entirely feasible though...

> for the long long term I dream of a driver passing page fragments +
> "on the side offloads (if any)" to the stack instead of fat SKBs, and
> in return it will get the same page back to be recycled into RX buffer
> or a replacement new one.
> good performance should really come from the stack/XDP/upper layers,
> not form the device drivers.
>
> but for the short term we will need to continue experimenting with
> what we have and optimize it as much as possible with no constrains.

I'm all for experimentation, opposed to make a mess of drivers any
more than they already are.

Tom

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-21 22:40           ` Saeed Mahameed
@ 2017-02-21 23:04             ` Tom Herbert
  0 siblings, 0 replies; 20+ messages in thread
From: Tom Herbert @ 2017-02-21 23:04 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David Miller, Saeed Mahameed, Alexander Duyck,
	Jesper Dangaard Brouer, Alexei Starovoitov, john fastabend,
	Linux Kernel Network Developers, Brenden Blanco

On Tue, Feb 21, 2017 at 2:40 PM, Saeed Mahameed
<saeedm@dev.mellanox.co.il> wrote:
> On Tue, Feb 21, 2017 at 7:40 PM, Tom Herbert <tom@herbertland.com> wrote:
>> On Tue, Feb 21, 2017 at 8:46 AM, David Miller <davem@davemloft.net> wrote:
>>> From: Tom Herbert <tom@herbertland.com>
>>> Date: Tue, 21 Feb 2017 08:35:41 -0800
>>>
>>>> We already have good APIs for similar datapath functionality (GRO,
>>>> GSO, xmit_more, etc.), and I don't see why XDP is so special that we
>>>> can't come up with a reasonable API for batching and implement it.
>>>
>>> What you are missing is that it wasn't always this way.
>>>
>>> The initial TSO support was a hodge-podge of weird driver APIs and
>>> simple heuristics thrown all over the place.  It was ugly but worked
>>> and allowed us to experiment.  We had to adjust driver internals
>>> a lot until on the way towards getting things how they are today.
>>>
>>> This is the natural course of things, so please don't suggest that XDP
>>> shouldn't evolve in the same way.
>>>
>>> I think we really need to be fast and loose right now and only try to
>>> constrict and perfect the API after some of this initial activity has
>>> died down.
>>>
>>> Yes it's more work for the brave drivers that add XDP support, but
>>> unfortunately that's how we figure out what's really needed and works
>>> in the long term.
>>
>> I'd be more supportive of this line of thinking if we (e.g. FB) didn't
>> have to spend the majority time over the past few months trying to
>> deal with all the complexity being thrown into drivers for all these
>> new features such as XDP. Case in point, Mellanox drivers are
>> completely non-modular and have a horrible directory structure. They
>> tried to fix, this but the patch set was rejected because it would
>> break people trying to do backports. That's a fair argument, but the
>> lesson I gather from that is that we should put more time in up front
>> thinking about how to structure code the right way instead of just
>> throwing it in and trying to deal with the consequences later.
>>
>
> I also gathered the same lesson :) I will do my best to separate
> between regular RX path and XDP-enabled RX path as much as possible in
> my next RX staging/bulking patches as the current mlx5 design allows
> me to do so with a little code duplication.
>
Separating the data paths is a major part of the problem. That means
we now have parallel paths in the system which kind of do the same
thing but in vastly different ways each of which has its own bugs and
idiosyncrasies. This makes things harder to debug and maintain (e.g.
dealing with the the striding vs. non-striding split in mlx5 was
particularly painful). Please work to unify the data path and minimize
changes in it by pushing complexity into the stack to get the required
functionality.

Tom

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-20 23:40     ` Alexander Duyck
@ 2017-02-21 23:08       ` Saeed Mahameed
  0 siblings, 0 replies; 20+ messages in thread
From: Saeed Mahameed @ 2017-02-21 23:08 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Saeed Mahameed, Jesper Dangaard Brouer, Alexei Starovoitov,
	John Fastabend, David Miller, Tom Herbert, netdev,
	Brenden Blanco

On Tue, Feb 21, 2017 at 1:40 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Mon, Feb 20, 2017 at 2:57 PM, Saeed Mahameed <saeedm@mellanox.com> wrote:
>>
>>
>> On 02/20/2017 10:09 PM, Alexander Duyck wrote:
>>> On Mon, Feb 20, 2017 at 2:13 AM, Jesper Dangaard Brouer
>>> <brouer@redhat.com> wrote:
>>>>
>>>> First thing to bring in order for the XDP project:
>>>>
>>>>   RX batching is missing.
>>>>
>>>> I don't want to discuss packet page-sizes or multi-port forwarding,
>>>> before we have established the most fundamental principal that all
>>>> other solution use; RX batching.
>>>
>>> That is all well and good, but some of us would like to discuss other
>>> items as it has a direct impact on our driver implementation and
>>> future driver design.  Rx batching really seems tangential to the
>>> whole XDP discussion anyway unless you are talking about rewriting the
>>> core BPF code and kernel API itself to process multiple frames at a
>>> time.
>>>
>>> That said, if something seems like it would break the concept you have
>>> for Rx batching please bring it up.  What I would like to see is well
>>> defined APIs and a usable interface so that I can present XDP to
>>> management and they will see the use of it and be willing to let me
>>> dedicate developer heads to enabling it on our drivers.
>>>
>>>> Without building in RX batching, from the beginning/now, the XDP
>>>> architecture have lost.  As adding features and capabilities, will
>>>> just lead us back to the exact same performance problems as before!
>>>
>>> I would argue you have much bigger issues to deal with.  Here is a short list:
>>> 1.  The Tx code is mostly just a toy.  We need support for more
>>> functional use cases.
>>> 2.  1 page per packet is costly, and blocks use on the intel drivers,
>>> mlx4 (after Eric's patches), and 64K page architectures.
>>> 3.  Should we support scatter-gather to support 9K jumbo frames
>>> instead of allocating order 2 pages?
>>>
>>> Focusing on Rx batching seems like bike shedding more than anything
>>> else.  I would much rather be focused on what the API definitions
>>> should be for the drivers and the BPF code rather than focus on the
>>> inner workings of the drivers themselves.  Then at that point we can
>>> start looking at expanding this out to other drivers and coming up
>>> with good test cases to test the functionality.  We really need the
>>> interfaces clearly defines so that we can then look at having those
>>> pulled into the distros so we have some sort of ABI we can work with
>>> in customer environments.
>>>
>>> Dropping frames is all well and good, but only so useful.  With the
>>> addition of DMA_ATTR_SKIP_CPU_SYNC we should be able to do writable
>>> pages so we could now do encap/decap type workloads.  If we can add
>>> support for routing pages between interfaces that gets us close to
>>> being able to OVS style demos.  At that point we can then start
>>> comparing ourselves to DPDK and FD.io and seeing what we can do to
>>> improve performance.
>>>
>>
>> Well, although I think Jesper is a little bit exaggerating ;) I guess he has a point
>> and i am on his side on this discussion. you see, if we define the APIs and ABIs now
>> and they turn out to be a bottleneck for the whole XDP arch performance, at that
>> point it will be too late to compare XDP to DPDK and other kernel bypass solutions.
>
> Yes, but at the same time we cannot hold due to decision paralysis.
> We should be moving forward, not holding waiting on things that may or
> may not get done.

I am not saying we should wait, i am saying we should work in all
frontiers, but keep in mind that the whole idea of XDP is max
performance with minimal kernel/stack overhead.

>
>> What we need to do is to bring XDP to a state where it performs at least the same as other
>> kernel bypass solutions. I know that the DPDK team here at mellanox spent years working
>> on DPDK performance, squeezing every bit out of the code/dcache/icache/cpu you name it..
>> We simply need to do the same for XDP to prove it worthy and can deliver the required
>> rates. Only then, when we have the performance baseline numbers, we can start expanding XDP features
>> and defining new use cases and a uniform API, while making sure the performance is kept at it max.
>
> The problem is performance without features is useless.  I can make a

XDP without performance is useless :).

> driver that received and drops all packets that goes really fast, but
> it isn't too terribly useful and nobody will use it.  I don't want us
> locking in on one use case and spending all of our time optimizing for
> that when there is a good chance that nobody cares.  For example the
> FIB argument Jesper was making is likely completely useless to most
> people who will want to use XDP.  While there are some that may want a
> router implemented in XDP it is much more likely that they will want
> to do VM to VM switching via something more like OVS.
>
> My argument is that we need to figure out what features we need, then
> we can focus on performance.  I would much rather deliver a feature
> and then improve the performance, than show the performance and not be
> able to meet that after adding a feature.  It is all a matter of
> setting expectations.
>

I think the use cases and the features list are already clear, XDP
should be simple, fast and flexible to implement most of the use cases
you mentioned above (firewall, routing, injecting, inspecting,
encap,decap) you name it :), and it should be up to the program the
user defines, really.
It shouldn't be different form the current kernel solutions and even
the stack itself (at least this is what i think of XDP), others might
disagree.

i know we are not quite there yet, but remember if XDP is not as fast
as other competitors, what is the point of doing it at all ?


>> Yes, there is a down side to this, that currently most of the optimizations and implementations we can do
>> are inside the device driver and they are driver dependent, but once we have a clear image
>> on how things should work, we can pause and think on how to generalize the approaches
>> to all device drivers.
>
> I'm fine with the optimizations being in the device driver, however
> feature implementations are another matter.  Historically once
> something is in a driver it takes a long time if ever for it to be
> generalized out of the driver.  More often than not the driver vendors
> prefer to leave their code as-is for a competitive advantage.
> Historically the way we deal with this as a community is that if an
> interface is likely to be used by more than one device it has to start
> out generalized.

point taken, but How do you generalized something that isn't fully cooked yet ?

>
>>>> Today we already have the 64 packets NAPI budget, but we are not
>>>> taking advantage of this. For XDP as long as eBPF always return
>>>> XDP_DROP or XDP_TX, then we (falsely) experience the effect of bulking
>>>> (as code fits within the icache) and see huge perf boosts.
>>>
>>> This makes a lot of assumptions.  First, the budget is up to 64, it
>>> isn't always 64.  Second, you say we are "falsely" seeing icache
>>> improvements, and I would argue that it isn't false as we are
>>> intentionally bypassing most of the stack to perform the drop early.
>>> That was kind of the point of all this.  Finally, this completely
>>> discounts GRO/LRO which would take care of aggregating the frames and
>>> reducing much of this overhead for TCP flows being received over the
>>> interface.
>>>
>>>> The initial principal of bulking/batching packets to amortize per
>>>> packet costs.  The next step is just as important: Lookup table sizes
>>>> (FIB) kills performance again. The solution is implementing a smart
>>>> table lookup scheme that prefetch hash table key-cells and afterwards
>>>> prefetch data-cells, based on the RX batch of packets.  Notice VPP
>>>> revolves around similar tricks, and why it beats DPDK, and why it
>>>> scales with 1Millon routes.
>>>
>>> This is where things go completely sideways in your argument.  If you
>>> want to implement some sort of custom FIB lookup library for XDP be my
>>> guest.  If you are talking about hacking on the kernel I would
>>> question how this is even related to XDP?  The lookup that is in the
>>> kernel is designed to provide the best possible lookup under a number
>>> of different conditions.  It is a "jack of all trades, master of none"
>>> type of implementation.
>>>
>>> Also, why should we be focused on FIB?  Seems like this is getting
>>> back into routing territory and what I am looking for is uses well
>>> beyond just routing.
>>>
>>>> I hope I've made it very clear where the focus for XDP should be.
>>>> This involves implementing what I call RX-stages in the drivers. While
>>>> doing that we can figure out the most optimal data structure for
>>>> packet batching.
>>>
>>> Yes Jesper, your point of view is clear.  This is the same agenda you
>>> have been pushing for the last several years.  I just don't see how
>>> this can be made a priority now for a project where it isn't even
>>> necessarily related.  In order for any of this to work the stack needs
>>> support for bulk Rx, and we still seem pretty far from that happening.
>>>
>>>>  I know Saeed is already working on RX-stages for mlx5, and I've tested
>>>> the initial version of his patch, and the results are excellent.
>>>
>>> That is great!  I look forward to seeing it when they push it to net-next.
>>>
>>> By the way, after looking over the mlx5 driver it seems like there is
>>> a bug in the logic.  From what I can tell it is using build_skb to
>>> build frames around the page, but it doesn't bother to take care of
>>> handling the mappings correctly.  So mlx5 can end up with data
>>> corruption when the pages are unmapped.  My advice would be to look at
>>> updating the driver to do something like what I did in ixgbe to make
>>> use of the DMA_ATTR_SKIP_CPU_SYNC DMA attribute so that it won't
>>> invalidate any updates made when adding headers or shared info.
>>>
>>
>> hmmm, are you talking about the mlx5 rx page cache ? will take a look at the ixgbe code for sure
>> but we didn't experience any issue of the sort, can you shed more light on the issue ?
>>
>> Thanks,
>> -Saeed.
>
> Basically the issue is that there are some architectures where
> dma_unmap_page will invalidate the page and cause any data written to
> it from the CPU side to be invalidated.  On x86 the only way to
> recreate this is to use the kernel parameter "swiotlb=force".
> Basically when a page was mapped you couldn't unmap it without running
> the risk of invalidating any data you had written to it.  I added a
> DMA attribute called DMA_ATTR_SKIP_CPU_SYNC which is meant to prevent
> that from taking place on unmap.  It also ends up being a performance
> gain on architectures that do this since it avoids looping through
> cache lines invalidating them on unmap.
>
> Hope that helps.
>

Thanks!! will take a look tomorrow.

> - Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-21 22:54         ` Tom Herbert
@ 2017-02-22  9:43           ` Jesper Dangaard Brouer
  2017-02-22 17:22             ` Tom Herbert
  0 siblings, 1 reply; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-22  9:43 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Saeed Mahameed, Saeed Mahameed, Alexander Duyck,
	Alexei Starovoitov, John Fastabend, David Miller, netdev,
	Brenden Blanco, brouer


On Tue, 21 Feb 2017 14:54:35 -0800 Tom Herbert <tom@herbertland.com> wrote:
> On Tue, Feb 21, 2017 at 2:29 PM, Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
[...]
> > The only complexity XDP is adding to the drivers is the constrains on
> > RX memory management and memory model, calling the XDP program itself
> > and handling the  action is really a simple thing once you have the
> > correct memory model.

Exactly, that is why I've been looking at introducing a generic
facility for a memory model for drivers.  This should help simply
drivers.  Due to performance needs this need to be a very thin API layer
on top of the page allocator. (That's why I'm working with Mel Gorman
to get more close integration with the page allocator e.g. a bulking
facility).

> > Who knows! maybe someday XDP will define one unified RX API for all
> > drivers and it even will handle normal stack delivery it self :).
> >  
> That's exactly the point and what we need for TXDP. I'm missing why
> doing this is such rocket science other than the fact that all these
> drivers are vastly different and changing the existing API is
> unpleasant. The only functional complexity I see in creating a generic
> batching interface is handling return codes asynchronously. This is
> entirely feasible though...

I'll be happy as long as we get a batching interface, then we can
incrementally do the optimizations later.

In the future, I do hope (like Saeed) this RX API will evolve into
delivering (a bulk of) raw-packet-pages into the netstack, this should
simplify drivers, and we can keep the complexity and SKB allocations
out of the drivers.
To start with, we can play with doing this delivering (a bulk of)
raw-packet-pages into Tom's TXDP engine/system?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-22  9:43           ` Jesper Dangaard Brouer
@ 2017-02-22 17:22             ` Tom Herbert
  2017-02-22 21:43               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 20+ messages in thread
From: Tom Herbert @ 2017-02-22 17:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, Saeed Mahameed, Alexander Duyck,
	Alexei Starovoitov, John Fastabend, David Miller, netdev,
	Brenden Blanco

On Wed, Feb 22, 2017 at 1:43 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Tue, 21 Feb 2017 14:54:35 -0800 Tom Herbert <tom@herbertland.com> wrote:
>> On Tue, Feb 21, 2017 at 2:29 PM, Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> [...]
>> > The only complexity XDP is adding to the drivers is the constrains on
>> > RX memory management and memory model, calling the XDP program itself
>> > and handling the  action is really a simple thing once you have the
>> > correct memory model.
>
> Exactly, that is why I've been looking at introducing a generic
> facility for a memory model for drivers.  This should help simply
> drivers.  Due to performance needs this need to be a very thin API layer
> on top of the page allocator. (That's why I'm working with Mel Gorman
> to get more close integration with the page allocator e.g. a bulking
> facility).
>
>> > Who knows! maybe someday XDP will define one unified RX API for all
>> > drivers and it even will handle normal stack delivery it self :).
>> >
>> That's exactly the point and what we need for TXDP. I'm missing why
>> doing this is such rocket science other than the fact that all these
>> drivers are vastly different and changing the existing API is
>> unpleasant. The only functional complexity I see in creating a generic
>> batching interface is handling return codes asynchronously. This is
>> entirely feasible though...
>
> I'll be happy as long as we get a batching interface, then we can
> incrementally do the optimizations later.
>
> In the future, I do hope (like Saeed) this RX API will evolve into
> delivering (a bulk of) raw-packet-pages into the netstack, this should
> simplify drivers, and we can keep the complexity and SKB allocations
> out of the drivers.
> To start with, we can play with doing this delivering (a bulk of)
> raw-packet-pages into Tom's TXDP engine/system?
>
Hi Jesper,

Maybe we can to start to narrow in on what a batching API might look like.

Looking at mlx5 (as a model of how XDP is implemented) the main RX
loop in ml5e_poll_rx_cq calls the backend handler in one indirect
function call. The XDP path goes through mlx5e_handle_rx_cqe,
skb_from_cqe, and mlx5e_xdp_handle. The first two deal a lot with
building the skbuf. As a prerequisite to RX batching it would be
helpful if this could be flatten so that most of the logic is obvious
in the main RX loop.

The model of RX batching seems straightforward enough-- pull packets
from the ring, save xdp_data information in a vector, periodically
call into the stack to handle a batch where argument is the vector of
packets and another argument is an output vector that gives return
codes (XDP actions), process the each return code for each packet in
the driver accordingly. Presumably, there is a maximum allowed batch
that may or may not be the same as the NAPI budget so the so the
batching call needs to be done when the limit is reach and also before
exiting NAPI. For each packet the stack can return an XDP code,
XDP_PASS in this case could be interpreted as being consumed by the
stack; this would be used in the case the stack creates an skbuff for
the packet. The stack on it's part can process the batch how it sees
fit, it can process each packet individual in the canonical model, or
we can continue processing a batch in a VPP-like fashion.

The batching API could be transparent to the stack or not. In the
transparent case, the driver calls what looks like a receive function
but the stack may defer processing for batching. A callback function
(that can be inlined) is used to process return codes as I mentioned
previously. In the non-transparent model, the driver knowingly creates
the packet vector and then explicitly calls another function to
process the vector. Personally, I lean towards the transparent API,
this may be less complexity in drivers and gives the stack more
control over the parameters of batching (for instance it may choose
some batch size to optimize its processing instead of driver guessing
the best size).

Btw the logic for RX batching is very similar to how we batch packets
for RPS (I think you already mention an skb-less RPS and that should
hopefully be something would falls out from this design).

Tom

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-22 17:22             ` Tom Herbert
@ 2017-02-22 21:43               ` Jesper Dangaard Brouer
  2017-02-22 22:08                 ` Tom Herbert
  0 siblings, 1 reply; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2017-02-22 21:43 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Saeed Mahameed, Saeed Mahameed, Alexander Duyck,
	Alexei Starovoitov, John Fastabend, David Miller, netdev,
	Brenden Blanco, brouer

On Wed, 22 Feb 2017 09:22:53 -0800
Tom Herbert <tom@herbertland.com> wrote:

> On Wed, Feb 22, 2017 at 1:43 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > On Tue, 21 Feb 2017 14:54:35 -0800 Tom Herbert <tom@herbertland.com> wrote:  
> >> On Tue, Feb 21, 2017 at 2:29 PM, Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:  
> > [...]  
> >> > The only complexity XDP is adding to the drivers is the constrains on
> >> > RX memory management and memory model, calling the XDP program itself
> >> > and handling the  action is really a simple thing once you have the
> >> > correct memory model.  
> >
> > Exactly, that is why I've been looking at introducing a generic
> > facility for a memory model for drivers.  This should help simply
> > drivers.  Due to performance needs this need to be a very thin API layer
> > on top of the page allocator. (That's why I'm working with Mel Gorman
> > to get more close integration with the page allocator e.g. a bulking
> > facility).
> >  
> >> > Who knows! maybe someday XDP will define one unified RX API for all
> >> > drivers and it even will handle normal stack delivery it self :).
> >> >  
> >> That's exactly the point and what we need for TXDP. I'm missing why
> >> doing this is such rocket science other than the fact that all these
> >> drivers are vastly different and changing the existing API is
> >> unpleasant. The only functional complexity I see in creating a generic
> >> batching interface is handling return codes asynchronously. This is
> >> entirely feasible though...  
> >
> > I'll be happy as long as we get a batching interface, then we can
> > incrementally do the optimizations later.
> >
> > In the future, I do hope (like Saeed) this RX API will evolve into
> > delivering (a bulk of) raw-packet-pages into the netstack, this should
> > simplify drivers, and we can keep the complexity and SKB allocations
> > out of the drivers.
> > To start with, we can play with doing this delivering (a bulk of)
> > raw-packet-pages into Tom's TXDP engine/system?
> >  
> Hi Jesper,
> 
> Maybe we can to start to narrow in on what a batching API might look like.
> 
> Looking at mlx5 (as a model of how XDP is implemented) the main RX
> loop in ml5e_poll_rx_cq calls the backend handler in one indirect
> function call. The XDP path goes through mlx5e_handle_rx_cqe,
> skb_from_cqe, and mlx5e_xdp_handle. The first two deal a lot with
> building the skbuf. As a prerequisite to RX batching it would be
> helpful if this could be flatten so that most of the logic is obvious
> in the main RX loop.

I fully agree here, it would be helpful to flatten out.  The mlx5
driver is a bit hard to follow in that respect.  Saeed have already
send me some offlist patches, where some of this code gets
restructured. In one of the patches the RX-stages does get flatten out
some more.  We are currently benchmarking this patchset, and depending
on CPU it is either a small win or a small (7ns) regressing (on the newest
CPUs).


> The model of RX batching seems straightforward enough-- pull packets
> from the ring, save xdp_data information in a vector, periodically
> call into the stack to handle a batch where argument is the vector of
> packets and another argument is an output vector that gives return
> codes (XDP actions), process the each return code for each packet in
> the driver accordingly.

Yes, exactly.  I did imagine that (maybe), the input vector of packets
could have a room for the return codes (XDP actions) next to the packet
pointer?


> Presumably, there is a maximum allowed batch
> that may or may not be the same as the NAPI budget so the so the
> batching call needs to be done when the limit is reach and also before
> exiting NAPI. 

In my PoC code that Saeed is working on, we have a smaller batch
size(10), and prefetch to L2 cache (like DPDK does), based on the
theory that we don't want to stress the L2 cache usage, and that these
CPUs usually have a Line Feed Buffer (LFB) that is limited to 10
outstanding cache-lines.

I don't know if this artifically smaller batch size is the right thing,
as DPDK always prefetch to L2 cache all 32 packets on RX.  And snabb
uses batches of 100 packets per "breath".


> For each packet the stack can return an XDP code,
> XDP_PASS in this case could be interpreted as being consumed by the
> stack; this would be used in the case the stack creates an skbuff for
> the packet. The stack on it's part can process the batch how it sees
> fit, it can process each packet individual in the canonical model, or
> we can continue processing a batch in a VPP-like fashion.

Agree.

> The batching API could be transparent to the stack or not. In the
> transparent case, the driver calls what looks like a receive function
> but the stack may defer processing for batching. A callback function
> (that can be inlined) is used to process return codes as I mentioned
> previously. In the non-transparent model, the driver knowingly creates
> the packet vector and then explicitly calls another function to
> process the vector. Personally, I lean towards the transparent API,
> this may be less complexity in drivers and gives the stack more
> control over the parameters of batching (for instance it may choose
> some batch size to optimize its processing instead of driver guessing
> the best size).

I cannot make up my mind on which model... I have to think some more
about this.  Thanks for bringing this up! :-)  This is something we
need to think about.


> Btw the logic for RX batching is very similar to how we batch packets
> for RPS (I think you already mention an skb-less RPS and that should
> hopefully be something would falls out from this design).

Yes, I've mentioned skb-less RPS before, because it seem wasteful for
RPS to allocate the fat skb on one CPU (and memset zero) force all
cache-lines hot, just to transfer it to another CPU.
 The tricky part is how do we transfer, the info from the NIC specific
descriptor on HW-offload fields? (that we usually update the SKB with,
before the netstack gets the SKB).

The question is also if XDP should be part of skb-less RPS steering, or
it should be something more generic in the stack? (after we get a bulk
of raw-packets delivered to the stack).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Focusing the XDP project
  2017-02-22 21:43               ` Jesper Dangaard Brouer
@ 2017-02-22 22:08                 ` Tom Herbert
  0 siblings, 0 replies; 20+ messages in thread
From: Tom Herbert @ 2017-02-22 22:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, Saeed Mahameed, Alexander Duyck,
	Alexei Starovoitov, John Fastabend, David Miller, netdev,
	Brenden Blanco

On Wed, Feb 22, 2017 at 1:43 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 22 Feb 2017 09:22:53 -0800
> Tom Herbert <tom@herbertland.com> wrote:
>
>> On Wed, Feb 22, 2017 at 1:43 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> >
>> > On Tue, 21 Feb 2017 14:54:35 -0800 Tom Herbert <tom@herbertland.com> wrote:
>> >> On Tue, Feb 21, 2017 at 2:29 PM, Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
>> > [...]
>> >> > The only complexity XDP is adding to the drivers is the constrains on
>> >> > RX memory management and memory model, calling the XDP program itself
>> >> > and handling the  action is really a simple thing once you have the
>> >> > correct memory model.
>> >
>> > Exactly, that is why I've been looking at introducing a generic
>> > facility for a memory model for drivers.  This should help simply
>> > drivers.  Due to performance needs this need to be a very thin API layer
>> > on top of the page allocator. (That's why I'm working with Mel Gorman
>> > to get more close integration with the page allocator e.g. a bulking
>> > facility).
>> >
>> >> > Who knows! maybe someday XDP will define one unified RX API for all
>> >> > drivers and it even will handle normal stack delivery it self :).
>> >> >
>> >> That's exactly the point and what we need for TXDP. I'm missing why
>> >> doing this is such rocket science other than the fact that all these
>> >> drivers are vastly different and changing the existing API is
>> >> unpleasant. The only functional complexity I see in creating a generic
>> >> batching interface is handling return codes asynchronously. This is
>> >> entirely feasible though...
>> >
>> > I'll be happy as long as we get a batching interface, then we can
>> > incrementally do the optimizations later.
>> >
>> > In the future, I do hope (like Saeed) this RX API will evolve into
>> > delivering (a bulk of) raw-packet-pages into the netstack, this should
>> > simplify drivers, and we can keep the complexity and SKB allocations
>> > out of the drivers.
>> > To start with, we can play with doing this delivering (a bulk of)
>> > raw-packet-pages into Tom's TXDP engine/system?
>> >
>> Hi Jesper,
>>
>> Maybe we can to start to narrow in on what a batching API might look like.
>>
>> Looking at mlx5 (as a model of how XDP is implemented) the main RX
>> loop in ml5e_poll_rx_cq calls the backend handler in one indirect
>> function call. The XDP path goes through mlx5e_handle_rx_cqe,
>> skb_from_cqe, and mlx5e_xdp_handle. The first two deal a lot with
>> building the skbuf. As a prerequisite to RX batching it would be
>> helpful if this could be flatten so that most of the logic is obvious
>> in the main RX loop.
>
> I fully agree here, it would be helpful to flatten out.  The mlx5
> driver is a bit hard to follow in that respect.  Saeed have already
> send me some offlist patches, where some of this code gets
> restructured. In one of the patches the RX-stages does get flatten out
> some more.  We are currently benchmarking this patchset, and depending
> on CPU it is either a small win or a small (7ns) regressing (on the newest
> CPUs).
>
Cool!

>
>> The model of RX batching seems straightforward enough-- pull packets
>> from the ring, save xdp_data information in a vector, periodically
>> call into the stack to handle a batch where argument is the vector of
>> packets and another argument is an output vector that gives return
>> codes (XDP actions), process the each return code for each packet in
>> the driver accordingly.
>
> Yes, exactly.  I did imagine that (maybe), the input vector of packets
> could have a room for the return codes (XDP actions) next to the packet
> pointer?
>
Which ever way is more efficient I suppose. The important point is
that the return code should be only the only thing returned to the
driver.

>
>> Presumably, there is a maximum allowed batch
>> that may or may not be the same as the NAPI budget so the so the
>> batching call needs to be done when the limit is reach and also before
>> exiting NAPI.
>
> In my PoC code that Saeed is working on, we have a smaller batch
> size(10), and prefetch to L2 cache (like DPDK does), based on the
> theory that we don't want to stress the L2 cache usage, and that these
> CPUs usually have a Line Feed Buffer (LFB) that is limited to 10
> outstanding cache-lines.
>
> I don't know if this artifically smaller batch size is the right thing,
> as DPDK always prefetch to L2 cache all 32 packets on RX.  And snabb
> uses batches of 100 packets per "breath".
>
Maybe make it configurable :-)

>
>> For each packet the stack can return an XDP code,
>> XDP_PASS in this case could be interpreted as being consumed by the
>> stack; this would be used in the case the stack creates an skbuff for
>> the packet. The stack on it's part can process the batch how it sees
>> fit, it can process each packet individual in the canonical model, or
>> we can continue processing a batch in a VPP-like fashion.
>
> Agree.
>
>> The batching API could be transparent to the stack or not. In the
>> transparent case, the driver calls what looks like a receive function
>> but the stack may defer processing for batching. A callback function
>> (that can be inlined) is used to process return codes as I mentioned
>> previously. In the non-transparent model, the driver knowingly creates
>> the packet vector and then explicitly calls another function to
>> process the vector. Personally, I lean towards the transparent API,
>> this may be less complexity in drivers and gives the stack more
>> control over the parameters of batching (for instance it may choose
>> some batch size to optimize its processing instead of driver guessing
>> the best size).
>
> I cannot make up my mind on which model... I have to think some more
> about this.  Thanks for bringing this up! :-)  This is something we
> need to think about.
>
>
>> Btw the logic for RX batching is very similar to how we batch packets
>> for RPS (I think you already mention an skb-less RPS and that should
>> hopefully be something would falls out from this design).
>
> Yes, I've mentioned skb-less RPS before, because it seem wasteful for
> RPS to allocate the fat skb on one CPU (and memset zero) force all
> cache-lines hot, just to transfer it to another CPU.
>  The tricky part is how do we transfer, the info from the NIC specific
> descriptor on HW-offload fields? (that we usually update the SKB with,
> before the netstack gets the SKB).
>
> The question is also if XDP should be part of skb-less RPS steering, or
> it should be something more generic in the stack? (after we get a bulk
> of raw-packets delivered to the stack).
>
I'd probably keep them separate for now since the mechanisms have
pretty different goals.

Tom

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-02-22 22:08 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-20 10:13 Focusing the XDP project Jesper Dangaard Brouer
2017-02-20 20:09 ` Alexander Duyck
2017-02-20 22:57   ` Saeed Mahameed
2017-02-20 23:40     ` Alexander Duyck
2017-02-21 23:08       ` Saeed Mahameed
2017-02-21 16:35     ` Tom Herbert
2017-02-21 16:46       ` David Miller
2017-02-21 17:40         ` Tom Herbert
2017-02-21 18:11           ` David Miller
2017-02-21 18:23             ` Tom Herbert
2017-02-21 22:40           ` Saeed Mahameed
2017-02-21 23:04             ` Tom Herbert
2017-02-21 22:29       ` Saeed Mahameed
2017-02-21 22:54         ` Tom Herbert
2017-02-22  9:43           ` Jesper Dangaard Brouer
2017-02-22 17:22             ` Tom Herbert
2017-02-22 21:43               ` Jesper Dangaard Brouer
2017-02-22 22:08                 ` Tom Herbert
2017-02-20 23:39   ` Jesper Dangaard Brouer
2017-02-21  0:39     ` Alexander Duyck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.