All of lore.kernel.org
 help / color / mirror / Atom feed
* Optimizing instruction-cache, more packets at each stage
@ 2016-01-15 13:22 Jesper Dangaard Brouer
  2016-01-15 13:32 ` Hannes Frederic Sowa
                   ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-15 13:22 UTC (permalink / raw)
  To: netdev
  Cc: brouer, David Miller, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend


Given net-next is closed, we have time to discuss controversial core
changes right? ;-)

I want to do some instruction-cache level optimizations.

What do I mean by that...

The kernel network stack code path (a packet travels) is obviously
larger than the instruction-cache (icache).  Today, every packet
travel individually through the network stack, experiencing the exact
same icache misses (as the previous packet).

I imagine that we could process several packets at each stage in the
packet processing code path.  That way making better use of the
icache.

Today, we already allow NAPI net_rx_action() to process many
(e.g. up-to 64) packets in the driver RX-poll routine.  But the driver
then calls the "full" stack for every single packet (e.g. via
napi_gro_receive()) in its processing loop.  Thus, trashing the icache
for every packet.

I have a prove-of-concept patch for ixgbe, which gives me 10% speedup
on full IP forwarding.  (This patch also optimize delaying when I
touch the packet data, thus it also optimizes data-cache misses).  The
basic idea is that I delay calling ixgbe_rx_skb/napi_gro_receive, and
allow the RX loop (in ixgbe_clean_rx_irq()) to run more iterations
before "flushing" the icache (by calling the stack).


This was only at the driver level.  I also would like some API towards
the stack.  Maybe we could simple pass a skb-list?

Changing / adjusting the stack to support processing in "stages" might
be more difficult/controversial?

Maybe we should view the packets stuck/avail in the RX ring, as
packets that all arrived at the same "time", and thus process them at
the same time.

By letting the "bulking" depend on the avail packets in the RX ring,
we automatically amortize processing cost in a scalable manor.


One challenge with icache optimizations is that it is hard to profile.
But hopefully the new Skylake CPU can profile this.  Because as I
always say, if you cannot measure it, you cannot improve it.


p.s. I doing a Network Performance BoF[1] at NetDev 1.1, where this
and many more subjects will be brought up face-to-face.

[1] http://netdevconf.org/1.1/bof-network-performance-bof-jesper-dangaard-brouer.html
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-15 13:22 Optimizing instruction-cache, more packets at each stage Jesper Dangaard Brouer
@ 2016-01-15 13:32 ` Hannes Frederic Sowa
  2016-01-15 14:17   ` Jesper Dangaard Brouer
  2016-01-15 13:36 ` David Laight
  2016-01-15 20:47 ` David Miller
  2 siblings, 1 reply; 59+ messages in thread
From: Hannes Frederic Sowa @ 2016-01-15 13:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev
  Cc: David Miller, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Florian Westphal, Paolo Abeni,
	John Fastabend

On 15.01.2016 14:22, Jesper Dangaard Brouer wrote:
>
> Given net-next is closed, we have time to discuss controversial core
> changes right? ;-)
>
> I want to do some instruction-cache level optimizations.
>
> What do I mean by that...
>
> The kernel network stack code path (a packet travels) is obviously
> larger than the instruction-cache (icache).  Today, every packet
> travel individually through the network stack, experiencing the exact
> same icache misses (as the previous packet).
>
> I imagine that we could process several packets at each stage in the
> packet processing code path.  That way making better use of the
> icache.
>
> Today, we already allow NAPI net_rx_action() to process many
> (e.g. up-to 64) packets in the driver RX-poll routine.  But the driver
> then calls the "full" stack for every single packet (e.g. via
> napi_gro_receive()) in its processing loop.  Thus, trashing the icache
> for every packet.
>
> I have a prove-of-concept patch for ixgbe, which gives me 10% speedup
> on full IP forwarding.  (This patch also optimize delaying when I
> touch the packet data, thus it also optimizes data-cache misses).  The
> basic idea is that I delay calling ixgbe_rx_skb/napi_gro_receive, and
> allow the RX loop (in ixgbe_clean_rx_irq()) to run more iterations
> before "flushing" the icache (by calling the stack).
>
>
> This was only at the driver level.  I also would like some API towards
> the stack.  Maybe we could simple pass a skb-list?
>
> Changing / adjusting the stack to support processing in "stages" might
> be more difficult/controversial?

I once tried this up till the vlan layer and error handling got so 
complex and complicated that I stopped there. Maybe it is possible in 
some separate stages.

This needs redesign of a lot of stuff and while doing so I would switch 
from a more stack based approach to build the stack to try out a more 
iterative one (see e.g. stack space consumption problems).

Just my 2 cents,
Hannes

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: Optimizing instruction-cache, more packets at each stage
  2016-01-15 13:22 Optimizing instruction-cache, more packets at each stage Jesper Dangaard Brouer
  2016-01-15 13:32 ` Hannes Frederic Sowa
@ 2016-01-15 13:36 ` David Laight
  2016-01-15 14:00   ` Jesper Dangaard Brouer
  2016-01-15 20:47 ` David Miller
  2 siblings, 1 reply; 59+ messages in thread
From: David Laight @ 2016-01-15 13:36 UTC (permalink / raw)
  To: 'Jesper Dangaard Brouer', netdev
  Cc: David Miller, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend

From: Jesper Dangaard Brouer
> Sent: 15 January 2016 13:22
...
> I want to do some instruction-cache level optimizations.
> 
> What do I mean by that...
> 
> The kernel network stack code path (a packet travels) is obviously
> larger than the instruction-cache (icache).  Today, every packet
> travel individually through the network stack, experiencing the exact
> same icache misses (as the previous packet).
...

Is that actually true for modern server processors that have large i-cache.
While the total size of the networking code may well be larger, that
part used for transmitting data packets will be much be smaller and
could easily fit in the icache.

	David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-15 13:36 ` David Laight
@ 2016-01-15 14:00   ` Jesper Dangaard Brouer
  2016-01-15 14:38     ` Felix Fietkau
  0 siblings, 1 reply; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-15 14:00 UTC (permalink / raw)
  To: David Laight
  Cc: netdev, David Miller, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, brouer

On Fri, 15 Jan 2016 13:36:04 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Jesper Dangaard Brouer
> > Sent: 15 January 2016 13:22
> ...
> > I want to do some instruction-cache level optimizations.
> > 
> > What do I mean by that...
> > 
> > The kernel network stack code path (a packet travels) is obviously
> > larger than the instruction-cache (icache).  Today, every packet
> > travel individually through the network stack, experiencing the exact
> > same icache misses (as the previous packet).
> ...
> 
> Is that actually true for modern server processors that have large i-cache.
> While the total size of the networking code may well be larger, that
> part used for transmitting data packets will be much be smaller and
> could easily fit in the icache.

Yes, exactly. That is what I'm betting on. If I can split it into
stages (e.g. part used for transmitting) that fits into icache then I
should see a win.

The icache is still quite small 32Kb on modern server processors.  I
don't know if smaller embedded processors also have icache and how
large they are.  I speculate this approach would also be a benefit for
them (if they have icache).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-15 13:32 ` Hannes Frederic Sowa
@ 2016-01-15 14:17   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-15 14:17 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: netdev, David Miller, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Florian Westphal, Paolo Abeni,
	John Fastabend, brouer

On Fri, 15 Jan 2016 14:32:12 +0100
Hannes Frederic Sowa <hannes@stressinduktion.org> wrote:

> On 15.01.2016 14:22, Jesper Dangaard Brouer wrote:
> >
> > Given net-next is closed, we have time to discuss controversial core
> > changes right? ;-)
> >
> > I want to do some instruction-cache level optimizations.
> >
> > What do I mean by that...
> >
> > The kernel network stack code path (a packet travels) is obviously
> > larger than the instruction-cache (icache).  Today, every packet
> > travel individually through the network stack, experiencing the exact
> > same icache misses (as the previous packet).
> >
> > I imagine that we could process several packets at each stage in the
> > packet processing code path.  That way making better use of the
> > icache.
> >
> > Today, we already allow NAPI net_rx_action() to process many
> > (e.g. up-to 64) packets in the driver RX-poll routine.  But the driver
> > then calls the "full" stack for every single packet (e.g. via
> > napi_gro_receive()) in its processing loop.  Thus, trashing the icache
> > for every packet.
> >
> > I have a prove-of-concept patch for ixgbe, which gives me 10% speedup
> > on full IP forwarding.  (This patch also optimize delaying when I
> > touch the packet data, thus it also optimizes data-cache misses).  The
> > basic idea is that I delay calling ixgbe_rx_skb/napi_gro_receive, and
> > allow the RX loop (in ixgbe_clean_rx_irq()) to run more iterations
> > before "flushing" the icache (by calling the stack).
> >
> >
> > This was only at the driver level.  I also would like some API towards
> > the stack.  Maybe we could simple pass a skb-list?
> >
> > Changing / adjusting the stack to support processing in "stages" might
> > be more difficult/controversial?
> 
> I once tried this up till the vlan layer and error handling got so 
> complex and complicated that I stopped there. Maybe it is possible in 
> some separate stages.

I've already split the driver layer into a stage.  Next I will split
GRO layer into a stage.  The GRO layer is actually quite expensive
icache-wise as it have deep calls, as the compiler cannot inline
functions due to the flexible function pointer approach.  Simply
enable/disable GRO show 10% CPU usage drop (and perf increase).


> This needs redesign of a lot of stuff and while doing so I would
> switch from a more stack based approach to build the stack to try out
> a more iterative one (see e.g. stack space consumption problems).

The recursive nature of the rx handler (__netif_receive_skb_core/another_round)
is not necessarily bad approach for icache usage (unless rx_handler()
call indirectly flush the icache).  But as you have shown it _is_ bad for
stack space consumption.  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-15 14:00   ` Jesper Dangaard Brouer
@ 2016-01-15 14:38     ` Felix Fietkau
  2016-01-18 11:54       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 59+ messages in thread
From: Felix Fietkau @ 2016-01-15 14:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, David Laight
  Cc: netdev, David Miller, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend

On 2016-01-15 15:00, Jesper Dangaard Brouer wrote:
> On Fri, 15 Jan 2016 13:36:04 +0000
> David Laight <David.Laight@ACULAB.COM> wrote:
> 
>> From: Jesper Dangaard Brouer
>> > Sent: 15 January 2016 13:22
>> ...
>> > I want to do some instruction-cache level optimizations.
>> > 
>> > What do I mean by that...
>> > 
>> > The kernel network stack code path (a packet travels) is obviously
>> > larger than the instruction-cache (icache).  Today, every packet
>> > travel individually through the network stack, experiencing the exact
>> > same icache misses (as the previous packet).
>> ...
>> 
>> Is that actually true for modern server processors that have large i-cache.
>> While the total size of the networking code may well be larger, that
>> part used for transmitting data packets will be much be smaller and
>> could easily fit in the icache.
> 
> Yes, exactly. That is what I'm betting on. If I can split it into
> stages (e.g. part used for transmitting) that fits into icache then I
> should see a win.
> 
> The icache is still quite small 32Kb on modern server processors.  I
> don't know if smaller embedded processors also have icache and how
> large they are.  I speculate this approach would also be a benefit for
> them (if they have icache).
All of the router devices that I work with have icache. Typical sizes
are 32 or 64 KiB. FWIW, I'm really looking forward to having such
optimizations in the network stack ;)

- Felix

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-15 13:22 Optimizing instruction-cache, more packets at each stage Jesper Dangaard Brouer
  2016-01-15 13:32 ` Hannes Frederic Sowa
  2016-01-15 13:36 ` David Laight
@ 2016-01-15 20:47 ` David Miller
  2016-01-18 10:27   ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 59+ messages in thread
From: David Miller @ 2016-01-15 20:47 UTC (permalink / raw)
  To: brouer
  Cc: netdev, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Fri, 15 Jan 2016 14:22:23 +0100

> This was only at the driver level.  I also would like some API towards
> the stack.  Maybe we could simple pass a skb-list?

Datastructures are everything so maybe we can create some kind of SKB
bundle abstractions.  Whether it's a lockless array or a linked list
behind it doesn't really matter.

We could have two categories: Related and Unrelated.

If you think about GRO and routing keys you might see what I am getting
at. :-)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-15 20:47 ` David Miller
@ 2016-01-18 10:27   ` Jesper Dangaard Brouer
  2016-01-18 16:24     ` David Miller
                       ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-18 10:27 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend, brouer

On Fri, 15 Jan 2016 15:47:21 -0500 (EST)
David Miller <davem@davemloft.net> wrote:

> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Fri, 15 Jan 2016 14:22:23 +0100
> 
> > This was only at the driver level.  I also would like some API towards
> > the stack.  Maybe we could simple pass a skb-list?
> 
> Datastructures are everything so maybe we can create some kind of SKB
> bundle abstractions.  Whether it's a lockless array or a linked list
> behind it doesn't really matter.
> 
> We could have two categories: Related and Unrelated.
> 
> If you think about GRO and routing keys you might see what I am getting
> at. :-)

Yes, I think I get it.  I like the idea of Related and Unrelated.
We already have GRO packets which is in the "Related" category/type.

I'm wondering about the API between driver and "GRO-layer" (calling
napi_gro_receive):

Down in the driver layer (RX), I think it is too early to categorize
Related/Unrelated SKB's, because we want to delay touching packet-data
as long as possible (waiting for the prefetcher to get data into
cache).

We could keep the napi_gro_receive() call.  But in-order to save
icache, then the driver could just create it's own simple loop around
napi_gro_receive().  This loop's icache and extra function call per
packet would cost something.

The down side is: The GRO layer will have no-idea how many "more"
packets are coming.  Thus, it depends on a "flush" API, which for
"xmit_more" didn't work out that well.

The NAPI drivers actually already have a flush API (calling
napi_complete_done()), BUT it does not always get invoked, e.g. if the
driver have more work to do, and want to keep polling.
 I'm not sure we want to delay "flushing" packets queued in the GRO
layer for this long(?).


The simplest solution to get around this (flush and driver loop
complexity), would be to create a SKB-list down in the driver, and
call napi_gro_receive() with this list.  Simply extending napi_gro_receive()
with a SKB list loop.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-15 14:38     ` Felix Fietkau
@ 2016-01-18 11:54       ` Jesper Dangaard Brouer
  2016-01-18 17:01         ` Eric Dumazet
  2016-01-25  0:08         ` Florian Fainelli
  0 siblings, 2 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-18 11:54 UTC (permalink / raw)
  To: Felix Fietkau
  Cc: David Laight, netdev, David Miller, Alexander Duyck,
	Alexei Starovoitov, Daniel Borkmann, Marek Majkowski,
	Hannes Frederic Sowa, Florian Westphal, Paolo Abeni,
	John Fastabend, brouer


On Fri, 15 Jan 2016 15:38:43 +0100 Felix Fietkau <nbd@openwrt.org> wrote:
> On 2016-01-15 15:00, Jesper Dangaard Brouer wrote:
[...]
> > 
> > The icache is still quite small 32Kb on modern server processors.  I
> > don't know if smaller embedded processors also have icache and how
> > large they are.  I speculate this approach would also be a benefit for
> > them (if they have icache).
>
> All of the router devices that I work with have icache. Typical sizes
> are 32 or 64 KiB. FWIW, I'm really looking forward to having such
> optimizations in the network stack ;)

That is very interesting. These kind of icache optimization will then
likely benefit lower-end devices more than high end Intel CPUs :-)

AFAIK the Intel CPUs are masking this icache problem, by having a icache
prefetcher and optimizing how fast the CPU can load/refill from higher
level caches.  Intel CPUs have a lot of HW-logic around this, which the
I assume the smaller CPUs don't.  E.g. quote from Intel Optimization
Reference Manual:

 "The instruction fetch unit (IFU) can fetch up to 16 bytes of aligned
  instruction bytes each cycle from the instruction cache to the
  instruction length decoder (ILD). The instruction queue (IQ) buffers
  the ILD-processed instructions and can deliver up to four instructions
  in one cycle to the instruction decoder."

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-18 10:27   ` Jesper Dangaard Brouer
@ 2016-01-18 16:24     ` David Miller
  2016-01-20 22:20       ` Or Gerlitz
  2016-01-18 16:53     ` Eric Dumazet
  2016-01-18 17:36     ` Tom Herbert
  2 siblings, 1 reply; 59+ messages in thread
From: David Miller @ 2016-01-18 16:24 UTC (permalink / raw)
  To: brouer
  Cc: netdev, alexander.duyck, alexei.starovoitov, borkmann, marek,
	hannes, fw, pabeni, john.r.fastabend

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Mon, 18 Jan 2016 11:27:03 +0100

> Down in the driver layer (RX), I think it is too early to categorize
> Related/Unrelated SKB's, because we want to delay touching packet-data
> as long as possible (waiting for the prefetcher to get data into
> cache).

You don't need to touch the headers in order to have a good idea
as to whether there is a strong possibility packets are related
or not.

We have the hash available.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-18 10:27   ` Jesper Dangaard Brouer
  2016-01-18 16:24     ` David Miller
@ 2016-01-18 16:53     ` Eric Dumazet
  2016-01-18 17:36     ` Tom Herbert
  2 siblings, 0 replies; 59+ messages in thread
From: Eric Dumazet @ 2016-01-18 16:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, netdev, alexander.duyck, alexei.starovoitov,
	borkmann, marek, hannes, fw, pabeni, john.r.fastabend

On Mon, 2016-01-18 at 11:27 +0100, Jesper Dangaard Brouer wrote:

> The NAPI drivers actually already have a flush API (calling
> napi_complete_done()), BUT it does not always get invoked, e.g. if the
> driver have more work to do, and want to keep polling.
>  I'm not sure we want to delay "flushing" packets queued in the GRO
> layer for this long(?).

Since linux-3.7 we have this logic :

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=2e71a6f8084e7ac87166dd77d99c44190fb844fc


(Some arches have quite expensive high resolution timestamps)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-18 11:54       ` Jesper Dangaard Brouer
@ 2016-01-18 17:01         ` Eric Dumazet
  2016-01-25  0:08         ` Florian Fainelli
  1 sibling, 0 replies; 59+ messages in thread
From: Eric Dumazet @ 2016-01-18 17:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Felix Fietkau, David Laight, netdev, David Miller,
	Alexander Duyck, Alexei Starovoitov, Daniel Borkmann,
	Marek Majkowski, Hannes Frederic Sowa, Florian Westphal,
	Paolo Abeni, John Fastabend

On Mon, 2016-01-18 at 12:54 +0100, Jesper Dangaard Brouer wrote:

> That is very interesting. These kind of icache optimization will then
> likely benefit lower-end devices more than high end Intel CPUs :-)
> 
> AFAIK the Intel CPUs are masking this icache problem, by having a icache
> prefetcher and optimizing how fast the CPU can load/refill from higher
> level caches.  Intel CPUs have a lot of HW-logic around this, which the
> I assume the smaller CPUs don't.  E.g. quote from Intel Optimization
> Reference Manual:
> 
>  "The instruction fetch unit (IFU) can fetch up to 16 bytes of aligned
>   instruction bytes each cycle from the instruction cache to the
>   instruction length decoder (ILD). The instruction queue (IQ) buffers
>   the ILD-processed instructions and can deliver up to four instructions
>   in one cycle to the instruction decoder."
> 

This does not tell how many core/threads can fetch 16 bytes per cycle.

With more than 36 execution units per socket, single peak performance of
one unit does not reflect what happens when all units are busy and
contend on shared resource.

If we want to properly exploit L1 caches of each execution unit, we need
to split the load in a pipeline. But the number of units depend on
hardware capabilities (like L1 cache size). Something hard to code in a
generic way (linux kernel)

For example, having the same core handling RX and TX interrupts are not
the best choice, especially when TX interrupts have to call expensive
callbacks to upper layers (TCP Small Queues).

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-18 10:27   ` Jesper Dangaard Brouer
  2016-01-18 16:24     ` David Miller
  2016-01-18 16:53     ` Eric Dumazet
@ 2016-01-18 17:36     ` Tom Herbert
  2016-01-18 17:49       ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-01-18 17:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, Linux Kernel Network Developers, Alexander Duyck,
	Alexei Starovoitov, Daniel Borkmann, marek, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend

On Mon, Jan 18, 2016 at 2:27 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Fri, 15 Jan 2016 15:47:21 -0500 (EST)
> David Miller <davem@davemloft.net> wrote:
>
>> From: Jesper Dangaard Brouer <brouer@redhat.com>
>> Date: Fri, 15 Jan 2016 14:22:23 +0100
>>
>> > This was only at the driver level.  I also would like some API towards
>> > the stack.  Maybe we could simple pass a skb-list?
>>
>> Datastructures are everything so maybe we can create some kind of SKB
>> bundle abstractions.  Whether it's a lockless array or a linked list
>> behind it doesn't really matter.
>>
>> We could have two categories: Related and Unrelated.
>>
>> If you think about GRO and routing keys you might see what I am getting
>> at. :-)
>
> Yes, I think I get it.  I like the idea of Related and Unrelated.
> We already have GRO packets which is in the "Related" category/type.
>
> I'm wondering about the API between driver and "GRO-layer" (calling
> napi_gro_receive):
>
> Down in the driver layer (RX), I think it is too early to categorize
> Related/Unrelated SKB's, because we want to delay touching packet-data
> as long as possible (waiting for the prefetcher to get data into
> cache).
>
Does DDIO address this?

> We could keep the napi_gro_receive() call.  But in-order to save
> icache, then the driver could just create it's own simple loop around
> napi_gro_receive().  This loop's icache and extra function call per
> packet would cost something.
>
> The down side is: The GRO layer will have no-idea how many "more"
> packets are coming.  Thus, it depends on a "flush" API, which for
> "xmit_more" didn't work out that well.
>
> The NAPI drivers actually already have a flush API (calling
> napi_complete_done()), BUT it does not always get invoked, e.g. if the
> driver have more work to do, and want to keep polling.
>  I'm not sure we want to delay "flushing" packets queued in the GRO
> layer for this long(?).
>
>
> The simplest solution to get around this (flush and driver loop
> complexity), would be to create a SKB-list down in the driver, and
> call napi_gro_receive() with this list.  Simply extending napi_gro_receive()
> with a SKB list loop.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-18 17:36     ` Tom Herbert
@ 2016-01-18 17:49       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-18 17:49 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, Linux Kernel Network Developers, Alexander Duyck,
	Alexei Starovoitov, Daniel Borkmann, marek, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, brouer


On Mon, 18 Jan 2016 09:36:32 -0800 Tom Herbert <tom@herbertland.com> wrote:

> On Mon, Jan 18, 2016 at 2:27 AM, Jesper Dangaard Brouer
[...]
> > Down in the driver layer (RX), I think it is too early to categorize
> > Related/Unrelated SKB's, because we want to delay touching packet-data
> > as long as possible (waiting for the prefetcher to get data into
> > cache).
> >
> Does DDIO address this?

Data Direct IO (DDIO) delivers packet-data into L3 cache, which is
great to avoid this first cache miss on data.  But not all CPUs have
this feature.  And it is difficult to deduct which CPUs support this
feature.

For test purposes, I do have systems both with and without DDIO.

I'm currently setting up as Skylake CPU based system, which I believe
don't have DDIO.  The reason for this system is that, the Skylake CPU
should have better PMU support for profiling icache and front-end.
I'll soon verify this...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-18 16:24     ` David Miller
@ 2016-01-20 22:20       ` Or Gerlitz
  2016-01-20 23:02         ` Eric Dumazet
  0 siblings, 1 reply; 59+ messages in thread
From: Or Gerlitz @ 2016-01-20 22:20 UTC (permalink / raw)
  To: David Miller, Eric Dumazet
  Cc: Jesper Dangaard Brouer, Linux Netdev List, Alexander Duyck,
	Alexei Starovoitov, borkmann, marek, hannes, Florian Westphal,
	Paolo Abeni, John Fastabend, Amir Vadai

On Mon, Jan 18, 2016 at 6:24 PM, David Miller <davem@davemloft.net> wrote:
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Mon, 18 Jan 2016 11:27:03 +0100
>
>> Down in the driver layer (RX), I think it is too early to categorize
>> Related/Unrelated SKB's, because we want to delay touching packet-data
>> as long as possible (waiting for the prefetcher to get data into
>> cache).
>
> You don't need to touch the headers in order to have a good idea
> as to whether there is a strong possibility packets are related
> or not.
>
> We have the hash available.

Dave, I assume you refer to the RSS hash result which is written by
NIC HWs to the completion descriptor and then fed to the stack by the
driver calling skb_set_hash(.)? Well, this can be taken even further.

Suppose a the NIC can be programmed by the kernel to provide a unique
flow tag on the completion descriptor per a given 5/12 tuple which
represents a TCP (or other logical) stream a higher level in the stack
is identifying to be in progress, and the driver plants that in
skb->mark before calling into the stack.

I guess this could yield nice speed up for the GRO stack -- matching
based on single 32 bit value instead of per protocol (eth, vlan, ip,
tcp) checks [1] - or hint which packets from the current window of
"ready" completion descriptor could be grouped together for upper
processing?

Or.

[1] some details to complete (...) here, on the last protocol hop we
do need to verify that it would be correct to stick the incoming
packet to the existing pending packet of this stream

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-20 22:20       ` Or Gerlitz
@ 2016-01-20 23:02         ` Eric Dumazet
  2016-01-20 23:27           ` Tom Herbert
  0 siblings, 1 reply; 59+ messages in thread
From: Eric Dumazet @ 2016-01-20 23:02 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Miller, Eric Dumazet, Jesper Dangaard Brouer,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov, borkmann,
	marek, hannes, Florian Westphal, Paolo Abeni, John Fastabend,
	Amir Vadai

On Thu, 2016-01-21 at 00:20 +0200, Or Gerlitz wrote:

> Dave, I assume you refer to the RSS hash result which is written by
> NIC HWs to the completion descriptor and then fed to the stack by the
> driver calling skb_set_hash(.)? Well, this can be taken even further.
> 
> Suppose a the NIC can be programmed by the kernel to provide a unique
> flow tag on the completion descriptor per a given 5/12 tuple which
> represents a TCP (or other logical) stream a higher level in the stack
> is identifying to be in progress, and the driver plants that in
> skb->mark before calling into the stack.
> 
> I guess this could yield nice speed up for the GRO stack -- matching
> based on single 32 bit value instead of per protocol (eth, vlan, ip,
> tcp) checks [1] - or hint which packets from the current window of
> "ready" completion descriptor could be grouped together for upper
> processing?

We already use the RSS hash (skb->hash) in GRO engine to speedup the
parsing : If skb->hash differs, then there is no point trying to
aggregate two packets.

Note that if we had a l4 hash for all provided packets, GRO could use a
hash table instead of one single list of skbs.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-20 23:02         ` Eric Dumazet
@ 2016-01-20 23:27           ` Tom Herbert
  2016-01-21 11:27             ` Jesper Dangaard Brouer
                               ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Tom Herbert @ 2016-01-20 23:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Or Gerlitz, David Miller, Eric Dumazet, Jesper Dangaard Brouer,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai

On Wed, Jan 20, 2016 at 3:02 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-01-21 at 00:20 +0200, Or Gerlitz wrote:
>
>> Dave, I assume you refer to the RSS hash result which is written by
>> NIC HWs to the completion descriptor and then fed to the stack by the
>> driver calling skb_set_hash(.)? Well, this can be taken even further.
>>
>> Suppose a the NIC can be programmed by the kernel to provide a unique
>> flow tag on the completion descriptor per a given 5/12 tuple which
>> represents a TCP (or other logical) stream a higher level in the stack
>> is identifying to be in progress, and the driver plants that in
>> skb->mark before calling into the stack.
>>
>> I guess this could yield nice speed up for the GRO stack -- matching
>> based on single 32 bit value instead of per protocol (eth, vlan, ip,
>> tcp) checks [1] - or hint which packets from the current window of
>> "ready" completion descriptor could be grouped together for upper
>> processing?
>
> We already use the RSS hash (skb->hash) in GRO engine to speedup the
> parsing : If skb->hash differs, then there is no point trying to
> aggregate two packets.
>
> Note that if we had a l4 hash for all provided packets, GRO could use a
> hash table instead of one single list of skbs.
>
Besides that, GRO requires parsing the packet anyway so I don't see
much value in trying to optimize GRO by using the hash.

Unfortunately, the hardware hash from devices hasn't really lived up
to its potential. The original intent of getting the hash from device
was to be able to do packet steering (RPS and RFS) without touching
the header. But this never was implemented. eth_type_trans touches
headers and GRO is best when done before steering. Given the
weaknesses of Toeplitz we talked about recently and that fact that
Jenkins is really fast to compute, I am starting to think maybe we
should always do a software hash and not rely on HW for it...

>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-20 23:27           ` Tom Herbert
@ 2016-01-21 11:27             ` Jesper Dangaard Brouer
  2016-01-21 12:49               ` Or Gerlitz
                                 ` (2 more replies)
  2016-01-21 12:23             ` Jesper Dangaard Brouer
  2016-02-02 16:13             ` Or Gerlitz
  2 siblings, 3 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-21 11:27 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Or Gerlitz, David Miller, Eric Dumazet,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai,
	brouer


On Wed, 20 Jan 2016 15:27:38 -0800 Tom Herbert <tom@herbertland.com> wrote:

> eth_type_trans touches headers

True, the eth_type_trans() call in the driver is a major bottleneck,
because it touch the packet header and happens very early in the driver.

In my experiments, where I extract several packet before calling
napi_gro_receive(), and I also delay calling eth_type_trans().  Most of
my speedup comes from this trick, as the prefetch() now that enough
time.

 while ((skb = __skb_dequeue(&rx_skb_list)) != NULL) {
	skb->protocol = eth_type_trans(skb, rq->netdev);
	napi_gro_receive(cq->napi, skb);
 }

What is the HW could provide the info we need in the descriptor?!?


eth_type_trans() does two things:

1) determine skb->protocol
2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST}

Could the HW descriptor deliver the "proto", or perhaps just some bits
on the most common proto's?

The skb->pkt_type don't need many bits.  And I bet the HW already have
the information.  The BROADCAST and MULTICAST indication are easy.  The
PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST
indication, if the eth->h_dest match the devices dev->dev_addr (else a
SW compare is required).

Is that doable in hardware?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-20 23:27           ` Tom Herbert
  2016-01-21 11:27             ` Jesper Dangaard Brouer
@ 2016-01-21 12:23             ` Jesper Dangaard Brouer
  2016-01-21 16:38               ` Tom Herbert
  2016-02-02 16:13             ` Or Gerlitz
  2 siblings, 1 reply; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-21 12:23 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Or Gerlitz, David Miller, Eric Dumazet,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai,
	brouer

On Wed, 20 Jan 2016 15:27:38 -0800
Tom Herbert <tom@herbertland.com> wrote:

> weaknesses of Toeplitz we talked about recently and that fact that
> Jenkins is really fast to compute, I am starting to think maybe we
> should always do a software hash and not rely on HW for it...

Please don't enforce a software hash.  You are proposing a hash
computation per packet which cost in the area 50-100 nanosec (?). And
on data which is cache cold (even with DDIO, you take the L3 cache
cost/hit).

Consider the increase in network hardware speeds.

Worst-case (pkt size 64 bytes) time between packets:
 *  10 Gbit/s -> 67.2 nanosec
 *  40 Gbit/s -> 16.8 nanosec
 * 100 Gbit/s ->  6.7 nanosec

Adding such a per packet cost is not going to fly.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-21 11:27             ` Jesper Dangaard Brouer
@ 2016-01-21 12:49               ` Or Gerlitz
  2016-01-21 13:57                 ` Jesper Dangaard Brouer
  2016-01-21 18:56                 ` David Miller
  2016-01-21 16:38               ` Eric Dumazet
  2016-01-21 18:54               ` David Miller
  2 siblings, 2 replies; 59+ messages in thread
From: Or Gerlitz @ 2016-01-21 12:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, Eric Dumazet, David Miller, Eric Dumazet,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai,
	Matan Barak

On Thu, Jan 21, 2016 at 1:27 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 20 Jan 2016 15:27:38 -0800 Tom Herbert <tom@herbertland.com> wrote:
>
>> eth_type_trans touches headers
>
> True, the eth_type_trans() call in the driver is a major bottleneck,
> because it touch the packet header and happens very early in the driver.
>
> In my experiments, where I extract several packet before calling
> napi_gro_receive(), and I also delay calling eth_type_trans().  Most of
> my speedup comes from this trick, as the prefetch() now that enough
> time.
>
>  while ((skb = __skb_dequeue(&rx_skb_list)) != NULL) {
>         skb->protocol = eth_type_trans(skb, rq->netdev);
>         napi_gro_receive(cq->napi, skb);
>  }
>
> What is the HW could provide the info we need in the descriptor?!?
>
>
> eth_type_trans() does two things:
>
> 1) determine skb->protocol
> 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST}
>
> Could the HW descriptor deliver the "proto", or perhaps just some bits
> on the most common proto's?
>
> The skb->pkt_type don't need many bits.  And I bet the HW already have
> the information.  The BROADCAST and MULTICAST indication are easy.  The
> PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST
> indication, if the eth->h_dest match the devices dev->dev_addr (else a
> SW compare is required).
>
> Is that doable in hardware?

As I wrote earlier, for determination of the eth-type HWs can do what you ask
here and more.

Protocol being IP or not (and only then you look in the data) you could
get I guess from many NICs, e.g if the NIC sets PKT_HASH_TYPE_L4
or PKT_HASH_TYPE_L3 then we know it's an IP packets and only if
we don't see this indication we look into the data.

As for pkt_type we can use NIC steering HW to provide us a tag saying if
it was our broadcast, other multicast or "our" unicast.

Or.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-21 12:49               ` Or Gerlitz
@ 2016-01-21 13:57                 ` Jesper Dangaard Brouer
  2016-01-21 18:56                 ` David Miller
  1 sibling, 0 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-21 13:57 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Tom Herbert, Eric Dumazet, David Miller, Eric Dumazet,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai,
	Matan Barak, brouer


On Thu, 21 Jan 2016 14:49:25 +0200 Or Gerlitz <gerlitz.or@gmail.com> wrote:

> On Thu, Jan 21, 2016 at 1:27 PM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Wed, 20 Jan 2016 15:27:38 -0800 Tom Herbert <tom@herbertland.com> wrote:
> >  
> >> eth_type_trans touches headers  
> >
> > True, the eth_type_trans() call in the driver is a major bottleneck,
> > because it touch the packet header and happens very early in the driver.
> >
> > In my experiments, where I extract several packet before calling
> > napi_gro_receive(), and I also delay calling eth_type_trans().  Most of
> > my speedup comes from this trick, as the prefetch() now that enough
> > time.
> >
> >  while ((skb = __skb_dequeue(&rx_skb_list)) != NULL) {
> >         skb->protocol = eth_type_trans(skb, rq->netdev);
> >         napi_gro_receive(cq->napi, skb);
> >  }
> >
> > What is the HW could provide the info we need in the descriptor?!?
> >
> >
> > eth_type_trans() does two things:
> >
> > 1) determine skb->protocol
> > 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST}
> >
> > Could the HW descriptor deliver the "proto", or perhaps just some bits
> > on the most common proto's?
> >
> > The skb->pkt_type don't need many bits.  And I bet the HW already have
> > the information.  The BROADCAST and MULTICAST indication are easy.  The
> > PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST
> > indication, if the eth->h_dest match the devices dev->dev_addr (else a
> > SW compare is required).
> >
> > Is that doable in hardware?  
> 
> As I wrote earlier, for determination of the eth-type HWs can do what
> you ask here and more.

That is great! Is this already being delivered in the descriptor?

> Protocol being IP or not (and only then you look in the data) you
> could get I guess from many NICs, e.g if the NIC sets PKT_HASH_TYPE_L4
> or PKT_HASH_TYPE_L3 then we know it's an IP packets and only if
> we don't see this indication we look into the data.

It is a good trick. But at this very early stage we only need the
eth-proto/type.  Once we get to processing the IP layer, then
packet-data will have been pulled / prefetched into L1 cache, thus cost
of determining that should be almost free.


> As for pkt_type we can use NIC steering HW to provide us a tag saying
> if it was our broadcast, other multicast or "our" unicast.

That would be good. Does that conflict with other programming of the
NIC HW, or can we always have it turned on?

If we can pull this off, then we can do some very interesting cache
latency hiding! :-)  (In my perf top eth_type_trans() is one of the top
contenders, especially for your mlx5 driver).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-21 11:27             ` Jesper Dangaard Brouer
  2016-01-21 12:49               ` Or Gerlitz
@ 2016-01-21 16:38               ` Eric Dumazet
  2016-01-21 18:54               ` David Miller
  2 siblings, 0 replies; 59+ messages in thread
From: Eric Dumazet @ 2016-01-21 16:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, Or Gerlitz, David Miller, Eric Dumazet,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai

On Thu, 2016-01-21 at 12:27 +0100, Jesper Dangaard Brouer wrote:

> In my experiments, where I extract several packet before calling
> napi_gro_receive(), and I also delay calling eth_type_trans().  Most of
> my speedup comes from this trick, as the prefetch() now that enough
> time.

It really depends on the cpu.

Many cpus have very poor prefetch performance.
prefetch instructions are lazily defined by Intel/AMD

Ivy Bridge prefetcher for example is known to be not that good.

http://www.agner.org/optimize/blog/read.php?i=415

http://www.agner.org/optimize/blog/read.php?i=285

https://groups.google.com/forum/#!topic/comp.arch/71wnqr_F9sw

Really, refrain from adding stuff that might look good one one cpu.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-21 12:23             ` Jesper Dangaard Brouer
@ 2016-01-21 16:38               ` Tom Herbert
  2016-01-21 17:48                 ` Eric Dumazet
  0 siblings, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-01-21 16:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Or Gerlitz, David Miller, Eric Dumazet,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai

On Thu, Jan 21, 2016 at 4:23 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 20 Jan 2016 15:27:38 -0800
> Tom Herbert <tom@herbertland.com> wrote:
>
>> weaknesses of Toeplitz we talked about recently and that fact that
>> Jenkins is really fast to compute, I am starting to think maybe we
>> should always do a software hash and not rely on HW for it...
>
> Please don't enforce a software hash.  You are proposing a hash
> computation per packet which cost in the area 50-100 nanosec (?). And
> on data which is cache cold (even with DDIO, you take the L3 cache
> cost/hit).
>
I clock Jenkins hash computation itself at ~6nsecs (not taking cache
miss), but your point is taken.

> Consider the increase in network hardware speeds.
>
> Worst-case (pkt size 64 bytes) time between packets:
>  *  10 Gbit/s -> 67.2 nanosec
>  *  40 Gbit/s -> 16.8 nanosec
>  * 100 Gbit/s ->  6.7 nanosec
>
> Adding such a per packet cost is not going to fly.
>
Sure, but the receive path is parallelized. Improving parallelism has
continuously shown to have much more impact than attempting to
optimize for cache misses. The primary goal is not to drive 100Gbps
with 64 packets from a single CPU. It is one benchmark of many we
should look at to measure efficiency of the data path, but I've yet to
see any real workload that requires that...

Regardless of anything, we need to load packet headers into CPU cache
to do protocol processing. I'm not sure I see how trying to defer that
as long as possible helps except in cases where the packet is crossing
CPU cache boundaries and can eliminate cache misses completely (not
just move them around from one function to another).

Tom

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-21 16:38               ` Tom Herbert
@ 2016-01-21 17:48                 ` Eric Dumazet
  2016-01-22 12:33                   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 59+ messages in thread
From: Eric Dumazet @ 2016-01-21 17:48 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Or Gerlitz, David Miller, Eric Dumazet,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai

On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote:

> Sure, but the receive path is parallelized.

This is true for multiqueue processing, assuming you can dedicate many
cores to process RX.

>  Improving parallelism has
> continuously shown to have much more impact than attempting to
> optimize for cache misses. The primary goal is not to drive 100Gbps
> with 64 packets from a single CPU. It is one benchmark of many we
> should look at to measure efficiency of the data path, but I've yet to
> see any real workload that requires that...
> 
> Regardless of anything, we need to load packet headers into CPU cache
> to do protocol processing. I'm not sure I see how trying to defer that
> as long as possible helps except in cases where the packet is crossing
> CPU cache boundaries and can eliminate cache misses completely (not
> just move them around from one function to another).

Note that some user space use multiple core (or hyper threads) to
implement a pipeline, using a single RX queue.

One thread can handle one stage (device RX drain) and prefetch data into
shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads)

The second thread process packets with headers already in L1/L2

This way, the ~100 ns (or even more if you also consider skb
allocations) penalty to bring packet headers do not hurt PPS.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-21 11:27             ` Jesper Dangaard Brouer
  2016-01-21 12:49               ` Or Gerlitz
  2016-01-21 16:38               ` Eric Dumazet
@ 2016-01-21 18:54               ` David Miller
  2016-01-24 14:28                 ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 59+ messages in thread
From: David Miller @ 2016-01-21 18:54 UTC (permalink / raw)
  To: brouer
  Cc: tom, eric.dumazet, gerlitz.or, edumazet, netdev, alexander.duyck,
	alexei.starovoitov, borkmann, marek, hannes, fw, pabeni,
	john.r.fastabend, amirva

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Thu, 21 Jan 2016 12:27:30 +0100

> eth_type_trans() does two things:
> 
> 1) determine skb->protocol
> 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST}
> 
> Could the HW descriptor deliver the "proto", or perhaps just some bits
> on the most common proto's?
> 
> The skb->pkt_type don't need many bits.  And I bet the HW already have
> the information.  The BROADCAST and MULTICAST indication are easy.  The
> PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST
> indication, if the eth->h_dest match the devices dev->dev_addr (else a
> SW compare is required).
> 
> Is that doable in hardware?

I feel like we've had this discussion before several years ago.

I think having just the protocol value would be enough.

skb->pkt_type we could deal with by using always an accessor and
evaluating it lazily.  Nothing needs it until we hit ip_rcv() or
similar.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-21 12:49               ` Or Gerlitz
  2016-01-21 13:57                 ` Jesper Dangaard Brouer
@ 2016-01-21 18:56                 ` David Miller
  2016-01-21 22:45                   ` Or Gerlitz
  1 sibling, 1 reply; 59+ messages in thread
From: David Miller @ 2016-01-21 18:56 UTC (permalink / raw)
  To: gerlitz.or
  Cc: brouer, tom, eric.dumazet, edumazet, netdev, alexander.duyck,
	alexei.starovoitov, borkmann, marek, hannes, fw, pabeni,
	john.r.fastabend, amirva, matanb

From: Or Gerlitz <gerlitz.or@gmail.com>
Date: Thu, 21 Jan 2016 14:49:25 +0200

> On Thu, Jan 21, 2016 at 1:27 PM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>> On Wed, 20 Jan 2016 15:27:38 -0800 Tom Herbert <tom@herbertland.com> wrote:
>>
>>> eth_type_trans touches headers
>>
>> True, the eth_type_trans() call in the driver is a major bottleneck,
>> because it touch the packet header and happens very early in the driver.
>>
>> In my experiments, where I extract several packet before calling
>> napi_gro_receive(), and I also delay calling eth_type_trans().  Most of
>> my speedup comes from this trick, as the prefetch() now that enough
>> time.
>>
>>  while ((skb = __skb_dequeue(&rx_skb_list)) != NULL) {
>>         skb->protocol = eth_type_trans(skb, rq->netdev);
>>         napi_gro_receive(cq->napi, skb);
>>  }
>>
>> What is the HW could provide the info we need in the descriptor?!?
>>
>>
>> eth_type_trans() does two things:
>>
>> 1) determine skb->protocol
>> 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST}
>>
>> Could the HW descriptor deliver the "proto", or perhaps just some bits
>> on the most common proto's?
>>
>> The skb->pkt_type don't need many bits.  And I bet the HW already have
>> the information.  The BROADCAST and MULTICAST indication are easy.  The
>> PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST
>> indication, if the eth->h_dest match the devices dev->dev_addr (else a
>> SW compare is required).
>>
>> Is that doable in hardware?
> 
> As I wrote earlier, for determination of the eth-type HWs can do what you ask
> here and more.
> 
> Protocol being IP or not (and only then you look in the data) you could
> get I guess from many NICs, e.g if the NIC sets PKT_HASH_TYPE_L4
> or PKT_HASH_TYPE_L3 then we know it's an IP packets and only if
> we don't see this indication we look into the data.

This doesn't differentiate ipv4 vs. ipv6 which is critical here, so this
mechanism is not sufficient.

We must know the exact ETH_P_* value.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-21 18:56                 ` David Miller
@ 2016-01-21 22:45                   ` Or Gerlitz
  2016-01-21 22:59                     ` David Miller
  0 siblings, 1 reply; 59+ messages in thread
From: Or Gerlitz @ 2016-01-21 22:45 UTC (permalink / raw)
  To: David Miller
  Cc: Jesper Dangaard Brouer, Tom Herbert, Eric Dumazet, Eric Dumazet,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai,
	Matan Barak

On Thu, Jan 21, 2016 at 8:56 PM, David Miller <davem@davemloft.net> wrote:
> From: Or Gerlitz <gerlitz.or@gmail.com>
> Date: Thu, 21 Jan 2016 14:49:25 +0200
>
>> On Thu, Jan 21, 2016 at 1:27 PM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>>> On Wed, 20 Jan 2016 15:27:38 -0800 Tom Herbert <tom@herbertland.com> wrote:
>>>
>>>> eth_type_trans touches headers
>>>
>>> True, the eth_type_trans() call in the driver is a major bottleneck,
>>> because it touch the packet header and happens very early in the driver.
>>>
>>> In my experiments, where I extract several packet before calling
>>> napi_gro_receive(), and I also delay calling eth_type_trans().  Most of
>>> my speedup comes from this trick, as the prefetch() now that enough
>>> time.
>>>
>>>  while ((skb = __skb_dequeue(&rx_skb_list)) != NULL) {
>>>         skb->protocol = eth_type_trans(skb, rq->netdev);
>>>         napi_gro_receive(cq->napi, skb);
>>>  }
>>>
>>> What is the HW could provide the info we need in the descriptor?!?
>>>
>>>
>>> eth_type_trans() does two things:
>>>
>>> 1) determine skb->protocol
>>> 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST}
>>>
>>> Could the HW descriptor deliver the "proto", or perhaps just some bits
>>> on the most common proto's?
>>>
>>> The skb->pkt_type don't need many bits.  And I bet the HW already have
>>> the information.  The BROADCAST and MULTICAST indication are easy.  The
>>> PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST
>>> indication, if the eth->h_dest match the devices dev->dev_addr (else a
>>> SW compare is required).
>>>
>>> Is that doable in hardware?
>>
>> As I wrote earlier, for determination of the eth-type HWs can do what you ask
>> here and more.
>>
>> Protocol being IP or not (and only then you look in the data) you could
>> get I guess from many NICs, e.g if the NIC sets PKT_HASH_TYPE_L4
>> or PKT_HASH_TYPE_L3 then we know it's an IP packets and only if
>> we don't see this indication we look into the data.
>
> This doesn't differentiate ipv4 vs. ipv6 which is critical here, so this
> mechanism is not sufficient.

Dave, at least in the ConnectX4 (mlx5e driver), as I commented earlier
on this thread, we can use programmed tags reported by the HW on the
completion of packets  whether the ethtype is ipv4 or ipv6 or
something else, and let the kernel
branch look into the packet memory on in the last case.

> We must know the exact ETH_P_* value.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-21 22:45                   ` Or Gerlitz
@ 2016-01-21 22:59                     ` David Miller
  0 siblings, 0 replies; 59+ messages in thread
From: David Miller @ 2016-01-21 22:59 UTC (permalink / raw)
  To: gerlitz.or
  Cc: brouer, tom, eric.dumazet, edumazet, netdev, alexander.duyck,
	alexei.starovoitov, borkmann, marek, hannes, fw, pabeni,
	john.r.fastabend, amirva, matanb

From: Or Gerlitz <gerlitz.or@gmail.com>
Date: Fri, 22 Jan 2016 00:45:13 +0200

> Dave, at least in the ConnectX4 (mlx5e driver), as I commented earlier
> on this thread, we can use programmed tags reported by the HW on the
> completion of packets  whether the ethtype is ipv4 or ipv6 or
> something else, and let the kernel
> branch look into the packet memory on in the last case.

Fair enough.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-21 17:48                 ` Eric Dumazet
@ 2016-01-22 12:33                   ` Jesper Dangaard Brouer
  2016-01-22 14:33                     ` Eric Dumazet
  2016-01-22 17:07                     ` Tom Herbert
  0 siblings, 2 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-22 12:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, Or Gerlitz, David Miller, Eric Dumazet,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai,
	brouer

On Thu, 21 Jan 2016 09:48:36 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote:
> 
> > Sure, but the receive path is parallelized.  
> 
> This is true for multiqueue processing, assuming you can dedicate many
> cores to process RX.
> 
> >  Improving parallelism has
> > continuously shown to have much more impact than attempting to
> > optimize for cache misses. The primary goal is not to drive 100Gbps
> > with 64 packets from a single CPU. It is one benchmark of many we
> > should look at to measure efficiency of the data path, but I've yet to
> > see any real workload that requires that...
> > 
> > Regardless of anything, we need to load packet headers into CPU cache
> > to do protocol processing. I'm not sure I see how trying to defer that
> > as long as possible helps except in cases where the packet is crossing
> > CPU cache boundaries and can eliminate cache misses completely (not
> > just move them around from one function to another).  
> 
> Note that some user space use multiple core (or hyper threads) to
> implement a pipeline, using a single RX queue.
> 
> One thread can handle one stage (device RX drain) and prefetch data into
> shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads)
> 
> The second thread process packets with headers already in L1/L2

I agree. I've heard experiences where DPDK users use 2 core for RX, and
1 core for TX, and achieve 10G wirespeed (14Mpps) real IPv4 forwarding
with full Internet routing table look up.

One of the ideas behind my alf_queue, is that it can be used for
efficiently distributing object (pointers) between threads.
1. because it only transfers the pointers (not touching object), and
2. because it enqueue/dequeue multiple objects with a single locked cmpxchg.
Thus, lower in the message passing cost between threads.


> This way, the ~100 ns (or even more if you also consider skb
> allocations) penalty to bring packet headers do not hurt PPS.

I've studied the allocation cost in great detail, thus let me share my
numbers, 100 ns is too high:

Total cost of alloc+free for 256 byte objects (on CPU i7-4790K @ 4.00GHz).
The cycles count should be comparable with other CPUs, but that nanosec
measurement is affected by the very high clock freq of this CPU.

Kmem_cache fastpath "recycle" case:
 SLUB => 44 cycles(tsc) 11.205 ns
 SLAB => 96 cycles(tsc) 24.119 ns.

The problem is that real use-cases in the network stack, almost always
hit the slowpath in kmem_cache allocators.

Kmem_cache "slowpath" case:
 SLUB => 117 cycles(tsc) 29.276 ns
 SLAB => 101 cycles(tsc) 25.342 ns

I've addressed this "slowpath" problem in the SLUB and SLAB allocators,
by introducing a bulk API, which amortize the needed sync-mechanisms.

Kmem_cache using bulk API:
 SLUB => 37 cycles(tsc) 9.280 ns
 SLAB => 20 cycles(tsc) 5.035 ns


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-22 12:33                   ` Jesper Dangaard Brouer
@ 2016-01-22 14:33                     ` Eric Dumazet
  2016-01-22 17:07                     ` Tom Herbert
  1 sibling, 0 replies; 59+ messages in thread
From: Eric Dumazet @ 2016-01-22 14:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, Or Gerlitz, David Miller, Eric Dumazet,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai

On Fri, 2016-01-22 at 13:33 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 21 Jan 2016 09:48:36 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote:
> > 
> > > Sure, but the receive path is parallelized.  
> > 
> > This is true for multiqueue processing, assuming you can dedicate many
> > cores to process RX.
> > 
> > >  Improving parallelism has
> > > continuously shown to have much more impact than attempting to
> > > optimize for cache misses. The primary goal is not to drive 100Gbps
> > > with 64 packets from a single CPU. It is one benchmark of many we
> > > should look at to measure efficiency of the data path, but I've yet to
> > > see any real workload that requires that...
> > > 
> > > Regardless of anything, we need to load packet headers into CPU cache
> > > to do protocol processing. I'm not sure I see how trying to defer that
> > > as long as possible helps except in cases where the packet is crossing
> > > CPU cache boundaries and can eliminate cache misses completely (not
> > > just move them around from one function to another).  
> > 
> > Note that some user space use multiple core (or hyper threads) to
> > implement a pipeline, using a single RX queue.
> > 
> > One thread can handle one stage (device RX drain) and prefetch data into
> > shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads)
> > 
> > The second thread process packets with headers already in L1/L2
> 
> I agree. I've heard experiences where DPDK users use 2 core for RX, and
> 1 core for TX, and achieve 10G wirespeed (14Mpps) real IPv4 forwarding
> with full Internet routing table look up.
> 
> One of the ideas behind my alf_queue, is that it can be used for
> efficiently distributing object (pointers) between threads.
> 1. because it only transfers the pointers (not touching object), and
> 2. because it enqueue/dequeue multiple objects with a single locked cmpxchg.
> Thus, lower in the message passing cost between threads.
> 
> 
> > This way, the ~100 ns (or even more if you also consider skb
> > allocations) penalty to bring packet headers do not hurt PPS.
> 
> I've studied the allocation cost in great detail, thus let me share my
> numbers, 100 ns is too high:
> 
> Total cost of alloc+free for 256 byte objects (on CPU i7-4790K @ 4.00GHz).
> The cycles count should be comparable with other CPUs, but that nanosec
> measurement is affected by the very high clock freq of this CPU.
> 
> Kmem_cache fastpath "recycle" case:
>  SLUB => 44 cycles(tsc) 11.205 ns
>  SLAB => 96 cycles(tsc) 24.119 ns.
> 
> The problem is that real use-cases in the network stack, almost always
> hit the slowpath in kmem_cache allocators.
> 
> Kmem_cache "slowpath" case:
>  SLUB => 117 cycles(tsc) 29.276 ns
>  SLAB => 101 cycles(tsc) 25.342 ns
> 
> I've addressed this "slowpath" problem in the SLUB and SLAB allocators,
> by introducing a bulk API, which amortize the needed sync-mechanisms.
> 
> Kmem_cache using bulk API:
>  SLUB => 37 cycles(tsc) 9.280 ns
>  SLAB => 20 cycles(tsc) 5.035 ns


Your numbers are nice, but the reality of most applications is they run
on hosts with ~72 hyperthreads, soon to be ~128 ht.

(Two physical sockets, with their corresponding memory)

The perf numbers show about 100 ns penalty per cache line miss, when all
these threads perform real work and applications are properly tuned,
because it is very rare the working set is all in caches.

In the following real case, we can see these numbers.

$ perf guncore -M miss_lat_rem,miss_lat_loc
#------------------------------------------------------------------------------------
#                Socket0                  |                Socket1                  |
#------------------------------------------------------------------------------------
# Load Miss Latency  | Load Miss Latency  | Load Miss Latency  | Load Miss Latency  |
#     Remote RAM     |     Local RAM      |     Remote RAM     |     Local RAM      |
#                  ns|                  ns|                  ns|                  ns|
#------------------------------------------------------------------------------------
               162.25               130.61               173.74               116.80
               162.40               130.41               173.33               116.59
               163.11               132.28               175.90               117.09
               163.36               132.86               176.69               117.45
               161.92               130.32               173.20               117.35
               163.46               130.99               174.80               117.42
               163.54               130.55               174.09               117.26
               163.29               129.75               173.84               117.36
               162.38               130.31               173.44               117.18
               163.00               130.81               174.47               117.24

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-22 12:33                   ` Jesper Dangaard Brouer
  2016-01-22 14:33                     ` Eric Dumazet
@ 2016-01-22 17:07                     ` Tom Herbert
  2016-01-22 17:17                       ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-01-22 17:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Or Gerlitz, David Miller, Eric Dumazet,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai

On Fri, Jan 22, 2016 at 4:33 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Thu, 21 Jan 2016 09:48:36 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote:
>>
>> > Sure, but the receive path is parallelized.
>>
>> This is true for multiqueue processing, assuming you can dedicate many
>> cores to process RX.
>>
>> >  Improving parallelism has
>> > continuously shown to have much more impact than attempting to
>> > optimize for cache misses. The primary goal is not to drive 100Gbps
>> > with 64 packets from a single CPU. It is one benchmark of many we
>> > should look at to measure efficiency of the data path, but I've yet to
>> > see any real workload that requires that...
>> >
>> > Regardless of anything, we need to load packet headers into CPU cache
>> > to do protocol processing. I'm not sure I see how trying to defer that
>> > as long as possible helps except in cases where the packet is crossing
>> > CPU cache boundaries and can eliminate cache misses completely (not
>> > just move them around from one function to another).
>>
>> Note that some user space use multiple core (or hyper threads) to
>> implement a pipeline, using a single RX queue.
>>
>> One thread can handle one stage (device RX drain) and prefetch data into
>> shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads)
>>
>> The second thread process packets with headers already in L1/L2
>
> I agree. I've heard experiences where DPDK users use 2 core for RX, and
> 1 core for TX, and achieve 10G wirespeed (14Mpps) real IPv4 forwarding
> with full Internet routing table look up.
>
> One of the ideas behind my alf_queue, is that it can be used for
> efficiently distributing object (pointers) between threads.
> 1. because it only transfers the pointers (not touching object), and
> 2. because it enqueue/dequeue multiple objects with a single locked cmpxchg.
> Thus, lower in the message passing cost between threads.
>
>
>> This way, the ~100 ns (or even more if you also consider skb
>> allocations) penalty to bring packet headers do not hurt PPS.
>
> I've studied the allocation cost in great detail, thus let me share my
> numbers, 100 ns is too high:
>
> Total cost of alloc+free for 256 byte objects (on CPU i7-4790K @ 4.00GHz).
> The cycles count should be comparable with other CPUs, but that nanosec
> measurement is affected by the very high clock freq of this CPU.
>
> Kmem_cache fastpath "recycle" case:
>  SLUB => 44 cycles(tsc) 11.205 ns
>  SLAB => 96 cycles(tsc) 24.119 ns.
>
> The problem is that real use-cases in the network stack, almost always
> hit the slowpath in kmem_cache allocators.
>
> Kmem_cache "slowpath" case:
>  SLUB => 117 cycles(tsc) 29.276 ns
>  SLAB => 101 cycles(tsc) 25.342 ns
>
> I've addressed this "slowpath" problem in the SLUB and SLAB allocators,
> by introducing a bulk API, which amortize the needed sync-mechanisms.
>
> Kmem_cache using bulk API:
>  SLUB => 37 cycles(tsc) 9.280 ns
>  SLAB => 20 cycles(tsc) 5.035 ns
>
Hi Jesper,

I am a little confused. I believe the 100ns hit refers specifically
cache miss on packet headers. Memory object allocation seems like
different problem; the latency might depend on cache misses, but it's
not on packet data (which we seem to assume is always a cache miss).
For the cache miss problem on the packet headers I think we really
need to evaluate whether DDIO adequately solves the it (need more
numbers :) ). As I read it, DDIO is enabled by default since Sandy
Bridge-EP and is transparent to both HW and SW. It seems like we
should have seen some sort of measurable benefit by now...

Tom

>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-22 17:07                     ` Tom Herbert
@ 2016-01-22 17:17                       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-22 17:17 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, Or Gerlitz, David Miller, Eric Dumazet,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai,
	brouer

On Fri, 22 Jan 2016 09:07:43 -0800
Tom Herbert <tom@herbertland.com> wrote:

> On Fri, Jan 22, 2016 at 4:33 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Thu, 21 Jan 2016 09:48:36 -0800
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >  
> >> On Thu, 2016-01-21 at 08:38 -0800, Tom Herbert wrote:
> >>  
> >> > Sure, but the receive path is parallelized.  
> >>
> >> This is true for multiqueue processing, assuming you can dedicate many
> >> cores to process RX.
> >>  
> >> >  Improving parallelism has
> >> > continuously shown to have much more impact than attempting to
> >> > optimize for cache misses. The primary goal is not to drive 100Gbps
> >> > with 64 packets from a single CPU. It is one benchmark of many we
> >> > should look at to measure efficiency of the data path, but I've yet to
> >> > see any real workload that requires that...
> >> >
> >> > Regardless of anything, we need to load packet headers into CPU cache
> >> > to do protocol processing. I'm not sure I see how trying to defer that
> >> > as long as possible helps except in cases where the packet is crossing
> >> > CPU cache boundaries and can eliminate cache misses completely (not
> >> > just move them around from one function to another).  
> >>
> >> Note that some user space use multiple core (or hyper threads) to
> >> implement a pipeline, using a single RX queue.
> >>
> >> One thread can handle one stage (device RX drain) and prefetch data into
> >> shared L1/L2 (and/or shared L3 for pipelines with more than 2 threads)
> >>
> >> The second thread process packets with headers already in L1/L2  
> >
> > I agree. I've heard experiences where DPDK users use 2 core for RX, and
> > 1 core for TX, and achieve 10G wirespeed (14Mpps) real IPv4 forwarding
> > with full Internet routing table look up.
> >
> > One of the ideas behind my alf_queue, is that it can be used for
> > efficiently distributing object (pointers) between threads.
> > 1. because it only transfers the pointers (not touching object), and
> > 2. because it enqueue/dequeue multiple objects with a single locked cmpxchg.
> > Thus, lower in the message passing cost between threads.
> >
> >  
> >> This way, the ~100 ns (or even more if you also consider skb
> >> allocations) penalty to bring packet headers do not hurt PPS.  
> >
> > I've studied the allocation cost in great detail, thus let me share my
> > numbers, 100 ns is too high:
> >
> > Total cost of alloc+free for 256 byte objects (on CPU i7-4790K @ 4.00GHz).
> > The cycles count should be comparable with other CPUs, but that nanosec
> > measurement is affected by the very high clock freq of this CPU.
> >
> > Kmem_cache fastpath "recycle" case:
> >  SLUB => 44 cycles(tsc) 11.205 ns
> >  SLAB => 96 cycles(tsc) 24.119 ns.
> >
> > The problem is that real use-cases in the network stack, almost always
> > hit the slowpath in kmem_cache allocators.
> >
> > Kmem_cache "slowpath" case:
> >  SLUB => 117 cycles(tsc) 29.276 ns
> >  SLAB => 101 cycles(tsc) 25.342 ns
> >
> > I've addressed this "slowpath" problem in the SLUB and SLAB allocators,
> > by introducing a bulk API, which amortize the needed sync-mechanisms.
> >
> > Kmem_cache using bulk API:
> >  SLUB => 37 cycles(tsc) 9.280 ns
> >  SLAB => 20 cycles(tsc) 5.035 ns
> >  
> Hi Jesper,
> 
> I am a little confused. I believe the 100ns hit refers specifically
> cache miss on packet headers. 

Sorry, I misread Eric's statement.  You are right.

> Memory object allocation seems like different problem;

Yes, it is, I just misread it, and though we were talking about memory
object alloc overhead. Sorry, for the confusion.


> the latency might depend on cache misses, but it's
> not on packet data (which we seem to assume is always a cache miss).
> For the cache miss problem on the packet headers I think we really
> need to evaluate whether DDIO adequately solves the it (need more
> numbers :) ). As I read it, DDIO is enabled by default since Sandy
> Bridge-EP and is transparent to both HW and SW. It seems like we
> should have seen some sort of measurable benefit by now...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-21 18:54               ` David Miller
@ 2016-01-24 14:28                 ` Jesper Dangaard Brouer
  2016-01-24 14:44                   ` Michael S. Tsirkin
  2016-01-24 20:09                   ` Optimizing instruction-cache, more packets at each stage Tom Herbert
  0 siblings, 2 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-24 14:28 UTC (permalink / raw)
  To: David Miller
  Cc: tom, eric.dumazet, gerlitz.or, edumazet, netdev, alexander.duyck,
	alexei.starovoitov, borkmann, marek, hannes, fw, pabeni,
	john.r.fastabend, amirva, brouer, Michael S. Tsirkin

On Thu, 21 Jan 2016 10:54:01 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Thu, 21 Jan 2016 12:27:30 +0100
> 
> > eth_type_trans() does two things:
> > 
> > 1) determine skb->protocol
> > 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST}
> > 
> > Could the HW descriptor deliver the "proto", or perhaps just some bits
> > on the most common proto's?
> > 
> > The skb->pkt_type don't need many bits.  And I bet the HW already have
> > the information.  The BROADCAST and MULTICAST indication are easy.  The
> > PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST
> > indication, if the eth->h_dest match the devices dev->dev_addr (else a
> > SW compare is required).
> > 
> > Is that doable in hardware?  
> 
> I feel like we've had this discussion before several years ago.
> 
> I think having just the protocol value would be enough.
> 
> skb->pkt_type we could deal with by using always an accessor and
> evaluating it lazily.  Nothing needs it until we hit ip_rcv() or
> similar.

First I thought, I liked the idea delaying the eval of skb->pkt_type.

BUT then I realized, what if we take this even further.  What if we
actually use this information, for something useful, at this very
early RX stage.

The information I'm interested in, from the HW descriptor, is if this
packet is NOT for local delivery.  If so, we can send the packet on a
"fast-forward" code path.

Think about bridging packets to a guest OS.  Because we know very
early at RX (from packet HW descriptor) we might even avoid allocating
a SKB.  We could just "forward" the packet-page to the guest OS.

Taking Eric's idea, of remote CPUs, we could even send these
packet-pages to a remote CPU (e.g. where the guest OS is running),
without having touched a single cache-line in the packet-data.  I
would still bundle them up first, to amortize the (100-133ns) cost of
transferring something to another CPU.

The data-cache trick, would be to instruct prefetcher only to start
prefetching to L3 or L2, when these packet are destined for a remote
CPU.  At-least Intel CPUs have prefetch operations that specify only
L2/L3 cache.


Maybe, we need a combined solution.  Lazy eval skb->pkt_type, for
local delivery, but set the information if avail from HW desc.  And
fast page-forward don't even need a SKB.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-24 14:28                 ` Jesper Dangaard Brouer
@ 2016-01-24 14:44                   ` Michael S. Tsirkin
  2016-01-24 17:28                     ` John Fastabend
  2016-01-24 20:09                   ` Optimizing instruction-cache, more packets at each stage Tom Herbert
  1 sibling, 1 reply; 59+ messages in thread
From: Michael S. Tsirkin @ 2016-01-24 14:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, tom, eric.dumazet, gerlitz.or, edumazet, netdev,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, amirva

On Sun, Jan 24, 2016 at 03:28:14PM +0100, Jesper Dangaard Brouer wrote:
> On Thu, 21 Jan 2016 10:54:01 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Jesper Dangaard Brouer <brouer@redhat.com>
> > Date: Thu, 21 Jan 2016 12:27:30 +0100
> > 
> > > eth_type_trans() does two things:
> > > 
> > > 1) determine skb->protocol
> > > 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST}
> > > 
> > > Could the HW descriptor deliver the "proto", or perhaps just some bits
> > > on the most common proto's?
> > > 
> > > The skb->pkt_type don't need many bits.  And I bet the HW already have
> > > the information.  The BROADCAST and MULTICAST indication are easy.  The
> > > PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST
> > > indication, if the eth->h_dest match the devices dev->dev_addr (else a
> > > SW compare is required).
> > > 
> > > Is that doable in hardware?  
> > 
> > I feel like we've had this discussion before several years ago.
> > 
> > I think having just the protocol value would be enough.
> > 
> > skb->pkt_type we could deal with by using always an accessor and
> > evaluating it lazily.  Nothing needs it until we hit ip_rcv() or
> > similar.
> 
> First I thought, I liked the idea delaying the eval of skb->pkt_type.
> 
> BUT then I realized, what if we take this even further.  What if we
> actually use this information, for something useful, at this very
> early RX stage.
> 
> The information I'm interested in, from the HW descriptor, is if this
> packet is NOT for local delivery.  If so, we can send the packet on a
> "fast-forward" code path.
> 
> Think about bridging packets to a guest OS.  Because we know very
> early at RX (from packet HW descriptor) we might even avoid allocating
> a SKB.  We could just "forward" the packet-page to the guest OS.

OK, so you would build a new kind of rx handler, and then
e.g. macvtap could maybe get packets this way?
Sure - e.g. vhost expects an skb at the moment
but it won't be too hard to teach it that there's
some other option.

Or maybe some kind of stub skb that just has
the correct length but no data is easier,
I'm not sure.

> Taking Eric's idea, of remote CPUs, we could even send these
> packet-pages to a remote CPU (e.g. where the guest OS is running),
> without having touched a single cache-line in the packet-data.  I
> would still bundle them up first, to amortize the (100-133ns) cost of
> transferring something to another CPU.

This bundling would have to happen in a guest
specific way then, so in vhost.
I'd be curious to see what you come up with.

> The data-cache trick, would be to instruct prefetcher only to start
> prefetching to L3 or L2, when these packet are destined for a remote
> CPU.  At-least Intel CPUs have prefetch operations that specify only
> L2/L3 cache.
> 
> 
> Maybe, we need a combined solution.  Lazy eval skb->pkt_type, for
> local delivery, but set the information if avail from HW desc.  And
> fast page-forward don't even need a SKB.
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-24 14:44                   ` Michael S. Tsirkin
@ 2016-01-24 17:28                     ` John Fastabend
  2016-01-25 13:15                       ` Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) Jesper Dangaard Brouer
  0 siblings, 1 reply; 59+ messages in thread
From: John Fastabend @ 2016-01-24 17:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jesper Dangaard Brouer
  Cc: David Miller, tom, eric.dumazet, gerlitz.or, edumazet, netdev,
	alexander.duyck, alexei.starovoitov, borkmann, marek, hannes, fw,
	pabeni, john.r.fastabend, amirva, Daniel Borkmann, vyasevich

On 16-01-24 06:44 AM, Michael S. Tsirkin wrote:
> On Sun, Jan 24, 2016 at 03:28:14PM +0100, Jesper Dangaard Brouer wrote:
>> On Thu, 21 Jan 2016 10:54:01 -0800 (PST)
>> David Miller <davem@davemloft.net> wrote:
>>
>>> From: Jesper Dangaard Brouer <brouer@redhat.com>
>>> Date: Thu, 21 Jan 2016 12:27:30 +0100
>>>
>>>> eth_type_trans() does two things:
>>>>
>>>> 1) determine skb->protocol
>>>> 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST}
>>>>
>>>> Could the HW descriptor deliver the "proto", or perhaps just some bits
>>>> on the most common proto's?
>>>>
>>>> The skb->pkt_type don't need many bits.  And I bet the HW already have
>>>> the information.  The BROADCAST and MULTICAST indication are easy.  The
>>>> PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST
>>>> indication, if the eth->h_dest match the devices dev->dev_addr (else a
>>>> SW compare is required).
>>>>
>>>> Is that doable in hardware?  
>>>
>>> I feel like we've had this discussion before several years ago.
>>>
>>> I think having just the protocol value would be enough.
>>>
>>> skb->pkt_type we could deal with by using always an accessor and
>>> evaluating it lazily.  Nothing needs it until we hit ip_rcv() or
>>> similar.
>>
>> First I thought, I liked the idea delaying the eval of skb->pkt_type.
>>
>> BUT then I realized, what if we take this even further.  What if we
>> actually use this information, for something useful, at this very
>> early RX stage.
>>
>> The information I'm interested in, from the HW descriptor, is if this
>> packet is NOT for local delivery.  If so, we can send the packet on a
>> "fast-forward" code path.
>>
>> Think about bridging packets to a guest OS.  Because we know very
>> early at RX (from packet HW descriptor) we might even avoid allocating
>> a SKB.  We could just "forward" the packet-page to the guest OS.
> 
> OK, so you would build a new kind of rx handler, and then
> e.g. macvtap could maybe get packets this way?
> Sure - e.g. vhost expects an skb at the moment
> but it won't be too hard to teach it that there's
> some other option.

+ Daniel, Vlad

If you use the macvtap device with the offload features you can "know"
via mac address that all packets on a specific hardware queue set belong
to a specific guest. (the queues are bound to a new netdev) This works
well with the passthru mode of macvlan. So you can do hardware bridging
this way. Supporting similar L3 modes probably not via macvlan has been
on my todo list for awhile but I haven't got there yet. ixgbe and fm10k
intel drivers support this now maybe others but those are the two I've
worked with recently.

The idea here is you remove any overhead from running bridge code, etc.
but still allowing users to stick netfilter, qos, etc hooks in the
datapath.

Also Daniel and I started working on a zero-copy RX mode which would
further help this by letting vhost-net pass down a set of dma buffers
we should probably get this working and submit it. iirc Vlad also
had the same sort of idea. The initial data for this looked good but
not as good as the solution below. However it had a similar issue as
below in that you just jumped over netfilter, qos, etc. Our initial
implementation used af_packet.

> 
> Or maybe some kind of stub skb that just has
> the correct length but no data is easier,
> I'm not sure.
> 

Another option is to use perfect filters to push traffic to a VF and
then map the VF into user space and use the vhost dpdk bits. This
works fairly well and gets pkts into the guest with little hypervisor
overhead and no(?) kernel network stack overhead. But the trade-off is
you cut out netfilter, qos, etc. This is really slick if you "trust"
your guest or have enough ACLs/etc in your hardware to "trust' the
guest.

A compromise is to use a VF and do not unbind it from the OS then
you can use macvtap again and map the netdev 1:1 to a guest. With
this mode you can still use your netfilter, qos, etc. but do l2,l3,l4
hardware forwarding with perfect filters.

As an aside if you don't like ethtool perfect filters I have a set of
patches to control this via 'tc' that I'll submit when net-next opens
up again which would let you support filtering on more field options
using offset:mask:value notation.

>> Taking Eric's idea, of remote CPUs, we could even send these
>> packet-pages to a remote CPU (e.g. where the guest OS is running),
>> without having touched a single cache-line in the packet-data.  I
>> would still bundle them up first, to amortize the (100-133ns) cost of
>> transferring something to another CPU.
> 
> This bundling would have to happen in a guest
> specific way then, so in vhost.
> I'd be curious to see what you come up with.
> 
>> The data-cache trick, would be to instruct prefetcher only to start
>> prefetching to L3 or L2, when these packet are destined for a remote
>> CPU.  At-least Intel CPUs have prefetch operations that specify only
>> L2/L3 cache.
>>
>>
>> Maybe, we need a combined solution.  Lazy eval skb->pkt_type, for
>> local delivery, but set the information if avail from HW desc.  And
>> fast page-forward don't even need a SKB.
>>
>> -- 
>> Best regards,
>>   Jesper Dangaard Brouer
>>   MSc.CS, Principal Kernel Engineer at Red Hat
>>   Author of http://www.iptv-analyzer.org
>>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-24 14:28                 ` Jesper Dangaard Brouer
  2016-01-24 14:44                   ` Michael S. Tsirkin
@ 2016-01-24 20:09                   ` Tom Herbert
  2016-01-24 21:41                     ` John Fastabend
  1 sibling, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-01-24 20:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet,
	Linux Kernel Network Developers, Alexander Duyck,
	Alexei Starovoitov, Daniel Borkmann, Marek Majkowski,
	Hannes Frederic Sowa, Florian Westphal, Paolo Abeni,
	John Fastabend, Amir Vadai, Michael S. Tsirkin

On Sun, Jan 24, 2016 at 6:28 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Thu, 21 Jan 2016 10:54:01 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
>
>> From: Jesper Dangaard Brouer <brouer@redhat.com>
>> Date: Thu, 21 Jan 2016 12:27:30 +0100
>>
>> > eth_type_trans() does two things:
>> >
>> > 1) determine skb->protocol
>> > 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST}
>> >
>> > Could the HW descriptor deliver the "proto", or perhaps just some bits
>> > on the most common proto's?
>> >
>> > The skb->pkt_type don't need many bits.  And I bet the HW already have
>> > the information.  The BROADCAST and MULTICAST indication are easy.  The
>> > PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST
>> > indication, if the eth->h_dest match the devices dev->dev_addr (else a
>> > SW compare is required).
>> >
>> > Is that doable in hardware?
>>
>> I feel like we've had this discussion before several years ago.
>>
>> I think having just the protocol value would be enough.
>>
>> skb->pkt_type we could deal with by using always an accessor and
>> evaluating it lazily.  Nothing needs it until we hit ip_rcv() or
>> similar.
>
> First I thought, I liked the idea delaying the eval of skb->pkt_type.
>
> BUT then I realized, what if we take this even further.  What if we
> actually use this information, for something useful, at this very
> early RX stage.
>
> The information I'm interested in, from the HW descriptor, is if this
> packet is NOT for local delivery.  If so, we can send the packet on a
> "fast-forward" code path.
>
> Think about bridging packets to a guest OS.  Because we know very
> early at RX (from packet HW descriptor) we might even avoid allocating
> a SKB.  We could just "forward" the packet-page to the guest OS.
>
> Taking Eric's idea, of remote CPUs, we could even send these
> packet-pages to a remote CPU (e.g. where the guest OS is running),
> without having touched a single cache-line in the packet-data.  I
> would still bundle them up first, to amortize the (100-133ns) cost of
> transferring something to another CPU.
>
You mean like RPS/RFS/aRFS/flow_director already does (except for the
zero-touch part)?

> The data-cache trick, would be to instruct prefetcher only to start
> prefetching to L3 or L2, when these packet are destined for a remote
> CPU.  At-least Intel CPUs have prefetch operations that specify only
> L2/L3 cache.
>
>
> Maybe, we need a combined solution.  Lazy eval skb->pkt_type, for
> local delivery, but set the information if avail from HW desc.  And
> fast page-forward don't even need a SKB.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-24 20:09                   ` Optimizing instruction-cache, more packets at each stage Tom Herbert
@ 2016-01-24 21:41                     ` John Fastabend
  2016-01-24 23:50                       ` Tom Herbert
  0 siblings, 1 reply; 59+ messages in thread
From: John Fastabend @ 2016-01-24 21:41 UTC (permalink / raw)
  To: Tom Herbert, Jesper Dangaard Brouer
  Cc: David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet,
	Linux Kernel Network Developers, Alexander Duyck,
	Alexei Starovoitov, Daniel Borkmann, Marek Majkowski,
	Hannes Frederic Sowa, Florian Westphal, Paolo Abeni,
	John Fastabend, Amir Vadai, Michael S. Tsirkin

On 16-01-24 12:09 PM, Tom Herbert wrote:
> On Sun, Jan 24, 2016 at 6:28 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>> On Thu, 21 Jan 2016 10:54:01 -0800 (PST)
>> David Miller <davem@davemloft.net> wrote:
>>
>>> From: Jesper Dangaard Brouer <brouer@redhat.com>
>>> Date: Thu, 21 Jan 2016 12:27:30 +0100
>>>
>>>> eth_type_trans() does two things:
>>>>
>>>> 1) determine skb->protocol
>>>> 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST}
>>>>
>>>> Could the HW descriptor deliver the "proto", or perhaps just some bits
>>>> on the most common proto's?
>>>>
>>>> The skb->pkt_type don't need many bits.  And I bet the HW already have
>>>> the information.  The BROADCAST and MULTICAST indication are easy.  The
>>>> PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST
>>>> indication, if the eth->h_dest match the devices dev->dev_addr (else a
>>>> SW compare is required).
>>>>
>>>> Is that doable in hardware?
>>>
>>> I feel like we've had this discussion before several years ago.
>>>
>>> I think having just the protocol value would be enough.
>>>
>>> skb->pkt_type we could deal with by using always an accessor and
>>> evaluating it lazily.  Nothing needs it until we hit ip_rcv() or
>>> similar.
>>
>> First I thought, I liked the idea delaying the eval of skb->pkt_type.
>>
>> BUT then I realized, what if we take this even further.  What if we
>> actually use this information, for something useful, at this very
>> early RX stage.
>>
>> The information I'm interested in, from the HW descriptor, is if this
>> packet is NOT for local delivery.  If so, we can send the packet on a
>> "fast-forward" code path.
>>
>> Think about bridging packets to a guest OS.  Because we know very
>> early at RX (from packet HW descriptor) we might even avoid allocating
>> a SKB.  We could just "forward" the packet-page to the guest OS.
>>
>> Taking Eric's idea, of remote CPUs, we could even send these
>> packet-pages to a remote CPU (e.g. where the guest OS is running),
>> without having touched a single cache-line in the packet-data.  I
>> would still bundle them up first, to amortize the (100-133ns) cost of
>> transferring something to another CPU.
>>
> You mean like RPS/RFS/aRFS/flow_director already does (except for the
> zero-touch part)?
> 

You could also look at ATR in the ixgbe/i40e drivers which on xmit
uses a tuple to try and force the hardware to recv on the same queue
pair as the sending side. The idea being you can bind tx/rx queue
pairs to a core and send/recv on the same core which tends to be an
OK strategy although not always. It is sometimes better to tx and rx
on separate cores.

>> The data-cache trick, would be to instruct prefetcher only to start
>> prefetching to L3 or L2, when these packet are destined for a remote
>> CPU.  At-least Intel CPUs have prefetch operations that specify only
>> L2/L3 cache.
>>
>>
>> Maybe, we need a combined solution.  Lazy eval skb->pkt_type, for
>> local delivery, but set the information if avail from HW desc.  And
>> fast page-forward don't even need a SKB.
>>
>> --
>> Best regards,
>>   Jesper Dangaard Brouer
>>   MSc.CS, Principal Kernel Engineer at Red Hat
>>   Author of http://www.iptv-analyzer.org
>>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-24 21:41                     ` John Fastabend
@ 2016-01-24 23:50                       ` Tom Herbert
  0 siblings, 0 replies; 59+ messages in thread
From: Tom Herbert @ 2016-01-24 23:50 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jesper Dangaard Brouer, David Miller, Eric Dumazet, Or Gerlitz,
	Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck,
	Alexei Starovoitov, Daniel Borkmann, Marek Majkowski,
	Hannes Frederic Sowa, Florian Westphal, Paolo Abeni,
	John Fastabend, Amir Vadai, Michael S. Tsirkin

On Sun, Jan 24, 2016 at 1:41 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 16-01-24 12:09 PM, Tom Herbert wrote:
>> On Sun, Jan 24, 2016 at 6:28 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>>> On Thu, 21 Jan 2016 10:54:01 -0800 (PST)
>>> David Miller <davem@davemloft.net> wrote:
>>>
>>>> From: Jesper Dangaard Brouer <brouer@redhat.com>
>>>> Date: Thu, 21 Jan 2016 12:27:30 +0100
>>>>
>>>>> eth_type_trans() does two things:
>>>>>
>>>>> 1) determine skb->protocol
>>>>> 2) setup skb->pkt_type = PACKET_{BROADCAST,MULTICAST,OTHERHOST}
>>>>>
>>>>> Could the HW descriptor deliver the "proto", or perhaps just some bits
>>>>> on the most common proto's?
>>>>>
>>>>> The skb->pkt_type don't need many bits.  And I bet the HW already have
>>>>> the information.  The BROADCAST and MULTICAST indication are easy.  The
>>>>> PACKET_OTHERHOST, can be turned around, by instead set a PACKET_HOST
>>>>> indication, if the eth->h_dest match the devices dev->dev_addr (else a
>>>>> SW compare is required).
>>>>>
>>>>> Is that doable in hardware?
>>>>
>>>> I feel like we've had this discussion before several years ago.
>>>>
>>>> I think having just the protocol value would be enough.
>>>>
>>>> skb->pkt_type we could deal with by using always an accessor and
>>>> evaluating it lazily.  Nothing needs it until we hit ip_rcv() or
>>>> similar.
>>>
>>> First I thought, I liked the idea delaying the eval of skb->pkt_type.
>>>
>>> BUT then I realized, what if we take this even further.  What if we
>>> actually use this information, for something useful, at this very
>>> early RX stage.
>>>
>>> The information I'm interested in, from the HW descriptor, is if this
>>> packet is NOT for local delivery.  If so, we can send the packet on a
>>> "fast-forward" code path.
>>>
>>> Think about bridging packets to a guest OS.  Because we know very
>>> early at RX (from packet HW descriptor) we might even avoid allocating
>>> a SKB.  We could just "forward" the packet-page to the guest OS.
>>>
>>> Taking Eric's idea, of remote CPUs, we could even send these
>>> packet-pages to a remote CPU (e.g. where the guest OS is running),
>>> without having touched a single cache-line in the packet-data.  I
>>> would still bundle them up first, to amortize the (100-133ns) cost of
>>> transferring something to another CPU.
>>>
>> You mean like RPS/RFS/aRFS/flow_director already does (except for the
>> zero-touch part)?
>>
>
> You could also look at ATR in the ixgbe/i40e drivers which on xmit
> uses a tuple to try and force the hardware to recv on the same queue
> pair as the sending side. The idea being you can bind tx/rx queue
> pairs to a core and send/recv on the same core which tends to be an
> OK strategy although not always. It is sometimes better to tx and rx
> on separate cores.
>
Right, we have seen cases where HW attempting to autonomously bind
tx/rx to the same CPU does nothing more than create a whole bunch of
OOO packets and a big mess otherwise.  The better approach is to allow
the stack to indicate to HW where *it* wants received packets for each
flow to go. If it wants to bind tx/rx it can do that, if it wants to
split that's fine to. This is possible with aRFS, and in fact I don't
see any reason why virtual drivers shouldn't support also aRFS to
allow guests control over steering within their CPUs.

>>> The data-cache trick, would be to instruct prefetcher only to start
>>> prefetching to L3 or L2, when these packet are destined for a remote
>>> CPU.  At-least Intel CPUs have prefetch operations that specify only
>>> L2/L3 cache.
>>>
>>>
>>> Maybe, we need a combined solution.  Lazy eval skb->pkt_type, for
>>> local delivery, but set the information if avail from HW desc.  And
>>> fast page-forward don't even need a SKB.
>>>
>>> --
>>> Best regards,
>>>   Jesper Dangaard Brouer
>>>   MSc.CS, Principal Kernel Engineer at Red Hat
>>>   Author of http://www.iptv-analyzer.org
>>>   LinkedIn: http://www.linkedin.com/in/brouer
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-18 11:54       ` Jesper Dangaard Brouer
  2016-01-18 17:01         ` Eric Dumazet
@ 2016-01-25  0:08         ` Florian Fainelli
  1 sibling, 0 replies; 59+ messages in thread
From: Florian Fainelli @ 2016-01-25  0:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Felix Fietkau
  Cc: David Laight, netdev, David Miller, Alexander Duyck,
	Alexei Starovoitov, Daniel Borkmann, Marek Majkowski,
	Hannes Frederic Sowa, Florian Westphal, Paolo Abeni,
	John Fastabend

Hi Jesper

On 18/01/2016 03:54, Jesper Dangaard Brouer wrote:
> 
> On Fri, 15 Jan 2016 15:38:43 +0100 Felix Fietkau <nbd@openwrt.org> wrote:
>> On 2016-01-15 15:00, Jesper Dangaard Brouer wrote:
> [...]
>>>
>>> The icache is still quite small 32Kb on modern server processors.  I
>>> don't know if smaller embedded processors also have icache and how
>>> large they are.  I speculate this approach would also be a benefit for
>>> them (if they have icache).
>>
>> All of the router devices that I work with have icache. Typical sizes
>> are 32 or 64 KiB. FWIW, I'm really looking forward to having such
>> optimizations in the network stack ;)
> 
> That is very interesting. These kind of icache optimization will then
> likely benefit lower-end devices more than high end Intel CPUs :-)

Typical embedded routers have small I and D cache, but they also have
fairly small cache line sizes (16, 32 or 64 bytes), and not necessarily
a L2 cache to help them, the memory bandwidth is also very limited
(DDR/DDR2 speeds are not uncommon) so the less I/D cache lines you
trash, the better obviously.

One thing that some HW vendors have done, before they started
introducing a HW capable of offloading routing/NAT workloads to
specialized hardware is to hack the heck of the Linux network stack to
allow a lightweight SKB structure to be used for forwarding and allocate
these "meta" bookeekping SKBs from a dedicated kmem cache pool to get
relatively predictable latencies.

There is also a notion of a dirty pointer within the skbuff itself, such
that instead of e.g: having your Ethernet NIC driver do a DMA-API call
which can potentially invalidate the D-cache for an entire 1500-ish
bytes Ethernet frame, the packet contents are "valid" up until the dirty
pointer, which is a nice trick if you are just forwarding, but requires
both SKB accessors/manipulation functions to check that, and your
Ethernet driver to be cooperative as well, so may not scale well.

Broadcom's implementation of such a thing can be found here among these
files, code is not kernel style compliant, but there might be some
re-usable ideas for you:

NBUFF/FKBUFF/SKBUFF are the actual packet book keeping data structures
that replace and/or extend the use of SKBs:

https://code.google.com/p/gfiber-gflt100/source/browse/kernel/linux/include/linux/nbuff.h
https://code.google.com/p/gfiber-gflt100/source/browse/kernel/linux/net/core/nbuff.c

# Check for CONFIG_MIPS_BRCM changes here:
https://code.google.com/p/gfiber-gflt100/source/browse/kernel/linux/net/core/skbuff.c
https://code.google.com/p/gfiber-gflt100/source/browse/kernel/linux/include/linux/skbuff.h

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-24 17:28                     ` John Fastabend
@ 2016-01-25 13:15                       ` Jesper Dangaard Brouer
  2016-01-25 17:09                         ` Tom Herbert
  0 siblings, 1 reply; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-25 13:15 UTC (permalink / raw)
  To: John Fastabend
  Cc: Michael S. Tsirkin, David Miller, tom, eric.dumazet, gerlitz.or,
	edumazet, netdev, alexander.duyck, alexei.starovoitov, borkmann,
	marek, hannes, fw, pabeni, john.r.fastabend, amirva,
	Daniel Borkmann, vyasevich, brouer


After reading John's reply about perfect filters, I want to re-state
my idea, for this very early RX stage.  And describe a packet-page
level bypass use-case, that John indirectly mentions.


There are two ideas, getting mixed up here.  (1) bundling from the
RX-ring, (2) allowing to pick up the "packet-page" directly.

Bundling (1) is something that seems natural, and which help us
amortize the cost between layers (and utilizes icache better). Lets
keep that in another thread.

This (2) direct forward of "packet-pages" is a fairly extreme idea,
BUT it have the potential of being an new integration point for
"selective" bypass-solutions and bringing RAW/af_packet (RX) up-to
speed with bypass-solutions.


Today, the bypass-solutions grab and control the entire NIC HW.  In
many cases this is not very practical, if you also want to use the NIC
for something else.

Solutions for bypassing only part of the traffic is starting to show
up.  Both a netmap[1] and a DPDK[2] based approach.

[1] https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/
[2] http://rhelblog.redhat.com/2015/10/02/getting-the-best-of-both-worlds-with-queue-splitting-bifurcated-driver/

Both approaches install a HW filter in the NIC, and redirect packets
to a separate RX HW queue (via ethtool ntuple + flow-type).  DPDK
needs pci SRIOV setup and then run it own poll-mode driver on top.
Netmap patch the orig ixgbe driver, and since CloudFlare/Gilberto's
changes[3] support a single RX queue mode.

[3] https://github.com/luigirizzo/netmap/pull/87


I'm thinking, why run all this extra driver software on top.  Why
don't we just pickup the (packet)-page from the RX ring, and
hand-it-over to a registered bypass handler?  (as mentioned before,
the HW descriptor need to somehow "mark" these packets for us).

I imagine some kind of page ring structure, and I also imagine
RAW/af_packet being a "bypass" consumer.  I guess the af_packet part
was also something John and Daniel have been looking at.


(top post, but left John's replay below, because it got me thinking)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer




On Sun, 24 Jan 2016 09:28:36 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> On 16-01-24 06:44 AM, Michael S. Tsirkin wrote:
> > On Sun, Jan 24, 2016 at 03:28:14PM +0100, Jesper Dangaard Brouer wrote:  
> >> On Thu, 21 Jan 2016 10:54:01 -0800 (PST)
> >> David Miller <davem@davemloft.net> wrote:
> >>  
> >>> From: Jesper Dangaard Brouer <brouer@redhat.com>
> >>> Date: Thu, 21 Jan 2016 12:27:30 +0100
> >>>  
[...]

> >>
> >> BUT then I realized, what if we take this even further.  What if we
> >> actually use this information, for something useful, at this very
> >> early RX stage.
> >>
> >> The information I'm interested in, from the HW descriptor, is if this
> >> packet is NOT for local delivery.  If so, we can send the packet on a
> >> "fast-forward" code path.
> >>
> >> Think about bridging packets to a guest OS.  Because we know very
> >> early at RX (from packet HW descriptor) we might even avoid allocating
> >> a SKB.  We could just "forward" the packet-page to the guest OS.  
> > 
> > OK, so you would build a new kind of rx handler, and then
> > e.g. macvtap could maybe get packets this way?
> > Sure - e.g. vhost expects an skb at the moment
> > but it won't be too hard to teach it that there's
> > some other option.  
> 
> + Daniel, Vlad
> 
> If you use the macvtap device with the offload features you can "know"
> via mac address that all packets on a specific hardware queue set belong
> to a specific guest. (the queues are bound to a new netdev) This works
> well with the passthru mode of macvlan. So you can do hardware bridging
> this way. Supporting similar L3 modes probably not via macvlan has been
> on my todo list for awhile but I haven't got there yet. ixgbe and fm10k
> intel drivers support this now maybe others but those are the two I've
> worked with recently.
> 
> The idea here is you remove any overhead from running bridge code, etc.
> but still allowing users to stick netfilter, qos, etc hooks in the
> datapath.
> 
> Also Daniel and I started working on a zero-copy RX mode which would
> further help this by letting vhost-net pass down a set of dma buffers
> we should probably get this working and submit it. iirc Vlad also
> had the same sort of idea. The initial data for this looked good but
> not as good as the solution below. However it had a similar issue as
> below in that you just jumped over netfilter, qos, etc. Our initial
> implementation used af_packet.
> 
> > 
> > Or maybe some kind of stub skb that just has
> > the correct length but no data is easier,
> > I'm not sure.
> >   
> 
> Another option is to use perfect filters to push traffic to a VF and
> then map the VF into user space and use the vhost dpdk bits. This
> works fairly well and gets pkts into the guest with little hypervisor
> overhead and no(?) kernel network stack overhead. But the trade-off is
> you cut out netfilter, qos, etc. This is really slick if you "trust"
> your guest or have enough ACLs/etc in your hardware to "trust' the
> guest.
> 
> A compromise is to use a VF and do not unbind it from the OS then
> you can use macvtap again and map the netdev 1:1 to a guest. With
> this mode you can still use your netfilter, qos, etc. but do l2,l3,l4
> hardware forwarding with perfect filters.
> 
> As an aside if you don't like ethtool perfect filters I have a set of
> patches to control this via 'tc' that I'll submit when net-next opens
> up again which would let you support filtering on more field options
> using offset:mask:value notation.
> 
> >> Taking Eric's idea, of remote CPUs, we could even send these
> >> packet-pages to a remote CPU (e.g. where the guest OS is running),
> >> without having touched a single cache-line in the packet-data.  I
> >> would still bundle them up first, to amortize the (100-133ns) cost of
> >> transferring something to another CPU.  
> > 
> > This bundling would have to happen in a guest
> > specific way then, so in vhost.
> > I'd be curious to see what you come up with.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-25 13:15                       ` Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) Jesper Dangaard Brouer
@ 2016-01-25 17:09                         ` Tom Herbert
  2016-01-25 17:50                           ` John Fastabend
  0 siblings, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-01-25 17:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, Michael S. Tsirkin, David Miller, Eric Dumazet,
	Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers,
	Alexander Duyck, Alexei Starovoitov, Daniel Borkmann,
	Marek Majkowski, Hannes Frederic Sowa, Florian Westphal,
	Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann,
	Vladislav Yasevich

On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> After reading John's reply about perfect filters, I want to re-state
> my idea, for this very early RX stage.  And describe a packet-page
> level bypass use-case, that John indirectly mentions.
>
>
> There are two ideas, getting mixed up here.  (1) bundling from the
> RX-ring, (2) allowing to pick up the "packet-page" directly.
>
> Bundling (1) is something that seems natural, and which help us
> amortize the cost between layers (and utilizes icache better). Lets
> keep that in another thread.
>
> This (2) direct forward of "packet-pages" is a fairly extreme idea,
> BUT it have the potential of being an new integration point for
> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to
> speed with bypass-solutions.
>
>
> Today, the bypass-solutions grab and control the entire NIC HW.  In
> many cases this is not very practical, if you also want to use the NIC
> for something else.
>
> Solutions for bypassing only part of the traffic is starting to show
> up.  Both a netmap[1] and a DPDK[2] based approach.
>
> [1] https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/
> [2] http://rhelblog.redhat.com/2015/10/02/getting-the-best-of-both-worlds-with-queue-splitting-bifurcated-driver/
>
> Both approaches install a HW filter in the NIC, and redirect packets
> to a separate RX HW queue (via ethtool ntuple + flow-type).  DPDK
> needs pci SRIOV setup and then run it own poll-mode driver on top.
> Netmap patch the orig ixgbe driver, and since CloudFlare/Gilberto's
> changes[3] support a single RX queue mode.
>
Jepser, thanks for providing more specifics.

One comment: If you intend to change core code paths or APIs for this,
then I think that we should require up front that the associated HW
support is protocol agnostic (i.e. HW filters must be programmable and
generic ). We don't want a promising feature like this to be
undermined by protocol ossification.

Thanks,
Tom

> [3] https://github.com/luigirizzo/netmap/pull/87
>
>
> I'm thinking, why run all this extra driver software on top.  Why
> don't we just pickup the (packet)-page from the RX ring, and
> hand-it-over to a registered bypass handler?  (as mentioned before,
> the HW descriptor need to somehow "mark" these packets for us).
>
> I imagine some kind of page ring structure, and I also imagine
> RAW/af_packet being a "bypass" consumer.  I guess the af_packet part
> was also something John and Daniel have been looking at.
>
>
> (top post, but left John's replay below, because it got me thinking)
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer
>
>
>
>
> On Sun, 24 Jan 2016 09:28:36 -0800
> John Fastabend <john.fastabend@gmail.com> wrote:
>
>> On 16-01-24 06:44 AM, Michael S. Tsirkin wrote:
>> > On Sun, Jan 24, 2016 at 03:28:14PM +0100, Jesper Dangaard Brouer wrote:
>> >> On Thu, 21 Jan 2016 10:54:01 -0800 (PST)
>> >> David Miller <davem@davemloft.net> wrote:
>> >>
>> >>> From: Jesper Dangaard Brouer <brouer@redhat.com>
>> >>> Date: Thu, 21 Jan 2016 12:27:30 +0100
>> >>>
> [...]
>
>> >>
>> >> BUT then I realized, what if we take this even further.  What if we
>> >> actually use this information, for something useful, at this very
>> >> early RX stage.
>> >>
>> >> The information I'm interested in, from the HW descriptor, is if this
>> >> packet is NOT for local delivery.  If so, we can send the packet on a
>> >> "fast-forward" code path.
>> >>
>> >> Think about bridging packets to a guest OS.  Because we know very
>> >> early at RX (from packet HW descriptor) we might even avoid allocating
>> >> a SKB.  We could just "forward" the packet-page to the guest OS.
>> >
>> > OK, so you would build a new kind of rx handler, and then
>> > e.g. macvtap could maybe get packets this way?
>> > Sure - e.g. vhost expects an skb at the moment
>> > but it won't be too hard to teach it that there's
>> > some other option.
>>
>> + Daniel, Vlad
>>
>> If you use the macvtap device with the offload features you can "know"
>> via mac address that all packets on a specific hardware queue set belong
>> to a specific guest. (the queues are bound to a new netdev) This works
>> well with the passthru mode of macvlan. So you can do hardware bridging
>> this way. Supporting similar L3 modes probably not via macvlan has been
>> on my todo list for awhile but I haven't got there yet. ixgbe and fm10k
>> intel drivers support this now maybe others but those are the two I've
>> worked with recently.
>>
>> The idea here is you remove any overhead from running bridge code, etc.
>> but still allowing users to stick netfilter, qos, etc hooks in the
>> datapath.
>>
>> Also Daniel and I started working on a zero-copy RX mode which would
>> further help this by letting vhost-net pass down a set of dma buffers
>> we should probably get this working and submit it. iirc Vlad also
>> had the same sort of idea. The initial data for this looked good but
>> not as good as the solution below. However it had a similar issue as
>> below in that you just jumped over netfilter, qos, etc. Our initial
>> implementation used af_packet.
>>
>> >
>> > Or maybe some kind of stub skb that just has
>> > the correct length but no data is easier,
>> > I'm not sure.
>> >
>>
>> Another option is to use perfect filters to push traffic to a VF and
>> then map the VF into user space and use the vhost dpdk bits. This
>> works fairly well and gets pkts into the guest with little hypervisor
>> overhead and no(?) kernel network stack overhead. But the trade-off is
>> you cut out netfilter, qos, etc. This is really slick if you "trust"
>> your guest or have enough ACLs/etc in your hardware to "trust' the
>> guest.
>>
>> A compromise is to use a VF and do not unbind it from the OS then
>> you can use macvtap again and map the netdev 1:1 to a guest. With
>> this mode you can still use your netfilter, qos, etc. but do l2,l3,l4
>> hardware forwarding with perfect filters.
>>
>> As an aside if you don't like ethtool perfect filters I have a set of
>> patches to control this via 'tc' that I'll submit when net-next opens
>> up again which would let you support filtering on more field options
>> using offset:mask:value notation.
>>
>> >> Taking Eric's idea, of remote CPUs, we could even send these
>> >> packet-pages to a remote CPU (e.g. where the guest OS is running),
>> >> without having touched a single cache-line in the packet-data.  I
>> >> would still bundle them up first, to amortize the (100-133ns) cost of
>> >> transferring something to another CPU.
>> >
>> > This bundling would have to happen in a guest
>> > specific way then, so in vhost.
>> > I'd be curious to see what you come up with.
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-25 17:09                         ` Tom Herbert
@ 2016-01-25 17:50                           ` John Fastabend
  2016-01-25 21:32                             ` Tom Herbert
  2016-01-25 22:10                             ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 59+ messages in thread
From: John Fastabend @ 2016-01-25 17:50 UTC (permalink / raw)
  To: Tom Herbert, Jesper Dangaard Brouer
  Cc: Michael S. Tsirkin, David Miller, Eric Dumazet, Or Gerlitz,
	Eric Dumazet, Linux Kernel Network Developers, Alexander Duyck,
	Alexei Starovoitov, Daniel Borkmann, Marek Majkowski,
	Hannes Frederic Sowa, Florian Westphal, Paolo Abeni,
	John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich

On 16-01-25 09:09 AM, Tom Herbert wrote:
> On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>>
>> After reading John's reply about perfect filters, I want to re-state
>> my idea, for this very early RX stage.  And describe a packet-page
>> level bypass use-case, that John indirectly mentions.
>>
>>
>> There are two ideas, getting mixed up here.  (1) bundling from the
>> RX-ring, (2) allowing to pick up the "packet-page" directly.
>>
>> Bundling (1) is something that seems natural, and which help us
>> amortize the cost between layers (and utilizes icache better). Lets
>> keep that in another thread.
>>
>> This (2) direct forward of "packet-pages" is a fairly extreme idea,
>> BUT it have the potential of being an new integration point for
>> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to
>> speed with bypass-solutions.
>>
>>
>> Today, the bypass-solutions grab and control the entire NIC HW.  In
>> many cases this is not very practical, if you also want to use the NIC
>> for something else.
>>
>> Solutions for bypassing only part of the traffic is starting to show
>> up.  Both a netmap[1] and a DPDK[2] based approach.
>>
>> [1] https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/
>> [2] http://rhelblog.redhat.com/2015/10/02/getting-the-best-of-both-worlds-with-queue-splitting-bifurcated-driver/
>>
>> Both approaches install a HW filter in the NIC, and redirect packets
>> to a separate RX HW queue (via ethtool ntuple + flow-type).  DPDK
>> needs pci SRIOV setup and then run it own poll-mode driver on top.
>> Netmap patch the orig ixgbe driver, and since CloudFlare/Gilberto's
>> changes[3] support a single RX queue mode.
>>

FWIW I wrote a version of the patch talked about in the queue splitting
article that didn't require SR-IOV and we also talked about it at last
netconf in ottowa. The problem is without SR-IOV if you map a queue
directly into userspace so you can run the poll mode drivers there is
nothing protecting the DMA engine. So userspace can put arbitrary
addresses in there. There is something called Process Address Space ID
(PASID) also part of the PCI-SIG spec that could help you here but I
don't know of any hardware that supports it. The other option is to
use system calls and validate the descriptors in the kernel but this
incurs some overhead we had it at 15% or so when I did the numbers
last year. However I'm told there is some interesting work going on
around syscall overhead that may help.

One thing to note is SRIOV does somewhat limit the number of these
types of interfaces you can support to the max VFs where as the
queue mechanism although slower with a function call would be limited
to max number of queues. Also busy polling will help here if you
are worried about pps.

Jesper, at least for you (2) case what are we missing with the
bifurcated/queue splitting work? Are you really after systems
without SR-IOV support or are you trying to get this on the order
of queues instead of VFs.

> Jepser, thanks for providing more specifics.
> 
> One comment: If you intend to change core code paths or APIs for this,
> then I think that we should require up front that the associated HW
> support is protocol agnostic (i.e. HW filters must be programmable and
> generic ). We don't want a promising feature like this to be
> undermined by protocol ossification.

At the moment we use ethtool ntuple filters which is basically adding
a new set of enums and structures every time we need a new protocol
so its painful and you need your vendor to support you and you need a
new kernel.

The flow api was shot down (which would get you to the point where
the user could specify the protocols for the driver to implement e.g.
put_parse_graph) and the only new proposals I've seen are bpf
translations in drivers and 'tc'. I plan to take another shot at this in
net-next.

> 
> Thanks,
> Tom
> 
>> [3] https://github.com/luigirizzo/netmap/pull/87
>>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-25 17:50                           ` John Fastabend
@ 2016-01-25 21:32                             ` Tom Herbert
  2016-01-25 21:58                               ` John Fastabend
  2016-01-25 22:10                             ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-01-25 21:32 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jesper Dangaard Brouer, Michael S. Tsirkin, David Miller,
	Eric Dumazet, Or Gerlitz, Eric Dumazet,
	Linux Kernel Network Developers, Alexander Duyck,
	Alexei Starovoitov, Daniel Borkmann, Marek Majkowski,
	Hannes Frederic Sowa, Florian Westphal, Paolo Abeni,
	John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich

On Mon, Jan 25, 2016 at 9:50 AM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 16-01-25 09:09 AM, Tom Herbert wrote:
>> On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>>>
>>> After reading John's reply about perfect filters, I want to re-state
>>> my idea, for this very early RX stage.  And describe a packet-page
>>> level bypass use-case, that John indirectly mentions.
>>>
>>>
>>> There are two ideas, getting mixed up here.  (1) bundling from the
>>> RX-ring, (2) allowing to pick up the "packet-page" directly.
>>>
>>> Bundling (1) is something that seems natural, and which help us
>>> amortize the cost between layers (and utilizes icache better). Lets
>>> keep that in another thread.
>>>
>>> This (2) direct forward of "packet-pages" is a fairly extreme idea,
>>> BUT it have the potential of being an new integration point for
>>> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to
>>> speed with bypass-solutions.
>>>
>>>
>>> Today, the bypass-solutions grab and control the entire NIC HW.  In
>>> many cases this is not very practical, if you also want to use the NIC
>>> for something else.
>>>
>>> Solutions for bypassing only part of the traffic is starting to show
>>> up.  Both a netmap[1] and a DPDK[2] based approach.
>>>
>>> [1] https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/
>>> [2] http://rhelblog.redhat.com/2015/10/02/getting-the-best-of-both-worlds-with-queue-splitting-bifurcated-driver/
>>>
>>> Both approaches install a HW filter in the NIC, and redirect packets
>>> to a separate RX HW queue (via ethtool ntuple + flow-type).  DPDK
>>> needs pci SRIOV setup and then run it own poll-mode driver on top.
>>> Netmap patch the orig ixgbe driver, and since CloudFlare/Gilberto's
>>> changes[3] support a single RX queue mode.
>>>
>
> FWIW I wrote a version of the patch talked about in the queue splitting
> article that didn't require SR-IOV and we also talked about it at last
> netconf in ottowa. The problem is without SR-IOV if you map a queue
> directly into userspace so you can run the poll mode drivers there is
> nothing protecting the DMA engine. So userspace can put arbitrary
> addresses in there. There is something called Process Address Space ID
> (PASID) also part of the PCI-SIG spec that could help you here but I
> don't know of any hardware that supports it. The other option is to
> use system calls and validate the descriptors in the kernel but this
> incurs some overhead we had it at 15% or so when I did the numbers
> last year. However I'm told there is some interesting work going on
> around syscall overhead that may help.
>
> One thing to note is SRIOV does somewhat limit the number of these
> types of interfaces you can support to the max VFs where as the
> queue mechanism although slower with a function call would be limited
> to max number of queues. Also busy polling will help here if you
> are worried about pps.
>
I think you're understating that a bit :-) We know that busy polling
helps with both pps and latency. IIRC, busy polling in the kernel
reduced latency by 2/3. Any latency or pps comparison between an
interrupt driven kernel stack and a userspace stack doing polling
would be invalid. If this work is all about latency (like burning
cores is not an issue), maybe busy polling should be be assumed for
all test cases?

> Jesper, at least for you (2) case what are we missing with the
> bifurcated/queue splitting work? Are you really after systems
> without SR-IOV support or are you trying to get this on the order
> of queues instead of VFs.
>
>> Jepser, thanks for providing more specifics.
>>
>> One comment: If you intend to change core code paths or APIs for this,
>> then I think that we should require up front that the associated HW
>> support is protocol agnostic (i.e. HW filters must be programmable and
>> generic ). We don't want a promising feature like this to be
>> undermined by protocol ossification.
>
> At the moment we use ethtool ntuple filters which is basically adding
> a new set of enums and structures every time we need a new protocol
> so its painful and you need your vendor to support you and you need a
> new kernel.
>
> The flow api was shot down (which would get you to the point where
> the user could specify the protocols for the driver to implement e.g.
> put_parse_graph) and the only new proposals I've seen are bpf
> translations in drivers and 'tc'. I plan to take another shot at this in
> net-next.
>
>>
>> Thanks,
>> Tom
>>
>>> [3] https://github.com/luigirizzo/netmap/pull/87
>>>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-25 21:32                             ` Tom Herbert
@ 2016-01-25 21:58                               ` John Fastabend
  0 siblings, 0 replies; 59+ messages in thread
From: John Fastabend @ 2016-01-25 21:58 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Michael S. Tsirkin, David Miller,
	Eric Dumazet, Or Gerlitz, Eric Dumazet,
	Linux Kernel Network Developers, Alexander Duyck,
	Alexei Starovoitov, Daniel Borkmann, Marek Majkowski,
	Hannes Frederic Sowa, Florian Westphal, Paolo Abeni,
	John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich

On 16-01-25 01:32 PM, Tom Herbert wrote:
> On Mon, Jan 25, 2016 at 9:50 AM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> On 16-01-25 09:09 AM, Tom Herbert wrote:
>>> On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer
>>> <brouer@redhat.com> wrote:
>>>>
>>>> After reading John's reply about perfect filters, I want to re-state
>>>> my idea, for this very early RX stage.  And describe a packet-page
>>>> level bypass use-case, that John indirectly mentions.
>>>>
>>>>
>>>> There are two ideas, getting mixed up here.  (1) bundling from the
>>>> RX-ring, (2) allowing to pick up the "packet-page" directly.
>>>>
>>>> Bundling (1) is something that seems natural, and which help us
>>>> amortize the cost between layers (and utilizes icache better). Lets
>>>> keep that in another thread.
>>>>
>>>> This (2) direct forward of "packet-pages" is a fairly extreme idea,
>>>> BUT it have the potential of being an new integration point for
>>>> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to
>>>> speed with bypass-solutions.
>>>>
>>>>
>>>> Today, the bypass-solutions grab and control the entire NIC HW.  In
>>>> many cases this is not very practical, if you also want to use the NIC
>>>> for something else.
>>>>
>>>> Solutions for bypassing only part of the traffic is starting to show
>>>> up.  Both a netmap[1] and a DPDK[2] based approach.
>>>>
>>>> [1] https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/
>>>> [2] http://rhelblog.redhat.com/2015/10/02/getting-the-best-of-both-worlds-with-queue-splitting-bifurcated-driver/
>>>>
>>>> Both approaches install a HW filter in the NIC, and redirect packets
>>>> to a separate RX HW queue (via ethtool ntuple + flow-type).  DPDK
>>>> needs pci SRIOV setup and then run it own poll-mode driver on top.
>>>> Netmap patch the orig ixgbe driver, and since CloudFlare/Gilberto's
>>>> changes[3] support a single RX queue mode.
>>>>
>>
>> FWIW I wrote a version of the patch talked about in the queue splitting
>> article that didn't require SR-IOV and we also talked about it at last
>> netconf in ottowa. The problem is without SR-IOV if you map a queue
>> directly into userspace so you can run the poll mode drivers there is
>> nothing protecting the DMA engine. So userspace can put arbitrary
>> addresses in there. There is something called Process Address Space ID
>> (PASID) also part of the PCI-SIG spec that could help you here but I
>> don't know of any hardware that supports it. The other option is to
>> use system calls and validate the descriptors in the kernel but this
>> incurs some overhead we had it at 15% or so when I did the numbers
>> last year. However I'm told there is some interesting work going on
>> around syscall overhead that may help.
>>
>> One thing to note is SRIOV does somewhat limit the number of these
>> types of interfaces you can support to the max VFs where as the
>> queue mechanism although slower with a function call would be limited
>> to max number of queues. Also busy polling will help here if you
>> are worried about pps.
>>
> I think you're understating that a bit :-) We know that busy polling
> helps with both pps and latency. IIRC, busy polling in the kernel
> reduced latency by 2/3. Any latency or pps comparison between an
> interrupt driven kernel stack and a userspace stack doing polling
> would be invalid. If this work is all about latency (like burning
> cores is not an issue), maybe busy polling should be be assumed for
> all test cases?

Probably if your going to try and report pps numbers and chart them
we mind as well play the game and use the best configuration we can.

Although I did want to make busy polling per queue or maybe create
L3/L4 netdev's like macvlan and put those in busy polling. Its a bit
overkill to put the entire device in busy polling mode when we have
only a couple sockets doing it. net-next is opening soon right ;)

> 
>> Jesper, at least for you (2) case what are we missing with the
>> bifurcated/queue splitting work? Are you really after systems
>> without SR-IOV support or are you trying to get this on the order
>> of queues instead of VFs.
>>
>>> Jepser, thanks for providing more specifics.
>>>
>>> One comment: If you intend to change core code paths or APIs for this,
>>> then I think that we should require up front that the associated HW
>>> support is protocol agnostic (i.e. HW filters must be programmable and
>>> generic ). We don't want a promising feature like this to be
>>> undermined by protocol ossification.
>>
>> At the moment we use ethtool ntuple filters which is basically adding
>> a new set of enums and structures every time we need a new protocol
>> so its painful and you need your vendor to support you and you need a
>> new kernel.
>>
>> The flow api was shot down (which would get you to the point where
>> the user could specify the protocols for the driver to implement e.g.
>> put_parse_graph) and the only new proposals I've seen are bpf
>> translations in drivers and 'tc'. I plan to take another shot at this in
>> net-next.
>>
>>>
>>> Thanks,
>>> Tom
>>>
>>>> [3] https://github.com/luigirizzo/netmap/pull/87
>>>>
>>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-25 17:50                           ` John Fastabend
  2016-01-25 21:32                             ` Tom Herbert
@ 2016-01-25 22:10                             ` Jesper Dangaard Brouer
  2016-01-27 20:47                               ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-25 22:10 UTC (permalink / raw)
  To: John Fastabend
  Cc: Tom Herbert, Michael S. Tsirkin, David Miller, Eric Dumazet,
	Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers,
	Alexander Duyck, Alexei Starovoitov, Daniel Borkmann,
	Marek Majkowski, Hannes Frederic Sowa, Florian Westphal,
	Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann,
	Vladislav Yasevich, brouer


On Mon, 25 Jan 2016 09:50:16 -0800 John Fastabend <john.fastabend@gmail.com> wrote:

> On 16-01-25 09:09 AM, Tom Herbert wrote:
> > On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:  
> >>
[...]
> >>
> >> There are two ideas, getting mixed up here.  (1) bundling from the
> >> RX-ring, (2) allowing to pick up the "packet-page" directly.
> >>
> >> Bundling (1) is something that seems natural, and which help us
> >> amortize the cost between layers (and utilizes icache better). Lets
> >> keep that in another thread.
> >>
> >> This (2) direct forward of "packet-pages" is a fairly extreme idea,
> >> BUT it have the potential of being an new integration point for
> >> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to
> >> speed with bypass-solutions.
>
[...]
> 
> Jesper, at least for you (2) case what are we missing with the
> bifurcated/queue splitting work? Are you really after systems
> without SR-IOV support or are you trying to get this on the order
> of queues instead of VFs.

I'm not saying something is missing for bifurcated/queue splitting work.
I'm not trying to work-around SR-IOV.

This an extreme idea, which I got while looking at the lowest RX layer.


Before working any further on this idea/path, I need/want to evaluate
if it makes sense from a performance point of view.  I need to evaluate
if "pulling" out these "packet-pages" is fast enough to compete with
DPDK/netmap.  Else it makes no sense to work on this path.

As a first step to evaluate this lowest RX layer, I'm simply hacking
the drivers (ixgbe and mlx5) to drop/discard packets within-the-driver.
For now, simply replacing napi_gro_receive() with dev_kfree_skb(), and
measuring the "RX-drop" performance.

Next step was to avoid the skb alloc+free calls, but doing so is more
complicated that I first anticipated, as the SKB is tied in fairly
heavily.  Thus, right now I'm instead hooking in my bulk alloc+free
API, as that will remove/mitigate most of the overhead of the
kmem_cache/slab-allocators.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-25 22:10                             ` Jesper Dangaard Brouer
@ 2016-01-27 20:47                               ` Jesper Dangaard Brouer
  2016-01-27 21:56                                 ` Alexei Starovoitov
  2016-01-28  2:50                                 ` Tom Herbert
  0 siblings, 2 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-27 20:47 UTC (permalink / raw)
  To: John Fastabend
  Cc: Tom Herbert, Michael S. Tsirkin, David Miller, Eric Dumazet,
	Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers,
	Alexander Duyck, Alexei Starovoitov, Daniel Borkmann,
	Marek Majkowski, Hannes Frederic Sowa, Florian Westphal,
	Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann,
	Vladislav Yasevich, brouer

On Mon, 25 Jan 2016 23:10:16 +0100
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Mon, 25 Jan 2016 09:50:16 -0800 John Fastabend <john.fastabend@gmail.com> wrote:
> 
> > On 16-01-25 09:09 AM, Tom Herbert wrote:  
> > > On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer
> > > <brouer@redhat.com> wrote:    
> > >>  
> [...]
> > >>
> > >> There are two ideas, getting mixed up here.  (1) bundling from the
> > >> RX-ring, (2) allowing to pick up the "packet-page" directly.
> > >>
> > >> Bundling (1) is something that seems natural, and which help us
> > >> amortize the cost between layers (and utilizes icache better). Lets
> > >> keep that in another thread.
> > >>
> > >> This (2) direct forward of "packet-pages" is a fairly extreme idea,
> > >> BUT it have the potential of being an new integration point for
> > >> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to
> > >> speed with bypass-solutions.  
> >  
> [...]
> > 
> > Jesper, at least for you (2) case what are we missing with the
> > bifurcated/queue splitting work? Are you really after systems
> > without SR-IOV support or are you trying to get this on the order
> > of queues instead of VFs.  
> 
> I'm not saying something is missing for bifurcated/queue splitting work.
> I'm not trying to work-around SR-IOV.
> 
> This an extreme idea, which I got while looking at the lowest RX layer.
> 
> 
> Before working any further on this idea/path, I need/want to evaluate
> if it makes sense from a performance point of view.  I need to evaluate
> if "pulling" out these "packet-pages" is fast enough to compete with
> DPDK/netmap.  Else it makes no sense to work on this path.
> 
> As a first step to evaluate this lowest RX layer, I'm simply hacking
> the drivers (ixgbe and mlx5) to drop/discard packets within-the-driver.
> For now, simply replacing napi_gro_receive() with dev_kfree_skb(), and
> measuring the "RX-drop" performance.
> 
> Next step was to avoid the skb alloc+free calls, but doing so is more
> complicated that I first anticipated, as the SKB is tied in fairly
> heavily.  Thus, right now I'm instead hooking in my bulk alloc+free
> API, as that will remove/mitigate most of the overhead of the
> kmem_cache/slab-allocators.

I've tried to deduct that kind of speeds we can achieve, at this lowest
RX layer. By in the mlx5/100G driver drop packets directly in the driver.
Just replacing replacing napi_gro_receive() with dev_kfree_skb(), was
fairly depressing, showing only 6.2Mpps (6253970 pps => 159.9 ns) (single core).

Looking at the perf report showed major cache-miss in eth_type_trans(29%/47ns).

And driver is hitting the SLUB slowpath quite badly (because it
prealloc SKBs and binds to RX ring, usually this test case would hits
SLUB "recycle" fastpath):

Group-report: kmem_cache/SLUB allocator functions ::
  5.00 % ~=  8.0 ns <= __slab_free
  4.91 % ~=  7.9 ns <= cmpxchg_double_slab.isra.65
  4.22 % ~=  6.7 ns <= kmem_cache_alloc
  1.68 % ~=  2.7 ns <= kmem_cache_free
  1.10 % ~=  1.8 ns <= ___slab_alloc
  0.93 % ~=  1.5 ns <= __cmpxchg_double_slab.isra.54
  0.65 % ~=  1.0 ns <= __slab_alloc.isra.74
  0.26 % ~=  0.4 ns <= put_cpu_partial
 Sum: 18.75 % => calc: 30.0 ns (sum: 30.0 ns) => Total: 159.9 ns

To get around the cache-miss in eth_type_trans(), I created a
"icache-loop" in mlx5e_poll_rx_cq() and pull all RX-ring packets "out",
before calling eth_type_trans(), reducing cost to 2.45%.

To mitigate the SLUB slowpath, I used my slab + SKB-napi bulk API .  And
also tuned SLUB (with slub_nomerge slub_min_objects=128) to get bigger
slab-pages, thus bigger bulk opportunities.

This helped a lot, I can now drop 12Mpps (12,088,767 => 82.7 ns).

Group-report: kmem_cache/SLUB allocator functions ::
  4.99 % ~=  4.1 ns <= kmem_cache_alloc_bulk
  2.87 % ~=  2.4 ns <= kmem_cache_free_bulk
  0.24 % ~=  0.2 ns <= ___slab_alloc
  0.23 % ~=  0.2 ns <= __slab_free
  0.21 % ~=  0.2 ns <= __cmpxchg_double_slab.isra.54
  0.17 % ~=  0.1 ns <= cmpxchg_double_slab.isra.65
  0.07 % ~=  0.1 ns <= put_cpu_partial
  0.04 % ~=  0.0 ns <= unfreeze_partials.isra.71
  0.03 % ~=  0.0 ns <= get_partial_node.isra.72
 Sum:  8.85 % => calc: 7.3 ns (sum: 7.3 ns) => Total: 82.7 ns

Full perf report output below signature, is from optimized case.

SKB related cost is 22.9 ns.  However 51.7% (11.84ns) cost originates
from memset of the SKB.

Group-report: related to pattern "skb" ::
 17.92 % ~= 14.8 ns <= __napi_alloc_skb   <== 80% memset(0) / rep stos
  3.29 % ~=  2.7 ns <= skb_release_data
  2.20 % ~=  1.8 ns <= napi_consume_skb
  1.86 % ~=  1.5 ns <= skb_release_head_state
  1.20 % ~=  1.0 ns <= skb_put
  1.14 % ~=  0.9 ns <= skb_release_all
  0.02 % ~=  0.0 ns <= __kfree_skb_flush
 Sum: 27.63 % => calc: 22.9 ns (sum: 22.9 ns) => Total: 82.7 ns

Doing a crude extrapolation, 82.7 ns subtract, SLUB (7.3 ns) and SKB
(22.9 ns) related => 52.5 ns -> extrapolate 19 Mpps would be the
maximum speed we can pull off packet-pages from the RX ring.

I don't know if 19Mpps (52.5 ns "overhead") is fast enough, to compete
with just mapping a RX HW queue/ring to netmap or via SR-IOV to DPDK(?)

But it was interesting to see how the lowest RX layer performs...
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Perf-report script:
 * https://github.com/netoptimizer/network-testing/blob/master/bin/perf_report_pps_stats.pl

Report: ALL functions ::
 19.71 % ~= 16.3 ns <= mlx5e_poll_rx_cq
 17.92 % ~= 14.8 ns <= __napi_alloc_skb
  9.54 % ~=  7.9 ns <= __free_page_frag
  7.16 % ~=  5.9 ns <= mlx5e_get_cqe
  6.37 % ~=  5.3 ns <= mlx5e_post_rx_wqes
  4.99 % ~=  4.1 ns <= kmem_cache_alloc_bulk
  3.70 % ~=  3.1 ns <= __alloc_page_frag
  3.29 % ~=  2.7 ns <= skb_release_data
  2.87 % ~=  2.4 ns <= kmem_cache_free_bulk
  2.45 % ~=  2.0 ns <= eth_type_trans
  2.43 % ~=  2.0 ns <= get_page_from_freelist
  2.36 % ~=  2.0 ns <= swiotlb_map_page
  2.20 % ~=  1.8 ns <= napi_consume_skb
  1.86 % ~=  1.5 ns <= skb_release_head_state
  1.25 % ~=  1.0 ns <= free_pages_prepare
  1.20 % ~=  1.0 ns <= skb_put
  1.14 % ~=  0.9 ns <= skb_release_all
  0.77 % ~=  0.6 ns <= __free_pages_ok
  0.59 % ~=  0.5 ns <= get_pfnblock_flags_mask
  0.59 % ~=  0.5 ns <= swiotlb_dma_mapping_error
  0.59 % ~=  0.5 ns <= unmap_single
  0.58 % ~=  0.5 ns <= _raw_spin_lock_irqsave
  0.57 % ~=  0.5 ns <= free_one_page
  0.56 % ~=  0.5 ns <= swiotlb_unmap_page
  0.52 % ~=  0.4 ns <= _raw_spin_lock
  0.46 % ~=  0.4 ns <= __mod_zone_page_state
  0.36 % ~=  0.3 ns <= __rmqueue
  0.36 % ~=  0.3 ns <= net_rx_action
  0.34 % ~=  0.3 ns <= __alloc_pages_nodemask
  0.31 % ~=  0.3 ns <= __zone_watermark_ok
  0.27 % ~=  0.2 ns <= mlx5e_napi_poll
  0.24 % ~=  0.2 ns <= ___slab_alloc
  0.23 % ~=  0.2 ns <= __slab_free
  0.22 % ~=  0.2 ns <= __list_del_entry
  0.21 % ~=  0.2 ns <= __cmpxchg_double_slab.isra.54
  0.21 % ~=  0.2 ns <= next_zones_zonelist
  0.20 % ~=  0.2 ns <= __list_add
  0.17 % ~=  0.1 ns <= __do_softirq
  0.17 % ~=  0.1 ns <= cmpxchg_double_slab.isra.65
  0.16 % ~=  0.1 ns <= __inc_zone_state
  0.12 % ~=  0.1 ns <= _raw_spin_unlock
  0.12 % ~=  0.1 ns <= zone_statistics
 (Percent limit(0.1%) stop at "mlx5e_poll_tx_cq")
 Sum: 99.45 % => calc: 82.3 ns (sum: 82.3 ns) => Total: 82.7 ns

Group-report: related to pattern "eth_type_trans|mlx5|ixgbe|__iowrite64_copy" ::
 (Driver related)
  19.71 % ~= 16.3 ns <= mlx5e_poll_rx_cq
  7.16 % ~=  5.9 ns <= mlx5e_get_cqe
  6.37 % ~=  5.3 ns <= mlx5e_post_rx_wqes
  2.45 % ~=  2.0 ns <= eth_type_trans
  0.27 % ~=  0.2 ns <= mlx5e_napi_poll
  0.09 % ~=  0.1 ns <= mlx5e_poll_tx_cq
 Sum: 36.05 % => calc: 29.8 ns (sum: 29.8 ns) => Total: 82.7 ns

Group-report: DMA functions ::
  2.36 % ~=  2.0 ns <= swiotlb_map_page
  0.59 % ~=  0.5 ns <= unmap_single
  0.59 % ~=  0.5 ns <= swiotlb_dma_mapping_error
  0.56 % ~=  0.5 ns <= swiotlb_unmap_page
 Sum:  4.10 % => calc: 3.4 ns (sum: 3.4 ns) => Total: 82.7 ns

Group-report: page_frag_cache functions ::
  9.54 % ~=  7.9 ns <= __free_page_frag
  3.70 % ~=  3.1 ns <= __alloc_page_frag
  2.43 % ~=  2.0 ns <= get_page_from_freelist
  1.25 % ~=  1.0 ns <= free_pages_prepare
  0.77 % ~=  0.6 ns <= __free_pages_ok
  0.59 % ~=  0.5 ns <= get_pfnblock_flags_mask
  0.57 % ~=  0.5 ns <= free_one_page
  0.46 % ~=  0.4 ns <= __mod_zone_page_state
  0.36 % ~=  0.3 ns <= __rmqueue
  0.34 % ~=  0.3 ns <= __alloc_pages_nodemask
  0.31 % ~=  0.3 ns <= __zone_watermark_ok
  0.21 % ~=  0.2 ns <= next_zones_zonelist
  0.16 % ~=  0.1 ns <= __inc_zone_state
  0.12 % ~=  0.1 ns <= zone_statistics
  0.02 % ~=  0.0 ns <= mod_zone_page_state
 Sum: 20.83 % => calc: 17.2 ns (sum: 17.2 ns) => Total: 82.7 ns

Group-report: kmem_cache/SLUB allocator functions ::
  4.99 % ~=  4.1 ns <= kmem_cache_alloc_bulk
  2.87 % ~=  2.4 ns <= kmem_cache_free_bulk
  0.24 % ~=  0.2 ns <= ___slab_alloc
  0.23 % ~=  0.2 ns <= __slab_free
  0.21 % ~=  0.2 ns <= __cmpxchg_double_slab.isra.54
  0.17 % ~=  0.1 ns <= cmpxchg_double_slab.isra.65
  0.07 % ~=  0.1 ns <= put_cpu_partial
  0.04 % ~=  0.0 ns <= unfreeze_partials.isra.71
  0.03 % ~=  0.0 ns <= get_partial_node.isra.72
 Sum:  8.85 % => calc: 7.3 ns (sum: 7.3 ns) => Total: 82.7 ns

 Group-report: related to pattern "skb" ::
 17.92 % ~= 14.8 ns <= __napi_alloc_skb   <== 80% memset(0) / rep stos
  3.29 % ~=  2.7 ns <= skb_release_data
  2.20 % ~=  1.8 ns <= napi_consume_skb
  1.86 % ~=  1.5 ns <= skb_release_head_state
  1.20 % ~=  1.0 ns <= skb_put
  1.14 % ~=  0.9 ns <= skb_release_all
  0.02 % ~=  0.0 ns <= __kfree_skb_flush
 Sum: 27.63 % => calc: 22.9 ns (sum: 22.9 ns) => Total: 82.7 ns

Group-report: Core network-stack functions ::
  0.36 % ~=  0.3 ns <= net_rx_action
  0.17 % ~=  0.1 ns <= __do_softirq
  0.02 % ~=  0.0 ns <= __raise_softirq_irqoff
  0.01 % ~=  0.0 ns <= run_ksoftirqd
  0.00 % ~=  0.0 ns <= run_timer_softirq
  0.00 % ~=  0.0 ns <= ksoftirqd_should_run
  0.00 % ~=  0.0 ns <= raise_softirq
 Sum:  0.56 % => calc: 0.5 ns (sum: 0.5 ns) => Total: 82.7 ns

Group-report: GRO network-stack functions ::
 Sum:  0.00 % => calc: 0.0 ns (sum: 0.0 ns) => Total: 82.7 ns

Group-report: related to pattern "spin_.*lock|mutex" ::
  0.58 % ~=  0.5 ns <= _raw_spin_lock_irqsave
  0.52 % ~=  0.4 ns <= _raw_spin_lock
  0.12 % ~=  0.1 ns <= _raw_spin_unlock
  0.01 % ~=  0.0 ns <= _raw_spin_unlock_irqrestore
  0.00 % ~=  0.0 ns <= __mutex_lock_slowpath
  0.00 % ~=  0.0 ns <= _raw_spin_lock_irq
 Sum:  1.23 % => calc: 1.0 ns (sum: 1.0 ns) => Total: 82.7 ns

 Negative Report: functions NOT included in group reports::
  0.22 % ~=  0.2 ns <= __list_del_entry
  0.20 % ~=  0.2 ns <= __list_add
  0.07 % ~=  0.1 ns <= list_del
  0.05 % ~=  0.0 ns <= native_sched_clock
  0.04 % ~=  0.0 ns <= irqtime_account_irq
  0.02 % ~=  0.0 ns <= rcu_bh_qs
  0.01 % ~=  0.0 ns <= task_tick_fair
  0.01 % ~=  0.0 ns <= net_rps_action_and_irq_enable.isra.112
  0.01 % ~=  0.0 ns <= perf_event_task_tick
  0.01 % ~=  0.0 ns <= apic_timer_interrupt
  0.01 % ~=  0.0 ns <= lapic_next_deadline
  0.01 % ~=  0.0 ns <= rcu_check_callbacks
  0.01 % ~=  0.0 ns <= smpboot_thread_fn
  0.01 % ~=  0.0 ns <= irqtime_account_process_tick.isra.3
  0.00 % ~=  0.0 ns <= intel_bts_enable_local
  0.00 % ~=  0.0 ns <= kthread_should_park
  0.00 % ~=  0.0 ns <= native_apic_mem_write
  0.00 % ~=  0.0 ns <= hrtimer_forward
  0.00 % ~=  0.0 ns <= get_work_pool
  0.00 % ~=  0.0 ns <= cpu_startup_entry
  0.00 % ~=  0.0 ns <= acct_account_cputime
  0.00 % ~=  0.0 ns <= set_next_entity
  0.00 % ~=  0.0 ns <= worker_thread
  0.00 % ~=  0.0 ns <= dbs_timer_handler
  0.00 % ~=  0.0 ns <= delay_tsc
  0.00 % ~=  0.0 ns <= idle_cpu
  0.00 % ~=  0.0 ns <= timerqueue_add
  0.00 % ~=  0.0 ns <= hrtimer_interrupt
  0.00 % ~=  0.0 ns <= dbs_work_handler
  0.00 % ~=  0.0 ns <= dequeue_entity
  0.00 % ~=  0.0 ns <= update_cfs_shares
  0.00 % ~=  0.0 ns <= update_fast_timekeeper
  0.00 % ~=  0.0 ns <= smp_trace_apic_timer_interrupt
  0.00 % ~=  0.0 ns <= __update_cpu_load
  0.00 % ~=  0.0 ns <= cpu_needs_another_gp
  0.00 % ~=  0.0 ns <= ret_from_intr
  0.00 % ~=  0.0 ns <= __intel_pmu_enable_all
  0.00 % ~=  0.0 ns <= trigger_load_balance
  0.00 % ~=  0.0 ns <= __schedule
  0.00 % ~=  0.0 ns <= nsecs_to_jiffies64
  0.00 % ~=  0.0 ns <= account_entity_dequeue
  0.00 % ~=  0.0 ns <= worker_enter_idle
  0.00 % ~=  0.0 ns <= __hrtimer_get_next_event
  0.00 % ~=  0.0 ns <= rcu_irq_exit
  0.00 % ~=  0.0 ns <= rb_erase
  0.00 % ~=  0.0 ns <= __intel_pmu_disable_all
  0.00 % ~=  0.0 ns <= tick_sched_do_timer
  0.00 % ~=  0.0 ns <= cpuacct_account_field
  0.00 % ~=  0.0 ns <= update_wall_time
  0.00 % ~=  0.0 ns <= notifier_call_chain
  0.00 % ~=  0.0 ns <= timekeeping_update
  0.00 % ~=  0.0 ns <= ktime_get_update_offsets_now
  0.00 % ~=  0.0 ns <= rb_next
  0.00 % ~=  0.0 ns <= rcu_all_qs
  0.00 % ~=  0.0 ns <= x86_pmu_disable
  0.00 % ~=  0.0 ns <= _cond_resched
  0.00 % ~=  0.0 ns <= __rcu_read_lock
  0.00 % ~=  0.0 ns <= __local_bh_enable
  0.00 % ~=  0.0 ns <= update_cpu_load_active
  0.00 % ~=  0.0 ns <= x86_pmu_enable
  0.00 % ~=  0.0 ns <= insert_work
  0.00 % ~=  0.0 ns <= ktime_get
  0.00 % ~=  0.0 ns <= __usecs_to_jiffies
  0.00 % ~=  0.0 ns <= __acct_update_integrals
  0.00 % ~=  0.0 ns <= scheduler_tick
  0.00 % ~=  0.0 ns <= update_vsyscall
  0.00 % ~=  0.0 ns <= memcpy_erms
  0.00 % ~=  0.0 ns <= get_cpu_idle_time_us
  0.00 % ~=  0.0 ns <= sched_clock_cpu
  0.00 % ~=  0.0 ns <= tick_do_update_jiffies64
  0.00 % ~=  0.0 ns <= hrtimer_active
  0.00 % ~=  0.0 ns <= profile_tick
  0.00 % ~=  0.0 ns <= __hrtimer_run_queues
  0.00 % ~=  0.0 ns <= kthread_should_stop
  0.00 % ~=  0.0 ns <= run_posix_cpu_timers
  0.00 % ~=  0.0 ns <= read_tsc
  0.00 % ~=  0.0 ns <= __remove_hrtimer
  0.00 % ~=  0.0 ns <= calc_global_load_tick
  0.00 % ~=  0.0 ns <= hrtimer_run_queues
  0.00 % ~=  0.0 ns <= irq_work_tick
  0.00 % ~=  0.0 ns <= cpuacct_charge
  0.00 % ~=  0.0 ns <= clockevents_program_event
  0.00 % ~=  0.0 ns <= update_blocked_averages
 Sum:  0.68 % => calc: 0.6 ns (sum: 0.6 ns) => Total: 82.7 ns

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-27 20:47                               ` Jesper Dangaard Brouer
@ 2016-01-27 21:56                                 ` Alexei Starovoitov
  2016-01-28  9:52                                   ` Jesper Dangaard Brouer
  2016-01-28  2:50                                 ` Tom Herbert
  1 sibling, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2016-01-27 21:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, Tom Herbert, Michael S. Tsirkin, David Miller,
	Eric Dumazet, Or Gerlitz, Eric Dumazet,
	Linux Kernel Network Developers, Alexander Duyck,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai,
	Daniel Borkmann, Vladislav Yasevich

On Wed, Jan 27, 2016 at 09:47:50PM +0100, Jesper Dangaard Brouer wrote:
>  Sum: 18.75 % => calc: 30.0 ns (sum: 30.0 ns) => Total: 159.9 ns
> 
> To get around the cache-miss in eth_type_trans(), I created a
> "icache-loop" in mlx5e_poll_rx_cq() and pull all RX-ring packets "out",
> before calling eth_type_trans(), reducing cost to 2.45%.
> 
> To mitigate the SLUB slowpath, I used my slab + SKB-napi bulk API .  And
> also tuned SLUB (with slub_nomerge slub_min_objects=128) to get bigger
> slab-pages, thus bigger bulk opportunities.
> 
> This helped a lot, I can now drop 12Mpps (12,088,767 => 82.7 ns).

great stuff. I think such batching loop will reduce the cost of
eth_type_trans() for all use cases.
Only unfortunate that it would need to be implemented in every driver,
but there is only a handful that people care about in high performance
setups, so I think it's worth getting this patch in for mlx5 and
the other drivers will catch up.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-27 20:47                               ` Jesper Dangaard Brouer
  2016-01-27 21:56                                 ` Alexei Starovoitov
@ 2016-01-28  2:50                                 ` Tom Herbert
  2016-01-28  9:25                                   ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-01-28  2:50 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, Michael S. Tsirkin, David Miller, Eric Dumazet,
	Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers,
	Alexander Duyck, Alexei Starovoitov, Daniel Borkmann,
	Marek Majkowski, Hannes Frederic Sowa, Florian Westphal,
	Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann,
	Vladislav Yasevich

On Wed, Jan 27, 2016 at 12:47 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Mon, 25 Jan 2016 23:10:16 +0100
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
>> On Mon, 25 Jan 2016 09:50:16 -0800 John Fastabend <john.fastabend@gmail.com> wrote:
>>
>> > On 16-01-25 09:09 AM, Tom Herbert wrote:
>> > > On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer
>> > > <brouer@redhat.com> wrote:
>> > >>
>> [...]
>> > >>
>> > >> There are two ideas, getting mixed up here.  (1) bundling from the
>> > >> RX-ring, (2) allowing to pick up the "packet-page" directly.
>> > >>
>> > >> Bundling (1) is something that seems natural, and which help us
>> > >> amortize the cost between layers (and utilizes icache better). Lets
>> > >> keep that in another thread.
>> > >>
>> > >> This (2) direct forward of "packet-pages" is a fairly extreme idea,
>> > >> BUT it have the potential of being an new integration point for
>> > >> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to
>> > >> speed with bypass-solutions.
>> >
>> [...]
>> >
>> > Jesper, at least for you (2) case what are we missing with the
>> > bifurcated/queue splitting work? Are you really after systems
>> > without SR-IOV support or are you trying to get this on the order
>> > of queues instead of VFs.
>>
>> I'm not saying something is missing for bifurcated/queue splitting work.
>> I'm not trying to work-around SR-IOV.
>>
>> This an extreme idea, which I got while looking at the lowest RX layer.
>>
>>
>> Before working any further on this idea/path, I need/want to evaluate
>> if it makes sense from a performance point of view.  I need to evaluate
>> if "pulling" out these "packet-pages" is fast enough to compete with
>> DPDK/netmap.  Else it makes no sense to work on this path.
>>
>> As a first step to evaluate this lowest RX layer, I'm simply hacking
>> the drivers (ixgbe and mlx5) to drop/discard packets within-the-driver.
>> For now, simply replacing napi_gro_receive() with dev_kfree_skb(), and
>> measuring the "RX-drop" performance.
>>
>> Next step was to avoid the skb alloc+free calls, but doing so is more
>> complicated that I first anticipated, as the SKB is tied in fairly
>> heavily.  Thus, right now I'm instead hooking in my bulk alloc+free
>> API, as that will remove/mitigate most of the overhead of the
>> kmem_cache/slab-allocators.
>
> I've tried to deduct that kind of speeds we can achieve, at this lowest
> RX layer. By in the mlx5/100G driver drop packets directly in the driver.
> Just replacing replacing napi_gro_receive() with dev_kfree_skb(), was
> fairly depressing, showing only 6.2Mpps (6253970 pps => 159.9 ns) (single core).
>
> Looking at the perf report showed major cache-miss in eth_type_trans(29%/47ns).
>
> And driver is hitting the SLUB slowpath quite badly (because it
> prealloc SKBs and binds to RX ring, usually this test case would hits
> SLUB "recycle" fastpath):
>
> Group-report: kmem_cache/SLUB allocator functions ::
>   5.00 % ~=  8.0 ns <= __slab_free
>   4.91 % ~=  7.9 ns <= cmpxchg_double_slab.isra.65
>   4.22 % ~=  6.7 ns <= kmem_cache_alloc
>   1.68 % ~=  2.7 ns <= kmem_cache_free
>   1.10 % ~=  1.8 ns <= ___slab_alloc
>   0.93 % ~=  1.5 ns <= __cmpxchg_double_slab.isra.54
>   0.65 % ~=  1.0 ns <= __slab_alloc.isra.74
>   0.26 % ~=  0.4 ns <= put_cpu_partial
>  Sum: 18.75 % => calc: 30.0 ns (sum: 30.0 ns) => Total: 159.9 ns
>
> To get around the cache-miss in eth_type_trans(), I created a
> "icache-loop" in mlx5e_poll_rx_cq() and pull all RX-ring packets "out",
> before calling eth_type_trans(), reducing cost to 2.45%.
>
> To mitigate the SLUB slowpath, I used my slab + SKB-napi bulk API .  And
> also tuned SLUB (with slub_nomerge slub_min_objects=128) to get bigger
> slab-pages, thus bigger bulk opportunities.
>
> This helped a lot, I can now drop 12Mpps (12,088,767 => 82.7 ns).
>
> Group-report: kmem_cache/SLUB allocator functions ::
>   4.99 % ~=  4.1 ns <= kmem_cache_alloc_bulk
>   2.87 % ~=  2.4 ns <= kmem_cache_free_bulk
>   0.24 % ~=  0.2 ns <= ___slab_alloc
>   0.23 % ~=  0.2 ns <= __slab_free
>   0.21 % ~=  0.2 ns <= __cmpxchg_double_slab.isra.54
>   0.17 % ~=  0.1 ns <= cmpxchg_double_slab.isra.65
>   0.07 % ~=  0.1 ns <= put_cpu_partial
>   0.04 % ~=  0.0 ns <= unfreeze_partials.isra.71
>   0.03 % ~=  0.0 ns <= get_partial_node.isra.72
>  Sum:  8.85 % => calc: 7.3 ns (sum: 7.3 ns) => Total: 82.7 ns
>
> Full perf report output below signature, is from optimized case.
>
> SKB related cost is 22.9 ns.  However 51.7% (11.84ns) cost originates
> from memset of the SKB.
>
> Group-report: related to pattern "skb" ::
>  17.92 % ~= 14.8 ns <= __napi_alloc_skb   <== 80% memset(0) / rep stos
>   3.29 % ~=  2.7 ns <= skb_release_data
>   2.20 % ~=  1.8 ns <= napi_consume_skb
>   1.86 % ~=  1.5 ns <= skb_release_head_state
>   1.20 % ~=  1.0 ns <= skb_put
>   1.14 % ~=  0.9 ns <= skb_release_all
>   0.02 % ~=  0.0 ns <= __kfree_skb_flush
>  Sum: 27.63 % => calc: 22.9 ns (sum: 22.9 ns) => Total: 82.7 ns
>
> Doing a crude extrapolation, 82.7 ns subtract, SLUB (7.3 ns) and SKB
> (22.9 ns) related => 52.5 ns -> extrapolate 19 Mpps would be the
> maximum speed we can pull off packet-pages from the RX ring.
>
> I don't know if 19Mpps (52.5 ns "overhead") is fast enough, to compete
> with just mapping a RX HW queue/ring to netmap or via SR-IOV to DPDK(?)
>
> But it was interesting to see how the lowest RX layer performs...

Cool stuff!

Looking at the typical driver receive path, I'm wonder if we should
beak netif_receive_skb (napi_gro_receive) into two parts. One utility
function to create a list of received skb's and prefetch the data
called as ring is processed, the other one to give the list to the
stack (e.g. netif_receive_skbs) and defer eth_type_trans as long as
possible. Is something like this what you are contemplating?

Tom

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer
>
>
> Perf-report script:
>  * https://github.com/netoptimizer/network-testing/blob/master/bin/perf_report_pps_stats.pl
>
> Report: ALL functions ::
>  19.71 % ~= 16.3 ns <= mlx5e_poll_rx_cq
>  17.92 % ~= 14.8 ns <= __napi_alloc_skb
>   9.54 % ~=  7.9 ns <= __free_page_frag
>   7.16 % ~=  5.9 ns <= mlx5e_get_cqe
>   6.37 % ~=  5.3 ns <= mlx5e_post_rx_wqes
>   4.99 % ~=  4.1 ns <= kmem_cache_alloc_bulk
>   3.70 % ~=  3.1 ns <= __alloc_page_frag
>   3.29 % ~=  2.7 ns <= skb_release_data
>   2.87 % ~=  2.4 ns <= kmem_cache_free_bulk
>   2.45 % ~=  2.0 ns <= eth_type_trans
>   2.43 % ~=  2.0 ns <= get_page_from_freelist
>   2.36 % ~=  2.0 ns <= swiotlb_map_page
>   2.20 % ~=  1.8 ns <= napi_consume_skb
>   1.86 % ~=  1.5 ns <= skb_release_head_state
>   1.25 % ~=  1.0 ns <= free_pages_prepare
>   1.20 % ~=  1.0 ns <= skb_put
>   1.14 % ~=  0.9 ns <= skb_release_all
>   0.77 % ~=  0.6 ns <= __free_pages_ok
>   0.59 % ~=  0.5 ns <= get_pfnblock_flags_mask
>   0.59 % ~=  0.5 ns <= swiotlb_dma_mapping_error
>   0.59 % ~=  0.5 ns <= unmap_single
>   0.58 % ~=  0.5 ns <= _raw_spin_lock_irqsave
>   0.57 % ~=  0.5 ns <= free_one_page
>   0.56 % ~=  0.5 ns <= swiotlb_unmap_page
>   0.52 % ~=  0.4 ns <= _raw_spin_lock
>   0.46 % ~=  0.4 ns <= __mod_zone_page_state
>   0.36 % ~=  0.3 ns <= __rmqueue
>   0.36 % ~=  0.3 ns <= net_rx_action
>   0.34 % ~=  0.3 ns <= __alloc_pages_nodemask
>   0.31 % ~=  0.3 ns <= __zone_watermark_ok
>   0.27 % ~=  0.2 ns <= mlx5e_napi_poll
>   0.24 % ~=  0.2 ns <= ___slab_alloc
>   0.23 % ~=  0.2 ns <= __slab_free
>   0.22 % ~=  0.2 ns <= __list_del_entry
>   0.21 % ~=  0.2 ns <= __cmpxchg_double_slab.isra.54
>   0.21 % ~=  0.2 ns <= next_zones_zonelist
>   0.20 % ~=  0.2 ns <= __list_add
>   0.17 % ~=  0.1 ns <= __do_softirq
>   0.17 % ~=  0.1 ns <= cmpxchg_double_slab.isra.65
>   0.16 % ~=  0.1 ns <= __inc_zone_state
>   0.12 % ~=  0.1 ns <= _raw_spin_unlock
>   0.12 % ~=  0.1 ns <= zone_statistics
>  (Percent limit(0.1%) stop at "mlx5e_poll_tx_cq")
>  Sum: 99.45 % => calc: 82.3 ns (sum: 82.3 ns) => Total: 82.7 ns
>
> Group-report: related to pattern "eth_type_trans|mlx5|ixgbe|__iowrite64_copy" ::
>  (Driver related)
>   19.71 % ~= 16.3 ns <= mlx5e_poll_rx_cq
>   7.16 % ~=  5.9 ns <= mlx5e_get_cqe
>   6.37 % ~=  5.3 ns <= mlx5e_post_rx_wqes
>   2.45 % ~=  2.0 ns <= eth_type_trans
>   0.27 % ~=  0.2 ns <= mlx5e_napi_poll
>   0.09 % ~=  0.1 ns <= mlx5e_poll_tx_cq
>  Sum: 36.05 % => calc: 29.8 ns (sum: 29.8 ns) => Total: 82.7 ns
>
> Group-report: DMA functions ::
>   2.36 % ~=  2.0 ns <= swiotlb_map_page
>   0.59 % ~=  0.5 ns <= unmap_single
>   0.59 % ~=  0.5 ns <= swiotlb_dma_mapping_error
>   0.56 % ~=  0.5 ns <= swiotlb_unmap_page
>  Sum:  4.10 % => calc: 3.4 ns (sum: 3.4 ns) => Total: 82.7 ns
>
> Group-report: page_frag_cache functions ::
>   9.54 % ~=  7.9 ns <= __free_page_frag
>   3.70 % ~=  3.1 ns <= __alloc_page_frag
>   2.43 % ~=  2.0 ns <= get_page_from_freelist
>   1.25 % ~=  1.0 ns <= free_pages_prepare
>   0.77 % ~=  0.6 ns <= __free_pages_ok
>   0.59 % ~=  0.5 ns <= get_pfnblock_flags_mask
>   0.57 % ~=  0.5 ns <= free_one_page
>   0.46 % ~=  0.4 ns <= __mod_zone_page_state
>   0.36 % ~=  0.3 ns <= __rmqueue
>   0.34 % ~=  0.3 ns <= __alloc_pages_nodemask
>   0.31 % ~=  0.3 ns <= __zone_watermark_ok
>   0.21 % ~=  0.2 ns <= next_zones_zonelist
>   0.16 % ~=  0.1 ns <= __inc_zone_state
>   0.12 % ~=  0.1 ns <= zone_statistics
>   0.02 % ~=  0.0 ns <= mod_zone_page_state
>  Sum: 20.83 % => calc: 17.2 ns (sum: 17.2 ns) => Total: 82.7 ns
>
> Group-report: kmem_cache/SLUB allocator functions ::
>   4.99 % ~=  4.1 ns <= kmem_cache_alloc_bulk
>   2.87 % ~=  2.4 ns <= kmem_cache_free_bulk
>   0.24 % ~=  0.2 ns <= ___slab_alloc
>   0.23 % ~=  0.2 ns <= __slab_free
>   0.21 % ~=  0.2 ns <= __cmpxchg_double_slab.isra.54
>   0.17 % ~=  0.1 ns <= cmpxchg_double_slab.isra.65
>   0.07 % ~=  0.1 ns <= put_cpu_partial
>   0.04 % ~=  0.0 ns <= unfreeze_partials.isra.71
>   0.03 % ~=  0.0 ns <= get_partial_node.isra.72
>  Sum:  8.85 % => calc: 7.3 ns (sum: 7.3 ns) => Total: 82.7 ns
>
>  Group-report: related to pattern "skb" ::
>  17.92 % ~= 14.8 ns <= __napi_alloc_skb   <== 80% memset(0) / rep stos
>   3.29 % ~=  2.7 ns <= skb_release_data
>   2.20 % ~=  1.8 ns <= napi_consume_skb
>   1.86 % ~=  1.5 ns <= skb_release_head_state
>   1.20 % ~=  1.0 ns <= skb_put
>   1.14 % ~=  0.9 ns <= skb_release_all
>   0.02 % ~=  0.0 ns <= __kfree_skb_flush
>  Sum: 27.63 % => calc: 22.9 ns (sum: 22.9 ns) => Total: 82.7 ns
>
> Group-report: Core network-stack functions ::
>   0.36 % ~=  0.3 ns <= net_rx_action
>   0.17 % ~=  0.1 ns <= __do_softirq
>   0.02 % ~=  0.0 ns <= __raise_softirq_irqoff
>   0.01 % ~=  0.0 ns <= run_ksoftirqd
>   0.00 % ~=  0.0 ns <= run_timer_softirq
>   0.00 % ~=  0.0 ns <= ksoftirqd_should_run
>   0.00 % ~=  0.0 ns <= raise_softirq
>  Sum:  0.56 % => calc: 0.5 ns (sum: 0.5 ns) => Total: 82.7 ns
>
> Group-report: GRO network-stack functions ::
>  Sum:  0.00 % => calc: 0.0 ns (sum: 0.0 ns) => Total: 82.7 ns
>
> Group-report: related to pattern "spin_.*lock|mutex" ::
>   0.58 % ~=  0.5 ns <= _raw_spin_lock_irqsave
>   0.52 % ~=  0.4 ns <= _raw_spin_lock
>   0.12 % ~=  0.1 ns <= _raw_spin_unlock
>   0.01 % ~=  0.0 ns <= _raw_spin_unlock_irqrestore
>   0.00 % ~=  0.0 ns <= __mutex_lock_slowpath
>   0.00 % ~=  0.0 ns <= _raw_spin_lock_irq
>  Sum:  1.23 % => calc: 1.0 ns (sum: 1.0 ns) => Total: 82.7 ns
>
>  Negative Report: functions NOT included in group reports::
>   0.22 % ~=  0.2 ns <= __list_del_entry
>   0.20 % ~=  0.2 ns <= __list_add
>   0.07 % ~=  0.1 ns <= list_del
>   0.05 % ~=  0.0 ns <= native_sched_clock
>   0.04 % ~=  0.0 ns <= irqtime_account_irq
>   0.02 % ~=  0.0 ns <= rcu_bh_qs
>   0.01 % ~=  0.0 ns <= task_tick_fair
>   0.01 % ~=  0.0 ns <= net_rps_action_and_irq_enable.isra.112
>   0.01 % ~=  0.0 ns <= perf_event_task_tick
>   0.01 % ~=  0.0 ns <= apic_timer_interrupt
>   0.01 % ~=  0.0 ns <= lapic_next_deadline
>   0.01 % ~=  0.0 ns <= rcu_check_callbacks
>   0.01 % ~=  0.0 ns <= smpboot_thread_fn
>   0.01 % ~=  0.0 ns <= irqtime_account_process_tick.isra.3
>   0.00 % ~=  0.0 ns <= intel_bts_enable_local
>   0.00 % ~=  0.0 ns <= kthread_should_park
>   0.00 % ~=  0.0 ns <= native_apic_mem_write
>   0.00 % ~=  0.0 ns <= hrtimer_forward
>   0.00 % ~=  0.0 ns <= get_work_pool
>   0.00 % ~=  0.0 ns <= cpu_startup_entry
>   0.00 % ~=  0.0 ns <= acct_account_cputime
>   0.00 % ~=  0.0 ns <= set_next_entity
>   0.00 % ~=  0.0 ns <= worker_thread
>   0.00 % ~=  0.0 ns <= dbs_timer_handler
>   0.00 % ~=  0.0 ns <= delay_tsc
>   0.00 % ~=  0.0 ns <= idle_cpu
>   0.00 % ~=  0.0 ns <= timerqueue_add
>   0.00 % ~=  0.0 ns <= hrtimer_interrupt
>   0.00 % ~=  0.0 ns <= dbs_work_handler
>   0.00 % ~=  0.0 ns <= dequeue_entity
>   0.00 % ~=  0.0 ns <= update_cfs_shares
>   0.00 % ~=  0.0 ns <= update_fast_timekeeper
>   0.00 % ~=  0.0 ns <= smp_trace_apic_timer_interrupt
>   0.00 % ~=  0.0 ns <= __update_cpu_load
>   0.00 % ~=  0.0 ns <= cpu_needs_another_gp
>   0.00 % ~=  0.0 ns <= ret_from_intr
>   0.00 % ~=  0.0 ns <= __intel_pmu_enable_all
>   0.00 % ~=  0.0 ns <= trigger_load_balance
>   0.00 % ~=  0.0 ns <= __schedule
>   0.00 % ~=  0.0 ns <= nsecs_to_jiffies64
>   0.00 % ~=  0.0 ns <= account_entity_dequeue
>   0.00 % ~=  0.0 ns <= worker_enter_idle
>   0.00 % ~=  0.0 ns <= __hrtimer_get_next_event
>   0.00 % ~=  0.0 ns <= rcu_irq_exit
>   0.00 % ~=  0.0 ns <= rb_erase
>   0.00 % ~=  0.0 ns <= __intel_pmu_disable_all
>   0.00 % ~=  0.0 ns <= tick_sched_do_timer
>   0.00 % ~=  0.0 ns <= cpuacct_account_field
>   0.00 % ~=  0.0 ns <= update_wall_time
>   0.00 % ~=  0.0 ns <= notifier_call_chain
>   0.00 % ~=  0.0 ns <= timekeeping_update
>   0.00 % ~=  0.0 ns <= ktime_get_update_offsets_now
>   0.00 % ~=  0.0 ns <= rb_next
>   0.00 % ~=  0.0 ns <= rcu_all_qs
>   0.00 % ~=  0.0 ns <= x86_pmu_disable
>   0.00 % ~=  0.0 ns <= _cond_resched
>   0.00 % ~=  0.0 ns <= __rcu_read_lock
>   0.00 % ~=  0.0 ns <= __local_bh_enable
>   0.00 % ~=  0.0 ns <= update_cpu_load_active
>   0.00 % ~=  0.0 ns <= x86_pmu_enable
>   0.00 % ~=  0.0 ns <= insert_work
>   0.00 % ~=  0.0 ns <= ktime_get
>   0.00 % ~=  0.0 ns <= __usecs_to_jiffies
>   0.00 % ~=  0.0 ns <= __acct_update_integrals
>   0.00 % ~=  0.0 ns <= scheduler_tick
>   0.00 % ~=  0.0 ns <= update_vsyscall
>   0.00 % ~=  0.0 ns <= memcpy_erms
>   0.00 % ~=  0.0 ns <= get_cpu_idle_time_us
>   0.00 % ~=  0.0 ns <= sched_clock_cpu
>   0.00 % ~=  0.0 ns <= tick_do_update_jiffies64
>   0.00 % ~=  0.0 ns <= hrtimer_active
>   0.00 % ~=  0.0 ns <= profile_tick
>   0.00 % ~=  0.0 ns <= __hrtimer_run_queues
>   0.00 % ~=  0.0 ns <= kthread_should_stop
>   0.00 % ~=  0.0 ns <= run_posix_cpu_timers
>   0.00 % ~=  0.0 ns <= read_tsc
>   0.00 % ~=  0.0 ns <= __remove_hrtimer
>   0.00 % ~=  0.0 ns <= calc_global_load_tick
>   0.00 % ~=  0.0 ns <= hrtimer_run_queues
>   0.00 % ~=  0.0 ns <= irq_work_tick
>   0.00 % ~=  0.0 ns <= cpuacct_charge
>   0.00 % ~=  0.0 ns <= clockevents_program_event
>   0.00 % ~=  0.0 ns <= update_blocked_averages
>  Sum:  0.68 % => calc: 0.6 ns (sum: 0.6 ns) => Total: 82.7 ns
>
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-28  2:50                                 ` Tom Herbert
@ 2016-01-28  9:25                                   ` Jesper Dangaard Brouer
  2016-01-28 12:45                                     ` Eric Dumazet
  0 siblings, 1 reply; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-28  9:25 UTC (permalink / raw)
  To: Tom Herbert
  Cc: John Fastabend, Michael S. Tsirkin, David Miller, Eric Dumazet,
	Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers,
	Alexander Duyck, Alexei Starovoitov, Daniel Borkmann,
	Marek Majkowski, Hannes Frederic Sowa, Florian Westphal,
	Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann,
	Vladislav Yasevich, brouer

On Wed, 27 Jan 2016 18:50:27 -0800
Tom Herbert <tom@herbertland.com> wrote:

> On Wed, Jan 27, 2016 at 12:47 PM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Mon, 25 Jan 2016 23:10:16 +0100
> > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >  
> >> On Mon, 25 Jan 2016 09:50:16 -0800 John Fastabend <john.fastabend@gmail.com> wrote:
> >>  
> >> > On 16-01-25 09:09 AM, Tom Herbert wrote:  
> >> > > On Mon, Jan 25, 2016 at 5:15 AM, Jesper Dangaard Brouer
> >> > > <brouer@redhat.com> wrote:  
> >> > >>  
> >> [...]  
> >> > >>
> >> > >> There are two ideas, getting mixed up here.  (1) bundling from the
> >> > >> RX-ring, (2) allowing to pick up the "packet-page" directly.
> >> > >>
> >> > >> Bundling (1) is something that seems natural, and which help us
> >> > >> amortize the cost between layers (and utilizes icache better). Lets
> >> > >> keep that in another thread.
> >> > >>
> >> > >> This (2) direct forward of "packet-pages" is a fairly extreme idea,
> >> > >> BUT it have the potential of being an new integration point for
> >> > >> "selective" bypass-solutions and bringing RAW/af_packet (RX) up-to
> >> > >> speed with bypass-solutions.  
> >> >  
> >> [...]  
> >> >
> >> > Jesper, at least for you (2) case what are we missing with the
> >> > bifurcated/queue splitting work? Are you really after systems
> >> > without SR-IOV support or are you trying to get this on the order
> >> > of queues instead of VFs.  
> >>
> >> I'm not saying something is missing for bifurcated/queue splitting work.
> >> I'm not trying to work-around SR-IOV.
> >>
> >> This an extreme idea, which I got while looking at the lowest RX layer.
> >>
> >>
> >> Before working any further on this idea/path, I need/want to evaluate
> >> if it makes sense from a performance point of view.  I need to evaluate
> >> if "pulling" out these "packet-pages" is fast enough to compete with
> >> DPDK/netmap.  Else it makes no sense to work on this path.
> >>
> >> As a first step to evaluate this lowest RX layer, I'm simply hacking
> >> the drivers (ixgbe and mlx5) to drop/discard packets within-the-driver.
> >> For now, simply replacing napi_gro_receive() with dev_kfree_skb(), and
> >> measuring the "RX-drop" performance.
> >>
> >> Next step was to avoid the skb alloc+free calls, but doing so is more
> >> complicated that I first anticipated, as the SKB is tied in fairly
> >> heavily.  Thus, right now I'm instead hooking in my bulk alloc+free
> >> API, as that will remove/mitigate most of the overhead of the
> >> kmem_cache/slab-allocators.  
> >
> > I've tried to deduct that kind of speeds we can achieve, at this lowest
> > RX layer. By in the mlx5/100G driver drop packets directly in the driver.
> > Just replacing replacing napi_gro_receive() with dev_kfree_skb(), was
> > fairly depressing, showing only 6.2Mpps (6253970 pps => 159.9 ns) (single core).
> >
> > Looking at the perf report showed major cache-miss in eth_type_trans(29%/47ns).
> >
> > And driver is hitting the SLUB slowpath quite badly (because it
> > prealloc SKBs and binds to RX ring, usually this test case would hits
> > SLUB "recycle" fastpath):
> >
> > Group-report: kmem_cache/SLUB allocator functions ::
> >   5.00 % ~=  8.0 ns <= __slab_free
> >   4.91 % ~=  7.9 ns <= cmpxchg_double_slab.isra.65
> >   4.22 % ~=  6.7 ns <= kmem_cache_alloc
> >   1.68 % ~=  2.7 ns <= kmem_cache_free
> >   1.10 % ~=  1.8 ns <= ___slab_alloc
> >   0.93 % ~=  1.5 ns <= __cmpxchg_double_slab.isra.54
> >   0.65 % ~=  1.0 ns <= __slab_alloc.isra.74
> >   0.26 % ~=  0.4 ns <= put_cpu_partial
> >  Sum: 18.75 % => calc: 30.0 ns (sum: 30.0 ns) => Total: 159.9 ns
> >
> > To get around the cache-miss in eth_type_trans(), I created a
> > "icache-loop" in mlx5e_poll_rx_cq() and pull all RX-ring packets "out",
> > before calling eth_type_trans(), reducing cost to 2.45%.
> >
> > To mitigate the SLUB slowpath, I used my slab + SKB-napi bulk API .  And
> > also tuned SLUB (with slub_nomerge slub_min_objects=128) to get bigger
> > slab-pages, thus bigger bulk opportunities.
> >
> > This helped a lot, I can now drop 12Mpps (12,088,767 => 82.7 ns).
> >
> > Group-report: kmem_cache/SLUB allocator functions ::
> >   4.99 % ~=  4.1 ns <= kmem_cache_alloc_bulk
> >   2.87 % ~=  2.4 ns <= kmem_cache_free_bulk
> >   0.24 % ~=  0.2 ns <= ___slab_alloc
> >   0.23 % ~=  0.2 ns <= __slab_free
> >   0.21 % ~=  0.2 ns <= __cmpxchg_double_slab.isra.54
> >   0.17 % ~=  0.1 ns <= cmpxchg_double_slab.isra.65
> >   0.07 % ~=  0.1 ns <= put_cpu_partial
> >   0.04 % ~=  0.0 ns <= unfreeze_partials.isra.71
> >   0.03 % ~=  0.0 ns <= get_partial_node.isra.72
> >  Sum:  8.85 % => calc: 7.3 ns (sum: 7.3 ns) => Total: 82.7 ns
> >
> > Full perf report output below signature, is from optimized case.
> >
> > SKB related cost is 22.9 ns.  However 51.7% (11.84ns) cost originates
> > from memset of the SKB.
> >
> > Group-report: related to pattern "skb" ::
> >  17.92 % ~= 14.8 ns <= __napi_alloc_skb   <== 80% memset(0) / rep stos
> >   3.29 % ~=  2.7 ns <= skb_release_data
> >   2.20 % ~=  1.8 ns <= napi_consume_skb
> >   1.86 % ~=  1.5 ns <= skb_release_head_state
> >   1.20 % ~=  1.0 ns <= skb_put
> >   1.14 % ~=  0.9 ns <= skb_release_all
> >   0.02 % ~=  0.0 ns <= __kfree_skb_flush
> >  Sum: 27.63 % => calc: 22.9 ns (sum: 22.9 ns) => Total: 82.7 ns
> >
> > Doing a crude extrapolation, 82.7 ns subtract, SLUB (7.3 ns) and SKB
> > (22.9 ns) related => 52.5 ns -> extrapolate 19 Mpps would be the
> > maximum speed we can pull off packet-pages from the RX ring.
> >
> > I don't know if 19Mpps (52.5 ns "overhead") is fast enough, to compete
> > with just mapping a RX HW queue/ring to netmap or via SR-IOV to DPDK(?)
> >
> > But it was interesting to see how the lowest RX layer performs...  
> 
> Cool stuff!

Thanks :-)
 
> Looking at the typical driver receive path, I'm wonder if we should
> beak netif_receive_skb (napi_gro_receive) into two parts. One utility
> function to create a list of received skb's and prefetch the data
> called as ring is processed, the other one to give the list to the
> stack (e.g. netif_receive_skbs) and defer eth_type_trans as long as
> possible. Is something like this what you are contemplating?

Yes, that is exactly what I'm contemplating :-)  That is idea "(1)".

A natural extension to this work, which I expect Tom will love, is to
also use the idea for RPS.  Once we have a SKB list in stack/GRO-layer,
then we could build a local sk_buff_head list for each remote CPU, by
calling get_rps_cpu().   And then enqueue_list_to_backlog, by a
skb_queue_splice_tail(&cpu_list, &cpu->sd->input_pkt_queue) call.

This would amortize the cost of transferring packets to a remote CPU,
which Eric AFAIK points out is costing approx ~133ns.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


> > Perf-report script:
> >  * https://github.com/netoptimizer/network-testing/blob/master/bin/perf_report_pps_stats.pl
> >
> > Report: ALL functions ::
> >  19.71 % ~= 16.3 ns <= mlx5e_poll_rx_cq
> >  17.92 % ~= 14.8 ns <= __napi_alloc_skb
> >   9.54 % ~=  7.9 ns <= __free_page_frag
> >   7.16 % ~=  5.9 ns <= mlx5e_get_cqe
> >   6.37 % ~=  5.3 ns <= mlx5e_post_rx_wqes
> >   4.99 % ~=  4.1 ns <= kmem_cache_alloc_bulk
> >   3.70 % ~=  3.1 ns <= __alloc_page_frag
> >   3.29 % ~=  2.7 ns <= skb_release_data
> >   2.87 % ~=  2.4 ns <= kmem_cache_free_bulk
> >   2.45 % ~=  2.0 ns <= eth_type_trans
> >   2.43 % ~=  2.0 ns <= get_page_from_freelist
> >   2.36 % ~=  2.0 ns <= swiotlb_map_page
> >   2.20 % ~=  1.8 ns <= napi_consume_skb
> >   1.86 % ~=  1.5 ns <= skb_release_head_state
> >   1.25 % ~=  1.0 ns <= free_pages_prepare
> >   1.20 % ~=  1.0 ns <= skb_put
> >   1.14 % ~=  0.9 ns <= skb_release_all
> >   0.77 % ~=  0.6 ns <= __free_pages_ok
> >   0.59 % ~=  0.5 ns <= get_pfnblock_flags_mask
> >   0.59 % ~=  0.5 ns <= swiotlb_dma_mapping_error
> >   0.59 % ~=  0.5 ns <= unmap_single
> >   0.58 % ~=  0.5 ns <= _raw_spin_lock_irqsave
> >   0.57 % ~=  0.5 ns <= free_one_page
> >   0.56 % ~=  0.5 ns <= swiotlb_unmap_page
> >   0.52 % ~=  0.4 ns <= _raw_spin_lock
> >   0.46 % ~=  0.4 ns <= __mod_zone_page_state
> >   0.36 % ~=  0.3 ns <= __rmqueue
> >   0.36 % ~=  0.3 ns <= net_rx_action
> >   0.34 % ~=  0.3 ns <= __alloc_pages_nodemask
> >   0.31 % ~=  0.3 ns <= __zone_watermark_ok
> >   0.27 % ~=  0.2 ns <= mlx5e_napi_poll
> >   0.24 % ~=  0.2 ns <= ___slab_alloc
> >   0.23 % ~=  0.2 ns <= __slab_free
> >   0.22 % ~=  0.2 ns <= __list_del_entry
> >   0.21 % ~=  0.2 ns <= __cmpxchg_double_slab.isra.54
> >   0.21 % ~=  0.2 ns <= next_zones_zonelist
> >   0.20 % ~=  0.2 ns <= __list_add
> >   0.17 % ~=  0.1 ns <= __do_softirq
> >   0.17 % ~=  0.1 ns <= cmpxchg_double_slab.isra.65
> >   0.16 % ~=  0.1 ns <= __inc_zone_state
> >   0.12 % ~=  0.1 ns <= _raw_spin_unlock
> >   0.12 % ~=  0.1 ns <= zone_statistics
> >  (Percent limit(0.1%) stop at "mlx5e_poll_tx_cq")
> >  Sum: 99.45 % => calc: 82.3 ns (sum: 82.3 ns) => Total: 82.7 ns
> >
> > Group-report: related to pattern "eth_type_trans|mlx5|ixgbe|__iowrite64_copy" ::
> >  (Driver related)
> >   19.71 % ~= 16.3 ns <= mlx5e_poll_rx_cq
> >   7.16 % ~=  5.9 ns <= mlx5e_get_cqe
> >   6.37 % ~=  5.3 ns <= mlx5e_post_rx_wqes
> >   2.45 % ~=  2.0 ns <= eth_type_trans
> >   0.27 % ~=  0.2 ns <= mlx5e_napi_poll
> >   0.09 % ~=  0.1 ns <= mlx5e_poll_tx_cq
> >  Sum: 36.05 % => calc: 29.8 ns (sum: 29.8 ns) => Total: 82.7 ns
> >
> > Group-report: DMA functions ::
> >   2.36 % ~=  2.0 ns <= swiotlb_map_page
> >   0.59 % ~=  0.5 ns <= unmap_single
> >   0.59 % ~=  0.5 ns <= swiotlb_dma_mapping_error
> >   0.56 % ~=  0.5 ns <= swiotlb_unmap_page
> >  Sum:  4.10 % => calc: 3.4 ns (sum: 3.4 ns) => Total: 82.7 ns
> >
> > Group-report: page_frag_cache functions ::
> >   9.54 % ~=  7.9 ns <= __free_page_frag
> >   3.70 % ~=  3.1 ns <= __alloc_page_frag
> >   2.43 % ~=  2.0 ns <= get_page_from_freelist
> >   1.25 % ~=  1.0 ns <= free_pages_prepare
> >   0.77 % ~=  0.6 ns <= __free_pages_ok
> >   0.59 % ~=  0.5 ns <= get_pfnblock_flags_mask
> >   0.57 % ~=  0.5 ns <= free_one_page
> >   0.46 % ~=  0.4 ns <= __mod_zone_page_state
> >   0.36 % ~=  0.3 ns <= __rmqueue
> >   0.34 % ~=  0.3 ns <= __alloc_pages_nodemask
> >   0.31 % ~=  0.3 ns <= __zone_watermark_ok
> >   0.21 % ~=  0.2 ns <= next_zones_zonelist
> >   0.16 % ~=  0.1 ns <= __inc_zone_state
> >   0.12 % ~=  0.1 ns <= zone_statistics
> >   0.02 % ~=  0.0 ns <= mod_zone_page_state
> >  Sum: 20.83 % => calc: 17.2 ns (sum: 17.2 ns) => Total: 82.7 ns
> >
> > Group-report: kmem_cache/SLUB allocator functions ::
> >   4.99 % ~=  4.1 ns <= kmem_cache_alloc_bulk
> >   2.87 % ~=  2.4 ns <= kmem_cache_free_bulk
> >   0.24 % ~=  0.2 ns <= ___slab_alloc
> >   0.23 % ~=  0.2 ns <= __slab_free
> >   0.21 % ~=  0.2 ns <= __cmpxchg_double_slab.isra.54
> >   0.17 % ~=  0.1 ns <= cmpxchg_double_slab.isra.65
> >   0.07 % ~=  0.1 ns <= put_cpu_partial
> >   0.04 % ~=  0.0 ns <= unfreeze_partials.isra.71
> >   0.03 % ~=  0.0 ns <= get_partial_node.isra.72
> >  Sum:  8.85 % => calc: 7.3 ns (sum: 7.3 ns) => Total: 82.7 ns
> >
> >  Group-report: related to pattern "skb" ::
> >  17.92 % ~= 14.8 ns <= __napi_alloc_skb   <== 80% memset(0) / rep stos
> >   3.29 % ~=  2.7 ns <= skb_release_data
> >   2.20 % ~=  1.8 ns <= napi_consume_skb
> >   1.86 % ~=  1.5 ns <= skb_release_head_state
> >   1.20 % ~=  1.0 ns <= skb_put
> >   1.14 % ~=  0.9 ns <= skb_release_all
> >   0.02 % ~=  0.0 ns <= __kfree_skb_flush
> >  Sum: 27.63 % => calc: 22.9 ns (sum: 22.9 ns) => Total: 82.7 ns
> >
> > Group-report: Core network-stack functions ::
> >   0.36 % ~=  0.3 ns <= net_rx_action
> >   0.17 % ~=  0.1 ns <= __do_softirq
> >   0.02 % ~=  0.0 ns <= __raise_softirq_irqoff
> >   0.01 % ~=  0.0 ns <= run_ksoftirqd
> >   0.00 % ~=  0.0 ns <= run_timer_softirq
> >   0.00 % ~=  0.0 ns <= ksoftirqd_should_run
> >   0.00 % ~=  0.0 ns <= raise_softirq
> >  Sum:  0.56 % => calc: 0.5 ns (sum: 0.5 ns) => Total: 82.7 ns
> >
> > Group-report: GRO network-stack functions ::
> >  Sum:  0.00 % => calc: 0.0 ns (sum: 0.0 ns) => Total: 82.7 ns
> >
> > Group-report: related to pattern "spin_.*lock|mutex" ::
> >   0.58 % ~=  0.5 ns <= _raw_spin_lock_irqsave
> >   0.52 % ~=  0.4 ns <= _raw_spin_lock
> >   0.12 % ~=  0.1 ns <= _raw_spin_unlock
> >   0.01 % ~=  0.0 ns <= _raw_spin_unlock_irqrestore
> >   0.00 % ~=  0.0 ns <= __mutex_lock_slowpath
> >   0.00 % ~=  0.0 ns <= _raw_spin_lock_irq
> >  Sum:  1.23 % => calc: 1.0 ns (sum: 1.0 ns) => Total: 82.7 ns
> >
> >  Negative Report: functions NOT included in group reports::
> >   0.22 % ~=  0.2 ns <= __list_del_entry
> >   0.20 % ~=  0.2 ns <= __list_add
> >   0.07 % ~=  0.1 ns <= list_del
> >   0.05 % ~=  0.0 ns <= native_sched_clock
> >   0.04 % ~=  0.0 ns <= irqtime_account_irq
> >   0.02 % ~=  0.0 ns <= rcu_bh_qs
> >   0.01 % ~=  0.0 ns <= task_tick_fair
> >   0.01 % ~=  0.0 ns <= net_rps_action_and_irq_enable.isra.112
> >   0.01 % ~=  0.0 ns <= perf_event_task_tick
> >   0.01 % ~=  0.0 ns <= apic_timer_interrupt
> >   0.01 % ~=  0.0 ns <= lapic_next_deadline
> >   0.01 % ~=  0.0 ns <= rcu_check_callbacks
> >   0.01 % ~=  0.0 ns <= smpboot_thread_fn
> >   0.01 % ~=  0.0 ns <= irqtime_account_process_tick.isra.3
> >   0.00 % ~=  0.0 ns <= intel_bts_enable_local
> >   0.00 % ~=  0.0 ns <= kthread_should_park
> >   0.00 % ~=  0.0 ns <= native_apic_mem_write
> >   0.00 % ~=  0.0 ns <= hrtimer_forward
> >   0.00 % ~=  0.0 ns <= get_work_pool
> >   0.00 % ~=  0.0 ns <= cpu_startup_entry
> >   0.00 % ~=  0.0 ns <= acct_account_cputime
> >   0.00 % ~=  0.0 ns <= set_next_entity
> >   0.00 % ~=  0.0 ns <= worker_thread
> >   0.00 % ~=  0.0 ns <= dbs_timer_handler
> >   0.00 % ~=  0.0 ns <= delay_tsc
> >   0.00 % ~=  0.0 ns <= idle_cpu
> >   0.00 % ~=  0.0 ns <= timerqueue_add
> >   0.00 % ~=  0.0 ns <= hrtimer_interrupt
> >   0.00 % ~=  0.0 ns <= dbs_work_handler
> >   0.00 % ~=  0.0 ns <= dequeue_entity
> >   0.00 % ~=  0.0 ns <= update_cfs_shares
> >   0.00 % ~=  0.0 ns <= update_fast_timekeeper
> >   0.00 % ~=  0.0 ns <= smp_trace_apic_timer_interrupt
> >   0.00 % ~=  0.0 ns <= __update_cpu_load
> >   0.00 % ~=  0.0 ns <= cpu_needs_another_gp
> >   0.00 % ~=  0.0 ns <= ret_from_intr
> >   0.00 % ~=  0.0 ns <= __intel_pmu_enable_all
> >   0.00 % ~=  0.0 ns <= trigger_load_balance
> >   0.00 % ~=  0.0 ns <= __schedule
> >   0.00 % ~=  0.0 ns <= nsecs_to_jiffies64
> >   0.00 % ~=  0.0 ns <= account_entity_dequeue
> >   0.00 % ~=  0.0 ns <= worker_enter_idle
> >   0.00 % ~=  0.0 ns <= __hrtimer_get_next_event
> >   0.00 % ~=  0.0 ns <= rcu_irq_exit
> >   0.00 % ~=  0.0 ns <= rb_erase
> >   0.00 % ~=  0.0 ns <= __intel_pmu_disable_all
> >   0.00 % ~=  0.0 ns <= tick_sched_do_timer
> >   0.00 % ~=  0.0 ns <= cpuacct_account_field
> >   0.00 % ~=  0.0 ns <= update_wall_time
> >   0.00 % ~=  0.0 ns <= notifier_call_chain
> >   0.00 % ~=  0.0 ns <= timekeeping_update
> >   0.00 % ~=  0.0 ns <= ktime_get_update_offsets_now
> >   0.00 % ~=  0.0 ns <= rb_next
> >   0.00 % ~=  0.0 ns <= rcu_all_qs
> >   0.00 % ~=  0.0 ns <= x86_pmu_disable
> >   0.00 % ~=  0.0 ns <= _cond_resched
> >   0.00 % ~=  0.0 ns <= __rcu_read_lock
> >   0.00 % ~=  0.0 ns <= __local_bh_enable
> >   0.00 % ~=  0.0 ns <= update_cpu_load_active
> >   0.00 % ~=  0.0 ns <= x86_pmu_enable
> >   0.00 % ~=  0.0 ns <= insert_work
> >   0.00 % ~=  0.0 ns <= ktime_get
> >   0.00 % ~=  0.0 ns <= __usecs_to_jiffies
> >   0.00 % ~=  0.0 ns <= __acct_update_integrals
> >   0.00 % ~=  0.0 ns <= scheduler_tick
> >   0.00 % ~=  0.0 ns <= update_vsyscall
> >   0.00 % ~=  0.0 ns <= memcpy_erms
> >   0.00 % ~=  0.0 ns <= get_cpu_idle_time_us
> >   0.00 % ~=  0.0 ns <= sched_clock_cpu
> >   0.00 % ~=  0.0 ns <= tick_do_update_jiffies64
> >   0.00 % ~=  0.0 ns <= hrtimer_active
> >   0.00 % ~=  0.0 ns <= profile_tick
> >   0.00 % ~=  0.0 ns <= __hrtimer_run_queues
> >   0.00 % ~=  0.0 ns <= kthread_should_stop
> >   0.00 % ~=  0.0 ns <= run_posix_cpu_timers
> >   0.00 % ~=  0.0 ns <= read_tsc
> >   0.00 % ~=  0.0 ns <= __remove_hrtimer
> >   0.00 % ~=  0.0 ns <= calc_global_load_tick
> >   0.00 % ~=  0.0 ns <= hrtimer_run_queues
> >   0.00 % ~=  0.0 ns <= irq_work_tick
> >   0.00 % ~=  0.0 ns <= cpuacct_charge
> >   0.00 % ~=  0.0 ns <= clockevents_program_event
> >   0.00 % ~=  0.0 ns <= update_blocked_averages
> >  Sum:  0.68 % => calc: 0.6 ns (sum: 0.6 ns) => Total: 82.7 ns
> >
> >  

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-27 21:56                                 ` Alexei Starovoitov
@ 2016-01-28  9:52                                   ` Jesper Dangaard Brouer
  2016-01-28 12:54                                     ` Eric Dumazet
                                                       ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-28  9:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Tom Herbert, Michael S. Tsirkin, David Miller,
	Eric Dumazet, Or Gerlitz, Eric Dumazet,
	Linux Kernel Network Developers, Alexander Duyck,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai,
	Daniel Borkmann, Vladislav Yasevich, brouer

On Wed, 27 Jan 2016 13:56:03 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Wed, Jan 27, 2016 at 09:47:50PM +0100, Jesper Dangaard Brouer wrote:
> >  Sum: 18.75 % => calc: 30.0 ns (sum: 30.0 ns) => Total: 159.9 ns
> > 
> > To get around the cache-miss in eth_type_trans(), I created a
> > "icache-loop" in mlx5e_poll_rx_cq() and pull all RX-ring packets "out",
> > before calling eth_type_trans(), reducing cost to 2.45%.
> > 
> > To mitigate the SLUB slowpath, I used my slab + SKB-napi bulk API .  And
> > also tuned SLUB (with slub_nomerge slub_min_objects=128) to get bigger
> > slab-pages, thus bigger bulk opportunities.
> > 
> > This helped a lot, I can now drop 12Mpps (12,088,767 => 82.7 ns).  
> 
> great stuff. I think such batching loop will reduce the cost of
> eth_type_trans() for all use cases.
> Only unfortunate that it would need to be implemented in every driver,
> but there is only a handful that people care about in high performance
> setups, so I think it's worth getting this patch in for mlx5 and
> the other drivers will catch up.

I'm still in flux/undecided how long we should delay the first touching
of pkt-data, which happens when calling eth_type_trans().  Should it
stay in the driver or not(?).

In the extreme case, for optimize for RPS sending to remote CPUs, delay
calling eth_type_trans() as long as possible.

1. In driver only start prefetch data to L2/L3 cache
2. Stack calls get_rps_cpu() and assume skb_get_hash() have HW hash
3. (Bulk) enqueue on remote_cpu->sd->input_pkt_queue
4. On remote CPU in process_backlog call eth_type_trans() on sd->input_pkt_queue


On the other hand, if the HW desc can provide skb->proto, and we can
lazy eval skb->pkt_type, then it is okay to keep that responsibility in
the driver (as the call to eth_type_trans() basically disappears).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-28  9:25                                   ` Jesper Dangaard Brouer
@ 2016-01-28 12:45                                     ` Eric Dumazet
  2016-01-28 16:37                                       ` Tom Herbert
  0 siblings, 1 reply; 59+ messages in thread
From: Eric Dumazet @ 2016-01-28 12:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, John Fastabend, Michael S. Tsirkin, David Miller,
	Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers,
	Alexander Duyck, Alexei Starovoitov, Daniel Borkmann,
	Marek Majkowski, Hannes Frederic Sowa, Florian Westphal,
	Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann,
	Vladislav Yasevich

On Thu, 2016-01-28 at 10:25 +0100, Jesper Dangaard Brouer wrote:

> Yes, that is exactly what I'm contemplating :-)  That is idea "(1)".
> 
> A natural extension to this work, which I expect Tom will love, is to
> also use the idea for RPS.  Once we have a SKB list in stack/GRO-layer,
> then we could build a local sk_buff_head list for each remote CPU, by
> calling get_rps_cpu().   And then enqueue_list_to_backlog, by a
> skb_queue_splice_tail(&cpu_list, &cpu->sd->input_pkt_queue) call.
> 
> This would amortize the cost of transferring packets to a remote CPU,
> which Eric AFAIK points out is costing approx ~133ns.
> 

Jesper, RPS and RFS already defer sending the IPI and submit batches to
remote cpus.

See commits 

e326bed2f47d0365da5a8faaf8ee93ed2d86325b ("rps: immediate send IPI in
process_backlog()")

88751275b8e867d756e4f86ae92afe0232de129f ("rps: shortcut
net_rps_action()")

And of course all the discussions we had to come up with
0a9627f2649a02bea165cfd529d7bcb625c2fcad ("rps: Receive Packet
Steering")

The current state :

net_rps_action_and_irq_enable() sends the IPI at the end of
net_rx_action() once all NAPI handlers have been called, and therefore
have accumulated packets and cook rps_ipi_list (via calls to
rps_ipi_queued() from enqueue_to_backlog())


Adding another stage in the pipeline would not help.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-28  9:52                                   ` Jesper Dangaard Brouer
@ 2016-01-28 12:54                                     ` Eric Dumazet
  2016-01-28 13:25                                     ` Eric Dumazet
  2016-01-28 16:43                                     ` Tom Herbert
  2 siblings, 0 replies; 59+ messages in thread
From: Eric Dumazet @ 2016-01-28 12:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, John Fastabend, Tom Herbert,
	Michael S. Tsirkin, David Miller, Or Gerlitz, Eric Dumazet,
	Linux Kernel Network Developers, Alexander Duyck,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai,
	Daniel Borkmann, Vladislav Yasevich

On Thu, 2016-01-28 at 10:52 +0100, Jesper Dangaard Brouer wrote:

> I'm still in flux/undecided how long we should delay the first touching
> of pkt-data, which happens when calling eth_type_trans().  Should it
> stay in the driver or not(?).
> 
> In the extreme case, for optimize for RPS sending to remote CPUs, delay
> calling eth_type_trans() as long as possible.
> 
> 1. In driver only start prefetch data to L2/L3 cache
> 2. Stack calls get_rps_cpu() and assume skb_get_hash() have HW hash
> 3. (Bulk) enqueue on remote_cpu->sd->input_pkt_queue
> 4. On remote CPU in process_backlog call eth_type_trans() on sd->input_pkt_queue
> 
> 
> On the other hand, if the HW desc can provide skb->proto, and we can
> lazy eval skb->pkt_type, then it is okay to keep that responsibility in
> the driver (as the call to eth_type_trans() basically disappears).


Delaying means GRO wont be able to recycle its super hot skb (see
napi_get_frags())

You might optimize the reception of packets in the router case (poor GRO
aggregation rate), but you'll slow down GRO efficiency when receiving
nice GRO trains.

When we receive a train of 10 MSS, driver keeps using the same sk_buff,
very hot in its L1

(This was the original idea of build_skb() to get nice cache locality
for the metadata, since it is 4 cache lines per sk_buff)

Now most drivers have no clue why it is important to allocate the skb
_after_ receiving the ethernet frame and not in advance.

(The lazy drivers allocate ~1024 skbs to prefill their ~1024 slot RX
ring)

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-28  9:52                                   ` Jesper Dangaard Brouer
  2016-01-28 12:54                                     ` Eric Dumazet
@ 2016-01-28 13:25                                     ` Eric Dumazet
  2016-01-28 16:43                                     ` Tom Herbert
  2 siblings, 0 replies; 59+ messages in thread
From: Eric Dumazet @ 2016-01-28 13:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, John Fastabend, Tom Herbert,
	Michael S. Tsirkin, David Miller, Or Gerlitz, Eric Dumazet,
	Linux Kernel Network Developers, Alexander Duyck,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai,
	Daniel Borkmann, Vladislav Yasevich

On Thu, 2016-01-28 at 10:52 +0100, Jesper Dangaard Brouer wrote:

> I'm still in flux/undecided how long we should delay the first touching
> of pkt-data, which happens when calling eth_type_trans().  Should it
> stay in the driver or not(?).

Some cpus have limited prefetch capabilities.
Sometimes, prefetches need to be spaced, otherwise they are ignored.
A driver author might be tempted to 'optimize' its rx handler for few
cpus.

Also, removing eth_type_trans() from the drivers would require quite
some work, but would be generic and certainly helpful.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-28 12:45                                     ` Eric Dumazet
@ 2016-01-28 16:37                                       ` Tom Herbert
  2016-01-28 16:43                                         ` Eric Dumazet
  2016-01-28 17:04                                         ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 59+ messages in thread
From: Tom Herbert @ 2016-01-28 16:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, John Fastabend, Michael S. Tsirkin,
	David Miller, Or Gerlitz, Eric Dumazet,
	Linux Kernel Network Developers, Alexander Duyck,
	Alexei Starovoitov, Daniel Borkmann, Marek Majkowski,
	Hannes Frederic Sowa, Florian Westphal, Paolo Abeni,
	John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich

On Thu, Jan 28, 2016 at 4:45 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2016-01-28 at 10:25 +0100, Jesper Dangaard Brouer wrote:
>
>> Yes, that is exactly what I'm contemplating :-)  That is idea "(1)".
>>
>> A natural extension to this work, which I expect Tom will love, is to
>> also use the idea for RPS.  Once we have a SKB list in stack/GRO-layer,
>> then we could build a local sk_buff_head list for each remote CPU, by
>> calling get_rps_cpu().   And then enqueue_list_to_backlog, by a
>> skb_queue_splice_tail(&cpu_list, &cpu->sd->input_pkt_queue) call.
>>
>> This would amortize the cost of transferring packets to a remote CPU,
>> which Eric AFAIK points out is costing approx ~133ns.
>>
>
> Jesper, RPS and RFS already defer sending the IPI and submit batches to
> remote cpus.
>
> See commits
>
> e326bed2f47d0365da5a8faaf8ee93ed2d86325b ("rps: immediate send IPI in
> process_backlog()")
>
> 88751275b8e867d756e4f86ae92afe0232de129f ("rps: shortcut
> net_rps_action()")
>
> And of course all the discussions we had to come up with
> 0a9627f2649a02bea165cfd529d7bcb625c2fcad ("rps: Receive Packet
> Steering")
>
> The current state :
>
> net_rps_action_and_irq_enable() sends the IPI at the end of
> net_rx_action() once all NAPI handlers have been called, and therefore
> have accumulated packets and cook rps_ipi_list (via calls to
> rps_ipi_queued() from enqueue_to_backlog())
>
>
> Adding another stage in the pipeline would not help.
>
skbs are enqueued on a CPU queue one at at time through
enqueue_to_backlog. It would be nice to do that as a batch of skbs.

>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-28  9:52                                   ` Jesper Dangaard Brouer
  2016-01-28 12:54                                     ` Eric Dumazet
  2016-01-28 13:25                                     ` Eric Dumazet
@ 2016-01-28 16:43                                     ` Tom Herbert
  2 siblings, 0 replies; 59+ messages in thread
From: Tom Herbert @ 2016-01-28 16:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, John Fastabend, Michael S. Tsirkin,
	David Miller, Eric Dumazet, Or Gerlitz, Eric Dumazet,
	Linux Kernel Network Developers, Alexander Duyck,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai,
	Daniel Borkmann, Vladislav Yasevich

On Thu, Jan 28, 2016 at 1:52 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 27 Jan 2016 13:56:03 -0800
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
>> On Wed, Jan 27, 2016 at 09:47:50PM +0100, Jesper Dangaard Brouer wrote:
>> >  Sum: 18.75 % => calc: 30.0 ns (sum: 30.0 ns) => Total: 159.9 ns
>> >
>> > To get around the cache-miss in eth_type_trans(), I created a
>> > "icache-loop" in mlx5e_poll_rx_cq() and pull all RX-ring packets "out",
>> > before calling eth_type_trans(), reducing cost to 2.45%.
>> >
>> > To mitigate the SLUB slowpath, I used my slab + SKB-napi bulk API .  And
>> > also tuned SLUB (with slub_nomerge slub_min_objects=128) to get bigger
>> > slab-pages, thus bigger bulk opportunities.
>> >
>> > This helped a lot, I can now drop 12Mpps (12,088,767 => 82.7 ns).
>>
>> great stuff. I think such batching loop will reduce the cost of
>> eth_type_trans() for all use cases.
>> Only unfortunate that it would need to be implemented in every driver,
>> but there is only a handful that people care about in high performance
>> setups, so I think it's worth getting this patch in for mlx5 and
>> the other drivers will catch up.
>
> I'm still in flux/undecided how long we should delay the first touching
> of pkt-data, which happens when calling eth_type_trans().  Should it
> stay in the driver or not(?).
>
> In the extreme case, for optimize for RPS sending to remote CPUs, delay
> calling eth_type_trans() as long as possible.
>
> 1. In driver only start prefetch data to L2/L3 cache
> 2. Stack calls get_rps_cpu() and assume skb_get_hash() have HW hash
> 3. (Bulk) enqueue on remote_cpu->sd->input_pkt_queue
> 4. On remote CPU in process_backlog call eth_type_trans() on sd->input_pkt_queue
>
There is also GRO to consider which still might be better to do before
packet steering? One thing that could be exploited is that we probably
don't need to look at packet data for GRO until we get a second packet
that matches the hash.

>
> On the other hand, if the HW desc can provide skb->proto, and we can
> lazy eval skb->pkt_type, then it is okay to keep that responsibility in
> the driver (as the call to eth_type_trans() basically disappears).
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-28 16:37                                       ` Tom Herbert
@ 2016-01-28 16:43                                         ` Eric Dumazet
  2016-01-28 17:04                                         ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 59+ messages in thread
From: Eric Dumazet @ 2016-01-28 16:43 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, John Fastabend, Michael S. Tsirkin,
	David Miller, Or Gerlitz, Eric Dumazet,
	Linux Kernel Network Developers, Alexander Duyck,
	Alexei Starovoitov, Daniel Borkmann, Marek Majkowski,
	Hannes Frederic Sowa, Florian Westphal, Paolo Abeni,
	John Fastabend, Amir Vadai, Daniel Borkmann, Vladislav Yasevich

On Thu, 2016-01-28 at 08:37 -0800, Tom Herbert wrote:

> skbs are enqueued on a CPU queue one at at time through
> enqueue_to_backlog. It would be nice to do that as a batch of skbs.

Adding yet another layer and cache misses.

This might be a win for stress situations, not for nominal traffic,
when very few packets are delivered per NAPI poll.

For stress situations, we do not rely on RPS/RFS at all, but prefer RSS
and appropriate number of RX queues, to have true silos.

For the router case, where Jesper wants 15 Mpps on a single core,
RPS/RFS is not used.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)
  2016-01-28 16:37                                       ` Tom Herbert
  2016-01-28 16:43                                         ` Eric Dumazet
@ 2016-01-28 17:04                                         ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-28 17:04 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, John Fastabend, Michael S. Tsirkin, David Miller,
	Or Gerlitz, Eric Dumazet, Linux Kernel Network Developers,
	Alexander Duyck, Alexei Starovoitov, Daniel Borkmann,
	Marek Majkowski, Hannes Frederic Sowa, Florian Westphal,
	Paolo Abeni, John Fastabend, Amir Vadai, Daniel Borkmann,
	Vladislav Yasevich, brouer

On Thu, 28 Jan 2016 08:37:07 -0800
Tom Herbert <tom@herbertland.com> wrote:

> On Thu, Jan 28, 2016 at 4:45 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Thu, 2016-01-28 at 10:25 +0100, Jesper Dangaard Brouer wrote:
> >  
> >> Yes, that is exactly what I'm contemplating :-)  That is idea "(1)".
> >>
> >> A natural extension to this work, which I expect Tom will love, is to
> >> also use the idea for RPS.  Once we have a SKB list in stack/GRO-layer,
> >> then we could build a local sk_buff_head list for each remote CPU, by
> >> calling get_rps_cpu().   And then enqueue_list_to_backlog, by a
> >> skb_queue_splice_tail(&cpu_list, &cpu->sd->input_pkt_queue) call.
> >>
> >> This would amortize the cost of transferring packets to a remote CPU,
> >> which Eric AFAIK points out is costing approx ~133ns.
> >>  
> >
> > Jesper, RPS and RFS already defer sending the IPI and submit batches to
> > remote cpus.
> >
> > See commits
> >
> > e326bed2f47d0365da5a8faaf8ee93ed2d86325b ("rps: immediate send IPI in
> > process_backlog()")
> >
> > 88751275b8e867d756e4f86ae92afe0232de129f ("rps: shortcut
> > net_rps_action()")
> >
> > And of course all the discussions we had to come up with
> > 0a9627f2649a02bea165cfd529d7bcb625c2fcad ("rps: Receive Packet
> > Steering")
> >
> > The current state :
> >
> > net_rps_action_and_irq_enable() sends the IPI at the end of
> > net_rx_action() once all NAPI handlers have been called, and therefore
> > have accumulated packets and cook rps_ipi_list (via calls to
> > rps_ipi_queued() from enqueue_to_backlog())

Yes, thanks for pointing this out. Then we already have amortized the
IPI call. Great.

> > Adding another stage in the pipeline would not help.
> >  
> skbs are enqueued on a CPU queue one at at time through
> enqueue_to_backlog. It would be nice to do that as a batch of skbs.

Yes, this was what I was looking at doing, a bulk enqueue to backlog.
Thus, amortizing the lock.  And if some remote CPU is reading/using
input_pkt_queue, then we don't bounce that cache line.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-01-20 23:27           ` Tom Herbert
  2016-01-21 11:27             ` Jesper Dangaard Brouer
  2016-01-21 12:23             ` Jesper Dangaard Brouer
@ 2016-02-02 16:13             ` Or Gerlitz
  2016-02-02 16:37               ` Eric Dumazet
  2 siblings, 1 reply; 59+ messages in thread
From: Or Gerlitz @ 2016-02-02 16:13 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, David Miller, Eric Dumazet, Jesper Dangaard Brouer,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai

On Thu, Jan 21, 2016 at 1:27 AM, Tom Herbert <tom@herbertland.com> wrote:

> Unfortunately, the hardware hash from devices hasn't really lived up
> to its potential. The original intent of getting the hash from device
> was to be able to do packet steering (RPS and RFS) without touching
> the header. But this never was implemented. eth_type_trans touches
> headers and GRO is best when done before steering. Given the
> weaknesses of Toeplitz we talked about recently and that fact that
> Jenkins is really fast to compute, I am starting to think maybe we
> should always do a software hash and not rely on HW for it...

Could you provide some details on the weaknesses of Toeplitz?
FYI, the admin is able to configure non-default keys for Toeplitz
through ethtool.

Or.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: Optimizing instruction-cache, more packets at each stage
  2016-02-02 16:13             ` Or Gerlitz
@ 2016-02-02 16:37               ` Eric Dumazet
  0 siblings, 0 replies; 59+ messages in thread
From: Eric Dumazet @ 2016-02-02 16:37 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Tom Herbert, David Miller, Eric Dumazet, Jesper Dangaard Brouer,
	Linux Netdev List, Alexander Duyck, Alexei Starovoitov,
	Daniel Borkmann, Marek Majkowski, Hannes Frederic Sowa,
	Florian Westphal, Paolo Abeni, John Fastabend, Amir Vadai

On Tue, 2016-02-02 at 18:13 +0200, Or Gerlitz wrote:

> Could you provide some details on the weaknesses of Toeplitz?
> FYI, the admin is able to configure non-default keys for Toeplitz
> through ethtool.

Well, Toeplitz keys are no longer default anyway, I hope.

d682d2bdc306 bnx2x: byte swap rss_key to comply to Toeplitz specs
4671fc6d47e0 net/mlx4_en: really allow to change RSS key
c33d23c21501 enic: use netdev_rss_key_fill() helper
6bf79cdddd50 vmxnet3: use netdev_rss_key_fill() helper
7a20db379ce7 sfc: use netdev_rss_key_fill() helper
b9d1ab7eb42e mlx4: use netdev_rss_key_fill() helper
9913c61c4486 ixgbe: use netdev_rss_key_fill() helper
eb31f8493eee igb: use netdev_rss_key_fill() helper
22f258a1cc2f i40e: use netdev_rss_key_fill() helper
c41a4fba4a22 fm10k: use netdev_rss_key_fill() helper
5c8d19da9508 e100e: use netdev_rss_key_fill() helper
1dcf7b1c5f57 be2net:use netdev_rss_key_fill() helper
0fa6aa4ac4e0 bna: use netdev_rss_key_fill() helper
396483564409 tg3: use netdev_rss_key_fill() helper
e3ec69ca80a2 bnx2x: use netdev_rss_key_fill() helper
b23063034f11 amd-xgbe: use netdev_rss_key_fill() helper
960fb622f851 net: provide a per host RSS key generic infrastructure

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2016-02-02 16:37 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-15 13:22 Optimizing instruction-cache, more packets at each stage Jesper Dangaard Brouer
2016-01-15 13:32 ` Hannes Frederic Sowa
2016-01-15 14:17   ` Jesper Dangaard Brouer
2016-01-15 13:36 ` David Laight
2016-01-15 14:00   ` Jesper Dangaard Brouer
2016-01-15 14:38     ` Felix Fietkau
2016-01-18 11:54       ` Jesper Dangaard Brouer
2016-01-18 17:01         ` Eric Dumazet
2016-01-25  0:08         ` Florian Fainelli
2016-01-15 20:47 ` David Miller
2016-01-18 10:27   ` Jesper Dangaard Brouer
2016-01-18 16:24     ` David Miller
2016-01-20 22:20       ` Or Gerlitz
2016-01-20 23:02         ` Eric Dumazet
2016-01-20 23:27           ` Tom Herbert
2016-01-21 11:27             ` Jesper Dangaard Brouer
2016-01-21 12:49               ` Or Gerlitz
2016-01-21 13:57                 ` Jesper Dangaard Brouer
2016-01-21 18:56                 ` David Miller
2016-01-21 22:45                   ` Or Gerlitz
2016-01-21 22:59                     ` David Miller
2016-01-21 16:38               ` Eric Dumazet
2016-01-21 18:54               ` David Miller
2016-01-24 14:28                 ` Jesper Dangaard Brouer
2016-01-24 14:44                   ` Michael S. Tsirkin
2016-01-24 17:28                     ` John Fastabend
2016-01-25 13:15                       ` Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage) Jesper Dangaard Brouer
2016-01-25 17:09                         ` Tom Herbert
2016-01-25 17:50                           ` John Fastabend
2016-01-25 21:32                             ` Tom Herbert
2016-01-25 21:58                               ` John Fastabend
2016-01-25 22:10                             ` Jesper Dangaard Brouer
2016-01-27 20:47                               ` Jesper Dangaard Brouer
2016-01-27 21:56                                 ` Alexei Starovoitov
2016-01-28  9:52                                   ` Jesper Dangaard Brouer
2016-01-28 12:54                                     ` Eric Dumazet
2016-01-28 13:25                                     ` Eric Dumazet
2016-01-28 16:43                                     ` Tom Herbert
2016-01-28  2:50                                 ` Tom Herbert
2016-01-28  9:25                                   ` Jesper Dangaard Brouer
2016-01-28 12:45                                     ` Eric Dumazet
2016-01-28 16:37                                       ` Tom Herbert
2016-01-28 16:43                                         ` Eric Dumazet
2016-01-28 17:04                                         ` Jesper Dangaard Brouer
2016-01-24 20:09                   ` Optimizing instruction-cache, more packets at each stage Tom Herbert
2016-01-24 21:41                     ` John Fastabend
2016-01-24 23:50                       ` Tom Herbert
2016-01-21 12:23             ` Jesper Dangaard Brouer
2016-01-21 16:38               ` Tom Herbert
2016-01-21 17:48                 ` Eric Dumazet
2016-01-22 12:33                   ` Jesper Dangaard Brouer
2016-01-22 14:33                     ` Eric Dumazet
2016-01-22 17:07                     ` Tom Herbert
2016-01-22 17:17                       ` Jesper Dangaard Brouer
2016-02-02 16:13             ` Or Gerlitz
2016-02-02 16:37               ` Eric Dumazet
2016-01-18 16:53     ` Eric Dumazet
2016-01-18 17:36     ` Tom Herbert
2016-01-18 17:49       ` Jesper Dangaard Brouer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.